Shuu12121
/

CodeModernBERT-Owl-v1

Model card Files Files and versions

CodeModernBERT-Owl-v1 / README.md

Shuu12121's picture

Create README.md

38e4aa6 verified 6 months ago

|

history blame contribute delete

2.26 kB

	---
	license: apache-2.0
	datasets:
	- Shuu12121/javascript-treesitter-filtered-datasetsV2
	- Shuu12121/ruby-treesitter-filtered-datasetsV2
	- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
	- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
	- Shuu12121/rust-treesitter-filtered-datasetsV2
	- Shuu12121/php-treesitter-filtered-datasetsV2
	- Shuu12121/python-treesitter-filtered-datasetsV2
	- Shuu12121/typescript-treesitter-filtered-datasetsV2
	language:
	- en
	pipeline_tag: fill-mask
	tags:
	- code
	- python
	- java
	- javascript
	- typescript
	- go
	- ruby
	- rust
	- php
	---

	# CodeModernBERT-Owl-v1🦉

	## Model Details

	* Model type: Bi-encoder architecture based on ModernBERT
	* Architecture:
	* Hidden size: 768
	* Layers: 22
	* Attention heads: 12
	* Intermediate size: 1,152
	* Max position embeddings: 8,192
	* Local attention window size: 128
	* RoPE positional encoding: θ = 160,000
	* Local RoPE positional encoding: θ = 10,000
	* Sequence length: up to 2,048 tokens for code and docstring inputs during pretraining
	* Implementation: Back-end in Python; integrated into OwlSpotLight, a Visual Studio Code extension.

	## Pretraining

	* Tokenizer: Custom BPE tokenizer trained for code and docstring pairs.
	* Data: Functions and natural language descriptions extracted from GitHub repositories.
	* Masking strategy: Two-phase pretraining.
	* Phase 1: Random Masked Language Modeling (MLM)
	30% of tokens in code functions are randomly masked and predicted using standard MLM.
	* Phase 2: Line-level Span Masking
	Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
	1. Convert input tokens back to strings.
	2. Detect newline tokens with regex and segment inputs by line.
	3. Exclude whitespace-only tokens from masking.
	4. Apply padding to align sequence lengths.
	5. Randomly mask 30% of tokens in each line segment and predict them.

	* Pretraining hyperparameters:
	* Batch size: 20
	* Gradient accumulation steps: 6
	* Effective batch size: 120
	* Optimizer: AdamW
	* Learning rate: 5e-5
	* Scheduler: Cosine
	* Epochs: 2
	* Precision: Mixed precision (fp16) using `transformers`