Shuu12121's picture
Create README.md
38e4aa6 verified
---
license: apache-2.0
datasets:
- Shuu12121/javascript-treesitter-filtered-datasetsV2
- Shuu12121/ruby-treesitter-filtered-datasetsV2
- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/rust-treesitter-filtered-datasetsV2
- Shuu12121/php-treesitter-filtered-datasetsV2
- Shuu12121/python-treesitter-filtered-datasetsV2
- Shuu12121/typescript-treesitter-filtered-datasetsV2
language:
- en
pipeline_tag: fill-mask
tags:
- code
- python
- java
- javascript
- typescript
- go
- ruby
- rust
- php
---
# CodeModernBERT-Owl-v1🦉
## Model Details
* **Model type**: Bi-encoder architecture based on ModernBERT
* **Architecture**:
* Hidden size: 768
* Layers: 22
* Attention heads: 12
* Intermediate size: 1,152
* Max position embeddings: 8,192
* Local attention window size: 128
* RoPE positional encoding: θ = 160,000
* Local RoPE positional encoding: θ = 10,000
* **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining
* **Implementation**: Back-end in Python; integrated into **OwlSpotLight**, a Visual Studio Code extension.
## Pretraining
* **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs.
* **Data**: Functions and natural language descriptions extracted from GitHub repositories.
* **Masking strategy**: Two-phase pretraining.
* **Phase 1: Random Masked Language Modeling (MLM)**
30% of tokens in code functions are randomly masked and predicted using standard MLM.
* **Phase 2: Line-level Span Masking**
Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
1. Convert input tokens back to strings.
2. Detect newline tokens with regex and segment inputs by line.
3. Exclude whitespace-only tokens from masking.
4. Apply padding to align sequence lengths.
5. Randomly mask 30% of tokens in each line segment and predict them.
* **Pretraining hyperparameters**:
* Batch size: 20
* Gradient accumulation steps: 6
* Effective batch size: 120
* Optimizer: AdamW
* Learning rate: 5e-5
* Scheduler: Cosine
* Epochs: 2
* Precision: Mixed precision (fp16) using `transformers`