khulnasoft commited on
Commit
6c400a4
·
verified ·
1 Parent(s): 09b5b45

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -0
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - transformers
7
+ - Qwen2
8
+ license: other
9
+ license_name: neoai-open-rail-m
10
+ license_link: LICENSE
11
+ pipeline_tag: sentence-similarity
12
+ library_name: sentence-transformers
13
+ base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
14
+ ---
15
+
16
+
17
+
18
+
19
+ ## NeoAI-Embed
20
+ **NeoAI-Embed is a state-of-the-art** code embedding model designed for retrieval tasks in the software development domain.
21
+ It is offered in two sizes: lite (1.5B) and medium (7B). The model is optimized for natural language-to-code and code-to-code retrieval, making it highly effective for applications such as code search, retrieval-augmented generation (RAG), and contextual understanding of programming languages.
22
+ This model outperforms all previous open-source models in the COIR and MTEB leaderboards, achieving best-in-class performance with a significantly smaller size compared to competing models.
23
+
24
+ ### Languages Supported:
25
+ * Python
26
+ * C++
27
+ * C#
28
+ * Go
29
+ * Java
30
+ * Javascript
31
+ * PHP
32
+ * Ruby
33
+ * Typescript
34
+
35
+
36
+ ## Model Information
37
+ - Model Size: 1.5B
38
+ - Embedding Dimension: 1536
39
+ - Max Input Tokens: 32k
40
+
41
+ ## Requirements
42
+ ```
43
+ transformers>=4.39.2
44
+ flash_attn>=2.5.6
45
+ ```
46
+
47
+ ## Usage
48
+
49
+ ### Sentence Transformers
50
+
51
+ ```python
52
+ from sentence_transformers import SentenceTransformer
53
+
54
+ # Download from the 🤗 Hub
55
+ model = SentenceTransformer("khulnasoft/NeoAI-Embed")
56
+ # Run inference
57
+ sentences = [
58
+ 'accumulator = sum(item.value for item in collection)',
59
+ 'result = reduce(lambda acc, curr: acc + curr.amount, data, 0)',
60
+ 'matrix = [[i*j for j in range(n)] for i in range(n)]'
61
+ ]
62
+ embeddings = model.encode(sentences)
63
+ print(embeddings.shape)
64
+ # [3, 1536]
65
+
66
+ # Get the similarity scores for the embeddings
67
+ similarities = model.similarity(embeddings, embeddings)
68
+ print(similarities.shape)
69
+ # [3, 3]
70
+ ```
71
+
72
+ ### Transformers
73
+
74
+ ```python
75
+ import torch
76
+ import torch.nn.functional as F
77
+
78
+ from torch import Tensor
79
+ from transformers import AutoTokenizer, AutoModel
80
+
81
+
82
+ def last_token_pool(last_hidden_states: Tensor,
83
+ attention_mask: Tensor) -> Tensor:
84
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
85
+ if left_padding:
86
+ return last_hidden_states[:, -1]
87
+ else:
88
+ sequence_lengths = attention_mask.sum(dim=1) - 1
89
+ batch_size = last_hidden_states.shape[0]
90
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
91
+
92
+
93
+ # Each query must come with a one-sentence instruction that describes the task
94
+ queries = [
95
+ 'how to handle memory efficient data streaming',
96
+ 'implement binary tree traversal'
97
+ ]
98
+
99
+ documents = [
100
+ """def process_in_chunks():
101
+ buffer = deque(maxlen=1000)
102
+ for record in source_iterator:
103
+ buffer.append(transform(record))
104
+ if len(buffer) >= 1000:
105
+ yield from buffer
106
+ buffer.clear()""",
107
+
108
+ """class LazyLoader:
109
+ def __init__(self, source):
110
+ self.generator = iter(source)
111
+ self._cache = []
112
+
113
+ def next_batch(self, size=100):
114
+ while len(self._cache) < size:
115
+ try:
116
+ self._cache.append(next(self.generator))
117
+ except StopIteration:
118
+ break
119
+ return self._cache.pop(0) if self._cache else None""",
120
+
121
+ """def dfs_recursive(root):
122
+ if not root:
123
+ return []
124
+ stack = []
125
+ stack.extend(dfs_recursive(root.right))
126
+ stack.append(root.val)
127
+ stack.extend(dfs_recursive(root.left))
128
+ return stack"""
129
+ ]
130
+ input_texts = queries + documents
131
+
132
+ tokenizer = AutoTokenizer.from_pretrained('khulnasoft/NeoAI-Embed', trust_remote_code=True)
133
+ model = AutoModel.from_pretrained('khulnasoft/NeoAI-Embed', trust_remote_code=True)
134
+
135
+ max_length = 8192
136
+
137
+ # Tokenize the input texts
138
+ batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
139
+ outputs = model(**batch_dict)
140
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
141
+
142
+ # normalize embeddings
143
+ embeddings = F.normalize(embeddings, p=2, dim=1)
144
+ scores = (embeddings[:2] @ embeddings[2:].T) * 100
145
+ print(scores.tolist())
146
+ ```
147
+
148
+
149
+
150
+
151
+ ## License
152
+ [NeoAI-Open-RAIL-M](https://www.neoai.khulnasoft.com/open-rail-m-license/)
153
+ <!--
154
+ ## Glossary
155
+
156
+ *Clearly define terms in order to be accessible across audiences.*
157
+ -->
158
+
159
+ <!--
160
+ ## Model Card Authors
161
+
162
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
163
+ -->
164
+
165
+ <!--
166
+ ## Model Card Contact
167
+
168
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
169
+ -->