satya007 commited on
Commit
45dea68
·
verified ·
1 Parent(s): 6df5fb7

Upload SpanExtractBERT model

Browse files
models/spanextractbert/README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - document-extraction
5
+ - span-prediction
6
+ - pytorch
7
+ datasets:
8
+ - bluecopa/smalldocs-jsonextract
9
+ language:
10
+ - en
11
+ ---
12
+
13
+ # SpanExtractBERT
14
+
15
+ This model is part of the SpanExtractBERT document extraction experiments.
16
+
17
+ ## Model Description
18
+
19
+ SpanExtractBERT is trained for structured document extraction using span prediction.
20
+ It extracts field values from documents by predicting start and end positions.
21
+
22
+ ## Training Data
23
+
24
+ Trained on [bluecopa/smalldocs-jsonextract](https://huggingface.co/datasets/bluecopa/smalldocs-jsonextract):
25
+ - 78,290 examples from 1,593 documents
26
+ - Document types: invoices, receipts
27
+ - ~80% span extractions, ~20% NULL predictions
28
+
29
+ ## Results
30
+
31
+
32
+ | Metric | Value |
33
+ |--------|-------|
34
+ | Exact Match | 0.0% |
35
+ | Span F1 | 1.8% |
36
+ | NULL F1 | 0.0% |
37
+
38
+
39
+ ## Usage
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoTokenizer
44
+
45
+ # Load model
46
+ model = torch.load("pytorch_model.bin")
47
+ tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
48
+
49
+ # Inference
50
+ doc_text = "Invoice #12345..."
51
+ query = "What is the invoice number?"
52
+
53
+ # Tokenize and predict
54
+ inputs = tokenizer(doc_text, return_tensors="pt")
55
+ query_inputs = tokenizer(query, return_tensors="pt")
56
+
57
+ with torch.no_grad():
58
+ start_logits, end_logits = model(
59
+ inputs["input_ids"],
60
+ inputs["attention_mask"],
61
+ query_inputs["input_ids"],
62
+ query_inputs["attention_mask"]
63
+ )
64
+
65
+ start_idx = start_logits.argmax(-1).item()
66
+ end_idx = end_logits.argmax(-1).item()
67
+
68
+ # Decode answer
69
+ answer = tokenizer.decode(inputs["input_ids"][0, start_idx:end_idx+1])
70
+ ```
71
+
72
+ ## Citation
73
+
74
+ ```bibtex
75
+ @article{spanextractbert2025,
76
+ title={SpanExtractBERT: High-Velocity Document Extraction via Query-Conditioned Encoders},
77
+ year={2025}
78
+ }
79
+ ```
models/spanextractbert/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa6033da49846b60fee496135721148d0592f2f748456ca89b51044c5fedd941
3
+ size 615016031