Dalila Cuevas Rodriguez
commited on
Commit
·
8759dad
1
Parent(s):
dabc803
small code and readme changes
Browse files- README.md +37 -2
- chat_engine.py +1 -1
- installs.sh +1 -3
README.md
CHANGED
|
@@ -47,10 +47,10 @@ LexVA follows a pure RAG approach without model finetuning:
|
|
| 47 |
### **Generation Model: Qwen3-14B**
|
| 48 |
Chosen for strong reasoning, thinking, and citation capability, as well as multilingual and long-context strengths.
|
| 49 |
|
| 50 |
-
### **Retriever: Section-Level
|
| 51 |
- Each statute section was embedded using the **Qwen3 Embedding** model.
|
| 52 |
- No internal chunking — the **complete statute text was the single retrieval unit**.
|
| 53 |
-
-
|
| 54 |
|
| 55 |
### **Synthetic QA Creation (used for evaluation)**
|
| 56 |
- Qwen3-14B generated the natural-language questions and draft answers.
|
|
@@ -65,6 +65,8 @@ Chosen for strong reasoning, thinking, and citation capability, as well as multi
|
|
| 65 |
|
| 66 |
No finetuning was performed; the evaluation tests the effectiveness of retrieval + prompting.
|
| 67 |
|
|
|
|
|
|
|
| 68 |
- **Developed by:** Dalila Cuevas Rodriguez
|
| 69 |
- **Model type:** RAG Pipeline
|
| 70 |
|
|
@@ -80,6 +82,22 @@ LexVA is intended for:
|
|
| 80 |
|
| 81 |
It is **not** intended for production legal advice or case outcome prediction.
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
## Bias, Risks, and Limitations
|
| 85 |
|
|
@@ -131,8 +149,25 @@ Use the code below to get started with the model.
|
|
| 131 |
| Mistral 7B (no RAG) | NA | NA | **0.855** | **0.224** | **0.717** |
|
| 132 |
| Mistral 7B (with RAG) | NA | NA | **0.946** | **0.664** | **0.790** |
|
| 133 |
|
|
|
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
|
|
|
| 136 |
|
| 137 |
### Model Architecture and Objective
|
| 138 |
|
|
|
|
| 47 |
### **Generation Model: Qwen3-14B**
|
| 48 |
Chosen for strong reasoning, thinking, and citation capability, as well as multilingual and long-context strengths.
|
| 49 |
|
| 50 |
+
### **Retriever: Section-Level**
|
| 51 |
- Each statute section was embedded using the **Qwen3 Embedding** model.
|
| 52 |
- No internal chunking — the **complete statute text was the single retrieval unit**.
|
| 53 |
+
- Vectors are stored in SQLite DB.
|
| 54 |
|
| 55 |
### **Synthetic QA Creation (used for evaluation)**
|
| 56 |
- Qwen3-14B generated the natural-language questions and draft answers.
|
|
|
|
| 65 |
|
| 66 |
No finetuning was performed; the evaluation tests the effectiveness of retrieval + prompting.
|
| 67 |
|
| 68 |
+
Please note that prompts and answers are cached in the SQLite DB to speed up benchmarking tasks.
|
| 69 |
+
|
| 70 |
- **Developed by:** Dalila Cuevas Rodriguez
|
| 71 |
- **Model type:** RAG Pipeline
|
| 72 |
|
|
|
|
| 82 |
|
| 83 |
It is **not** intended for production legal advice or case outcome prediction.
|
| 84 |
|
| 85 |
+
# Setup
|
| 86 |
+
1. Clone this model repository using git `git clone https://huggingface.co/dcrodriguez/virginia-legal-rag-lexva`
|
| 87 |
+
2. Optionally use a python 3.12 virtual environment
|
| 88 |
+
3. Install the pip packages in `install.sh`
|
| 89 |
+
4. Run `python setup.py`. This will download the dataset, load it into a SQLite DB, and generate all the document embeddings.
|
| 90 |
+
5. Run `python example.py`
|
| 91 |
+
6. Write your own scripts using the chat_engine and retriever classes.
|
| 92 |
+
|
| 93 |
+
Notes:
|
| 94 |
+
- If you're running in an ephemeral environment like Google Colab, you should put your sqlite DB in your Google Drive. Change the relative path in `setup.py` to point to your drive, and then instantiate the chat engine like this:
|
| 95 |
+
|
| 96 |
+
```python
|
| 97 |
+
sqlite_path = "/content/drive/MyDrive/LexVA/va_code.db"
|
| 98 |
+
chat_engine = LLMChatEngine(sqlite_path)
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
|
| 102 |
## Bias, Risks, and Limitations
|
| 103 |
|
|
|
|
| 149 |
| Mistral 7B (no RAG) | NA | NA | **0.855** | **0.224** | **0.717** |
|
| 150 |
| Mistral 7B (with RAG) | NA | NA | **0.946** | **0.664** | **0.790** |
|
| 151 |
|
| 152 |
+
Cosine similarity measures how semantically similar two answers are by generating two embedding vectors. While it's quick to calculate, it ended up being a poor comparison metric for legal Q&A. You can see all the similarity values are within 8 percent. Answers generated without RAG were semantically similar, but contained many hallucinations and false statements.
|
| 153 |
|
| 154 |
+
The DeepEval benchmarks both use an LLM as a judge. I used GPT-5 Mini for these benchmarks as the judge model. Smaller local models had trouble with reasoning, and scores seemed inflated.
|
| 155 |
+
|
| 156 |
+
Faithfulness measures how correct the pipeline output is when given the relevant statute. The input is an actual output, and expected output, and retrieved context documents. It is a more standardized metric compared to GEval, but I think it struggled with RAG and non-RAG answers being similar despite citations being made up for non-RAG answers. Medium size models like Qwen 14B do a surprisingly good job answering these questions without RAG, but once again they make up most of their references, which makes it much less useful as a legal research tool. I think this is why Faithfulness metrics are as close as they are. Qwen3 14B scored higher than Mistral 7B which is expected.
|
| 157 |
+
|
| 158 |
+
GEval was by far the most useful metric. The input is an actual output and an expected output. GEval supports custom grading criteria which was necessary to properly benchmark this pipeline. The evaluation steps are below.
|
| 159 |
+
|
| 160 |
+
```python
|
| 161 |
+
evaluation_steps=[
|
| 162 |
+
"Check whether the facts in 'actual output' contradict any facts in 'expected output'.",
|
| 163 |
+
"Heavily penalize omission of important details from the expected output.",
|
| 164 |
+
"If a specific statute is referenced in 'expected output' but not 'actual output' lower the score, but numerically close references are ok. For example, 55.1-1243.2 is close to 55.1-1244.1, but is far from 19.2-308.",
|
| 165 |
+
"Additional statute references in actual output are ok",
|
| 166 |
+
"Minor stylistic differences or changes in wording are acceptable.",
|
| 167 |
+
],
|
| 168 |
+
```
|
| 169 |
|
| 170 |
+
This benchmark was significantly more successful at identifying hallucinated citations. It evaluates both the claims made in the answer, as well as examines if the expected statute citations were used. You can see the largest delta between RAG and non-RAG answers using this benchmark.
|
| 171 |
|
| 172 |
### Model Architecture and Objective
|
| 173 |
|
chat_engine.py
CHANGED
|
@@ -177,7 +177,7 @@ class LLMChatEngine:
|
|
| 177 |
quantization_config=quant_config,
|
| 178 |
device_map=device_map, # spans GPU + CPU
|
| 179 |
max_memory=max_memory, # triggers offloading once GPU limit is hit
|
| 180 |
-
|
| 181 |
)
|
| 182 |
|
| 183 |
# Precompute the token ids corresponding to "</think>" if possible
|
|
|
|
| 177 |
quantization_config=quant_config,
|
| 178 |
device_map=device_map, # spans GPU + CPU
|
| 179 |
max_memory=max_memory, # triggers offloading once GPU limit is hit
|
| 180 |
+
dtype=torch.bfloat16
|
| 181 |
)
|
| 182 |
|
| 183 |
# Precompute the token ids corresponding to "</think>" if possible
|
installs.sh
CHANGED
|
@@ -1,3 +1 @@
|
|
| 1 |
-
pip install datasets numpy pandas tqdm
|
| 2 |
-
pip install torch torchvision
|
| 3 |
-
pip install transformers sentence-transformers
|
|
|
|
| 1 |
+
pip install datasets numpy pandas tqdm torch torchvision transformers sentence-transformers deepeval beautifulsoup4 bitsandbytes
|
|
|
|
|
|