English
rag
Dalila Cuevas Rodriguez commited on
Commit
8759dad
·
1 Parent(s): dabc803

small code and readme changes

Browse files
Files changed (3) hide show
  1. README.md +37 -2
  2. chat_engine.py +1 -1
  3. installs.sh +1 -3
README.md CHANGED
@@ -47,10 +47,10 @@ LexVA follows a pure RAG approach without model finetuning:
47
  ### **Generation Model: Qwen3-14B**
48
  Chosen for strong reasoning, thinking, and citation capability, as well as multilingual and long-context strengths.
49
 
50
- ### **Retriever: Section-Level FAISS Index**
51
  - Each statute section was embedded using the **Qwen3 Embedding** model.
52
  - No internal chunking — the **complete statute text was the single retrieval unit**.
53
- - Retrieval used vector similarity search (FAISS), typically with **top-k = 5**.
54
 
55
  ### **Synthetic QA Creation (used for evaluation)**
56
  - Qwen3-14B generated the natural-language questions and draft answers.
@@ -65,6 +65,8 @@ Chosen for strong reasoning, thinking, and citation capability, as well as multi
65
 
66
  No finetuning was performed; the evaluation tests the effectiveness of retrieval + prompting.
67
 
 
 
68
  - **Developed by:** Dalila Cuevas Rodriguez
69
  - **Model type:** RAG Pipeline
70
 
@@ -80,6 +82,22 @@ LexVA is intended for:
80
 
81
  It is **not** intended for production legal advice or case outcome prediction.
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ## Bias, Risks, and Limitations
85
 
@@ -131,8 +149,25 @@ Use the code below to get started with the model.
131
  | Mistral 7B (no RAG) | NA | NA | **0.855** | **0.224** | **0.717** |
132
  | Mistral 7B (with RAG) | NA | NA | **0.946** | **0.664** | **0.790** |
133
 
 
134
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
 
136
 
137
  ### Model Architecture and Objective
138
 
 
47
  ### **Generation Model: Qwen3-14B**
48
  Chosen for strong reasoning, thinking, and citation capability, as well as multilingual and long-context strengths.
49
 
50
+ ### **Retriever: Section-Level**
51
  - Each statute section was embedded using the **Qwen3 Embedding** model.
52
  - No internal chunking — the **complete statute text was the single retrieval unit**.
53
+ - Vectors are stored in SQLite DB.
54
 
55
  ### **Synthetic QA Creation (used for evaluation)**
56
  - Qwen3-14B generated the natural-language questions and draft answers.
 
65
 
66
  No finetuning was performed; the evaluation tests the effectiveness of retrieval + prompting.
67
 
68
+ Please note that prompts and answers are cached in the SQLite DB to speed up benchmarking tasks.
69
+
70
  - **Developed by:** Dalila Cuevas Rodriguez
71
  - **Model type:** RAG Pipeline
72
 
 
82
 
83
  It is **not** intended for production legal advice or case outcome prediction.
84
 
85
+ # Setup
86
+ 1. Clone this model repository using git `git clone https://huggingface.co/dcrodriguez/virginia-legal-rag-lexva`
87
+ 2. Optionally use a python 3.12 virtual environment
88
+ 3. Install the pip packages in `install.sh`
89
+ 4. Run `python setup.py`. This will download the dataset, load it into a SQLite DB, and generate all the document embeddings.
90
+ 5. Run `python example.py`
91
+ 6. Write your own scripts using the chat_engine and retriever classes.
92
+
93
+ Notes:
94
+ - If you're running in an ephemeral environment like Google Colab, you should put your sqlite DB in your Google Drive. Change the relative path in `setup.py` to point to your drive, and then instantiate the chat engine like this:
95
+
96
+ ```python
97
+ sqlite_path = "/content/drive/MyDrive/LexVA/va_code.db"
98
+ chat_engine = LLMChatEngine(sqlite_path)
99
+ ```
100
+
101
 
102
  ## Bias, Risks, and Limitations
103
 
 
149
  | Mistral 7B (no RAG) | NA | NA | **0.855** | **0.224** | **0.717** |
150
  | Mistral 7B (with RAG) | NA | NA | **0.946** | **0.664** | **0.790** |
151
 
152
+ Cosine similarity measures how semantically similar two answers are by generating two embedding vectors. While it's quick to calculate, it ended up being a poor comparison metric for legal Q&A. You can see all the similarity values are within 8 percent. Answers generated without RAG were semantically similar, but contained many hallucinations and false statements.
153
 
154
+ The DeepEval benchmarks both use an LLM as a judge. I used GPT-5 Mini for these benchmarks as the judge model. Smaller local models had trouble with reasoning, and scores seemed inflated.
155
+
156
+ Faithfulness measures how correct the pipeline output is when given the relevant statute. The input is an actual output, and expected output, and retrieved context documents. It is a more standardized metric compared to GEval, but I think it struggled with RAG and non-RAG answers being similar despite citations being made up for non-RAG answers. Medium size models like Qwen 14B do a surprisingly good job answering these questions without RAG, but once again they make up most of their references, which makes it much less useful as a legal research tool. I think this is why Faithfulness metrics are as close as they are. Qwen3 14B scored higher than Mistral 7B which is expected.
157
+
158
+ GEval was by far the most useful metric. The input is an actual output and an expected output. GEval supports custom grading criteria which was necessary to properly benchmark this pipeline. The evaluation steps are below.
159
+
160
+ ```python
161
+ evaluation_steps=[
162
+ "Check whether the facts in 'actual output' contradict any facts in 'expected output'.",
163
+ "Heavily penalize omission of important details from the expected output.",
164
+ "If a specific statute is referenced in 'expected output' but not 'actual output' lower the score, but numerically close references are ok. For example, 55.1-1243.2 is close to 55.1-1244.1, but is far from 19.2-308.",
165
+ "Additional statute references in actual output are ok",
166
+ "Minor stylistic differences or changes in wording are acceptable.",
167
+ ],
168
+ ```
169
 
170
+ This benchmark was significantly more successful at identifying hallucinated citations. It evaluates both the claims made in the answer, as well as examines if the expected statute citations were used. You can see the largest delta between RAG and non-RAG answers using this benchmark.
171
 
172
  ### Model Architecture and Objective
173
 
chat_engine.py CHANGED
@@ -177,7 +177,7 @@ class LLMChatEngine:
177
  quantization_config=quant_config,
178
  device_map=device_map, # spans GPU + CPU
179
  max_memory=max_memory, # triggers offloading once GPU limit is hit
180
- torch_dtype=torch.bfloat16 # or "auto", but be explicit if you like
181
  )
182
 
183
  # Precompute the token ids corresponding to "</think>" if possible
 
177
  quantization_config=quant_config,
178
  device_map=device_map, # spans GPU + CPU
179
  max_memory=max_memory, # triggers offloading once GPU limit is hit
180
+ dtype=torch.bfloat16
181
  )
182
 
183
  # Precompute the token ids corresponding to "</think>" if possible
installs.sh CHANGED
@@ -1,3 +1 @@
1
- pip install datasets numpy pandas tqdm
2
- pip install torch torchvision
3
- pip install transformers sentence-transformers
 
1
+ pip install datasets numpy pandas tqdm torch torchvision transformers sentence-transformers deepeval beautifulsoup4 bitsandbytes