seanpedrickcase commited on
Commit
d5b5291
·
1 Parent(s): 7bb945f

Minor update to cli_redact for new local OCR model options. Updated app_settings.qmd, user_guide.qmd, and readme.md with descriptions of new features

Browse files
Files changed (5) hide show
  1. README.md +137 -35
  2. cli_redact.py +2 -1
  3. pyproject.toml +1 -1
  4. src/app_settings.qmd +192 -6
  5. src/user_guide.qmd +136 -32
README.md CHANGED
@@ -10,7 +10,7 @@ license: agpl-3.0
10
  ---
11
  # Document redaction
12
 
13
- version: 1.5.0
14
 
15
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
16
 
@@ -249,7 +249,6 @@ Now you have the app installed, what follows is a guide on how to use it for bas
249
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
250
 
251
  ### Advanced user guide
252
- - [Advanced user guide](#advanced-user-guide)
253
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
254
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
255
  - [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
@@ -261,7 +260,6 @@ Now you have the app installed, what follows is a guide on how to use it for bas
261
  - [Merging redaction review files](#merging-redaction-review-files)
262
 
263
  ### Features for expert users/system administrators
264
- - [Features for expert users/system administrators](#features-for-expert-userssystem-administrators)
265
  - [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
266
  - [Command Line Interface (CLI)](#command-line-interface-cli)
267
 
@@ -376,7 +374,17 @@ If you have used the AWS Textract option for extracting text, you may also see a
376
 
377
  ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
378
 
379
- Similarly, if you have used the 'Local OCR method' to extract text, you may see a '..._ocr_results_with_words.json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
 
 
 
 
 
 
 
 
 
 
380
 
381
  ### Downloading output files from previous redaction tasks
382
 
@@ -686,6 +694,7 @@ You can also write open text into an input box and redact that using the same me
686
  ### Redaction log outputs
687
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
688
 
 
689
  ## Identifying and redacting duplicate pages
690
 
691
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
@@ -916,45 +925,91 @@ AWS_SECRET_KEY= your-secret-key
916
 
917
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
918
 
919
- Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
920
-
921
- ## Advanced OCR options (Hybrid OCR)
922
 
923
- The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by your system administrator.
924
 
925
  ### Available OCR models
926
 
927
- - **Tesseract** (default): The standard OCR engine that works well for most documents
928
- - **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise
929
- - **Hybrid**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text
 
 
930
 
931
  ### Enabling advanced OCR options
932
 
933
- To enable these options, your system administrator needs to modify the configuration file (`config.py`) and set:
934
 
 
935
  ```
936
  SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
937
  ```
938
 
939
- Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between:
940
- - tesseract
941
- - hybrid
942
- - paddle
 
 
 
 
 
 
 
943
 
944
- ### Hybrid OCR configuration
945
 
946
- The hybrid OCR mode uses several configurable parameters:
947
 
948
- - **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 65): Tesseract confidence score below which PaddleOCR will be used for re-extraction
949
- - **HYBRID_OCR_PADDING** (default: 1): Padding added to word bounding boxes before re-extraction
950
- - **SAVE_EXAMPLE_HYBRID_IMAGES** (default: False): Save comparison images when using hybrid mode
951
- - **SAVE_PAGE_OCR_VISUALISATIONS** (default: False): Save images with PaddleOCR bounding boxes overlaid
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
952
 
953
  ### When to use different OCR models
954
 
955
- - **Tesseract**: Best for general use, good balance of speed and accuracy
956
- - **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision
957
- - **Hybrid**: Best for challenging documents where some text has low confidence scores, providing the benefits of both engines
 
 
958
 
959
 
960
 
@@ -1069,18 +1124,65 @@ python cli_redact.py --task textract --textract_action list
1069
 
1070
  ### Common CLI options
1071
 
 
 
1072
  - `--task`: Choose between "redact", "deduplicate", or "textract"
1073
- - `--input_file`: Path to input file(s)
1074
  - `--output_dir`: Directory for output files (default: output/)
1075
- - `--page_min` / `--page_max`: Process only specific page range
1076
- - `--ocr_method`: Choose text extraction method
1077
- - `--pii_detector`: Choose PII detection method
1078
- - `--local_redact_entities`: Specify local entities to redact
1079
- - `--allow_list_file` / `--deny_list_file`: Custom lists
1080
- - `--redact_whole_page_file`: List of pages to redact completely
1081
- - `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching
1082
- - `--similarity_threshold`: Threshold for duplicate detection
1083
- - `--anon_strategy`: Anonymization strategy for tabular data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1084
 
1085
  ### Output files
1086
 
 
10
  ---
11
  # Document redaction
12
 
13
+ version: 1.5.1
14
 
15
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
16
 
 
249
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
250
 
251
  ### Advanced user guide
 
252
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
253
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
254
  - [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
 
260
  - [Merging redaction review files](#merging-redaction-review-files)
261
 
262
  ### Features for expert users/system administrators
 
263
  - [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
264
  - [Command Line Interface (CLI)](#command-line-interface-cli)
265
 
 
374
 
375
  ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
376
 
377
+ #### Additional outputs in the log file outputs
378
+
379
+ On the Redaction settings tab, near the bottom of the pagethere is a section called 'Log file outputs'. This section contains the following files:
380
+
381
+ You may see a '..._ocr_results_with_words... .json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
382
+
383
+ Also you will see a 'decision_process_table.csv' file. This file contains a table of the decisions made by the app for each page of the document. This can be useful for debugging and understanding the decisions made by the app.
384
+
385
+ Additionally, if the option is enabled by your system administrator, on this tab you may see an image of the output from the OCR model used to extract the text from the document, an image ending with page number and '_visualisations.jpg'. A separate image will be created for each page of the document like the one below. This can be useful for seeing at a glance whether the text extraction process for a page was successful, and whether word-level bounding boxes are correctly positioned.
386
+
387
+ ![Text analysis output](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/example_complaint_letter_1_textract_visualisations.jpg)
388
 
389
  ### Downloading output files from previous redaction tasks
390
 
 
694
  ### Redaction log outputs
695
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
696
 
697
+
698
  ## Identifying and redacting duplicate pages
699
 
700
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
 
925
 
926
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
927
 
928
+ ## Advanced OCR options
 
 
929
 
930
+ The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by changing the app_config.env file in your '/config' folder, or system environment variables in your system.
931
 
932
  ### Available OCR models
933
 
934
+ - **Tesseract** (default): The standard OCR engine that works well for most documents. Provides good word-level bounding box accuracy.
935
+ - **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise. Best for documents with clear, well-formatted text.
936
+ - **Hybrid-paddle**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text regions.
937
+ - **Hybrid-vlm**: Combines Tesseract with Vision Language Models (VLM) - uses Tesseract for initial extraction, then a VLM model (default: Dots.OCR) for re-extraction of low-confidence text.
938
+ - **Hybrid-paddle-vlm**: Combines PaddleOCR with Vision Language Models - uses PaddleOCR first, then a VLM model for low-confidence regions.
939
 
940
  ### Enabling advanced OCR options
941
 
942
+ To enable these options, you need to modify the app_config.env file in your '/config' folder and set the following environment variables:
943
 
944
+ **Basic OCR model selection:**
945
  ```
946
  SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
947
  ```
948
 
949
+ **To enable PaddleOCR options (paddle, hybrid-paddle):**
950
+ ```
951
+ SHOW_PADDLE_MODEL_OPTIONS = "True"
952
+ ```
953
+
954
+ **To enable Vision Language Model options (hybrid-vlm, hybrid-paddle-vlm):**
955
+ ```
956
+ SHOW_VLM_MODEL_OPTIONS = "True"
957
+ ```
958
+
959
+ Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between the available models based on what has been enabled.
960
 
961
+ ### OCR configuration parameters
962
 
963
+ The following parameters can be configured by your system administrator to fine-tune OCR behavior:
964
 
965
+ #### Hybrid OCR settings
966
+
967
+ - **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 80): Tesseract confidence score below which the secondary OCR engine (PaddleOCR or VLM) will be used for re-extraction. Lower values mean more text will be re-extracted.
968
+ - **HYBRID_OCR_PADDING** (default: 1): Padding (in pixels) added to word bounding boxes before re-extraction with the secondary engine.
969
+ - **SAVE_EXAMPLE_HYBRID_IMAGES** (default: False): If enabled, saves comparison images showing Tesseract vs. secondary engine results when using hybrid modes.
970
+ - **SAVE_PAGE_OCR_VISUALISATIONS** (default: False): If enabled, saves images with detected bounding boxes overlaid for debugging purposes.
971
+
972
+ #### Tesseract settings
973
+
974
+ - **TESSERACT_SEGMENTATION_LEVEL** (default: 11): Tesseract PSM (Page Segmentation Mode) level. Valid values are 0-13. Higher values provide more detailed segmentation but may be slower.
975
+
976
+ #### PaddleOCR settings
977
+
978
+ - **PADDLE_USE_TEXTLINE_ORIENTATION** (default: False): If enabled, PaddleOCR will detect and correct text line orientation.
979
+ - **PADDLE_DET_DB_UNCLIP_RATIO** (default: 1.2): Controls the expansion ratio of detected text regions. Higher values expand the detection area more.
980
+ - **CONVERT_LINE_TO_WORD_LEVEL** (default: False): If enabled, converts PaddleOCR line-level results to word-level for better precision in bounding boxes (not perfect, but pretty good).
981
+ - **LOAD_PADDLE_AT_STARTUP** (default: False): If enabled, loads the PaddleOCR model when the application starts, reducing latency for first use but increasing startup time.
982
+
983
+ #### Image preprocessing
984
+
985
+ - **PREPROCESS_LOCAL_OCR_IMAGES** (default: True): If enabled, images are preprocessed before OCR. This can improve accuracy but may slow down processing.
986
+ - **SAVE_PREPROCESS_IMAGES** (default: False): If enabled, saves the preprocessed images for debugging purposes.
987
+
988
+ #### Vision Language Model (VLM) settings
989
+
990
+ When VLM options are enabled, the following settings are available:
991
+
992
+ - **SELECTED_MODEL** (default: "Dots.OCR"): The VLM model to use. Options include: "Nanonets-OCR2-3B", "Dots.OCR", "Qwen3-VL-2B-Instruct", "Qwen3-VL-4B-Instruct", "PaddleOCR-VL".
993
+ - **MAX_SPACES_GPU_RUN_TIME** (default: 60): Maximum seconds to run GPU operations on Hugging Face Spaces.
994
+ - **MAX_NEW_TOKENS** (default: 30): Maximum number of tokens to generate for VLM responses.
995
+ - **MAX_INPUT_TOKEN_LENGTH** (default: 4096): Maximum number of tokens that can be input to the VLM.
996
+ - **VLM_MAX_IMAGE_SIZE** (default: 1000000): Maximum total pixels (width × height) for images. Larger images are resized while maintaining aspect ratio.
997
+ - **VLM_MAX_DPI** (default: 300.0): Maximum DPI for images. Higher DPI images are resized accordingly.
998
+ - **USE_FLASH_ATTENTION** (default: False): If enabled, uses flash attention for improved VLM performance.
999
+ - **SAVE_VLM_INPUT_IMAGES** (default: False): If enabled, saves input images sent to VLM for debugging.
1000
+
1001
+ #### General settings
1002
+
1003
+ - **MODEL_CACHE_PATH** (default: "./model_cache"): Directory where OCR models are cached.
1004
+ - **OVERWRITE_EXISTING_OCR_RESULTS** (default: False): If enabled, always creates new OCR results instead of loading from existing JSON files.
1005
 
1006
  ### When to use different OCR models
1007
 
1008
+ - **Tesseract**: Best for general use, providing a good balance of speed and accuracy with precise word-level bounding boxes.
1009
+ - **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision.
1010
+ - **Hybrid-paddle**: Best for challenging documents where some text has low confidence scores, combining Tesseract's word-level precision with PaddleOCR's improved text recognition.
1011
+ - **Hybrid-vlm**: Best for very challenging documents with poor image quality or unusual text layouts, leveraging advanced vision models for difficult text.
1012
+ - **Hybrid-paddle-vlm**: Most comprehensive option, combining PaddleOCR's line-level detection with a VLM's advanced recognition capabilities.
1013
 
1014
 
1015
 
 
1124
 
1125
  ### Common CLI options
1126
 
1127
+ #### General options
1128
+
1129
  - `--task`: Choose between "redact", "deduplicate", or "textract"
1130
+ - `--input_file`: Path to input file(s) - can specify multiple files separated by spaces
1131
  - `--output_dir`: Directory for output files (default: output/)
1132
+ - `--input_dir`: Directory for input files (default: input/)
1133
+ - `--language`: Language of document content (e.g., "en", "es", "fr")
1134
+ - `--username`: Username for session tracking
1135
+ - `--pii_detector`: Choose PII detection method ("Local", "AWS Comprehend", or "None")
1136
+ - `--local_redact_entities`: Specify local entities to redact (space-separated list)
1137
+ - `--aws_redact_entities`: Specify AWS Comprehend entities to redact (space-separated list)
1138
+ - `--aws_access_key` / `--aws_secret_key`: AWS credentials for cloud services
1139
+ - `--aws_region`: AWS region for cloud services
1140
+ - `--s3_bucket`: S3 bucket name for cloud operations
1141
+ - `--cost_code`: Cost code for tracking usage
1142
+
1143
+ #### PDF/Image redaction options
1144
+
1145
+ - `--ocr_method`: Choose text extraction method ("AWS Textract", "Local OCR", or "Local text")
1146
+ - `--chosen_local_ocr_model`: Local OCR model to use (e.g., "tesseract", "paddle", "hybrid-paddle", "hybrid-vlm")
1147
+ - `--page_min` / `--page_max`: Process only specific page range (0 for max means all pages)
1148
+ - `--images_dpi`: DPI for image processing (default: 300.0)
1149
+ - `--preprocess_local_ocr_images`: Preprocess images before OCR (True/False)
1150
+ - `--compress_redacted_pdf`: Compress the final redacted PDF (True/False)
1151
+ - `--return_pdf_end_of_redaction`: Return PDF at end of redaction process (True/False)
1152
+ - `--allow_list_file` / `--deny_list_file`: Paths to custom allow/deny list CSV files
1153
+ - `--redact_whole_page_file`: Path to CSV file listing pages to redact completely
1154
+ - `--handwrite_signature_extraction`: Handwriting and signature extraction options for Textract ("Extract handwriting", "Extract signatures")
1155
+ - `--extract_forms`: Extract forms during Textract analysis (flag)
1156
+ - `--extract_tables`: Extract tables during Textract analysis (flag)
1157
+ - `--extract_layout`: Extract layout during Textract analysis (flag)
1158
+
1159
+ #### Tabular/Word anonymization options
1160
+
1161
+ - `--anon_strategy`: Anonymization strategy (e.g., "redact", "redact completely", "replace_redacted", "encrypt", "hash")
1162
+ - `--text_columns`: List of column names to anonymize (space-separated)
1163
+ - `--excel_sheets`: Specific Excel sheet names to process (space-separated)
1164
+ - `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching (default: 1)
1165
+ - `--match_fuzzy_whole_phrase_bool`: Match fuzzy whole phrase (True/False)
1166
+ - `--do_initial_clean`: Perform initial text cleaning for tabular data (True/False)
1167
+
1168
+ #### Duplicate detection options
1169
+
1170
+ - `--duplicate_type`: Type of duplicate detection ("pages" for OCR files or "tabular" for CSV/Excel)
1171
+ - `--similarity_threshold`: Similarity threshold (0-1) to consider content as duplicates (default: 0.95)
1172
+ - `--min_word_count`: Minimum word count for text to be considered (default: 10)
1173
+ - `--min_consecutive_pages`: Minimum number of consecutive pages to consider as a match (default: 1)
1174
+ - `--greedy_match`: Use greedy matching strategy for consecutive pages (True/False)
1175
+ - `--combine_pages`: Combine text from same page number within a file (True/False)
1176
+ - `--remove_duplicate_rows`: Remove duplicate rows from output (True/False)
1177
+
1178
+ #### Textract batch operations options
1179
+
1180
+ - `--textract_action`: Action to perform ("submit", "retrieve", or "list")
1181
+ - `--job_id`: Textract job ID for retrieve action
1182
+ - `--extract_signatures`: Extract signatures during Textract analysis (flag)
1183
+ - `--textract_bucket`: S3 bucket name for Textract operations
1184
+ - `--poll_interval`: Polling interval in seconds for job status (default: 30)
1185
+ - `--max_poll_attempts`: Maximum polling attempts before timeout (default: 120)
1186
 
1187
  ### Output files
1188
 
cli_redact.py CHANGED
@@ -36,6 +36,7 @@ from tools.config import (
36
  FULL_ENTITY_LIST,
37
  IMAGES_DPI,
38
  INPUT_FOLDER,
 
39
  LOCAL_PII_OPTION,
40
  OUTPUT_FOLDER,
41
  PADDLE_MODEL_PATH,
@@ -399,7 +400,7 @@ python cli_redact.py --task textract --textract_action list
399
  )
400
  pdf_group.add_argument(
401
  "--chosen_local_ocr_model",
402
- choices=["tesseract", "hybrid-paddle", "paddle"],
403
  default=CHOSEN_LOCAL_OCR_MODEL,
404
  help="Local OCR model to use.",
405
  )
 
36
  FULL_ENTITY_LIST,
37
  IMAGES_DPI,
38
  INPUT_FOLDER,
39
+ LOCAL_OCR_MODEL_OPTIONS,
40
  LOCAL_PII_OPTION,
41
  OUTPUT_FOLDER,
42
  PADDLE_MODEL_PATH,
 
400
  )
401
  pdf_group.add_argument(
402
  "--chosen_local_ocr_model",
403
+ choices=LOCAL_OCR_MODEL_OPTIONS,
404
  default=CHOSEN_LOCAL_OCR_MODEL,
405
  help="Local OCR model to use.",
406
  )
pyproject.toml CHANGED
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
 
5
  [project]
6
  name = "doc_redaction"
7
- version = "1.5.0"
8
  description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
9
  readme = "README.md"
10
  requires-python = ">=3.10"
 
4
 
5
  [project]
6
  name = "doc_redaction"
7
+ version = "1.5.1"
8
  description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
9
  readme = "README.md"
10
  requires-python = ">=3.10"
src/app_settings.qmd CHANGED
@@ -97,11 +97,11 @@ Configuration for input and output file handling.
97
  * **Description:** If set to `'True'`, the application will save output and input files into session-specific subfolders.
98
  * **Default Value:** `'False'`
99
 
100
- * **`OUTPUT_FOLDER`**
101
  * **Description:** Specifies the default output folder for generated files. Can be set to `"TEMP"` to use a temporary directory.
102
  * **Default Value:** `'output/'`
103
 
104
- * **`INPUT_FOLDER`**
105
  * **Description:** Specifies the default input folder for files. Can be set to `"TEMP"` to use a temporary directory.
106
  * **Default Value:** `'input/'`
107
 
@@ -225,6 +225,14 @@ Configurations for the Gradio UI, server behavior, and application limits.
225
  * **Description:** Maximum number of characters for open text input.
226
  * **Default Value:** `50000`
227
 
 
 
 
 
 
 
 
 
228
  * **`TLDEXTRACT_CACHE`**
229
  * **Description:** Path to the cache directory used by the `tldextract` library.
230
  * **Default Value:** `'tmp/tld/'`
@@ -263,6 +271,14 @@ Configurations related to text extraction, PII detection, and the redaction proc
263
  * **Description:** Controls whether local (Tesseract) or AWS (Textract) text extraction options are shown in the UI.
264
  * **Default Value:** `"True"` for both.
265
 
 
 
 
 
 
 
 
 
266
  * **`SHOW_LOCAL_PII_DETECTION_OPTIONS`** / **`SHOW_AWS_PII_DETECTION_OPTIONS`**
267
  * **Description:** Controls whether local or AWS (Comprehend) PII detection options are shown in the UI.
268
  * **Default Value:** `"True"` for both.
@@ -309,7 +325,7 @@ Configurations related to text extraction, PII detection, and the redaction proc
309
 
310
  * **`HYBRID_OCR_CONFIDENCE_THRESHOLD`**
311
  * **Description:** In "hybrid-paddle" mode, this is the Tesseract confidence score below which PaddleOCR will be used for re-extraction.
312
- * **Default Value:** `65`
313
 
314
  * **`HYBRID_OCR_PADDING`**
315
  * **Description:** In "hybrid-paddle" mode, padding added to the word's bounding box before re-extraction.
@@ -333,6 +349,76 @@ Configurations related to text extraction, PII detection, and the redaction proc
333
 
334
  * **`PREPROCESS_LOCAL_OCR_IMAGES`**
335
  * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
336
  * **Default Value:** `"False"`
337
 
338
  ### Entity and Search Options
@@ -367,6 +453,10 @@ Configurations related to text extraction, PII detection, and the redaction proc
367
  * **Description:** The default options selected for Textract's handwriting and signature detection.
368
  * **Default Value:** `['Extract handwriting']`
369
 
 
 
 
 
370
  * **`INCLUDE_FORM_EXTRACTION_TEXTRACT_OPTION`**
371
  * **`INCLUDE_LAYOUT_EXTRACTION_TEXTRACT_OPTION`**
372
  * **`INCLUDE_TABLE_EXTRACTION_TEXTRACT_OPTION`**
@@ -481,9 +571,105 @@ Settings for running the application from the command line (Direct Mode) or as a
481
  * **Description:** Path to the input file and output directory for the task.
482
  * **Default Values:** `''`, `output/`
483
 
484
- * **Other `DIRECT_MODE_*` variables:**
485
- * **Description:** These variables allow for setting nearly all application options (e.g., `DIRECT_MODE_PII_DETECTOR`, `DIRECT_MODE_SIMILARITY_THRESHOLD`) directly for a single CLI run, overriding other configurations.
486
- * **Default Value:** Defaults are inherited from the main application settings (e.g., `LOCAL_PII_OPTION`, `DEFAULT_DUPLICATE_DETECTION_THRESHOLD`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
487
 
488
  ### Lambda Configuration
489
 
 
97
  * **Description:** If set to `'True'`, the application will save output and input files into session-specific subfolders.
98
  * **Default Value:** `'False'`
99
 
100
+ * **`OUTPUT_FOLDER`** (environment variable: `GRADIO_OUTPUT_FOLDER`)
101
  * **Description:** Specifies the default output folder for generated files. Can be set to `"TEMP"` to use a temporary directory.
102
  * **Default Value:** `'output/'`
103
 
104
+ * **`INPUT_FOLDER`** (environment variable: `GRADIO_INPUT_FOLDER`)
105
  * **Description:** Specifies the default input folder for files. Can be set to `"TEMP"` to use a temporary directory.
106
  * **Default Value:** `'input/'`
107
 
 
225
  * **Description:** Maximum number of characters for open text input.
226
  * **Default Value:** `50000`
227
 
228
+ * **`PAGE_BREAK_VALUE`**
229
+ * **Description:** Number of pages to process before breaking and restarting from the last finished page (not currently activated).
230
+ * **Default Value:** `99999`
231
+
232
+ * **`MAX_TIME_VALUE`**
233
+ * **Description:** Maximum time value for processing operations.
234
+ * **Default Value:** `999999`
235
+
236
  * **`TLDEXTRACT_CACHE`**
237
  * **Description:** Path to the cache directory used by the `tldextract` library.
238
  * **Default Value:** `'tmp/tld/'`
 
271
  * **Description:** Controls whether local (Tesseract) or AWS (Textract) text extraction options are shown in the UI.
272
  * **Default Value:** `"True"` for both.
273
 
274
+ * **`SELECTABLE_TEXT_EXTRACT_OPTION`**, **`TESSERACT_TEXT_EXTRACT_OPTION`**, **`TEXTRACT_TEXT_EXTRACT_OPTION`**
275
+ * **Description:** Labels for text extraction model options displayed in the UI. Customize the display names for "Local model - selectable text", "Local OCR model - PDFs without selectable text", and "AWS Textract service - all PDF types" respectively.
276
+ * **Default Values:** `"Local model - selectable text"`, `"Local OCR model - PDFs without selectable text"`, `"AWS Textract service - all PDF types"`
277
+
278
+ * **`NO_REDACTION_PII_OPTION`**, **`LOCAL_PII_OPTION`**, **`AWS_PII_OPTION`**
279
+ * **Description:** Labels for PII detection model options displayed in the UI. Customize the display names for "Only extract text (no redaction)", "Local", and "AWS Comprehend" respectively.
280
+ * **Default Values:** `"Only extract text (no redaction)"`, `"Local"`, `"AWS Comprehend"`
281
+
282
  * **`SHOW_LOCAL_PII_DETECTION_OPTIONS`** / **`SHOW_AWS_PII_DETECTION_OPTIONS`**
283
  * **Description:** Controls whether local or AWS (Comprehend) PII detection options are shown in the UI.
284
  * **Default Value:** `"True"` for both.
 
325
 
326
  * **`HYBRID_OCR_CONFIDENCE_THRESHOLD`**
327
  * **Description:** In "hybrid-paddle" mode, this is the Tesseract confidence score below which PaddleOCR will be used for re-extraction.
328
+ * **Default Value:** `80`
329
 
330
  * **`HYBRID_OCR_PADDING`**
331
  * **Description:** In "hybrid-paddle" mode, padding added to the word's bounding box before re-extraction.
 
349
 
350
  * **`PREPROCESS_LOCAL_OCR_IMAGES`**
351
  * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
352
+ * **Default Value:** `"True"`
353
+
354
+ * **`SAVE_PREPROCESS_IMAGES`**
355
+ * **Description:** If set to `"True"`, saves the preprocessed images for debugging purposes.
356
+ * **Default Value:** `"False"`
357
+
358
+ * **`SHOW_PADDLE_MODEL_OPTIONS`**
359
+ * **Description:** If set to `"True"`, allows the user to select PaddleOCR-related options (paddle, hybrid-paddle) from the UI.
360
+ * **Default Value:** `"False"`
361
+
362
+ * **`MODEL_CACHE_PATH`**
363
+ * **Description:** Path to the directory where models are cached.
364
+ * **Default Value:** `"./model_cache"`
365
+
366
+ * **`TESSERACT_SEGMENTATION_LEVEL`**
367
+ * **Description:** Tesseract PSM (Page Segmentation Mode) level to use for OCR. Valid values are 0-13.
368
+ * **Default Value:** `11`
369
+
370
+ * **`CONVERT_LINE_TO_WORD_LEVEL`**
371
+ * **Description:** If set to `"True"`, converts PaddleOCR line-level OCR results to word-level for better precision.
372
+ * **Default Value:** `"False"`
373
+
374
+ * **`LOAD_PADDLE_AT_STARTUP`**
375
+ * **Description:** If set to `"True"`, loads the PaddleOCR model at application startup.
376
+ * **Default Value:** `"False"`
377
+
378
+ ### Vision Language Model (VLM) Options
379
+
380
+ * **`SHOW_VLM_MODEL_OPTIONS`**
381
+ * **Description:** If set to `"True"`, VLM (Vision Language Model) options will be shown in the UI.
382
+ * **Default Value:** `"False"`
383
+
384
+ * **`SELECTED_MODEL`**
385
+ * **Description:** Selected vision model for OCR. Choose from: `"Nanonets-OCR2-3B"`, `"Dots.OCR"`, `"Qwen3-VL-2B-Instruct"`, `"Qwen3-VL-4B-Instruct"`, `"PaddleOCR-VL"`.
386
+ * **Default Value:** `"Dots.OCR"`
387
+
388
+ * **`MAX_SPACES_GPU_RUN_TIME`**
389
+ * **Description:** Maximum number of seconds to run the GPU on Spaces (Hugging Face Spaces).
390
+ * **Default Value:** `60`
391
+
392
+ * **`MAX_NEW_TOKENS`**
393
+ * **Description:** Maximum number of tokens to generate for VLM responses.
394
+ * **Default Value:** `30`
395
+
396
+ * **`DEFAULT_MAX_NEW_TOKENS`**
397
+ * **Description:** Default maximum number of tokens to generate for VLM responses.
398
+ * **Default Value:** `30`
399
+
400
+ * **`MAX_INPUT_TOKEN_LENGTH`**
401
+ * **Description:** Maximum number of tokens that can be input to the VLM.
402
+ * **Default Value:** `4096`
403
+
404
+ * **`VLM_MAX_IMAGE_SIZE`**
405
+ * **Description:** Maximum total pixels (width * height) for images passed to VLM. Images with more pixels will be resized while maintaining aspect ratio.
406
+ * **Default Value:** `1000000` (1000x1000)
407
+
408
+ * **`VLM_MAX_DPI`**
409
+ * **Description:** Maximum DPI for images passed to VLM. Images with higher DPI will be resized accordingly.
410
+ * **Default Value:** `300.0`
411
+
412
+ * **`USE_FLASH_ATTENTION`**
413
+ * **Description:** If set to `"True"`, uses flash attention for the VLM, which can improve performance.
414
+ * **Default Value:** `"False"`
415
+
416
+ * **`OVERWRITE_EXISTING_OCR_RESULTS`**
417
+ * **Description:** If set to `"True"`, always creates new OCR results instead of loading from existing JSON files.
418
+ * **Default Value:** `"False"`
419
+
420
+ * **`SAVE_VLM_INPUT_IMAGES`**
421
+ * **Description:** If set to `"True"`, saves input images sent to VLM OCR for debugging purposes.
422
  * **Default Value:** `"False"`
423
 
424
  ### Entity and Search Options
 
453
  * **Description:** The default options selected for Textract's handwriting and signature detection.
454
  * **Default Value:** `['Extract handwriting']`
455
 
456
+ * **`HANDWRITE_SIGNATURE_TEXTBOX_FULL_OPTIONS`**
457
+ * **Description:** Full list of available options for Textract's handwriting and signature detection. Can include `'Extract handwriting'`, `'Extract signatures'`, and optionally `'Extract forms'`, `'Extract layout'`, `'Extract tables'` if the corresponding include options are enabled.
458
+ * **Default Value:** `['Extract handwriting', 'Extract signatures']`
459
+
460
  * **`INCLUDE_FORM_EXTRACTION_TEXTRACT_OPTION`**
461
  * **`INCLUDE_LAYOUT_EXTRACTION_TEXTRACT_OPTION`**
462
  * **`INCLUDE_TABLE_EXTRACTION_TEXTRACT_OPTION`**
 
571
  * **Description:** Path to the input file and output directory for the task.
572
  * **Default Values:** `''`, `output/`
573
 
574
+ * **`DIRECT_MODE_DUPLICATE_TYPE`**
575
+ * **Description:** Type of duplicate detection for direct mode: `'pages'` or `'tabular'`.
576
+ * **Default Value:** `'pages'`
577
+
578
+ * **`DIRECT_MODE_LANGUAGE`**
579
+ * **Description:** Language for document processing in direct mode.
580
+ * **Default Value:** Inherits from `DEFAULT_LANGUAGE`
581
+
582
+ * **`DIRECT_MODE_PII_DETECTOR`**
583
+ * **Description:** PII detection method for direct mode.
584
+ * **Default Value:** Inherits from `LOCAL_PII_OPTION`
585
+
586
+ * **`DIRECT_MODE_OCR_METHOD`**
587
+ * **Description:** OCR method for PDF/image processing in direct mode.
588
+ * **Default Value:** `"Local OCR"`
589
+
590
+ * **`DIRECT_MODE_PAGE_MIN`** / **`DIRECT_MODE_PAGE_MAX`**
591
+ * **Description:** First and last page to process in direct mode. `0` for max means process all pages.
592
+ * **Default Values:** Inherit from `DEFAULT_PAGE_MIN` / `DEFAULT_PAGE_MAX`
593
+
594
+ * **`DIRECT_MODE_IMAGES_DPI`**
595
+ * **Description:** DPI for image processing in direct mode.
596
+ * **Default Value:** Inherits from `IMAGES_DPI`
597
+
598
+ * **`DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL`**
599
+ * **Description:** Local OCR model choice for direct mode.
600
+ * **Default Value:** Inherits from `CHOSEN_LOCAL_OCR_MODEL`
601
+
602
+ * **`DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES`**
603
+ * **Description:** If set to `"True"`, preprocesses images before OCR in direct mode.
604
+ * **Default Value:** Inherits from `PREPROCESS_LOCAL_OCR_IMAGES`
605
+
606
+ * **`DIRECT_MODE_COMPRESS_REDACTED_PDF`**
607
+ * **Description:** If set to `"True"`, compresses the redacted PDF output in direct mode.
608
+ * **Default Value:** Inherits from `COMPRESS_REDACTED_PDF`
609
+
610
+ * **`DIRECT_MODE_RETURN_PDF_END_OF_REDACTION`**
611
+ * **Description:** If set to `"True"`, returns a PDF at the end of redaction in direct mode.
612
+ * **Default Value:** Inherits from `RETURN_REDACTED_PDF`
613
+
614
+ * **`DIRECT_MODE_EXTRACT_FORMS`**
615
+ * **Description:** If set to `"True"`, extracts forms during Textract analysis in direct mode.
616
+ * **Default Value:** `"False"`
617
+
618
+ * **`DIRECT_MODE_EXTRACT_TABLES`**
619
+ * **Description:** If set to `"True"`, extracts tables during Textract analysis in direct mode.
620
+ * **Default Value:** `"False"`
621
+
622
+ * **`DIRECT_MODE_EXTRACT_LAYOUT`**
623
+ * **Description:** If set to `"True"`, extracts layout during Textract analysis in direct mode.
624
+ * **Default Value:** `"False"`
625
+
626
+ * **`DIRECT_MODE_EXTRACT_SIGNATURES`**
627
+ * **Description:** If set to `"True"`, extracts signatures during Textract analysis in direct mode.
628
+ * **Default Value:** `"False"`
629
+
630
+ * **`DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL`**
631
+ * **Description:** If set to `"True"`, matches fuzzy whole phrases in direct mode.
632
+ * **Default Value:** `"True"`
633
+
634
+ * **`DIRECT_MODE_ANON_STRATEGY`**
635
+ * **Description:** Anonymisation strategy for tabular data in direct mode.
636
+ * **Default Value:** Inherits from `DEFAULT_TABULAR_ANONYMISATION_STRATEGY`
637
+
638
+ * **`DIRECT_MODE_FUZZY_MISTAKES`**
639
+ * **Description:** Number of fuzzy spelling mistakes allowed in direct mode.
640
+ * **Default Value:** Inherits from `DEFAULT_FUZZY_SPELLING_MISTAKES_NUM`
641
+
642
+ * **`DIRECT_MODE_SIMILARITY_THRESHOLD`**
643
+ * **Description:** Similarity threshold for duplicate detection in direct mode.
644
+ * **Default Value:** Inherits from `DEFAULT_DUPLICATE_DETECTION_THRESHOLD`
645
+
646
+ * **`DIRECT_MODE_MIN_WORD_COUNT`**
647
+ * **Description:** Minimum word count for duplicate detection in direct mode.
648
+ * **Default Value:** Inherits from `DEFAULT_MIN_WORD_COUNT`
649
+
650
+ * **`DIRECT_MODE_MIN_CONSECUTIVE_PAGES`**
651
+ * **Description:** Minimum consecutive pages for duplicate detection in direct mode.
652
+ * **Default Value:** Inherits from `DEFAULT_MIN_CONSECUTIVE_PAGES`
653
+
654
+ * **`DIRECT_MODE_GREEDY_MATCH`**
655
+ * **Description:** If set to `"True"`, uses greedy matching for duplicate detection in direct mode.
656
+ * **Default Value:** Inherits from `USE_GREEDY_DUPLICATE_DETECTION`
657
+
658
+ * **`DIRECT_MODE_COMBINE_PAGES`**
659
+ * **Description:** If set to `"True"`, combines pages for duplicate detection in direct mode.
660
+ * **Default Value:** Inherits from `DEFAULT_COMBINE_PAGES`
661
+
662
+ * **`DIRECT_MODE_REMOVE_DUPLICATE_ROWS`**
663
+ * **Description:** If set to `"True"`, removes duplicate rows in tabular data in direct mode.
664
+ * **Default Value:** Inherits from `REMOVE_DUPLICATE_ROWS`
665
+
666
+ * **`DIRECT_MODE_TEXTRACT_ACTION`**
667
+ * **Description:** Textract action for batch operations in direct mode.
668
+ * **Default Value:** `''`
669
+
670
+ * **`DIRECT_MODE_JOB_ID`**
671
+ * **Description:** Job ID for Textract operations in direct mode.
672
+ * **Default Value:** `''`
673
 
674
  ### Lambda Configuration
675
 
src/user_guide.qmd CHANGED
@@ -150,7 +150,17 @@ If you have used the AWS Textract option for extracting text, you may also see a
150
 
151
  ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
152
 
153
- Similarly, if you have used the 'Local OCR method' to extract text, you may see a '..._ocr_results_with_words.json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
 
 
 
 
 
 
 
 
 
 
154
 
155
  ### Downloading output files from previous redaction tasks
156
 
@@ -460,6 +470,7 @@ You can also write open text into an input box and redact that using the same me
460
  ### Redaction log outputs
461
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
462
 
 
463
  ## Identifying and redacting duplicate pages
464
 
465
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
@@ -690,45 +701,91 @@ AWS_SECRET_KEY= your-secret-key
690
 
691
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
692
 
693
- Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
694
-
695
- ## Advanced OCR options (Hybrid OCR)
696
 
697
- The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by your system administrator.
698
 
699
  ### Available OCR models
700
 
701
- - **Tesseract** (default): The standard OCR engine that works well for most documents
702
- - **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise
703
- - **Hybrid**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text
 
 
704
 
705
  ### Enabling advanced OCR options
706
 
707
- To enable these options, your system administrator needs to modify the configuration file (`config.py`) and set:
708
 
 
709
  ```
710
  SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
711
  ```
712
 
713
- Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between:
714
- - tesseract
715
- - hybrid
716
- - paddle
 
 
 
 
 
 
 
717
 
718
- ### Hybrid OCR configuration
719
 
720
- The hybrid OCR mode uses several configurable parameters:
721
 
722
- - **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 65): Tesseract confidence score below which PaddleOCR will be used for re-extraction
723
- - **HYBRID_OCR_PADDING** (default: 1): Padding added to word bounding boxes before re-extraction
724
- - **SAVE_EXAMPLE_HYBRID_IMAGES** (default: False): Save comparison images when using hybrid mode
725
- - **SAVE_PAGE_OCR_VISUALISATIONS** (default: False): Save images with PaddleOCR bounding boxes overlaid
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
726
 
727
  ### When to use different OCR models
728
 
729
- - **Tesseract**: Best for general use, good balance of speed and accuracy
730
- - **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision
731
- - **Hybrid**: Best for challenging documents where some text has low confidence scores, providing the benefits of both engines
 
 
732
 
733
 
734
 
@@ -843,18 +900,65 @@ python cli_redact.py --task textract --textract_action list
843
 
844
  ### Common CLI options
845
 
 
 
846
  - `--task`: Choose between "redact", "deduplicate", or "textract"
847
- - `--input_file`: Path to input file(s)
848
  - `--output_dir`: Directory for output files (default: output/)
849
- - `--page_min` / `--page_max`: Process only specific page range
850
- - `--ocr_method`: Choose text extraction method
851
- - `--pii_detector`: Choose PII detection method
852
- - `--local_redact_entities`: Specify local entities to redact
853
- - `--allow_list_file` / `--deny_list_file`: Custom lists
854
- - `--redact_whole_page_file`: List of pages to redact completely
855
- - `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching
856
- - `--similarity_threshold`: Threshold for duplicate detection
857
- - `--anon_strategy`: Anonymization strategy for tabular data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
858
 
859
  ### Output files
860
 
 
150
 
151
  ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
152
 
153
+ #### Additional outputs in the log file outputs
154
+
155
+ On the Redaction settings tab, near the bottom of the pagethere is a section called 'Log file outputs'. This section contains the following files:
156
+
157
+ You may see a '..._ocr_results_with_words... .json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
158
+
159
+ Also you will see a 'decision_process_table.csv' file. This file contains a table of the decisions made by the app for each page of the document. This can be useful for debugging and understanding the decisions made by the app.
160
+
161
+ Additionally, if the option is enabled by your system administrator, on this tab you may see an image of the output from the OCR model used to extract the text from the document, an image ending with page number and '_visualisations.jpg'. A separate image will be created for each page of the document like the one below. This can be useful for seeing at a glance whether the text extraction process for a page was successful, and whether word-level bounding boxes are correctly positioned.
162
+
163
+ ![Text analysis output](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/example_complaint_letter_1_textract_visualisations.jpg)
164
 
165
  ### Downloading output files from previous redaction tasks
166
 
 
470
  ### Redaction log outputs
471
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
472
 
473
+
474
  ## Identifying and redacting duplicate pages
475
 
476
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
 
701
 
702
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
703
 
704
+ ## Advanced OCR options
 
 
705
 
706
+ The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by changing the app_config.env file in your '/config' folder, or system environment variables in your system.
707
 
708
  ### Available OCR models
709
 
710
+ - **Tesseract** (default): The standard OCR engine that works well for most documents. Provides good word-level bounding box accuracy.
711
+ - **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise. Best for documents with clear, well-formatted text.
712
+ - **Hybrid-paddle**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text regions.
713
+ - **Hybrid-vlm**: Combines Tesseract with Vision Language Models (VLM) - uses Tesseract for initial extraction, then a VLM model (default: Dots.OCR) for re-extraction of low-confidence text.
714
+ - **Hybrid-paddle-vlm**: Combines PaddleOCR with Vision Language Models - uses PaddleOCR first, then a VLM model for low-confidence regions.
715
 
716
  ### Enabling advanced OCR options
717
 
718
+ To enable these options, you need to modify the app_config.env file in your '/config' folder and set the following environment variables:
719
 
720
+ **Basic OCR model selection:**
721
  ```
722
  SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
723
  ```
724
 
725
+ **To enable PaddleOCR options (paddle, hybrid-paddle):**
726
+ ```
727
+ SHOW_PADDLE_MODEL_OPTIONS = "True"
728
+ ```
729
+
730
+ **To enable Vision Language Model options (hybrid-vlm, hybrid-paddle-vlm):**
731
+ ```
732
+ SHOW_VLM_MODEL_OPTIONS = "True"
733
+ ```
734
+
735
+ Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between the available models based on what has been enabled.
736
 
737
+ ### OCR configuration parameters
738
 
739
+ The following parameters can be configured by your system administrator to fine-tune OCR behavior:
740
 
741
+ #### Hybrid OCR settings
742
+
743
+ - **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 80): Tesseract confidence score below which the secondary OCR engine (PaddleOCR or VLM) will be used for re-extraction. Lower values mean more text will be re-extracted.
744
+ - **HYBRID_OCR_PADDING** (default: 1): Padding (in pixels) added to word bounding boxes before re-extraction with the secondary engine.
745
+ - **SAVE_EXAMPLE_HYBRID_IMAGES** (default: False): If enabled, saves comparison images showing Tesseract vs. secondary engine results when using hybrid modes.
746
+ - **SAVE_PAGE_OCR_VISUALISATIONS** (default: False): If enabled, saves images with detected bounding boxes overlaid for debugging purposes.
747
+
748
+ #### Tesseract settings
749
+
750
+ - **TESSERACT_SEGMENTATION_LEVEL** (default: 11): Tesseract PSM (Page Segmentation Mode) level. Valid values are 0-13. Higher values provide more detailed segmentation but may be slower.
751
+
752
+ #### PaddleOCR settings
753
+
754
+ - **PADDLE_USE_TEXTLINE_ORIENTATION** (default: False): If enabled, PaddleOCR will detect and correct text line orientation.
755
+ - **PADDLE_DET_DB_UNCLIP_RATIO** (default: 1.2): Controls the expansion ratio of detected text regions. Higher values expand the detection area more.
756
+ - **CONVERT_LINE_TO_WORD_LEVEL** (default: False): If enabled, converts PaddleOCR line-level results to word-level for better precision in bounding boxes (not perfect, but pretty good).
757
+ - **LOAD_PADDLE_AT_STARTUP** (default: False): If enabled, loads the PaddleOCR model when the application starts, reducing latency for first use but increasing startup time.
758
+
759
+ #### Image preprocessing
760
+
761
+ - **PREPROCESS_LOCAL_OCR_IMAGES** (default: True): If enabled, images are preprocessed before OCR. This can improve accuracy but may slow down processing.
762
+ - **SAVE_PREPROCESS_IMAGES** (default: False): If enabled, saves the preprocessed images for debugging purposes.
763
+
764
+ #### Vision Language Model (VLM) settings
765
+
766
+ When VLM options are enabled, the following settings are available:
767
+
768
+ - **SELECTED_MODEL** (default: "Dots.OCR"): The VLM model to use. Options include: "Nanonets-OCR2-3B", "Dots.OCR", "Qwen3-VL-2B-Instruct", "Qwen3-VL-4B-Instruct", "PaddleOCR-VL".
769
+ - **MAX_SPACES_GPU_RUN_TIME** (default: 60): Maximum seconds to run GPU operations on Hugging Face Spaces.
770
+ - **MAX_NEW_TOKENS** (default: 30): Maximum number of tokens to generate for VLM responses.
771
+ - **MAX_INPUT_TOKEN_LENGTH** (default: 4096): Maximum number of tokens that can be input to the VLM.
772
+ - **VLM_MAX_IMAGE_SIZE** (default: 1000000): Maximum total pixels (width × height) for images. Larger images are resized while maintaining aspect ratio.
773
+ - **VLM_MAX_DPI** (default: 300.0): Maximum DPI for images. Higher DPI images are resized accordingly.
774
+ - **USE_FLASH_ATTENTION** (default: False): If enabled, uses flash attention for improved VLM performance.
775
+ - **SAVE_VLM_INPUT_IMAGES** (default: False): If enabled, saves input images sent to VLM for debugging.
776
+
777
+ #### General settings
778
+
779
+ - **MODEL_CACHE_PATH** (default: "./model_cache"): Directory where OCR models are cached.
780
+ - **OVERWRITE_EXISTING_OCR_RESULTS** (default: False): If enabled, always creates new OCR results instead of loading from existing JSON files.
781
 
782
  ### When to use different OCR models
783
 
784
+ - **Tesseract**: Best for general use, providing a good balance of speed and accuracy with precise word-level bounding boxes.
785
+ - **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision.
786
+ - **Hybrid-paddle**: Best for challenging documents where some text has low confidence scores, combining Tesseract's word-level precision with PaddleOCR's improved text recognition.
787
+ - **Hybrid-vlm**: Best for very challenging documents with poor image quality or unusual text layouts, leveraging advanced vision models for difficult text.
788
+ - **Hybrid-paddle-vlm**: Most comprehensive option, combining PaddleOCR's line-level detection with a VLM's advanced recognition capabilities.
789
 
790
 
791
 
 
900
 
901
  ### Common CLI options
902
 
903
+ #### General options
904
+
905
  - `--task`: Choose between "redact", "deduplicate", or "textract"
906
+ - `--input_file`: Path to input file(s) - can specify multiple files separated by spaces
907
  - `--output_dir`: Directory for output files (default: output/)
908
+ - `--input_dir`: Directory for input files (default: input/)
909
+ - `--language`: Language of document content (e.g., "en", "es", "fr")
910
+ - `--username`: Username for session tracking
911
+ - `--pii_detector`: Choose PII detection method ("Local", "AWS Comprehend", or "None")
912
+ - `--local_redact_entities`: Specify local entities to redact (space-separated list)
913
+ - `--aws_redact_entities`: Specify AWS Comprehend entities to redact (space-separated list)
914
+ - `--aws_access_key` / `--aws_secret_key`: AWS credentials for cloud services
915
+ - `--aws_region`: AWS region for cloud services
916
+ - `--s3_bucket`: S3 bucket name for cloud operations
917
+ - `--cost_code`: Cost code for tracking usage
918
+
919
+ #### PDF/Image redaction options
920
+
921
+ - `--ocr_method`: Choose text extraction method ("AWS Textract", "Local OCR", or "Local text")
922
+ - `--chosen_local_ocr_model`: Local OCR model to use (e.g., "tesseract", "paddle", "hybrid-paddle", "hybrid-vlm")
923
+ - `--page_min` / `--page_max`: Process only specific page range (0 for max means all pages)
924
+ - `--images_dpi`: DPI for image processing (default: 300.0)
925
+ - `--preprocess_local_ocr_images`: Preprocess images before OCR (True/False)
926
+ - `--compress_redacted_pdf`: Compress the final redacted PDF (True/False)
927
+ - `--return_pdf_end_of_redaction`: Return PDF at end of redaction process (True/False)
928
+ - `--allow_list_file` / `--deny_list_file`: Paths to custom allow/deny list CSV files
929
+ - `--redact_whole_page_file`: Path to CSV file listing pages to redact completely
930
+ - `--handwrite_signature_extraction`: Handwriting and signature extraction options for Textract ("Extract handwriting", "Extract signatures")
931
+ - `--extract_forms`: Extract forms during Textract analysis (flag)
932
+ - `--extract_tables`: Extract tables during Textract analysis (flag)
933
+ - `--extract_layout`: Extract layout during Textract analysis (flag)
934
+
935
+ #### Tabular/Word anonymization options
936
+
937
+ - `--anon_strategy`: Anonymization strategy (e.g., "redact", "redact completely", "replace_redacted", "encrypt", "hash")
938
+ - `--text_columns`: List of column names to anonymize (space-separated)
939
+ - `--excel_sheets`: Specific Excel sheet names to process (space-separated)
940
+ - `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching (default: 1)
941
+ - `--match_fuzzy_whole_phrase_bool`: Match fuzzy whole phrase (True/False)
942
+ - `--do_initial_clean`: Perform initial text cleaning for tabular data (True/False)
943
+
944
+ #### Duplicate detection options
945
+
946
+ - `--duplicate_type`: Type of duplicate detection ("pages" for OCR files or "tabular" for CSV/Excel)
947
+ - `--similarity_threshold`: Similarity threshold (0-1) to consider content as duplicates (default: 0.95)
948
+ - `--min_word_count`: Minimum word count for text to be considered (default: 10)
949
+ - `--min_consecutive_pages`: Minimum number of consecutive pages to consider as a match (default: 1)
950
+ - `--greedy_match`: Use greedy matching strategy for consecutive pages (True/False)
951
+ - `--combine_pages`: Combine text from same page number within a file (True/False)
952
+ - `--remove_duplicate_rows`: Remove duplicate rows from output (True/False)
953
+
954
+ #### Textract batch operations options
955
+
956
+ - `--textract_action`: Action to perform ("submit", "retrieve", or "list")
957
+ - `--job_id`: Textract job ID for retrieve action
958
+ - `--extract_signatures`: Extract signatures during Textract analysis (flag)
959
+ - `--textract_bucket`: S3 bucket name for Textract operations
960
+ - `--poll_interval`: Polling interval in seconds for job status (default: 30)
961
+ - `--max_poll_attempts`: Maximum polling attempts before timeout (default: 120)
962
 
963
  ### Output files
964