Commit
·
d5b5291
1
Parent(s):
7bb945f
Minor update to cli_redact for new local OCR model options. Updated app_settings.qmd, user_guide.qmd, and readme.md with descriptions of new features
Browse files- README.md +137 -35
- cli_redact.py +2 -1
- pyproject.toml +1 -1
- src/app_settings.qmd +192 -6
- src/user_guide.qmd +136 -32
README.md
CHANGED
|
@@ -10,7 +10,7 @@ license: agpl-3.0
|
|
| 10 |
---
|
| 11 |
# Document redaction
|
| 12 |
|
| 13 |
-
version: 1.5.
|
| 14 |
|
| 15 |
Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
|
| 16 |
|
|
@@ -249,7 +249,6 @@ Now you have the app installed, what follows is a guide on how to use it for bas
|
|
| 249 |
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
| 250 |
|
| 251 |
### Advanced user guide
|
| 252 |
-
- [Advanced user guide](#advanced-user-guide)
|
| 253 |
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
| 254 |
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
| 255 |
- [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
|
|
@@ -261,7 +260,6 @@ Now you have the app installed, what follows is a guide on how to use it for bas
|
|
| 261 |
- [Merging redaction review files](#merging-redaction-review-files)
|
| 262 |
|
| 263 |
### Features for expert users/system administrators
|
| 264 |
-
- [Features for expert users/system administrators](#features-for-expert-userssystem-administrators)
|
| 265 |
- [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
|
| 266 |
- [Command Line Interface (CLI)](#command-line-interface-cli)
|
| 267 |
|
|
@@ -376,7 +374,17 @@ If you have used the AWS Textract option for extracting text, you may also see a
|
|
| 376 |
|
| 377 |

|
| 378 |
|
| 379 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
|
| 381 |
### Downloading output files from previous redaction tasks
|
| 382 |
|
|
@@ -686,6 +694,7 @@ You can also write open text into an input box and redact that using the same me
|
|
| 686 |
### Redaction log outputs
|
| 687 |
A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
|
| 688 |
|
|
|
|
| 689 |
## Identifying and redacting duplicate pages
|
| 690 |
|
| 691 |
The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
|
|
@@ -916,45 +925,91 @@ AWS_SECRET_KEY= your-secret-key
|
|
| 916 |
|
| 917 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
| 918 |
|
| 919 |
-
|
| 920 |
-
|
| 921 |
-
## Advanced OCR options (Hybrid OCR)
|
| 922 |
|
| 923 |
-
The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by your system
|
| 924 |
|
| 925 |
### Available OCR models
|
| 926 |
|
| 927 |
-
- **Tesseract** (default): The standard OCR engine that works well for most documents
|
| 928 |
-
- **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise
|
| 929 |
-
- **Hybrid**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text
|
|
|
|
|
|
|
| 930 |
|
| 931 |
### Enabling advanced OCR options
|
| 932 |
|
| 933 |
-
To enable these options,
|
| 934 |
|
|
|
|
| 935 |
```
|
| 936 |
SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
|
| 937 |
```
|
| 938 |
|
| 939 |
-
|
| 940 |
-
|
| 941 |
-
|
| 942 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 943 |
|
| 944 |
-
###
|
| 945 |
|
| 946 |
-
The
|
| 947 |
|
| 948 |
-
|
| 949 |
-
|
| 950 |
-
- **
|
| 951 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 952 |
|
| 953 |
### When to use different OCR models
|
| 954 |
|
| 955 |
-
- **Tesseract**: Best for general use, good balance of speed and accuracy
|
| 956 |
-
- **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision
|
| 957 |
-
- **Hybrid**: Best for challenging documents where some text has low confidence scores,
|
|
|
|
|
|
|
| 958 |
|
| 959 |
|
| 960 |
|
|
@@ -1069,18 +1124,65 @@ python cli_redact.py --task textract --textract_action list
|
|
| 1069 |
|
| 1070 |
### Common CLI options
|
| 1071 |
|
|
|
|
|
|
|
| 1072 |
- `--task`: Choose between "redact", "deduplicate", or "textract"
|
| 1073 |
-
- `--input_file`: Path to input file(s)
|
| 1074 |
- `--output_dir`: Directory for output files (default: output/)
|
| 1075 |
-
- `--
|
| 1076 |
-
- `--
|
| 1077 |
-
- `--
|
| 1078 |
-
- `--
|
| 1079 |
-
- `--
|
| 1080 |
-
- `--
|
| 1081 |
-
- `--
|
| 1082 |
-
- `--
|
| 1083 |
-
- `--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1084 |
|
| 1085 |
### Output files
|
| 1086 |
|
|
|
|
| 10 |
---
|
| 11 |
# Document redaction
|
| 12 |
|
| 13 |
+
version: 1.5.1
|
| 14 |
|
| 15 |
Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
|
| 16 |
|
|
|
|
| 249 |
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
| 250 |
|
| 251 |
### Advanced user guide
|
|
|
|
| 252 |
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
| 253 |
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
| 254 |
- [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
|
|
|
|
| 260 |
- [Merging redaction review files](#merging-redaction-review-files)
|
| 261 |
|
| 262 |
### Features for expert users/system administrators
|
|
|
|
| 263 |
- [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
|
| 264 |
- [Command Line Interface (CLI)](#command-line-interface-cli)
|
| 265 |
|
|
|
|
| 374 |
|
| 375 |

|
| 376 |
|
| 377 |
+
#### Additional outputs in the log file outputs
|
| 378 |
+
|
| 379 |
+
On the Redaction settings tab, near the bottom of the pagethere is a section called 'Log file outputs'. This section contains the following files:
|
| 380 |
+
|
| 381 |
+
You may see a '..._ocr_results_with_words... .json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
|
| 382 |
+
|
| 383 |
+
Also you will see a 'decision_process_table.csv' file. This file contains a table of the decisions made by the app for each page of the document. This can be useful for debugging and understanding the decisions made by the app.
|
| 384 |
+
|
| 385 |
+
Additionally, if the option is enabled by your system administrator, on this tab you may see an image of the output from the OCR model used to extract the text from the document, an image ending with page number and '_visualisations.jpg'. A separate image will be created for each page of the document like the one below. This can be useful for seeing at a glance whether the text extraction process for a page was successful, and whether word-level bounding boxes are correctly positioned.
|
| 386 |
+
|
| 387 |
+

|
| 388 |
|
| 389 |
### Downloading output files from previous redaction tasks
|
| 390 |
|
|
|
|
| 694 |
### Redaction log outputs
|
| 695 |
A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
|
| 696 |
|
| 697 |
+
|
| 698 |
## Identifying and redacting duplicate pages
|
| 699 |
|
| 700 |
The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
|
|
|
|
| 925 |
|
| 926 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
| 927 |
|
| 928 |
+
## Advanced OCR options
|
|
|
|
|
|
|
| 929 |
|
| 930 |
+
The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by changing the app_config.env file in your '/config' folder, or system environment variables in your system.
|
| 931 |
|
| 932 |
### Available OCR models
|
| 933 |
|
| 934 |
+
- **Tesseract** (default): The standard OCR engine that works well for most documents. Provides good word-level bounding box accuracy.
|
| 935 |
+
- **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise. Best for documents with clear, well-formatted text.
|
| 936 |
+
- **Hybrid-paddle**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text regions.
|
| 937 |
+
- **Hybrid-vlm**: Combines Tesseract with Vision Language Models (VLM) - uses Tesseract for initial extraction, then a VLM model (default: Dots.OCR) for re-extraction of low-confidence text.
|
| 938 |
+
- **Hybrid-paddle-vlm**: Combines PaddleOCR with Vision Language Models - uses PaddleOCR first, then a VLM model for low-confidence regions.
|
| 939 |
|
| 940 |
### Enabling advanced OCR options
|
| 941 |
|
| 942 |
+
To enable these options, you need to modify the app_config.env file in your '/config' folder and set the following environment variables:
|
| 943 |
|
| 944 |
+
**Basic OCR model selection:**
|
| 945 |
```
|
| 946 |
SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
|
| 947 |
```
|
| 948 |
|
| 949 |
+
**To enable PaddleOCR options (paddle, hybrid-paddle):**
|
| 950 |
+
```
|
| 951 |
+
SHOW_PADDLE_MODEL_OPTIONS = "True"
|
| 952 |
+
```
|
| 953 |
+
|
| 954 |
+
**To enable Vision Language Model options (hybrid-vlm, hybrid-paddle-vlm):**
|
| 955 |
+
```
|
| 956 |
+
SHOW_VLM_MODEL_OPTIONS = "True"
|
| 957 |
+
```
|
| 958 |
+
|
| 959 |
+
Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between the available models based on what has been enabled.
|
| 960 |
|
| 961 |
+
### OCR configuration parameters
|
| 962 |
|
| 963 |
+
The following parameters can be configured by your system administrator to fine-tune OCR behavior:
|
| 964 |
|
| 965 |
+
#### Hybrid OCR settings
|
| 966 |
+
|
| 967 |
+
- **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 80): Tesseract confidence score below which the secondary OCR engine (PaddleOCR or VLM) will be used for re-extraction. Lower values mean more text will be re-extracted.
|
| 968 |
+
- **HYBRID_OCR_PADDING** (default: 1): Padding (in pixels) added to word bounding boxes before re-extraction with the secondary engine.
|
| 969 |
+
- **SAVE_EXAMPLE_HYBRID_IMAGES** (default: False): If enabled, saves comparison images showing Tesseract vs. secondary engine results when using hybrid modes.
|
| 970 |
+
- **SAVE_PAGE_OCR_VISUALISATIONS** (default: False): If enabled, saves images with detected bounding boxes overlaid for debugging purposes.
|
| 971 |
+
|
| 972 |
+
#### Tesseract settings
|
| 973 |
+
|
| 974 |
+
- **TESSERACT_SEGMENTATION_LEVEL** (default: 11): Tesseract PSM (Page Segmentation Mode) level. Valid values are 0-13. Higher values provide more detailed segmentation but may be slower.
|
| 975 |
+
|
| 976 |
+
#### PaddleOCR settings
|
| 977 |
+
|
| 978 |
+
- **PADDLE_USE_TEXTLINE_ORIENTATION** (default: False): If enabled, PaddleOCR will detect and correct text line orientation.
|
| 979 |
+
- **PADDLE_DET_DB_UNCLIP_RATIO** (default: 1.2): Controls the expansion ratio of detected text regions. Higher values expand the detection area more.
|
| 980 |
+
- **CONVERT_LINE_TO_WORD_LEVEL** (default: False): If enabled, converts PaddleOCR line-level results to word-level for better precision in bounding boxes (not perfect, but pretty good).
|
| 981 |
+
- **LOAD_PADDLE_AT_STARTUP** (default: False): If enabled, loads the PaddleOCR model when the application starts, reducing latency for first use but increasing startup time.
|
| 982 |
+
|
| 983 |
+
#### Image preprocessing
|
| 984 |
+
|
| 985 |
+
- **PREPROCESS_LOCAL_OCR_IMAGES** (default: True): If enabled, images are preprocessed before OCR. This can improve accuracy but may slow down processing.
|
| 986 |
+
- **SAVE_PREPROCESS_IMAGES** (default: False): If enabled, saves the preprocessed images for debugging purposes.
|
| 987 |
+
|
| 988 |
+
#### Vision Language Model (VLM) settings
|
| 989 |
+
|
| 990 |
+
When VLM options are enabled, the following settings are available:
|
| 991 |
+
|
| 992 |
+
- **SELECTED_MODEL** (default: "Dots.OCR"): The VLM model to use. Options include: "Nanonets-OCR2-3B", "Dots.OCR", "Qwen3-VL-2B-Instruct", "Qwen3-VL-4B-Instruct", "PaddleOCR-VL".
|
| 993 |
+
- **MAX_SPACES_GPU_RUN_TIME** (default: 60): Maximum seconds to run GPU operations on Hugging Face Spaces.
|
| 994 |
+
- **MAX_NEW_TOKENS** (default: 30): Maximum number of tokens to generate for VLM responses.
|
| 995 |
+
- **MAX_INPUT_TOKEN_LENGTH** (default: 4096): Maximum number of tokens that can be input to the VLM.
|
| 996 |
+
- **VLM_MAX_IMAGE_SIZE** (default: 1000000): Maximum total pixels (width × height) for images. Larger images are resized while maintaining aspect ratio.
|
| 997 |
+
- **VLM_MAX_DPI** (default: 300.0): Maximum DPI for images. Higher DPI images are resized accordingly.
|
| 998 |
+
- **USE_FLASH_ATTENTION** (default: False): If enabled, uses flash attention for improved VLM performance.
|
| 999 |
+
- **SAVE_VLM_INPUT_IMAGES** (default: False): If enabled, saves input images sent to VLM for debugging.
|
| 1000 |
+
|
| 1001 |
+
#### General settings
|
| 1002 |
+
|
| 1003 |
+
- **MODEL_CACHE_PATH** (default: "./model_cache"): Directory where OCR models are cached.
|
| 1004 |
+
- **OVERWRITE_EXISTING_OCR_RESULTS** (default: False): If enabled, always creates new OCR results instead of loading from existing JSON files.
|
| 1005 |
|
| 1006 |
### When to use different OCR models
|
| 1007 |
|
| 1008 |
+
- **Tesseract**: Best for general use, providing a good balance of speed and accuracy with precise word-level bounding boxes.
|
| 1009 |
+
- **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision.
|
| 1010 |
+
- **Hybrid-paddle**: Best for challenging documents where some text has low confidence scores, combining Tesseract's word-level precision with PaddleOCR's improved text recognition.
|
| 1011 |
+
- **Hybrid-vlm**: Best for very challenging documents with poor image quality or unusual text layouts, leveraging advanced vision models for difficult text.
|
| 1012 |
+
- **Hybrid-paddle-vlm**: Most comprehensive option, combining PaddleOCR's line-level detection with a VLM's advanced recognition capabilities.
|
| 1013 |
|
| 1014 |
|
| 1015 |
|
|
|
|
| 1124 |
|
| 1125 |
### Common CLI options
|
| 1126 |
|
| 1127 |
+
#### General options
|
| 1128 |
+
|
| 1129 |
- `--task`: Choose between "redact", "deduplicate", or "textract"
|
| 1130 |
+
- `--input_file`: Path to input file(s) - can specify multiple files separated by spaces
|
| 1131 |
- `--output_dir`: Directory for output files (default: output/)
|
| 1132 |
+
- `--input_dir`: Directory for input files (default: input/)
|
| 1133 |
+
- `--language`: Language of document content (e.g., "en", "es", "fr")
|
| 1134 |
+
- `--username`: Username for session tracking
|
| 1135 |
+
- `--pii_detector`: Choose PII detection method ("Local", "AWS Comprehend", or "None")
|
| 1136 |
+
- `--local_redact_entities`: Specify local entities to redact (space-separated list)
|
| 1137 |
+
- `--aws_redact_entities`: Specify AWS Comprehend entities to redact (space-separated list)
|
| 1138 |
+
- `--aws_access_key` / `--aws_secret_key`: AWS credentials for cloud services
|
| 1139 |
+
- `--aws_region`: AWS region for cloud services
|
| 1140 |
+
- `--s3_bucket`: S3 bucket name for cloud operations
|
| 1141 |
+
- `--cost_code`: Cost code for tracking usage
|
| 1142 |
+
|
| 1143 |
+
#### PDF/Image redaction options
|
| 1144 |
+
|
| 1145 |
+
- `--ocr_method`: Choose text extraction method ("AWS Textract", "Local OCR", or "Local text")
|
| 1146 |
+
- `--chosen_local_ocr_model`: Local OCR model to use (e.g., "tesseract", "paddle", "hybrid-paddle", "hybrid-vlm")
|
| 1147 |
+
- `--page_min` / `--page_max`: Process only specific page range (0 for max means all pages)
|
| 1148 |
+
- `--images_dpi`: DPI for image processing (default: 300.0)
|
| 1149 |
+
- `--preprocess_local_ocr_images`: Preprocess images before OCR (True/False)
|
| 1150 |
+
- `--compress_redacted_pdf`: Compress the final redacted PDF (True/False)
|
| 1151 |
+
- `--return_pdf_end_of_redaction`: Return PDF at end of redaction process (True/False)
|
| 1152 |
+
- `--allow_list_file` / `--deny_list_file`: Paths to custom allow/deny list CSV files
|
| 1153 |
+
- `--redact_whole_page_file`: Path to CSV file listing pages to redact completely
|
| 1154 |
+
- `--handwrite_signature_extraction`: Handwriting and signature extraction options for Textract ("Extract handwriting", "Extract signatures")
|
| 1155 |
+
- `--extract_forms`: Extract forms during Textract analysis (flag)
|
| 1156 |
+
- `--extract_tables`: Extract tables during Textract analysis (flag)
|
| 1157 |
+
- `--extract_layout`: Extract layout during Textract analysis (flag)
|
| 1158 |
+
|
| 1159 |
+
#### Tabular/Word anonymization options
|
| 1160 |
+
|
| 1161 |
+
- `--anon_strategy`: Anonymization strategy (e.g., "redact", "redact completely", "replace_redacted", "encrypt", "hash")
|
| 1162 |
+
- `--text_columns`: List of column names to anonymize (space-separated)
|
| 1163 |
+
- `--excel_sheets`: Specific Excel sheet names to process (space-separated)
|
| 1164 |
+
- `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching (default: 1)
|
| 1165 |
+
- `--match_fuzzy_whole_phrase_bool`: Match fuzzy whole phrase (True/False)
|
| 1166 |
+
- `--do_initial_clean`: Perform initial text cleaning for tabular data (True/False)
|
| 1167 |
+
|
| 1168 |
+
#### Duplicate detection options
|
| 1169 |
+
|
| 1170 |
+
- `--duplicate_type`: Type of duplicate detection ("pages" for OCR files or "tabular" for CSV/Excel)
|
| 1171 |
+
- `--similarity_threshold`: Similarity threshold (0-1) to consider content as duplicates (default: 0.95)
|
| 1172 |
+
- `--min_word_count`: Minimum word count for text to be considered (default: 10)
|
| 1173 |
+
- `--min_consecutive_pages`: Minimum number of consecutive pages to consider as a match (default: 1)
|
| 1174 |
+
- `--greedy_match`: Use greedy matching strategy for consecutive pages (True/False)
|
| 1175 |
+
- `--combine_pages`: Combine text from same page number within a file (True/False)
|
| 1176 |
+
- `--remove_duplicate_rows`: Remove duplicate rows from output (True/False)
|
| 1177 |
+
|
| 1178 |
+
#### Textract batch operations options
|
| 1179 |
+
|
| 1180 |
+
- `--textract_action`: Action to perform ("submit", "retrieve", or "list")
|
| 1181 |
+
- `--job_id`: Textract job ID for retrieve action
|
| 1182 |
+
- `--extract_signatures`: Extract signatures during Textract analysis (flag)
|
| 1183 |
+
- `--textract_bucket`: S3 bucket name for Textract operations
|
| 1184 |
+
- `--poll_interval`: Polling interval in seconds for job status (default: 30)
|
| 1185 |
+
- `--max_poll_attempts`: Maximum polling attempts before timeout (default: 120)
|
| 1186 |
|
| 1187 |
### Output files
|
| 1188 |
|
cli_redact.py
CHANGED
|
@@ -36,6 +36,7 @@ from tools.config import (
|
|
| 36 |
FULL_ENTITY_LIST,
|
| 37 |
IMAGES_DPI,
|
| 38 |
INPUT_FOLDER,
|
|
|
|
| 39 |
LOCAL_PII_OPTION,
|
| 40 |
OUTPUT_FOLDER,
|
| 41 |
PADDLE_MODEL_PATH,
|
|
@@ -399,7 +400,7 @@ python cli_redact.py --task textract --textract_action list
|
|
| 399 |
)
|
| 400 |
pdf_group.add_argument(
|
| 401 |
"--chosen_local_ocr_model",
|
| 402 |
-
choices=
|
| 403 |
default=CHOSEN_LOCAL_OCR_MODEL,
|
| 404 |
help="Local OCR model to use.",
|
| 405 |
)
|
|
|
|
| 36 |
FULL_ENTITY_LIST,
|
| 37 |
IMAGES_DPI,
|
| 38 |
INPUT_FOLDER,
|
| 39 |
+
LOCAL_OCR_MODEL_OPTIONS,
|
| 40 |
LOCAL_PII_OPTION,
|
| 41 |
OUTPUT_FOLDER,
|
| 42 |
PADDLE_MODEL_PATH,
|
|
|
|
| 400 |
)
|
| 401 |
pdf_group.add_argument(
|
| 402 |
"--chosen_local_ocr_model",
|
| 403 |
+
choices=LOCAL_OCR_MODEL_OPTIONS,
|
| 404 |
default=CHOSEN_LOCAL_OCR_MODEL,
|
| 405 |
help="Local OCR model to use.",
|
| 406 |
)
|
pyproject.toml
CHANGED
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "doc_redaction"
|
| 7 |
-
version = "1.5.
|
| 8 |
description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
|
| 9 |
readme = "README.md"
|
| 10 |
requires-python = ">=3.10"
|
|
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "doc_redaction"
|
| 7 |
+
version = "1.5.1"
|
| 8 |
description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
|
| 9 |
readme = "README.md"
|
| 10 |
requires-python = ">=3.10"
|
src/app_settings.qmd
CHANGED
|
@@ -97,11 +97,11 @@ Configuration for input and output file handling.
|
|
| 97 |
* **Description:** If set to `'True'`, the application will save output and input files into session-specific subfolders.
|
| 98 |
* **Default Value:** `'False'`
|
| 99 |
|
| 100 |
-
* **`OUTPUT_FOLDER`**
|
| 101 |
* **Description:** Specifies the default output folder for generated files. Can be set to `"TEMP"` to use a temporary directory.
|
| 102 |
* **Default Value:** `'output/'`
|
| 103 |
|
| 104 |
-
* **`INPUT_FOLDER`**
|
| 105 |
* **Description:** Specifies the default input folder for files. Can be set to `"TEMP"` to use a temporary directory.
|
| 106 |
* **Default Value:** `'input/'`
|
| 107 |
|
|
@@ -225,6 +225,14 @@ Configurations for the Gradio UI, server behavior, and application limits.
|
|
| 225 |
* **Description:** Maximum number of characters for open text input.
|
| 226 |
* **Default Value:** `50000`
|
| 227 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
* **`TLDEXTRACT_CACHE`**
|
| 229 |
* **Description:** Path to the cache directory used by the `tldextract` library.
|
| 230 |
* **Default Value:** `'tmp/tld/'`
|
|
@@ -263,6 +271,14 @@ Configurations related to text extraction, PII detection, and the redaction proc
|
|
| 263 |
* **Description:** Controls whether local (Tesseract) or AWS (Textract) text extraction options are shown in the UI.
|
| 264 |
* **Default Value:** `"True"` for both.
|
| 265 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 266 |
* **`SHOW_LOCAL_PII_DETECTION_OPTIONS`** / **`SHOW_AWS_PII_DETECTION_OPTIONS`**
|
| 267 |
* **Description:** Controls whether local or AWS (Comprehend) PII detection options are shown in the UI.
|
| 268 |
* **Default Value:** `"True"` for both.
|
|
@@ -309,7 +325,7 @@ Configurations related to text extraction, PII detection, and the redaction proc
|
|
| 309 |
|
| 310 |
* **`HYBRID_OCR_CONFIDENCE_THRESHOLD`**
|
| 311 |
* **Description:** In "hybrid-paddle" mode, this is the Tesseract confidence score below which PaddleOCR will be used for re-extraction.
|
| 312 |
-
* **Default Value:** `
|
| 313 |
|
| 314 |
* **`HYBRID_OCR_PADDING`**
|
| 315 |
* **Description:** In "hybrid-paddle" mode, padding added to the word's bounding box before re-extraction.
|
|
@@ -333,6 +349,76 @@ Configurations related to text extraction, PII detection, and the redaction proc
|
|
| 333 |
|
| 334 |
* **`PREPROCESS_LOCAL_OCR_IMAGES`**
|
| 335 |
* **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 336 |
* **Default Value:** `"False"`
|
| 337 |
|
| 338 |
### Entity and Search Options
|
|
@@ -367,6 +453,10 @@ Configurations related to text extraction, PII detection, and the redaction proc
|
|
| 367 |
* **Description:** The default options selected for Textract's handwriting and signature detection.
|
| 368 |
* **Default Value:** `['Extract handwriting']`
|
| 369 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 370 |
* **`INCLUDE_FORM_EXTRACTION_TEXTRACT_OPTION`**
|
| 371 |
* **`INCLUDE_LAYOUT_EXTRACTION_TEXTRACT_OPTION`**
|
| 372 |
* **`INCLUDE_TABLE_EXTRACTION_TEXTRACT_OPTION`**
|
|
@@ -481,9 +571,105 @@ Settings for running the application from the command line (Direct Mode) or as a
|
|
| 481 |
* **Description:** Path to the input file and output directory for the task.
|
| 482 |
* **Default Values:** `''`, `output/`
|
| 483 |
|
| 484 |
-
*
|
| 485 |
-
* **Description:**
|
| 486 |
-
* **Default Value:**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 487 |
|
| 488 |
### Lambda Configuration
|
| 489 |
|
|
|
|
| 97 |
* **Description:** If set to `'True'`, the application will save output and input files into session-specific subfolders.
|
| 98 |
* **Default Value:** `'False'`
|
| 99 |
|
| 100 |
+
* **`OUTPUT_FOLDER`** (environment variable: `GRADIO_OUTPUT_FOLDER`)
|
| 101 |
* **Description:** Specifies the default output folder for generated files. Can be set to `"TEMP"` to use a temporary directory.
|
| 102 |
* **Default Value:** `'output/'`
|
| 103 |
|
| 104 |
+
* **`INPUT_FOLDER`** (environment variable: `GRADIO_INPUT_FOLDER`)
|
| 105 |
* **Description:** Specifies the default input folder for files. Can be set to `"TEMP"` to use a temporary directory.
|
| 106 |
* **Default Value:** `'input/'`
|
| 107 |
|
|
|
|
| 225 |
* **Description:** Maximum number of characters for open text input.
|
| 226 |
* **Default Value:** `50000`
|
| 227 |
|
| 228 |
+
* **`PAGE_BREAK_VALUE`**
|
| 229 |
+
* **Description:** Number of pages to process before breaking and restarting from the last finished page (not currently activated).
|
| 230 |
+
* **Default Value:** `99999`
|
| 231 |
+
|
| 232 |
+
* **`MAX_TIME_VALUE`**
|
| 233 |
+
* **Description:** Maximum time value for processing operations.
|
| 234 |
+
* **Default Value:** `999999`
|
| 235 |
+
|
| 236 |
* **`TLDEXTRACT_CACHE`**
|
| 237 |
* **Description:** Path to the cache directory used by the `tldextract` library.
|
| 238 |
* **Default Value:** `'tmp/tld/'`
|
|
|
|
| 271 |
* **Description:** Controls whether local (Tesseract) or AWS (Textract) text extraction options are shown in the UI.
|
| 272 |
* **Default Value:** `"True"` for both.
|
| 273 |
|
| 274 |
+
* **`SELECTABLE_TEXT_EXTRACT_OPTION`**, **`TESSERACT_TEXT_EXTRACT_OPTION`**, **`TEXTRACT_TEXT_EXTRACT_OPTION`**
|
| 275 |
+
* **Description:** Labels for text extraction model options displayed in the UI. Customize the display names for "Local model - selectable text", "Local OCR model - PDFs without selectable text", and "AWS Textract service - all PDF types" respectively.
|
| 276 |
+
* **Default Values:** `"Local model - selectable text"`, `"Local OCR model - PDFs without selectable text"`, `"AWS Textract service - all PDF types"`
|
| 277 |
+
|
| 278 |
+
* **`NO_REDACTION_PII_OPTION`**, **`LOCAL_PII_OPTION`**, **`AWS_PII_OPTION`**
|
| 279 |
+
* **Description:** Labels for PII detection model options displayed in the UI. Customize the display names for "Only extract text (no redaction)", "Local", and "AWS Comprehend" respectively.
|
| 280 |
+
* **Default Values:** `"Only extract text (no redaction)"`, `"Local"`, `"AWS Comprehend"`
|
| 281 |
+
|
| 282 |
* **`SHOW_LOCAL_PII_DETECTION_OPTIONS`** / **`SHOW_AWS_PII_DETECTION_OPTIONS`**
|
| 283 |
* **Description:** Controls whether local or AWS (Comprehend) PII detection options are shown in the UI.
|
| 284 |
* **Default Value:** `"True"` for both.
|
|
|
|
| 325 |
|
| 326 |
* **`HYBRID_OCR_CONFIDENCE_THRESHOLD`**
|
| 327 |
* **Description:** In "hybrid-paddle" mode, this is the Tesseract confidence score below which PaddleOCR will be used for re-extraction.
|
| 328 |
+
* **Default Value:** `80`
|
| 329 |
|
| 330 |
* **`HYBRID_OCR_PADDING`**
|
| 331 |
* **Description:** In "hybrid-paddle" mode, padding added to the word's bounding box before re-extraction.
|
|
|
|
| 349 |
|
| 350 |
* **`PREPROCESS_LOCAL_OCR_IMAGES`**
|
| 351 |
* **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
|
| 352 |
+
* **Default Value:** `"True"`
|
| 353 |
+
|
| 354 |
+
* **`SAVE_PREPROCESS_IMAGES`**
|
| 355 |
+
* **Description:** If set to `"True"`, saves the preprocessed images for debugging purposes.
|
| 356 |
+
* **Default Value:** `"False"`
|
| 357 |
+
|
| 358 |
+
* **`SHOW_PADDLE_MODEL_OPTIONS`**
|
| 359 |
+
* **Description:** If set to `"True"`, allows the user to select PaddleOCR-related options (paddle, hybrid-paddle) from the UI.
|
| 360 |
+
* **Default Value:** `"False"`
|
| 361 |
+
|
| 362 |
+
* **`MODEL_CACHE_PATH`**
|
| 363 |
+
* **Description:** Path to the directory where models are cached.
|
| 364 |
+
* **Default Value:** `"./model_cache"`
|
| 365 |
+
|
| 366 |
+
* **`TESSERACT_SEGMENTATION_LEVEL`**
|
| 367 |
+
* **Description:** Tesseract PSM (Page Segmentation Mode) level to use for OCR. Valid values are 0-13.
|
| 368 |
+
* **Default Value:** `11`
|
| 369 |
+
|
| 370 |
+
* **`CONVERT_LINE_TO_WORD_LEVEL`**
|
| 371 |
+
* **Description:** If set to `"True"`, converts PaddleOCR line-level OCR results to word-level for better precision.
|
| 372 |
+
* **Default Value:** `"False"`
|
| 373 |
+
|
| 374 |
+
* **`LOAD_PADDLE_AT_STARTUP`**
|
| 375 |
+
* **Description:** If set to `"True"`, loads the PaddleOCR model at application startup.
|
| 376 |
+
* **Default Value:** `"False"`
|
| 377 |
+
|
| 378 |
+
### Vision Language Model (VLM) Options
|
| 379 |
+
|
| 380 |
+
* **`SHOW_VLM_MODEL_OPTIONS`**
|
| 381 |
+
* **Description:** If set to `"True"`, VLM (Vision Language Model) options will be shown in the UI.
|
| 382 |
+
* **Default Value:** `"False"`
|
| 383 |
+
|
| 384 |
+
* **`SELECTED_MODEL`**
|
| 385 |
+
* **Description:** Selected vision model for OCR. Choose from: `"Nanonets-OCR2-3B"`, `"Dots.OCR"`, `"Qwen3-VL-2B-Instruct"`, `"Qwen3-VL-4B-Instruct"`, `"PaddleOCR-VL"`.
|
| 386 |
+
* **Default Value:** `"Dots.OCR"`
|
| 387 |
+
|
| 388 |
+
* **`MAX_SPACES_GPU_RUN_TIME`**
|
| 389 |
+
* **Description:** Maximum number of seconds to run the GPU on Spaces (Hugging Face Spaces).
|
| 390 |
+
* **Default Value:** `60`
|
| 391 |
+
|
| 392 |
+
* **`MAX_NEW_TOKENS`**
|
| 393 |
+
* **Description:** Maximum number of tokens to generate for VLM responses.
|
| 394 |
+
* **Default Value:** `30`
|
| 395 |
+
|
| 396 |
+
* **`DEFAULT_MAX_NEW_TOKENS`**
|
| 397 |
+
* **Description:** Default maximum number of tokens to generate for VLM responses.
|
| 398 |
+
* **Default Value:** `30`
|
| 399 |
+
|
| 400 |
+
* **`MAX_INPUT_TOKEN_LENGTH`**
|
| 401 |
+
* **Description:** Maximum number of tokens that can be input to the VLM.
|
| 402 |
+
* **Default Value:** `4096`
|
| 403 |
+
|
| 404 |
+
* **`VLM_MAX_IMAGE_SIZE`**
|
| 405 |
+
* **Description:** Maximum total pixels (width * height) for images passed to VLM. Images with more pixels will be resized while maintaining aspect ratio.
|
| 406 |
+
* **Default Value:** `1000000` (1000x1000)
|
| 407 |
+
|
| 408 |
+
* **`VLM_MAX_DPI`**
|
| 409 |
+
* **Description:** Maximum DPI for images passed to VLM. Images with higher DPI will be resized accordingly.
|
| 410 |
+
* **Default Value:** `300.0`
|
| 411 |
+
|
| 412 |
+
* **`USE_FLASH_ATTENTION`**
|
| 413 |
+
* **Description:** If set to `"True"`, uses flash attention for the VLM, which can improve performance.
|
| 414 |
+
* **Default Value:** `"False"`
|
| 415 |
+
|
| 416 |
+
* **`OVERWRITE_EXISTING_OCR_RESULTS`**
|
| 417 |
+
* **Description:** If set to `"True"`, always creates new OCR results instead of loading from existing JSON files.
|
| 418 |
+
* **Default Value:** `"False"`
|
| 419 |
+
|
| 420 |
+
* **`SAVE_VLM_INPUT_IMAGES`**
|
| 421 |
+
* **Description:** If set to `"True"`, saves input images sent to VLM OCR for debugging purposes.
|
| 422 |
* **Default Value:** `"False"`
|
| 423 |
|
| 424 |
### Entity and Search Options
|
|
|
|
| 453 |
* **Description:** The default options selected for Textract's handwriting and signature detection.
|
| 454 |
* **Default Value:** `['Extract handwriting']`
|
| 455 |
|
| 456 |
+
* **`HANDWRITE_SIGNATURE_TEXTBOX_FULL_OPTIONS`**
|
| 457 |
+
* **Description:** Full list of available options for Textract's handwriting and signature detection. Can include `'Extract handwriting'`, `'Extract signatures'`, and optionally `'Extract forms'`, `'Extract layout'`, `'Extract tables'` if the corresponding include options are enabled.
|
| 458 |
+
* **Default Value:** `['Extract handwriting', 'Extract signatures']`
|
| 459 |
+
|
| 460 |
* **`INCLUDE_FORM_EXTRACTION_TEXTRACT_OPTION`**
|
| 461 |
* **`INCLUDE_LAYOUT_EXTRACTION_TEXTRACT_OPTION`**
|
| 462 |
* **`INCLUDE_TABLE_EXTRACTION_TEXTRACT_OPTION`**
|
|
|
|
| 571 |
* **Description:** Path to the input file and output directory for the task.
|
| 572 |
* **Default Values:** `''`, `output/`
|
| 573 |
|
| 574 |
+
* **`DIRECT_MODE_DUPLICATE_TYPE`**
|
| 575 |
+
* **Description:** Type of duplicate detection for direct mode: `'pages'` or `'tabular'`.
|
| 576 |
+
* **Default Value:** `'pages'`
|
| 577 |
+
|
| 578 |
+
* **`DIRECT_MODE_LANGUAGE`**
|
| 579 |
+
* **Description:** Language for document processing in direct mode.
|
| 580 |
+
* **Default Value:** Inherits from `DEFAULT_LANGUAGE`
|
| 581 |
+
|
| 582 |
+
* **`DIRECT_MODE_PII_DETECTOR`**
|
| 583 |
+
* **Description:** PII detection method for direct mode.
|
| 584 |
+
* **Default Value:** Inherits from `LOCAL_PII_OPTION`
|
| 585 |
+
|
| 586 |
+
* **`DIRECT_MODE_OCR_METHOD`**
|
| 587 |
+
* **Description:** OCR method for PDF/image processing in direct mode.
|
| 588 |
+
* **Default Value:** `"Local OCR"`
|
| 589 |
+
|
| 590 |
+
* **`DIRECT_MODE_PAGE_MIN`** / **`DIRECT_MODE_PAGE_MAX`**
|
| 591 |
+
* **Description:** First and last page to process in direct mode. `0` for max means process all pages.
|
| 592 |
+
* **Default Values:** Inherit from `DEFAULT_PAGE_MIN` / `DEFAULT_PAGE_MAX`
|
| 593 |
+
|
| 594 |
+
* **`DIRECT_MODE_IMAGES_DPI`**
|
| 595 |
+
* **Description:** DPI for image processing in direct mode.
|
| 596 |
+
* **Default Value:** Inherits from `IMAGES_DPI`
|
| 597 |
+
|
| 598 |
+
* **`DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL`**
|
| 599 |
+
* **Description:** Local OCR model choice for direct mode.
|
| 600 |
+
* **Default Value:** Inherits from `CHOSEN_LOCAL_OCR_MODEL`
|
| 601 |
+
|
| 602 |
+
* **`DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES`**
|
| 603 |
+
* **Description:** If set to `"True"`, preprocesses images before OCR in direct mode.
|
| 604 |
+
* **Default Value:** Inherits from `PREPROCESS_LOCAL_OCR_IMAGES`
|
| 605 |
+
|
| 606 |
+
* **`DIRECT_MODE_COMPRESS_REDACTED_PDF`**
|
| 607 |
+
* **Description:** If set to `"True"`, compresses the redacted PDF output in direct mode.
|
| 608 |
+
* **Default Value:** Inherits from `COMPRESS_REDACTED_PDF`
|
| 609 |
+
|
| 610 |
+
* **`DIRECT_MODE_RETURN_PDF_END_OF_REDACTION`**
|
| 611 |
+
* **Description:** If set to `"True"`, returns a PDF at the end of redaction in direct mode.
|
| 612 |
+
* **Default Value:** Inherits from `RETURN_REDACTED_PDF`
|
| 613 |
+
|
| 614 |
+
* **`DIRECT_MODE_EXTRACT_FORMS`**
|
| 615 |
+
* **Description:** If set to `"True"`, extracts forms during Textract analysis in direct mode.
|
| 616 |
+
* **Default Value:** `"False"`
|
| 617 |
+
|
| 618 |
+
* **`DIRECT_MODE_EXTRACT_TABLES`**
|
| 619 |
+
* **Description:** If set to `"True"`, extracts tables during Textract analysis in direct mode.
|
| 620 |
+
* **Default Value:** `"False"`
|
| 621 |
+
|
| 622 |
+
* **`DIRECT_MODE_EXTRACT_LAYOUT`**
|
| 623 |
+
* **Description:** If set to `"True"`, extracts layout during Textract analysis in direct mode.
|
| 624 |
+
* **Default Value:** `"False"`
|
| 625 |
+
|
| 626 |
+
* **`DIRECT_MODE_EXTRACT_SIGNATURES`**
|
| 627 |
+
* **Description:** If set to `"True"`, extracts signatures during Textract analysis in direct mode.
|
| 628 |
+
* **Default Value:** `"False"`
|
| 629 |
+
|
| 630 |
+
* **`DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL`**
|
| 631 |
+
* **Description:** If set to `"True"`, matches fuzzy whole phrases in direct mode.
|
| 632 |
+
* **Default Value:** `"True"`
|
| 633 |
+
|
| 634 |
+
* **`DIRECT_MODE_ANON_STRATEGY`**
|
| 635 |
+
* **Description:** Anonymisation strategy for tabular data in direct mode.
|
| 636 |
+
* **Default Value:** Inherits from `DEFAULT_TABULAR_ANONYMISATION_STRATEGY`
|
| 637 |
+
|
| 638 |
+
* **`DIRECT_MODE_FUZZY_MISTAKES`**
|
| 639 |
+
* **Description:** Number of fuzzy spelling mistakes allowed in direct mode.
|
| 640 |
+
* **Default Value:** Inherits from `DEFAULT_FUZZY_SPELLING_MISTAKES_NUM`
|
| 641 |
+
|
| 642 |
+
* **`DIRECT_MODE_SIMILARITY_THRESHOLD`**
|
| 643 |
+
* **Description:** Similarity threshold for duplicate detection in direct mode.
|
| 644 |
+
* **Default Value:** Inherits from `DEFAULT_DUPLICATE_DETECTION_THRESHOLD`
|
| 645 |
+
|
| 646 |
+
* **`DIRECT_MODE_MIN_WORD_COUNT`**
|
| 647 |
+
* **Description:** Minimum word count for duplicate detection in direct mode.
|
| 648 |
+
* **Default Value:** Inherits from `DEFAULT_MIN_WORD_COUNT`
|
| 649 |
+
|
| 650 |
+
* **`DIRECT_MODE_MIN_CONSECUTIVE_PAGES`**
|
| 651 |
+
* **Description:** Minimum consecutive pages for duplicate detection in direct mode.
|
| 652 |
+
* **Default Value:** Inherits from `DEFAULT_MIN_CONSECUTIVE_PAGES`
|
| 653 |
+
|
| 654 |
+
* **`DIRECT_MODE_GREEDY_MATCH`**
|
| 655 |
+
* **Description:** If set to `"True"`, uses greedy matching for duplicate detection in direct mode.
|
| 656 |
+
* **Default Value:** Inherits from `USE_GREEDY_DUPLICATE_DETECTION`
|
| 657 |
+
|
| 658 |
+
* **`DIRECT_MODE_COMBINE_PAGES`**
|
| 659 |
+
* **Description:** If set to `"True"`, combines pages for duplicate detection in direct mode.
|
| 660 |
+
* **Default Value:** Inherits from `DEFAULT_COMBINE_PAGES`
|
| 661 |
+
|
| 662 |
+
* **`DIRECT_MODE_REMOVE_DUPLICATE_ROWS`**
|
| 663 |
+
* **Description:** If set to `"True"`, removes duplicate rows in tabular data in direct mode.
|
| 664 |
+
* **Default Value:** Inherits from `REMOVE_DUPLICATE_ROWS`
|
| 665 |
+
|
| 666 |
+
* **`DIRECT_MODE_TEXTRACT_ACTION`**
|
| 667 |
+
* **Description:** Textract action for batch operations in direct mode.
|
| 668 |
+
* **Default Value:** `''`
|
| 669 |
+
|
| 670 |
+
* **`DIRECT_MODE_JOB_ID`**
|
| 671 |
+
* **Description:** Job ID for Textract operations in direct mode.
|
| 672 |
+
* **Default Value:** `''`
|
| 673 |
|
| 674 |
### Lambda Configuration
|
| 675 |
|
src/user_guide.qmd
CHANGED
|
@@ -150,7 +150,17 @@ If you have used the AWS Textract option for extracting text, you may also see a
|
|
| 150 |
|
| 151 |

|
| 152 |
|
| 153 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
### Downloading output files from previous redaction tasks
|
| 156 |
|
|
@@ -460,6 +470,7 @@ You can also write open text into an input box and redact that using the same me
|
|
| 460 |
### Redaction log outputs
|
| 461 |
A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
|
| 462 |
|
|
|
|
| 463 |
## Identifying and redacting duplicate pages
|
| 464 |
|
| 465 |
The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
|
|
@@ -690,45 +701,91 @@ AWS_SECRET_KEY= your-secret-key
|
|
| 690 |
|
| 691 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
| 692 |
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
## Advanced OCR options (Hybrid OCR)
|
| 696 |
|
| 697 |
-
The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by your system
|
| 698 |
|
| 699 |
### Available OCR models
|
| 700 |
|
| 701 |
-
- **Tesseract** (default): The standard OCR engine that works well for most documents
|
| 702 |
-
- **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise
|
| 703 |
-
- **Hybrid**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text
|
|
|
|
|
|
|
| 704 |
|
| 705 |
### Enabling advanced OCR options
|
| 706 |
|
| 707 |
-
To enable these options,
|
| 708 |
|
|
|
|
| 709 |
```
|
| 710 |
SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
|
| 711 |
```
|
| 712 |
|
| 713 |
-
|
| 714 |
-
|
| 715 |
-
|
| 716 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 717 |
|
| 718 |
-
###
|
| 719 |
|
| 720 |
-
The
|
| 721 |
|
| 722 |
-
|
| 723 |
-
|
| 724 |
-
- **
|
| 725 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 726 |
|
| 727 |
### When to use different OCR models
|
| 728 |
|
| 729 |
-
- **Tesseract**: Best for general use, good balance of speed and accuracy
|
| 730 |
-
- **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision
|
| 731 |
-
- **Hybrid**: Best for challenging documents where some text has low confidence scores,
|
|
|
|
|
|
|
| 732 |
|
| 733 |
|
| 734 |
|
|
@@ -843,18 +900,65 @@ python cli_redact.py --task textract --textract_action list
|
|
| 843 |
|
| 844 |
### Common CLI options
|
| 845 |
|
|
|
|
|
|
|
| 846 |
- `--task`: Choose between "redact", "deduplicate", or "textract"
|
| 847 |
-
- `--input_file`: Path to input file(s)
|
| 848 |
- `--output_dir`: Directory for output files (default: output/)
|
| 849 |
-
- `--
|
| 850 |
-
- `--
|
| 851 |
-
- `--
|
| 852 |
-
- `--
|
| 853 |
-
- `--
|
| 854 |
-
- `--
|
| 855 |
-
- `--
|
| 856 |
-
- `--
|
| 857 |
-
- `--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 858 |
|
| 859 |
### Output files
|
| 860 |
|
|
|
|
| 150 |
|
| 151 |

|
| 152 |
|
| 153 |
+
#### Additional outputs in the log file outputs
|
| 154 |
+
|
| 155 |
+
On the Redaction settings tab, near the bottom of the pagethere is a section called 'Log file outputs'. This section contains the following files:
|
| 156 |
+
|
| 157 |
+
You may see a '..._ocr_results_with_words... .json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
|
| 158 |
+
|
| 159 |
+
Also you will see a 'decision_process_table.csv' file. This file contains a table of the decisions made by the app for each page of the document. This can be useful for debugging and understanding the decisions made by the app.
|
| 160 |
+
|
| 161 |
+
Additionally, if the option is enabled by your system administrator, on this tab you may see an image of the output from the OCR model used to extract the text from the document, an image ending with page number and '_visualisations.jpg'. A separate image will be created for each page of the document like the one below. This can be useful for seeing at a glance whether the text extraction process for a page was successful, and whether word-level bounding boxes are correctly positioned.
|
| 162 |
+
|
| 163 |
+

|
| 164 |
|
| 165 |
### Downloading output files from previous redaction tasks
|
| 166 |
|
|
|
|
| 470 |
### Redaction log outputs
|
| 471 |
A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
|
| 472 |
|
| 473 |
+
|
| 474 |
## Identifying and redacting duplicate pages
|
| 475 |
|
| 476 |
The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
|
|
|
|
| 701 |
|
| 702 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
| 703 |
|
| 704 |
+
## Advanced OCR options
|
|
|
|
|
|
|
| 705 |
|
| 706 |
+
The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by changing the app_config.env file in your '/config' folder, or system environment variables in your system.
|
| 707 |
|
| 708 |
### Available OCR models
|
| 709 |
|
| 710 |
+
- **Tesseract** (default): The standard OCR engine that works well for most documents. Provides good word-level bounding box accuracy.
|
| 711 |
+
- **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise. Best for documents with clear, well-formatted text.
|
| 712 |
+
- **Hybrid-paddle**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text regions.
|
| 713 |
+
- **Hybrid-vlm**: Combines Tesseract with Vision Language Models (VLM) - uses Tesseract for initial extraction, then a VLM model (default: Dots.OCR) for re-extraction of low-confidence text.
|
| 714 |
+
- **Hybrid-paddle-vlm**: Combines PaddleOCR with Vision Language Models - uses PaddleOCR first, then a VLM model for low-confidence regions.
|
| 715 |
|
| 716 |
### Enabling advanced OCR options
|
| 717 |
|
| 718 |
+
To enable these options, you need to modify the app_config.env file in your '/config' folder and set the following environment variables:
|
| 719 |
|
| 720 |
+
**Basic OCR model selection:**
|
| 721 |
```
|
| 722 |
SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
|
| 723 |
```
|
| 724 |
|
| 725 |
+
**To enable PaddleOCR options (paddle, hybrid-paddle):**
|
| 726 |
+
```
|
| 727 |
+
SHOW_PADDLE_MODEL_OPTIONS = "True"
|
| 728 |
+
```
|
| 729 |
+
|
| 730 |
+
**To enable Vision Language Model options (hybrid-vlm, hybrid-paddle-vlm):**
|
| 731 |
+
```
|
| 732 |
+
SHOW_VLM_MODEL_OPTIONS = "True"
|
| 733 |
+
```
|
| 734 |
+
|
| 735 |
+
Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between the available models based on what has been enabled.
|
| 736 |
|
| 737 |
+
### OCR configuration parameters
|
| 738 |
|
| 739 |
+
The following parameters can be configured by your system administrator to fine-tune OCR behavior:
|
| 740 |
|
| 741 |
+
#### Hybrid OCR settings
|
| 742 |
+
|
| 743 |
+
- **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 80): Tesseract confidence score below which the secondary OCR engine (PaddleOCR or VLM) will be used for re-extraction. Lower values mean more text will be re-extracted.
|
| 744 |
+
- **HYBRID_OCR_PADDING** (default: 1): Padding (in pixels) added to word bounding boxes before re-extraction with the secondary engine.
|
| 745 |
+
- **SAVE_EXAMPLE_HYBRID_IMAGES** (default: False): If enabled, saves comparison images showing Tesseract vs. secondary engine results when using hybrid modes.
|
| 746 |
+
- **SAVE_PAGE_OCR_VISUALISATIONS** (default: False): If enabled, saves images with detected bounding boxes overlaid for debugging purposes.
|
| 747 |
+
|
| 748 |
+
#### Tesseract settings
|
| 749 |
+
|
| 750 |
+
- **TESSERACT_SEGMENTATION_LEVEL** (default: 11): Tesseract PSM (Page Segmentation Mode) level. Valid values are 0-13. Higher values provide more detailed segmentation but may be slower.
|
| 751 |
+
|
| 752 |
+
#### PaddleOCR settings
|
| 753 |
+
|
| 754 |
+
- **PADDLE_USE_TEXTLINE_ORIENTATION** (default: False): If enabled, PaddleOCR will detect and correct text line orientation.
|
| 755 |
+
- **PADDLE_DET_DB_UNCLIP_RATIO** (default: 1.2): Controls the expansion ratio of detected text regions. Higher values expand the detection area more.
|
| 756 |
+
- **CONVERT_LINE_TO_WORD_LEVEL** (default: False): If enabled, converts PaddleOCR line-level results to word-level for better precision in bounding boxes (not perfect, but pretty good).
|
| 757 |
+
- **LOAD_PADDLE_AT_STARTUP** (default: False): If enabled, loads the PaddleOCR model when the application starts, reducing latency for first use but increasing startup time.
|
| 758 |
+
|
| 759 |
+
#### Image preprocessing
|
| 760 |
+
|
| 761 |
+
- **PREPROCESS_LOCAL_OCR_IMAGES** (default: True): If enabled, images are preprocessed before OCR. This can improve accuracy but may slow down processing.
|
| 762 |
+
- **SAVE_PREPROCESS_IMAGES** (default: False): If enabled, saves the preprocessed images for debugging purposes.
|
| 763 |
+
|
| 764 |
+
#### Vision Language Model (VLM) settings
|
| 765 |
+
|
| 766 |
+
When VLM options are enabled, the following settings are available:
|
| 767 |
+
|
| 768 |
+
- **SELECTED_MODEL** (default: "Dots.OCR"): The VLM model to use. Options include: "Nanonets-OCR2-3B", "Dots.OCR", "Qwen3-VL-2B-Instruct", "Qwen3-VL-4B-Instruct", "PaddleOCR-VL".
|
| 769 |
+
- **MAX_SPACES_GPU_RUN_TIME** (default: 60): Maximum seconds to run GPU operations on Hugging Face Spaces.
|
| 770 |
+
- **MAX_NEW_TOKENS** (default: 30): Maximum number of tokens to generate for VLM responses.
|
| 771 |
+
- **MAX_INPUT_TOKEN_LENGTH** (default: 4096): Maximum number of tokens that can be input to the VLM.
|
| 772 |
+
- **VLM_MAX_IMAGE_SIZE** (default: 1000000): Maximum total pixels (width × height) for images. Larger images are resized while maintaining aspect ratio.
|
| 773 |
+
- **VLM_MAX_DPI** (default: 300.0): Maximum DPI for images. Higher DPI images are resized accordingly.
|
| 774 |
+
- **USE_FLASH_ATTENTION** (default: False): If enabled, uses flash attention for improved VLM performance.
|
| 775 |
+
- **SAVE_VLM_INPUT_IMAGES** (default: False): If enabled, saves input images sent to VLM for debugging.
|
| 776 |
+
|
| 777 |
+
#### General settings
|
| 778 |
+
|
| 779 |
+
- **MODEL_CACHE_PATH** (default: "./model_cache"): Directory where OCR models are cached.
|
| 780 |
+
- **OVERWRITE_EXISTING_OCR_RESULTS** (default: False): If enabled, always creates new OCR results instead of loading from existing JSON files.
|
| 781 |
|
| 782 |
### When to use different OCR models
|
| 783 |
|
| 784 |
+
- **Tesseract**: Best for general use, providing a good balance of speed and accuracy with precise word-level bounding boxes.
|
| 785 |
+
- **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision.
|
| 786 |
+
- **Hybrid-paddle**: Best for challenging documents where some text has low confidence scores, combining Tesseract's word-level precision with PaddleOCR's improved text recognition.
|
| 787 |
+
- **Hybrid-vlm**: Best for very challenging documents with poor image quality or unusual text layouts, leveraging advanced vision models for difficult text.
|
| 788 |
+
- **Hybrid-paddle-vlm**: Most comprehensive option, combining PaddleOCR's line-level detection with a VLM's advanced recognition capabilities.
|
| 789 |
|
| 790 |
|
| 791 |
|
|
|
|
| 900 |
|
| 901 |
### Common CLI options
|
| 902 |
|
| 903 |
+
#### General options
|
| 904 |
+
|
| 905 |
- `--task`: Choose between "redact", "deduplicate", or "textract"
|
| 906 |
+
- `--input_file`: Path to input file(s) - can specify multiple files separated by spaces
|
| 907 |
- `--output_dir`: Directory for output files (default: output/)
|
| 908 |
+
- `--input_dir`: Directory for input files (default: input/)
|
| 909 |
+
- `--language`: Language of document content (e.g., "en", "es", "fr")
|
| 910 |
+
- `--username`: Username for session tracking
|
| 911 |
+
- `--pii_detector`: Choose PII detection method ("Local", "AWS Comprehend", or "None")
|
| 912 |
+
- `--local_redact_entities`: Specify local entities to redact (space-separated list)
|
| 913 |
+
- `--aws_redact_entities`: Specify AWS Comprehend entities to redact (space-separated list)
|
| 914 |
+
- `--aws_access_key` / `--aws_secret_key`: AWS credentials for cloud services
|
| 915 |
+
- `--aws_region`: AWS region for cloud services
|
| 916 |
+
- `--s3_bucket`: S3 bucket name for cloud operations
|
| 917 |
+
- `--cost_code`: Cost code for tracking usage
|
| 918 |
+
|
| 919 |
+
#### PDF/Image redaction options
|
| 920 |
+
|
| 921 |
+
- `--ocr_method`: Choose text extraction method ("AWS Textract", "Local OCR", or "Local text")
|
| 922 |
+
- `--chosen_local_ocr_model`: Local OCR model to use (e.g., "tesseract", "paddle", "hybrid-paddle", "hybrid-vlm")
|
| 923 |
+
- `--page_min` / `--page_max`: Process only specific page range (0 for max means all pages)
|
| 924 |
+
- `--images_dpi`: DPI for image processing (default: 300.0)
|
| 925 |
+
- `--preprocess_local_ocr_images`: Preprocess images before OCR (True/False)
|
| 926 |
+
- `--compress_redacted_pdf`: Compress the final redacted PDF (True/False)
|
| 927 |
+
- `--return_pdf_end_of_redaction`: Return PDF at end of redaction process (True/False)
|
| 928 |
+
- `--allow_list_file` / `--deny_list_file`: Paths to custom allow/deny list CSV files
|
| 929 |
+
- `--redact_whole_page_file`: Path to CSV file listing pages to redact completely
|
| 930 |
+
- `--handwrite_signature_extraction`: Handwriting and signature extraction options for Textract ("Extract handwriting", "Extract signatures")
|
| 931 |
+
- `--extract_forms`: Extract forms during Textract analysis (flag)
|
| 932 |
+
- `--extract_tables`: Extract tables during Textract analysis (flag)
|
| 933 |
+
- `--extract_layout`: Extract layout during Textract analysis (flag)
|
| 934 |
+
|
| 935 |
+
#### Tabular/Word anonymization options
|
| 936 |
+
|
| 937 |
+
- `--anon_strategy`: Anonymization strategy (e.g., "redact", "redact completely", "replace_redacted", "encrypt", "hash")
|
| 938 |
+
- `--text_columns`: List of column names to anonymize (space-separated)
|
| 939 |
+
- `--excel_sheets`: Specific Excel sheet names to process (space-separated)
|
| 940 |
+
- `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching (default: 1)
|
| 941 |
+
- `--match_fuzzy_whole_phrase_bool`: Match fuzzy whole phrase (True/False)
|
| 942 |
+
- `--do_initial_clean`: Perform initial text cleaning for tabular data (True/False)
|
| 943 |
+
|
| 944 |
+
#### Duplicate detection options
|
| 945 |
+
|
| 946 |
+
- `--duplicate_type`: Type of duplicate detection ("pages" for OCR files or "tabular" for CSV/Excel)
|
| 947 |
+
- `--similarity_threshold`: Similarity threshold (0-1) to consider content as duplicates (default: 0.95)
|
| 948 |
+
- `--min_word_count`: Minimum word count for text to be considered (default: 10)
|
| 949 |
+
- `--min_consecutive_pages`: Minimum number of consecutive pages to consider as a match (default: 1)
|
| 950 |
+
- `--greedy_match`: Use greedy matching strategy for consecutive pages (True/False)
|
| 951 |
+
- `--combine_pages`: Combine text from same page number within a file (True/False)
|
| 952 |
+
- `--remove_duplicate_rows`: Remove duplicate rows from output (True/False)
|
| 953 |
+
|
| 954 |
+
#### Textract batch operations options
|
| 955 |
+
|
| 956 |
+
- `--textract_action`: Action to perform ("submit", "retrieve", or "list")
|
| 957 |
+
- `--job_id`: Textract job ID for retrieve action
|
| 958 |
+
- `--extract_signatures`: Extract signatures during Textract analysis (flag)
|
| 959 |
+
- `--textract_bucket`: S3 bucket name for Textract operations
|
| 960 |
+
- `--poll_interval`: Polling interval in seconds for job status (default: 30)
|
| 961 |
+
- `--max_poll_attempts`: Maximum polling attempts before timeout (default: 120)
|
| 962 |
|
| 963 |
### Output files
|
| 964 |
|