Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

seanpedrickcase commited on 2 days ago

Commit

419fb7d

1 Parent(s): c2d2ccd

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search

Browse files

Files changed (6) hide show

README.md +4 -4
app.py +6 -2
pyproject.toml +6 -7
tools/config.py +11 -1
tools/custom_image_analyser_engine.py +114 -11
tools/word_segmenter.py +7 -8

README.md CHANGED Viewed

@@ -10,11 +10,13 @@ license: agpl-3.0
 ---
 # Document redaction
-version: 1.5.1
 Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
-To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings).  AWS Comprehend gives better results at a small cost.
 Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.
@@ -589,8 +591,6 @@ The workflow is designed to be simple: **Search → Select → Redact**.
 #### **Step 1: Search for Text**
-#### **Step 1: Search for Text**
 1.  Navigate to the **"Search text to make new redactions"** tab.
 2.  The main table will initially be populated with all the text extracted from the document for a page, broken down by word.
 3.  To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find (this will search the whole document). If you want to do a regex-based search, tick the 'Enable regex pattern matching' box under 'Search options' below (Note this will only be able to search for patterns in text within each cell).

 ---
 # Document redaction
+version: 1.5.2
 Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
+To identify text in documents, the 'Local' text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. PaddleOCR and VLM support is also provided (see the installation instructions below).
+For PII identification, 'Local' (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings).  AWS Comprehend gives better results at a small cost.
 Additional options on the 'Redaction settings' include, the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the 'Review redactions' tab to quickly create a final redacted document.
 #### **Step 1: Search for Text**
 1.  Navigate to the **"Search text to make new redactions"** tab.
 2.  The main table will initially be populated with all the text extracted from the document for a page, broken down by word.
 3.  To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find (this will search the whole document). If you want to do a regex-based search, tick the 'Enable regex pattern matching' box under 'Search options' below (Note this will only be able to search for patterns in text within each cell).

app.py CHANGED Viewed

@@ -119,6 +119,7 @@ from tools.config import (
     RUN_AWS_FUNCTIONS,
     RUN_DIRECT_MODE,
     RUN_FASTAPI,
     S3_ACCESS_LOGS_FOLDER,
     S3_ALLOW_LIST_PATH,
     S3_COST_CODES_PATH,
@@ -1258,7 +1259,7 @@ with blocks:
                         open=EXTRACTION_AND_PII_OPTIONS_OPEN_BY_DEFAULT,
                     ):
                         local_ocr_method_radio = gr.Radio(
-                            label="""Choose local OCR model. "tesseract" is the default and will work for most documents. "paddle" is accurate for whole line text extraction, but word-level extract is not natively supported, and so word bounding boxes will be inaccurate. "hybrid-paddle" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with the chosen hybrid model (default PaddleOCR) on words with low confidence. "hybrid-vlm" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with the chosen vision model (default Dots.OCR) on words with low confidence. "hybrid-paddle-vlm" is a combination of PaddleOCR with the chosen vision model (default Dots.OCR) on words with low confidence.""",
                             value=CHOSEN_LOCAL_OCR_MODEL,
                             choices=LOCAL_OCR_MODEL_OPTIONS,
                             interactive=True,
@@ -4755,7 +4756,7 @@ with blocks:
             duplicate_files_out,
             full_duplicate_data_by_file,
         ],
-    )
     # Clicking on a cell in the redact items table will take you to that page
     all_page_line_level_ocr_results_with_words_df.select(
@@ -6549,6 +6550,7 @@ with blocks:
                 max_file_size=MAX_FILE_SIZE,
                 path=FASTAPI_ROOT_PATH,
                 favicon_path=Path(FAVICON_PATH),
             )
             # Example command to run in uvicorn (in python): uvicorn.run("app:app", host=GRADIO_SERVER_NAME, port=GRADIO_SERVER_PORT)
@@ -6566,6 +6568,7 @@ with blocks:
                         server_port=GRADIO_SERVER_PORT,
                         root_path=ROOT_PATH,
                         favicon_path=Path(FAVICON_PATH),
                     )
                 else:
                     blocks.launch(
@@ -6576,6 +6579,7 @@ with blocks:
                         server_port=GRADIO_SERVER_PORT,
                         root_path=ROOT_PATH,
                         favicon_path=Path(FAVICON_PATH),
                     )
     else:

     RUN_AWS_FUNCTIONS,
     RUN_DIRECT_MODE,
     RUN_FASTAPI,
+    RUN_MCP_SERVER,
     S3_ACCESS_LOGS_FOLDER,
     S3_ALLOW_LIST_PATH,
     S3_COST_CODES_PATH,
                         open=EXTRACTION_AND_PII_OPTIONS_OPEN_BY_DEFAULT,
                     ):
                         local_ocr_method_radio = gr.Radio(
+                            label="""Choose a local OCR model. "tesseract" is the default and will work for documents with clear typed text. "paddle" is more accurate for text extraction where the text is not clear or well-formatted, but word-level extract is not natively supported, and so word bounding boxes will be inaccurate. The hybrid models will do a first pass with one model, and a second pass on words/phrases with low confidence with a more powerful model. "hybrid-paddle" will do the first pass with Tesseract, and the second with PaddleOCR. "hybrid-vlm" is a combination of Tesseract for OCR, and a second pass with the chosen vision model (VLM). "hybrid-paddle-vlm" is a combination of PaddleOCR with the chosen VLM.""",
                             value=CHOSEN_LOCAL_OCR_MODEL,
                             choices=LOCAL_OCR_MODEL_OPTIONS,
                             interactive=True,
             duplicate_files_out,
             full_duplicate_data_by_file,
         ],
+    api_name="word_level_ocr_text_search")
     # Clicking on a cell in the redact items table will take you to that page
     all_page_line_level_ocr_results_with_words_df.select(
                 max_file_size=MAX_FILE_SIZE,
                 path=FASTAPI_ROOT_PATH,
                 favicon_path=Path(FAVICON_PATH),
+                mcp_server=RUN_MCP_SERVER,
             )
             # Example command to run in uvicorn (in python): uvicorn.run("app:app", host=GRADIO_SERVER_NAME, port=GRADIO_SERVER_PORT)
                         server_port=GRADIO_SERVER_PORT,
                         root_path=ROOT_PATH,
                         favicon_path=Path(FAVICON_PATH),
+                        mcp_server=RUN_MCP_SERVER,
                     )
                 else:
                     blocks.launch(
                         server_port=GRADIO_SERVER_PORT,
                         root_path=ROOT_PATH,
                         favicon_path=Path(FAVICON_PATH),
+                        mcp_server=RUN_MCP_SERVER,
                     )
     else:

pyproject.toml CHANGED Viewed

@@ -2,17 +2,16 @@
 requires = ["setuptools>=61.0", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "doc_redaction"
-version = "1.5.1"
 description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
 readme = "README.md"
 requires-python = ">=3.10"
-[project.urls]
-Homepage = "https://seanpedrick-case.github.io/doc_redaction/"
-Repository = "https://github.com/seanpedrick-case/doc_redaction"
 dependencies = [
     "pdfminer.six==20250506",
     "pdf2image==1.17.0",
@@ -61,7 +60,7 @@ paddle = [
 # Extra dependencies for VLM models
 # For torch you should use --index-url https://download.pytorch.org/whl/cu126 for cuda support for paddleocr, need to install manually
 vlm = [
-    "torch<=2.5.1,<=2.8.0",
     "torchvision>=0.20.1",
     "transformers==4.57.1",
     "accelerate==1.11.0",

 requires = ["setuptools>=61.0", "wheel"]
 build-backend = "setuptools.build_meta"
+[project.urls]
+Homepage = "https://seanpedrick-case.github.io/doc_redaction/"
+Repository = "https://github.com/seanpedrick-case/doc_redaction"
 [project]
 name = "doc_redaction"
+version = "1.5.2"
 description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
     "pdfminer.six==20250506",
     "pdf2image==1.17.0",
 # Extra dependencies for VLM models
 # For torch you should use --index-url https://download.pytorch.org/whl/cu126 for cuda support for paddleocr, need to install manually
 vlm = [
+    "torch>=2.5.1,<=2.8.0",
     "torchvision>=0.20.1",
     "transformers==4.57.1",
     "accelerate==1.11.0",

tools/config.py CHANGED Viewed

@@ -281,6 +281,8 @@ FAVICON_PATH = get_or_create_env_var("FAVICON_PATH", "favicon.png")
 RUN_FASTAPI = convert_string_to_boolean(get_or_create_env_var("RUN_FASTAPI", "False"))
 MAX_QUEUE_SIZE = int(get_or_create_env_var("MAX_QUEUE_SIZE", "5"))
 MAX_FILE_SIZE = get_or_create_env_var("MAX_FILE_SIZE", "250mb").lower()
@@ -492,7 +494,7 @@ OVERWRITE_EXISTING_OCR_RESULTS = convert_string_to_boolean(
 ### Local OCR model - Tesseract vs PaddleOCR
 CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var(
     "CHOSEN_LOCAL_OCR_MODEL", "tesseract"
-)  # Choose between "tesseract", "hybrid-paddle", and "paddle". "paddle" is accurate for whole line text extraction, but word-level extract is not natively supported, and so word bounding boxes will be inaccurate. "hybrid-paddle" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with the chosen hybrid model (default PaddleOCR) on words with low confidence. "hybrid-vlm" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with the chosen vision model (default Dots.OCR) on words with low confidence. "hybrid-paddle-vlm" is a combination of PaddleOCR with the chosen vision model (default Dots.OCR) on words with low confidence.
 SHOW_LOCAL_OCR_MODEL_OPTIONS = convert_string_to_boolean(
     get_or_create_env_var("SHOW_LOCAL_OCR_MODEL_OPTIONS", "False")
@@ -525,6 +527,10 @@ HYBRID_OCR_PADDING = int(
     get_or_create_env_var("HYBRID_OCR_PADDING", "1")
 )  # The padding to add to the text when passing it to PaddleOCR for re-extraction using the hybrid OCR method.
 TESSERACT_SEGMENTATION_LEVEL = int(
     get_or_create_env_var("TESSERACT_SEGMENTATION_LEVEL", "11")
 )  # Tesseract segmentation level: PSM level to use for Tesseract OCR
@@ -553,6 +559,10 @@ SAVE_PAGE_OCR_VISUALISATIONS = convert_string_to_boolean(
     get_or_create_env_var("SAVE_PAGE_OCR_VISUALISATIONS", "False")
 )  # Whether to save visualisations of Tesseract, PaddleOCR, and Textract bounding boxes.
 # Model storage paths for Lambda compatibility
 PADDLE_MODEL_PATH = get_or_create_env_var(
     "PADDLE_MODEL_PATH", ""

 RUN_FASTAPI = convert_string_to_boolean(get_or_create_env_var("RUN_FASTAPI", "False"))
+RUN_MCP_SERVER = convert_string_to_boolean(get_or_create_env_var("RUN_MCP_SERVER", "False"))
 MAX_QUEUE_SIZE = int(get_or_create_env_var("MAX_QUEUE_SIZE", "5"))
 MAX_FILE_SIZE = get_or_create_env_var("MAX_FILE_SIZE", "250mb").lower()
 ### Local OCR model - Tesseract vs PaddleOCR
 CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var(
     "CHOSEN_LOCAL_OCR_MODEL", "tesseract"
+)  # "tesseract" is the default and will work for documents with clear typed text. "paddle" is more accurate for text extraction where the text is not clear or well-formatted, but word-level extract is not natively supported, and so word bounding boxes will be inaccurate. The hybrid models will do a first pass with one model, and a second pass on words/phrases with low confidence with a more powerful model. "hybrid-paddle" will do the first pass with Tesseract, and the second with PaddleOCR. "hybrid-vlm" is a combination of Tesseract for OCR, and a second pass with the chosen vision model (VLM). "hybrid-paddle-vlm" is a combination of PaddleOCR with the chosen VLM.
 SHOW_LOCAL_OCR_MODEL_OPTIONS = convert_string_to_boolean(
     get_or_create_env_var("SHOW_LOCAL_OCR_MODEL_OPTIONS", "False")
     get_or_create_env_var("HYBRID_OCR_PADDING", "1")
 )  # The padding to add to the text when passing it to PaddleOCR for re-extraction using the hybrid OCR method.
+TESSERACT_WORD_LEVEL_OCR = convert_string_to_boolean(
+    get_or_create_env_var("TESSERACT_WORD_LEVEL_OCR", "True")
+)  # Whether to use Tesseract word-level OCR.
 TESSERACT_SEGMENTATION_LEVEL = int(
     get_or_create_env_var("TESSERACT_SEGMENTATION_LEVEL", "11")
 )  # Tesseract segmentation level: PSM level to use for Tesseract OCR
     get_or_create_env_var("SAVE_PAGE_OCR_VISUALISATIONS", "False")
 )  # Whether to save visualisations of Tesseract, PaddleOCR, and Textract bounding boxes.
+SAVE_WORD_SEGMENTER_OUTPUT_IMAGES = convert_string_to_boolean(
+    get_or_create_env_var("SAVE_WORD_SEGMENTER_OUTPUT_IMAGES", "False")
+)  # Whether to save output images from the word segmenter.
 # Model storage paths for Lambda compatibility
 PADDLE_MODEL_PATH = get_or_create_env_var(
     "PADDLE_MODEL_PATH", ""

tools/custom_image_analyser_engine.py CHANGED Viewed

@@ -10,6 +10,7 @@ import botocore
 import cv2
 import gradio as gr
 import numpy as np
 import pytesseract
 from pdfminer.layout import LTChar
 from PIL import Image
@@ -34,6 +35,7 @@ from tools.config import (
     SAVE_VLM_INPUT_IMAGES,
     SELECTED_MODEL,
     TESSERACT_SEGMENTATION_LEVEL,
     VLM_MAX_DPI,
     VLM_MAX_IMAGE_SIZE,
 )
@@ -1238,11 +1240,13 @@ class CustomImageAnalyzerEngine:
             print(
                 f"Warning: Image dimension mismatch! Expected {image_width}x{image_height}, but got {actual_width}x{actual_height}"
             )
-            print(f"Using actual dimensions: {actual_width}x{actual_height}")
             # Update to use actual dimensions
             image_width = actual_width
             image_height = actual_height
         segmenter = AdaptiveSegmenter(output_folder=self.output_folder)
         # Process each line
@@ -1591,6 +1595,30 @@ class CustomImageAnalyzerEngine:
                 1,
             )
     def _perform_hybrid_ocr(
         self,
         image: Image.Image,
@@ -1600,8 +1628,22 @@ class CustomImageAnalyzerEngine:
         image_name: str = "unknown_image_name",
     ) -> Dict[str, list]:
         """
-        Performs OCR using Tesseract for bounding boxes and PaddleOCR/VLM for low-confidence text.
-        Returns data in the same dictionary format as pytesseract.image_to_data.
         """
         # Determine if we're using VLM or PaddleOCR
         use_vlm = self.ocr_engine == "hybrid-vlm"
@@ -1615,15 +1657,37 @@ class CustomImageAnalyzerEngine:
                         "No OCR object provided and 'paddle_ocr' is not initialized."
                     )
-        print("Starting hybrid OCR process...")
-        # 1. Get initial word-level results from Tesseract
         tesseract_data = pytesseract.image_to_data(
             image,
             output_type=pytesseract.Output.DICT,
             config=self.tesseract_config,
             lang=self.tesseract_lang,
         )
         final_data = {
             "text": list(),
@@ -1708,7 +1772,7 @@ class CustomImageAnalyzerEngine:
                             text, new_text, conf, new_conf, ocr_type
                         )
-                        if SAVE_EXAMPLE_HYBRID_IMAGES is True:
                             # Normalize and validate image_name to prevent path traversal attacks
                             normalized_image_name = os.path.normpath(
                                 image_name + "_" + ocr_type
@@ -2196,6 +2260,28 @@ class CustomImageAnalyzerEngine:
                 lang=self.tesseract_lang,  # Ensure the Tesseract language data (e.g., fra.traineddata) is installed on your system.
             )
         elif self.ocr_engine == "paddle" or self.ocr_engine == "hybrid-paddle-vlm":
             if ocr is None:
@@ -2371,13 +2457,15 @@ class CustomImageAnalyzerEngine:
         # Convert line-level results to word-level if configured and needed
         if CONVERT_LINE_TO_WORD_LEVEL and self._is_line_level_data(ocr_data):
-            # print("Converting line-level OCR results to word-level...")
             # Check if coordinates need to be scaled to match the image we're cropping from
             # For PaddleOCR: _convert_paddle_to_tesseract_format converts coordinates to original image space
             #   - If PaddleOCR processed the original image (image_path provided), crop from original image (no scaling)
             #   - If PaddleOCR processed the preprocessed image (no image_path), scale coordinates to preprocessed space and crop from preprocessed image
-            # For Tesseract: OCR runs on preprocessed image, so coordinates are already in preprocessed space,
-            #   matching the preprocessed image we're cropping from - no scaling needed
             needs_scaling = False
             crop_image = image  # Default to preprocessed image
@@ -2405,6 +2493,19 @@ class CustomImageAnalyzerEngine:
                     else:
                         # PaddleOCR processed the preprocessed image, so scale coordinates to preprocessed space
                         needs_scaling = True
             if needs_scaling:
                 # Calculate scale factors from original to preprocessed
@@ -2488,7 +2589,8 @@ class CustomImageAnalyzerEngine:
             def get_model(idx):
                 return default_model
-        return [
             OCRResult(
                 text=clean_unicode_text(ocr_result["text"][i]),
                 left=ocr_result["left"][i],
@@ -2497,11 +2599,12 @@ class CustomImageAnalyzerEngine:
                 height=ocr_result["height"][i],
                 conf=round(float(ocr_result["conf"][i]), 0),
                 model=get_model(i),
-                # line_number=ocr_result['abs_line_id'][i]
             )
             for i in valid_indices
         ]
     def analyze_text(
         self,
         line_level_ocr_results: List[OCRResult],

 import cv2
 import gradio as gr
 import numpy as np
+import pandas as pd
 import pytesseract
 from pdfminer.layout import LTChar
 from PIL import Image
     SAVE_VLM_INPUT_IMAGES,
     SELECTED_MODEL,
     TESSERACT_SEGMENTATION_LEVEL,
+    TESSERACT_WORD_LEVEL_OCR,
     VLM_MAX_DPI,
     VLM_MAX_IMAGE_SIZE,
 )
             print(
                 f"Warning: Image dimension mismatch! Expected {image_width}x{image_height}, but got {actual_width}x{actual_height}"
             )
+            #print(f"Using actual dimensions: {actual_width}x{actual_height}")
             # Update to use actual dimensions
             image_width = actual_width
             image_height = actual_height
+        print("segmenting line-level OCR results to word-level...")
         segmenter = AdaptiveSegmenter(output_folder=self.output_folder)
         # Process each line
                 1,
             )
+    # Calculate line-level bounding boxes and average confidence
+    def _calculate_line_bbox(self, group):
+        # Get the leftmost and rightmost positions
+        left = group['left'].min()
+        top = group['top'].min()
+        right = (group['left'] + group['width']).max()
+        bottom = (group['top'] + group['height']).max()
+        # Calculate width and height
+        width = right - left
+        height = bottom - top
+        # Calculate average confidence
+        avg_conf = round(group['conf'].mean(), 0)
+        return pd.Series({
+            'text': ' '.join(group['text'].astype(str).tolist()),
+            'left': left,
+            'top': top,
+            'width': width,
+            'height': height,
+            'conf': avg_conf
+        })
     def _perform_hybrid_ocr(
         self,
         image: Image.Image,
         image_name: str = "unknown_image_name",
     ) -> Dict[str, list]:
         """
+        Performs hybrid OCR on an image using Tesseract for initial OCR and PaddleOCR/VLM to enhance
+        results for low-confidence or uncertain words.
+        Args:
+            image (Image.Image): The input image (PIL format) to be processed.
+            confidence_threshold (int, optional): Tesseract confidence threshold below which words are
+                re-analyzed with secondary OCR (PaddleOCR/VLM). Defaults to HYBRID_OCR_CONFIDENCE_THRESHOLD.
+            padding (int, optional): Pixel padding (in all directions) to add around each word box when
+                cropping for secondary OCR. Defaults to HYBRID_OCR_PADDING.
+            ocr (Optional[Any], optional): An instance of the PaddleOCR or VLM engine. If None, will use the
+                instance's `paddle_ocr` attribute if available. Only necessary for PaddleOCR-based pipelines.
+            image_name (str, optional): Optional name of the image, useful for debugging and visualization.
+        Returns:
+            Dict[str, list]: OCR results in the dictionary format of pytesseract.image_to_data (keys:
+                'text', 'left', 'top', 'width', 'height', 'conf', 'model', ...).
         """
         # Determine if we're using VLM or PaddleOCR
         use_vlm = self.ocr_engine == "hybrid-vlm"
                         "No OCR object provided and 'paddle_ocr' is not initialized."
                     )
+        #print("Starting hybrid OCR process...")
+        # 1. Get initial word-level results from Tesseract
         tesseract_data = pytesseract.image_to_data(
             image,
             output_type=pytesseract.Output.DICT,
             config=self.tesseract_config,
             lang=self.tesseract_lang,
         )
+        if TESSERACT_WORD_LEVEL_OCR is False:
+            ocr_df = pd.DataFrame(tesseract_data)
+            # Filter out invalid entries (confidence == -1)
+            ocr_df = ocr_df[ocr_df.conf != -1]
+            # Group by line and aggregate text
+            line_groups = ocr_df.groupby(['block_num', 'par_num', 'line_num'])
+            ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
+            # Overwrite tesseract_data with the aggregated data
+            tesseract_data = {
+                'text': ocr_data['text'].tolist(),
+                'left': ocr_data['left'].astype(int).tolist(),
+                'top': ocr_data['top'].astype(int).tolist(),
+                'width': ocr_data['width'].astype(int).tolist(),
+                'height': ocr_data['height'].astype(int).tolist(),
+                'conf': ocr_data['conf'].tolist(),
+                'model': ['Tesseract'] * len(ocr_data)  # Add model field
+            }
         final_data = {
             "text": list(),
                             text, new_text, conf, new_conf, ocr_type
                         )
+                        if SAVE_EXAMPLE_HYBRID_IMAGES:
                             # Normalize and validate image_name to prevent path traversal attacks
                             normalized_image_name = os.path.normpath(
                                 image_name + "_" + ocr_type
                 lang=self.tesseract_lang,  # Ensure the Tesseract language data (e.g., fra.traineddata) is installed on your system.
             )
+            if TESSERACT_WORD_LEVEL_OCR is False:
+                ocr_df = pd.DataFrame(ocr_data)
+                # Filter out invalid entries (confidence == -1)
+                ocr_df = ocr_df[ocr_df.conf != -1]
+                # Group by line and aggregate text
+                line_groups = ocr_df.groupby(['block_num', 'par_num', 'line_num'])
+                ocr_data = line_groups.apply(self._calculate_line_bbox).reset_index()
+                # Convert DataFrame to dictionary of lists format expected by downstream code
+                ocr_data = {
+                    'text': ocr_data['text'].tolist(),
+                    'left': ocr_data['left'].astype(int).tolist(),
+                    'top': ocr_data['top'].astype(int).tolist(),
+                    'width': ocr_data['width'].astype(int).tolist(),
+                    'height': ocr_data['height'].astype(int).tolist(),
+                    'conf': ocr_data['conf'].tolist(),
+                    'model': ['Tesseract'] * len(ocr_data)  # Add model field
+                }
         elif self.ocr_engine == "paddle" or self.ocr_engine == "hybrid-paddle-vlm":
             if ocr is None:
         # Convert line-level results to word-level if configured and needed
         if CONVERT_LINE_TO_WORD_LEVEL and self._is_line_level_data(ocr_data):
+            #print("Converting line-level OCR results to word-level...")
             # Check if coordinates need to be scaled to match the image we're cropping from
             # For PaddleOCR: _convert_paddle_to_tesseract_format converts coordinates to original image space
             #   - If PaddleOCR processed the original image (image_path provided), crop from original image (no scaling)
             #   - If PaddleOCR processed the preprocessed image (no image_path), scale coordinates to preprocessed space and crop from preprocessed image
+            # For Tesseract: OCR runs on preprocessed image
+            #   - If scale_factor != 1.0, rescale_ocr_data converted coordinates to original space, so crop from original image
+            #   - If scale_factor == 1.0, coordinates are still in preprocessed space, so crop from preprocessed image
             needs_scaling = False
             crop_image = image  # Default to preprocessed image
                     else:
                         # PaddleOCR processed the preprocessed image, so scale coordinates to preprocessed space
                         needs_scaling = True
+                elif self.ocr_engine == "tesseract":
+                    # For Tesseract: if scale_factor != 1.0, rescale_ocr_data converted coordinates to original space
+                    # So we need to crop from the original image, not the preprocessed image
+                    if scale_factor != 1.0 and original_image_for_visualization is not None:
+                        # Coordinates are in original space, so crop from original image
+                        crop_image = original_image_for_visualization
+                        crop_image_width = original_image_width
+                        crop_image_height = original_image_height
+                        needs_scaling = False
+                    else:
+                        # scale_factor == 1.0, so coordinates are still in preprocessed space
+                        # Crop from preprocessed image - no scaling needed
+                        needs_scaling = False
             if needs_scaling:
                 # Calculate scale factors from original to preprocessed
             def get_model(idx):
                 return default_model
+        output = [
             OCRResult(
                 text=clean_unicode_text(ocr_result["text"][i]),
                 left=ocr_result["left"][i],
                 height=ocr_result["height"][i],
                 conf=round(float(ocr_result["conf"][i]), 0),
                 model=get_model(i),
             )
             for i in valid_indices
         ]
+        return output
     def analyze_text(
         self,
         line_level_ocr_results: List[OCRResult],

tools/word_segmenter.py CHANGED Viewed

@@ -4,7 +4,7 @@ from typing import Dict, List, Tuple
 import cv2
 import numpy as np
-from tools.config import OUTPUT_FOLDER
 INITIAL_KERNEL_WIDTH_FACTOR = 0.05  # Default 0.05
 INITIAL_VALLEY_THRESHOLD_FACTOR = 0.05  # Default 0.05
@@ -15,7 +15,6 @@ MIN_SPACE_FACTOR = 0.3  # Default 0.4
 MATCH_TOLERANCE = 0  # Default 0
 MIN_AREA_THRESHOLD = 6  # Default 6
 DEFAULT_TRIM_PERCENTAGE = 0.2  # Default 0.2
-SHOW_OUTPUT_IMAGES = False  # Default False
 class AdaptiveSegmenter:
@@ -291,7 +290,7 @@ class AdaptiveSegmenter:
         # print(f"line_text: {line_text}")
         shortened_line_text = line_text.replace(" ", "_")[:10]
-        if SHOW_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_original.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
@@ -346,7 +345,7 @@ class AdaptiveSegmenter:
             return ({}, False)
         # Save deskewed image (optional, only if image_name is provided)
-        if SHOW_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_deskewed.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
@@ -402,7 +401,7 @@ class AdaptiveSegmenter:
             return ({}, False)
         # Save cropped image (optional, only if image_name is provided)
-        if SHOW_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_binary.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
@@ -436,7 +435,7 @@ class AdaptiveSegmenter:
         # dilated_binary = cv2.dilate(closed_binary, kernel, iterations=1)
         # Use 'closed_binary' (or 'dilated_binary') from now on.
-        if SHOW_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_closed_binary.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
@@ -633,7 +632,7 @@ class AdaptiveSegmenter:
         # print(f"Target word count: {target_word_count}")
         # Save cropped image (optional, only if image_name is provided)
-        if SHOW_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_clean_binary.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
@@ -898,7 +897,7 @@ class AdaptiveSegmenter:
                 remapped_output[key].append(box[key])
         # Visualisation
-        if SHOW_OUTPUT_IMAGES:
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_final_boxes.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
             output_image_vis = line_image.copy()

 import cv2
 import numpy as np
+from tools.config import OUTPUT_FOLDER, SAVE_WORD_SEGMENTER_OUTPUT_IMAGES
 INITIAL_KERNEL_WIDTH_FACTOR = 0.05  # Default 0.05
 INITIAL_VALLEY_THRESHOLD_FACTOR = 0.05  # Default 0.05
 MATCH_TOLERANCE = 0  # Default 0
 MIN_AREA_THRESHOLD = 6  # Default 6
 DEFAULT_TRIM_PERCENTAGE = 0.2  # Default 0.2
 class AdaptiveSegmenter:
         # print(f"line_text: {line_text}")
         shortened_line_text = line_text.replace(" ", "_")[:10]
+        if SAVE_WORD_SEGMENTER_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_original.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
             return ({}, False)
         # Save deskewed image (optional, only if image_name is provided)
+        if SAVE_WORD_SEGMENTER_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_deskewed.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
             return ({}, False)
         # Save cropped image (optional, only if image_name is provided)
+        if SAVE_WORD_SEGMENTER_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_binary.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
         # dilated_binary = cv2.dilate(closed_binary, kernel, iterations=1)
         # Use 'closed_binary' (or 'dilated_binary') from now on.
+        if SAVE_WORD_SEGMENTER_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_closed_binary.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
         # print(f"Target word count: {target_word_count}")
         # Save cropped image (optional, only if image_name is provided)
+        if SAVE_WORD_SEGMENTER_OUTPUT_IMAGES:
             os.makedirs(self.output_folder, exist_ok=True)
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_clean_binary.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
                 remapped_output[key].append(box[key])
         # Visualisation
+        if SAVE_WORD_SEGMENTER_OUTPUT_IMAGES:
             output_path = f"{self.output_folder}/word_segmentation/{image_name}_{shortened_line_text}_final_boxes.png"
             os.makedirs(f"{self.output_folder}/word_segmentation", exist_ok=True)
             output_image_vis = line_image.copy()