Commit History

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search
419fb7d

seanpedrickcase commited on

Merge pull request #92 from seanpedrick-case/regex_search
c2d2ccd
unverified

Sean Pedrick-Case commited on

Added regex search feature for multi-word text search
21318d3

seanpedrickcase commited on

Minor update to cli_redact for new local OCR model options. Updated app_settings.qmd, user_guide.qmd, and readme.md with descriptions of new features
d5b5291

seanpedrickcase commited on

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues
7bb945f

seanpedrickcase commited on

Merge pull request #89 from seanpedrick-case/textract_type_name_output
00011db
unverified

Sean Pedrick-Case commited on

Added suffix to textract output files according to tasks included (e.g. signature analysis). Improved reporting when Textract client doesn't exist. Fixed display for cost and time taken. Changes to config variables to allow exclusion of PaddleOCR from display
25e2089

seanpedrickcase commited on

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces
2c00d05

seanpedrickcase commited on

Added paddle to pre-requirements.txt
01c8eb6

seanpedrickcase commited on

Allowed for load Paddle at startup. Updated requirements for torch compatability
bf83b6f

seanpedrickcase commited on

Updated requirements for torch. Updated main hf flow to force changes to spaces repo
e59fbb7

seanpedrickcase commited on

Updated dependencies, github to HF workflow
059a5f7

seanpedrickcase commited on

Updated sync to hf workflow for zero GPU space sync
27ed5c8

seanpedrickcase commited on

Updated readme for install instructions with paddle, vlms
c3ccad4

seanpedrickcase commited on

Merge pull request #88 from seanpedrick-case/vlm_support
cd01917
unverified

Sean Pedrick-Case commited on

Similar cleanup to requirements_lightweight.txt
ef8c72e

seanpedrickcase commited on

Updated test suites to use the lightweight version of requirements.txt
f5146c7

seanpedrickcase commited on

Optimised VLM model choice and prompting/parameters
ad60619

seanpedrickcase commited on

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml
6d4f6e4

seanpedrickcase commited on

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU
c2becd8

seanpedrickcase commited on

Improved new requirements. Improved visual OCR outputs and word-level Paddle outputs and general bounding box positioning
e4493fe

seanpedrickcase commited on

Initial commit for VLM support. Created visualisations for OCR output. Corrected log_file_output_paths reference.
5e01004

seanpedrickcase commited on

Again revised spaCy language model load for different languages
2f34683

seanpedrickcase commited on

Modified model load for custom languages with spaCy. Languages should load successfully now.
2148ddd

seanpedrickcase commited on

User ownership folder change to whole user folder in Dockerfile. Minor changes to documentation
bf7b066

seanpedrickcase commited on

Ensured that AWS credentials called correctly in logger settings.
43c7a6d

seanpedrickcase commited on

Updated user guide and app settings. Updated some additional lambda_entrypoint arguments. Ensured that examples are correctly displayed on GUI.
c543ba0

seanpedrickcase commited on

head attribute added to Gradio blocks context to enable enforcement of direct vs relative file paths. Updates to direct mode/lambda entrypoint to ensure as many options as possible can be user defined
febacad

seanpedrickcase commited on

Merge pull request #80 from seanpedrick-case/main
41e7358
unverified

Sean Pedrick-Case commited on

Fix condition check for SHOW_EXAMPLES
57de024
unverified

Sean Pedrick-Case commited on

Merge pull request #79 from seanpedrick-case/dev
b0dca2c
unverified

Sean Pedrick-Case commited on

Correction to PaddleOCR config variable. Minor print statement changes
6c62394

seanpedrickcase commited on

Revised environment variables for consistency.
5f824f4

seanpedrickcase commited on

Custom env variables should now overwrite defaults for lambda function. Usage logs should now be correctly created with lambda function
6806363

seanpedrickcase commited on

Updated some config variable defaults for lambda_entrypoint (e.g. page max, min) to ensure that they are correctly parsed
8da3518

seanpedrickcase commited on

cli_usage_logger should now use custom folder input
022f8a1

seanpedrickcase commited on

Corrected environment variable file references for log files and spacy/paddle folders for lambda_entrypoint
e347a56

seanpedrickcase commited on

Added logging folders to cli_redact to ensure correct saves with read-only file systems (e.g. lambda). Updated list-based parsing of arguments in lambda_entrypoint.py
40c65f7

seanpedrickcase commited on

Updated lambda_entrypoint dict references. Redaction functions should now return files even if MAX_TIME_VALUE value exceeded. load_all_output_files should now return subfolder files
260af8f

seanpedrickcase commited on

Updated file processing for more efficient redaction for specific page ranges. Updated lambda_entrypoint to allow for environment variables from .env files, and limits to compatible file types
59caba2

seanpedrickcase commited on

Updated some config variables for lambda functions to enable successful run
20046b2

seanpedrickcase commited on

Updated cdk_stack for build commands compatible with new dockerfile. Minor changes to lambda function to specify text extraction method correctly.
e7e4e50

seanpedrickcase commited on