Commits · seanpedrickcase/document

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search

419fb7d

seanpedrickcase commited on 11 days ago

Minor update to cli_redact for new local OCR model options. Updated app_settings.qmd, user_guide.qmd, and readme.md with descriptions of new features

d5b5291

seanpedrickcase commited on 11 days ago

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues

7bb945f

seanpedrickcase commited on 11 days ago

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces

2c00d05

seanpedrickcase commited on 12 days ago

Updated requirements

1935c45

seanpedrickcase commited on 13 days ago

Updated dependencies, github to HF workflow

059a5f7

seanpedrickcase commited on 14 days ago

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml

6d4f6e4

seanpedrickcase commited on 15 days ago

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU

c2becd8

seanpedrickcase commited on 15 days ago

Updated user guide and app settings. Updated some additional lambda_entrypoint arguments. Ensured that examples are correctly displayed on GUI.

c543ba0

seanpedrickcase commited on 23 days ago

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution.

0d7ad2a

seanpedrickcase commited on 27 days ago

Fixed text inclusion in review pdf outputs from apply_redactions_to_review_df... function. Apply redaction pymupdf text/graphic/image options are now modifiable. Tables on review screen should now be able to use Gradio column filter options. Moved some functions to more logical location.

b597212

seanpedrickcase commited on Oct 2

Moved from gunicorn to uvicorn for AWS deployment

799caf1

seanpedrickcase commited on Oct 1

Added gunicorn to requirements for when building Dockerfile based on FastAPI rather than Gradio directly. Updated minor some file path issues. Set return review PDF as default.

b38d4b9

seanpedrickcase commited on Sep 30

Added capability of loading in redaction annotations from PDF documents directly into the app. Minor function documentation improvements, GUI changes, package updates.

b61459d

seanpedrickcase commited on Sep 29

Removed some extraneous test steps. Improved Example loading and feedback, and redaction feedback. Minor security updates. Fixed Adobe xfdf file parsing.

1cb1897

seanpedrickcase commited on Sep 25

Updated Windows Tesseract install location for test

96b0e0e

seanpedrickcase commited on Sep 24

Fixed duplicate page argument mismatch. Readded Windows tests. Added refresh token options to cdk. Package updates

ad8fef5

seanpedrickcase commited on Sep 24

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load.

bafcf39

seanpedrickcase commited on Sep 23

Added a test suite based on the functions in cli_redact.py

084af54

seanpedrickcase commited on Sep 22

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates.

d60759d

seanpedrickcase commited on Sep 21

Fix to tabular redaction, added tabular deduplication. Updated cli call capability for both

aa5c211

seanpedrickcase commited on Sep 16

Updated review functions to update with manual reviews. Minor package update

80268bb

seanpedrickcase commited on Aug 27

Corrected some multiple xlsx/docx file redaction issues. package updates.

6f96988

seanpedrickcase commited on Aug 22

Added PaddleOCR support

2878a94

seanpedrickcase commited on Aug 19

Added capability to redact Word files

57aca87

seanpedrickcase commited on Aug 15

Package updates

00f09d5

seanpedrickcase commited on Aug 15

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements

ee6b7fb

seanpedrickcase commited on Aug 14

Updated packages. Corrected CSV logger headings, can now submit custom log csv names to S3. Started work on identifying and deduplicating at the line level

e424038

seanpedrickcase commited on Jul 2

Updated CDK code for custom KMS keys, new VPCs. Minor package updates.

9f51e70

seanpedrickcase commited on Jun 29

Updated duplicate pages interface to include subdocuments and review. Updated relevant user guide. Minor package updates

f47b137

seanpedrickcase commited on Jun 17

Adapted Dockerfile for systems with read only file system. Minor package updates.

a7566b9

seanpedrickcase commited on Jun 15

Update version numbers and readme

c28176d

seanpedrickcase commited on May 21

Updated version numbers

3270701

seanpedrickcase commited on May 20

Updated version numbers, gradio package version.

20b655f

seanpedrickcase commited on May 19

Updated gradio version. Minor changes to redactor function sequence. Minor formatting and wording changes.

5a21738

seanpedrickcase commited on May 7

Upgraded version numbers

3dbd1f7

seanpedrickcase commited on May 6

Corrected a couple of bugs. Now Textract whole document API call outputs will load also the input PDF into the app

10f46e9

seanpedrickcase commited on May 6

Updated version numbers, minor text revision

69c2af9

seanpedrickcase commited on Apr 29

Fix for image file redaction

36f8e9f

seanpedrickcase commited on Apr 29

Minor changes for cost codes, package updates. Added pyproject.toml file

47a3a80

seanpedrickcase commited on Apr 28

Commit History

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search 419fb7d

Minor update to cli_redact for new local OCR model options. Updated app_settings.qmd, user_guide.qmd, and readme.md with descriptions of new features d5b5291

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues 7bb945f

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces 2c00d05

Updated requirements 1935c45

Updated dependencies, github to HF workflow 059a5f7

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml 6d4f6e4

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU c2becd8

Updated user guide and app settings. Updated some additional lambda_entrypoint arguments. Ensured that examples are correctly displayed on GUI. c543ba0

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution. 0d7ad2a

Fixed text inclusion in review pdf outputs from apply_redactions_to_review_df... function. Apply redaction pymupdf text/graphic/image options are now modifiable. Tables on review screen should now be able to use Gradio column filter options. Moved some functions to more logical location. b597212

Moved from gunicorn to uvicorn for AWS deployment 799caf1

Added gunicorn to requirements for when building Dockerfile based on FastAPI rather than Gradio directly. Updated minor some file path issues. Set return review PDF as default. b38d4b9

Added capability of loading in redaction annotations from PDF documents directly into the app. Minor function documentation improvements, GUI changes, package updates. b61459d

Removed some extraneous test steps. Improved Example loading and feedback, and redaction feedback. Minor security updates. Fixed Adobe xfdf file parsing. 1cb1897

Updated Windows Tesseract install location for test 96b0e0e

Fixed duplicate page argument mismatch. Readded Windows tests. Added refresh token options to cdk. Package updates ad8fef5

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load. bafcf39

Added a test suite based on the functions in cli_redact.py 084af54

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates. d60759d

Fix to tabular redaction, added tabular deduplication. Updated cli call capability for both aa5c211

Updated review functions to update with manual reviews. Minor package update 80268bb

Corrected some multiple xlsx/docx file redaction issues. package updates. 6f96988

Added PaddleOCR support 2878a94

Added capability to redact Word files 57aca87

Package updates 00f09d5

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements ee6b7fb

Updated packages. Corrected CSV logger headings, can now submit custom log csv names to S3. Started work on identifying and deduplicating at the line level e424038

Updated CDK code for custom KMS keys, new VPCs. Minor package updates. 9f51e70

Updated duplicate pages interface to include subdocuments and review. Updated relevant user guide. Minor package updates f47b137

Adapted Dockerfile for systems with read only file system. Minor package updates. a7566b9

Update version numbers and readme c28176d

Updated version numbers 3270701

Updated version numbers, gradio package version. 20b655f

Updated gradio version. Minor changes to redactor function sequence. Minor formatting and wording changes. 5a21738

Upgraded version numbers 3dbd1f7

Corrected a couple of bugs. Now Textract whole document API call outputs will load also the input PDF into the app 10f46e9

Updated version numbers, minor text revision 69c2af9

Fix for image file redaction 36f8e9f

Minor changes for cost codes, package updates. Added pyproject.toml file 47a3a80

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search

419fb7d

Minor update to cli_redact for new local OCR model options. Updated app_settings.qmd, user_guide.qmd, and readme.md with descriptions of new features

d5b5291

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues

7bb945f

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces

2c00d05

Updated requirements

1935c45

Updated dependencies, github to HF workflow

059a5f7

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml

6d4f6e4

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU

c2becd8

Updated user guide and app settings. Updated some additional lambda_entrypoint arguments. Ensured that examples are correctly displayed on GUI.

c543ba0

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution.

0d7ad2a

Fixed text inclusion in review pdf outputs from apply_redactions_to_review_df... function. Apply redaction pymupdf text/graphic/image options are now modifiable. Tables on review screen should now be able to use Gradio column filter options. Moved some functions to more logical location.

b597212

Moved from gunicorn to uvicorn for AWS deployment

799caf1

Added gunicorn to requirements for when building Dockerfile based on FastAPI rather than Gradio directly. Updated minor some file path issues. Set return review PDF as default.

b38d4b9

Added capability of loading in redaction annotations from PDF documents directly into the app. Minor function documentation improvements, GUI changes, package updates.

b61459d

Removed some extraneous test steps. Improved Example loading and feedback, and redaction feedback. Minor security updates. Fixed Adobe xfdf file parsing.

1cb1897

Updated Windows Tesseract install location for test

96b0e0e

Fixed duplicate page argument mismatch. Readded Windows tests. Added refresh token options to cdk. Package updates

ad8fef5

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load.

bafcf39

Added a test suite based on the functions in cli_redact.py

084af54

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates.

d60759d

Fix to tabular redaction, added tabular deduplication. Updated cli call capability for both

aa5c211

Updated review functions to update with manual reviews. Minor package update

80268bb

Corrected some multiple xlsx/docx file redaction issues. package updates.

6f96988

Added PaddleOCR support

2878a94

Added capability to redact Word files

57aca87

Package updates

00f09d5

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements

ee6b7fb

Updated packages. Corrected CSV logger headings, can now submit custom log csv names to S3. Started work on identifying and deduplicating at the line level

e424038

Updated CDK code for custom KMS keys, new VPCs. Minor package updates.

9f51e70

Updated duplicate pages interface to include subdocuments and review. Updated relevant user guide. Minor package updates

f47b137

Adapted Dockerfile for systems with read only file system. Minor package updates.

a7566b9

Update version numbers and readme

c28176d

Updated version numbers

3270701

Updated version numbers, gradio package version.

20b655f

Updated gradio version. Minor changes to redactor function sequence. Minor formatting and wording changes.

5a21738

Upgraded version numbers

3dbd1f7

Corrected a couple of bugs. Now Textract whole document API call outputs will load also the input PDF into the app

10f46e9

Updated version numbers, minor text revision

69c2af9

Fix for image file redaction

36f8e9f

Minor changes for cost codes, package updates. Added pyproject.toml file

47a3a80