BiospexOcrProcessor

Purpose

Performs optical character recognition (OCR) on images using Tesseract.js. It is optimized for label text extraction while filtering out noise like rulers or barcodes.

Workflow

Trigger: Automatically triggered by S3 ObjectCreated events (typically after BiospexImageFetcher uploads an image).
Analysis: Retrieves the image and its metadata (from the S3 object's Metadata field).
OCR: Executes Tesseract.js with PSM 1 (Page Segmentation Mode: Automatic with OSD) using English and Latin language packs.
Filtering: Processes extracted text blocks to remove:
- Small "noise" blocks (area < 2000px).
- Thin vertical/horizontal lines (rulers).
- Low confidence guesses (< 20%).
Cleanup: Deletes the source image from S3 after processing.
Callback: Sends the extracted text and status back to the Laravel app via SQS.

Inputs/Outputs

Inputs (S3 Event):
- S3 Object Key: The path to the image in the bucket.
- S3 Object Metadata: queue-id, file-id, subject-id, updates-url.
Outputs:
- SQS: Extracted text and status notification sent to updates-url.

Related Components

Laravel Command: App\Console\Commands\SqsListenerOcrUpdate (Listens for success status).
Laravel Job: App\Jobs\TesseractOcrUpdateJob (Processes the extracted text in the Laravel database).
Related Lambda: BiospexImageFetcher (The primary source of images for this processor).

Deployment

Use the deploy.sh script for interactive deployment to AWS (Region: us-east-2).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tessdata		tessdata
.gitignore		.gitignore
README.md		README.md
deploy.sh		deploy.sh
index.mjs		index.mjs
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiospexOcrProcessor

Purpose

Workflow

Inputs/Outputs

Related Components

Deployment

About

Uh oh!

Releases

Packages

Languages

AustinMastLab/BiospexOcrProcessor

Folders and files

Latest commit

History

Repository files navigation

BiospexOcrProcessor

Purpose

Workflow

Inputs/Outputs

Related Components

Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages