Skip to content

AustinMastLab/BiospexOcrProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BiospexOcrProcessor

Purpose

Performs optical character recognition (OCR) on images using Tesseract.js. It is optimized for label text extraction while filtering out noise like rulers or barcodes.

Workflow

  1. Trigger: Automatically triggered by S3 ObjectCreated events (typically after BiospexImageFetcher uploads an image).
  2. Analysis: Retrieves the image and its metadata (from the S3 object's Metadata field).
  3. OCR: Executes Tesseract.js with PSM 1 (Page Segmentation Mode: Automatic with OSD) using English and Latin language packs.
  4. Filtering: Processes extracted text blocks to remove:
    • Small "noise" blocks (area < 2000px).
    • Thin vertical/horizontal lines (rulers).
    • Low confidence guesses (< 20%).
  5. Cleanup: Deletes the source image from S3 after processing.
  6. Callback: Sends the extracted text and status back to the Laravel app via SQS.

Inputs/Outputs

  • Inputs (S3 Event):
    • S3 Object Key: The path to the image in the bucket.
    • S3 Object Metadata: queue-id, file-id, subject-id, updates-url.
  • Outputs:
    • SQS: Extracted text and status notification sent to updates-url.

Related Components

  • Laravel Command: App\Console\Commands\SqsListenerOcrUpdate (Listens for success status).
  • Laravel Job: App\Jobs\TesseractOcrUpdateJob (Processes the extracted text in the Laravel database).
  • Related Lambda: BiospexImageFetcher (The primary source of images for this processor).

Deployment

Use the deploy.sh script for interactive deployment to AWS (Region: us-east-2).

About

Performs OCR processing on Biospex images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published