Performs optical character recognition (OCR) on images using Tesseract.js. It is optimized for label text extraction while filtering out noise like rulers or barcodes.
- Trigger: Automatically triggered by S3
ObjectCreatedevents (typically afterBiospexImageFetcheruploads an image). - Analysis: Retrieves the image and its metadata (from the S3 object's Metadata field).
- OCR: Executes Tesseract.js with
PSM 1(Page Segmentation Mode: Automatic with OSD) using English and Latin language packs. - Filtering: Processes extracted text blocks to remove:
- Small "noise" blocks (area < 2000px).
- Thin vertical/horizontal lines (rulers).
- Low confidence guesses (< 20%).
- Cleanup: Deletes the source image from S3 after processing.
- Callback: Sends the extracted text and status back to the Laravel app via SQS.
- Inputs (S3 Event):
- S3 Object Key: The path to the image in the bucket.
- S3 Object Metadata:
queue-id,file-id,subject-id,updates-url.
- Outputs:
- SQS: Extracted text and status notification sent to
updates-url.
- SQS: Extracted text and status notification sent to
- Laravel Command:
App\Console\Commands\SqsListenerOcrUpdate(Listens forsuccessstatus). - Laravel Job:
App\Jobs\TesseractOcrUpdateJob(Processes the extracted text in the Laravel database). - Related Lambda:
BiospexImageFetcher(The primary source of images for this processor).
Use the deploy.sh script for interactive deployment to AWS (Region: us-east-2).