Building a Serverless Email Document Extraction Solution with AWS Textract: Part 3 - Routing Objects for Downstream Processing

In the previous post of this series, we tackled how to land inbound emails routed to an entire domain using SES, a Lambda function, and an S3 bucket. As the whole point of these posts is parsing image-based documents of invoices using AWS Textract, you’re probably wondering how we get from files in S3 to magic, OCR-extracted text. This post gets us most of the way there, addressing some points of improvement on our original solution. In addition, we add the function to our serverless application in this post that actually gets our extracted text back from Textract. In a future post, we’ll stand up a DynamoDB table for storing our outputs and then look at ways to interact with the data we’ve stored there.

Building a Serverless Email Document Extraction Solution with AWS Textract: Part 2 - Landing Inbound Emails

In the first post of this series, we looked at a solution to allow us to define a serverless, email-based workflow to extract relevant information from auto maintenance invoices. Even in this age of accelerated digital transformation, there are still many scenarios in business and life where we receive data that is not in a machine-friendly format; we are building this solution to address these kinds of situations. We use the Serverless Framework to build the core of this solution. We create a few resources and configuration items in this post manually, but you are certainly free to manage these elements with something like CloudFormation or Terraform if desired. This post focuses on the resources highlighted in the figure below, where we design a solution to land incoming emails with S3 and SES, and sort them with a Lambda function:

Solution Focus

Building a Serverless Email Document Extraction Solution with AWS Textract: Part 1 - Overview

Earlier this year, I tried to consolidate all of my automotive maintenance histories into a database-backed system that was the lowest-friction means possible for me to keep up with my records. At the time, I settled on building out a solution using Airtable. I was able to set up a solution very quickly. I am honestly quite happy with the outcome, except that I am still manually keying in records to either the Airtable app or on their site based on the paper records that my auto shop gives me on every visit. Ideally, I would like a solution that handles the data extraction from the paper records I get from my shop and stores it in a structured format that I can easily consume.