Building a Serverless Email Document Extraction Solution with AWS Textract: Part 1 - Overview

Earlier this year, I tried to consolidate all of my automotive maintenance histories into a database-backed system that was the lowest-friction means possible for me to keep up with my records. At the time, I settled on building out a solution using Airtable. I was able to set up a solution very quickly. I am honestly quite happy with the outcome, except that I am still manually keying in records to either the Airtable app or on their site based on the paper records that my auto shop gives me on every visit. Ideally, I would like a solution that handles the data extraction from the paper records I get from my shop and stores it in a structured format that I can easily consume.

Aside from this, I consolidated, cleaned up, and fully automated the provisioning of one of the few EC2-based resources I run in my AWS account: a mail server that runs Postfix and Dovecot. Recently though, I began to think about moving this over to a serverless paradigm. In my research, I discovered this is quite easy to do using SES. I use mutt to read my mail most of the time. It seemed feasible to build a small CLI locally to keep S3-based emails in sync using a process like the one James Turner describes on his site. While reading mail locally that is stored in S3 isn’t the point of this post, it is worth mentioning that there are low-friction ways to make this work in certain situations. The thesis of this post is this: despite next-generation systems, we still do a lot of data processing manually due to constraints imposed by outside forces. If we can’t change the “interfaces” by which we receive this data, can we process it in more efficient and effective ways? As it turns out, yes, we can!

In this first of several posts, we look at the overall architecture of the end-state solution.

Solution Architecture Overview

Let’s break this diagram down a bit. The underlying premise of the solution is that using SES’s ability to receive email, we can turn email processing into a cloud-native workflow.

  • Using the SES-to-S3 receipt mechanism drops all emails in the root of the bucket.

  • Since all emails land by default in the root of the bucket with a random identifier as a filename, I use a Lambda (“inbound-ses-process”) to sort the emails by recipient, place them into an appropriate prefix in the bucket, and extract any attachments into a separate bucket prefix for subsequent processing.

  • To track auto maintenance records, I am only worried about emails that get sorted into the “auto-maintenance” prefix (e.g. I sent the email to “auto-maintenance@tld.net”). Since this is constrained to a specific prefix, I can use an S3 event notification set up for this prefix to trigger the “auto-maintenance-to-textract” Lambda.

  • The “auto-maintenance-to-textract” Lambda fetches the appropriate attachment file(s) from the attachments prefix in the holding bucket and sends it off to the Textract service for async processing. When Textract finishes processing the file, it signals job completion by publishing to the “auto-maintenance-textract-done” SNS topic.

  • The “auto-maintenance-ddb-writer” Lambda subscribes to the “auto-maintenance-textract-done” SNS topic. It gets the Textract job ID from the message, which it uses to retrieve the results from the Textract API. It processes the results and then stores them in the “auto-maintenance” DynamoDB table. Lastly, the function pushes a SUCCESS/FAIL notification to the “auto-maintenance-done” SNS topic, intended for end-user notifications.

  • As I am the only user of the system, my mobile number subscribes to the “auto-maintenance-done” topic. Obviously, in business scenarios, this could be a subscription from a mailing list or any enterprise notification system.

There are caveats to this solution that I want you to be aware of, however, so let’s call them out here.

  • This solution processes a single document format from a single vendor. Generalizing a single solution to deal with multiple formats from multiple vendors means that the “auto-maintenance-ddb-writer” function would require a more complex codebase than the one developed for this solution. This solution represents a simple personal use-case, though it is extendable to more complex problems by extending the programming model backing the solution.

  • There is no UI/interface for looking at Dynamo DB records directly (besides the console). If you want to build a solution like this and present the data to business/end-users, the development of a user interface is something you have to consider.

  • I am not placing my solution under the same sort of conditions (i.e. number of users, load, edge cases [e.g. corrupt attachments, no attachments, et al.]) that you would encounter if you implement this solution in a broader context, so keep that in mind if you consider building something like this for business use.

  • In this example, I am forwarding my entire domain’s email to S3 for storage. This likely is untenable in contexts outside of personal applications, though perhaps creating a dedicated subdomain for such a solution as this one is feasible (and could still be implemented similarly).

In future posts, I break down the solution into consumable pieces, allowing you to implement this solution for yourself. Admittedly, I am most excited about getting some hands-on experience with Textract, so stay tuned!

comments powered by Disqus