Why You Won't Be Running c7n-org in an AWS Lambda Function

Thu, May 14, 2020 c7n-org, lambda, aws, custodian, python, os error 38, batch, serverless

I recently had the good fortune to take on a really fun project at work. First off, the client was incredibly easy to work with, which makes any project (even something I might consider tedious and boring, like migration work) a win in my book. In any case, this wasn’t a boring project – the client asked us to roll out Cloud Custodian across their entire AWS footprint – which at this point consists of an AWS Organization with a decent number of accounts (and more to follow).

While I had never used it before, I was aware of c7n-org (a member of the overall Cloud Custodian product suite) and its design goal of allowing a user to run custodian across multiple accounts (strictly speaking, the use of AWS Organizations is not a pre-requisite to the use of c7n-org, but there are benefits if you do, such as having c7n-org generate the YAML file that it itself uses to map out what accounts it will run against, or the use of StackSets to orchestrate IAM roles in child/spoke accounts for custodian to run under).

My solution itself and the deployment mechanisms I designed for it worked fairly well overall. These included the use of StackSets to provision IAM roles for custodian to run under in child accounts and a few stacks in the main “security” account (where most of Custodian’s standalone assets, like c7n-mailer and its backing resources, and something to run c7n-org and the backing resources for that). Outside of AWS, I stitched together a pretty nice policy-authoring tool using Python leveraging Jinja to pull out common config to inject those values into policy template based on a dev/prod split, as well as the capability to aggregate policies in multiple files (authored as YAML files) into a single document under a unified polices key that c7n-org could ingest (for reference, c7n-org run can only accept a single policy file as an argument, in contrast to the more standard custodian run, which can be pointed at a directory full of policy files).

So, all down to setting up something to run c7n-org… As I mentioned, my first inclination was to try it out in a Lambda function. After reading a post in cloud-custodian’s gitter channel from a gent who indicated that he had pulled off this feat previously (and maybe he did, possible with the changes in c7n-org producing the problems I ran into having been added at some later point in time – version as of the time I encountered this issue is 0.5.7), I thought I was in good shape (the trick lies in that c7n-org is basically a CLI tool, so running it from a Lambda is a bit tricky). There were some kinks to work out in actually being able to invoke c7n-org run, but I was able to work those out. Unfortunately, I worked those out only to then discover this error in my CloudWatch logs…

b'Traceback (most recent call last):\n  File "/var/task/c7n-org", line 10, in <module>\n    sys.exit(cli())\n  File "/var/task/click/core.py", line 829, in __call__\n    return self.main(*args, **kwargs)\n  File "/var/task/click/core.py", line 782, in main\n    rv = self.invoke(ctx)\n  File "/var/task/click/core.py", line 1259, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File "/var/task/click/core.py", line 1066, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File "/var/task/click/core.py", line 610, in invoke\n    return callback(*args, **kwargs)\n  File "/var/task/c7n_org/cli.py", line 636, in run\n    with executor(max_workers=WORKER_COUNT) as w:\n  File "/var/lang/lib/python3.7/concurrent/futures/process.py", line 556, in __init__\n    pending_work_items=self._pending_work_items)\n  File "/var/lang/lib/python3.7/concurrent/futures/process.py", line 165, in __init__\n    super().__init__(max_size, ctx=ctx)\n  File "/var/lang/lib/python3.7/multiprocessing/queues.py", line 42, in __init__\n    self._rlock = ctx.Lock()\n  File "/var/lang/lib/python3.7/multiprocessing/context.py", line 67, in Lock\n    return Lock(ctx=self.get_context())\n  File "/var/lang/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__\n    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n  File "/var/lang/lib/python3.7/multiprocessing/synchronize.py", line 59, in __init__\n    unlink_now)\nOSError: [Errno 38] Function not implemented\n'

The important part in all that? I honed in immediately on the OSError: [Errno 38] Function not implemented.

So, what the heck is OS Error 38?

A quick Google for OS Error 38, python, and Lambda quickly turned up some interesting info. I checked SO first, where I found this post. A bit more digging turned up that Python’s multiprocessing module’s Queue implementation expects /dev/shm to be available. There is a quite long-running AWS Support thread on this very issue. Long-story short: if you want to use multiprocessing and run it in Lambda, don’t use their Queue implementation. This is all well and fine if you’re the one writing the code; otherwise, you either maintain your own fork (and deal with the implications of changing the implementation), or find another way to run it.

Docker and AWS Batch to the Rescue

I decided to throw in the towel on the Lambda-based solution and pull out a trick that’s worked in the past. Using LambCI’s docker-lambda, I quickly containerized the code I had authored for Lambda. Drawing on some previous work I had done with Batch, I was able to get a Batch Compute Environment, Queue, and Task Def set up in relatively short order – so, while not Lambda, still “serverless”. While there was some additional experimentation to be done setting parallelization flags, choosing proper instance sizes, etc., I can tell you that it is pretty straightforward to run c7n-org as a Lambda function (even if you have to shim it into a container!). No OS Error 38, just lots of nice log output – roll that beautiful log footage!

Roll that beautiful log footage

That looks more like it!

A few more links related to Python, multiprocessing, and OS Error 38:

Using Python Multiprocessing Queue Inside AWS Lambda Function

OSError 38 [Errno 38] with multiprocessing

Multiprocessing vs Threading Python [duplicate]