One Adventure on Lambda and Pulumi
- When AWS Lambda Shines (and When It Hits a Wall)
- Why Pulumi for Infrastructure as Code?
- The Two-Day Bug
- Lessons Learned the Hard Way (So You Don’t Have to)
The role of the modern data scientist is changing. The lines between data science, data engineering, and MLOps are blurring, and success now demands skills that span the entire data lifecycle. I’ve experienced this evolution firsthand in my recent transition from model training to deploying full-scale data pipelines in the cloud.
This post shares the view and errors I met on deploying lambda with pulumi, starting with a core decision: when to use a tool like AWS Lambda.
When AWS Lambda Shines (and When It Hits a Wall)
AWS Lambda is an excellent choice for small, independent tasks that respond to events. Think of it as a highly scalable function that runs on demand. However, it’s crucial to be aware of its limitations:
- Code Size: Your Lambda function, including all its Lambda Layers (where you’d store libraries like Pandas), has an unzipped size limit of 250 MB. Most of time, after pandas, there is no space for other packages.
- Memory: You can allocate up to 10 GB of memory. This is sufficient for many applications, but if your task demands hundreds of GBs, Lambda isn’t the right fit. Exceeding 10 GB could also make it less cost-effective compared to alternatives like EC2, ECS, or Apache Airflow.
- Time Limit: Lambda functions have a maximum execution time of 15 minutes. If your code runs longer, you’ll need a different solution.
For complex, long-running workflows, Lambda isn’t ideal. This is where tools like Apache Airflow (or AWS’s managed service, MWAA) come into play. Airflow is specifically designed for orchestrating multi-step data pipelines and long-running jobs.
Here’s a quick comparison:
Feature | AWS Lambda | Apache Airflow (DAGs) |
---|---|---|
Trigger | Events (e.g., S3, API Gateway) | Scheduled Time (fixed/regular) |
Use Case | Short, event-driven tasks | Longer ETL, data pipelines |
Memory | Up to 10 GB | Scales with underlying compute |
Cost | Pay-per-execution and duration | Based on underlying infrastructure |
Why Pulumi for Infrastructure as Code?
Pulumi is my preferred tool for Infrastructure as Code (IaC). It’s an open-source solution that allows you to define and manage your cloud infrastructure using familiar programming languages like Python.
What I particularly appreciate about Pulumi is its ability to deploy and then completely tear down entire cloud environments. This feature is invaluable for avoiding accidentally leaving resources running and incurring unnecessary costs.
Compared to Terraform, which uses its own domain-specific language (HCL), Pulumi lets me write my infrastructure definitions in Python (or other general-purpose languages), in which I can leverage all the familiar Python tools, libraries, and testing frameworks.
Which IaC you like to use? Let me know in the comments!
Recommended Reading for Pulumi Success
For initial setup, always refer to the official Pulumi documentation. For efficient and automated deployments, I highly recommend integrating Pulumi into your GitHub Actions. This creates a powerful CI/CD pipeline where every code push can automatically manage your cloud infrastructure.
Here are some resources that I found particularly helpful:
- YouTube Tutorial: Pulumi Introduction: Pulumi Tutorial: Introduction, Benefits, and Demo of Modern Infrastructure as Code
- Blog Post: Setting up Pulumi with GitHub Actions
The Two-Day Bug
To Solve: Read a Parquet file compressed with Zstandard (zstd) from S3 using a Lambda function.
Round1
My immediate thought was to use Pandas and AWS even provides a public Pandas Lambda Layer as pip packages in lambda environment.
import pandas as pd
import os
def lambda_handler(event, context):
bucket = event['bucket']
key = event['key']
s3_path = f"s3://{bucket}/{key}"
print(f"Attempting to read: {s3_path}")
df_day = pd.read_parquet(s3_path) # The hopeful line
print(f"Successfully read {len(df_day)} rows.")
return {"status": "success", "rows": len(df_day)}
However when deploying to aws, it raises
[ERROR] ImportError: Missing optional dependency 'fsspec'. Use pip or conda to install fsspec.
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 112, in lambda_handler
df_day = pd.read_parquet(s3_path) # This is what's failing
File "/opt/python/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/opt/python/pandas/io/parquet.py", line 233, in read
path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
File "/opt/python/pandas/io/parquet.py", line 82, in _get_path_or_handle
fsspec = import_optional_dependency("fsspec")
File "/opt/python/pandas/compat/_optio
It turns out pandas.read_parquet()
relies on fsspec
for handling file systems beyond your local machine, like S3. Crucially, the AWS-provided Pandas layer I was using didn’t bundle fsspec
.
Now, Lambda has limits on layer size (250MB total uncompressed) and count (5 per function). The official Pandas Lambda Layer is already quite chunky, often leaving little room for more.
Still, I tried to add a separate fsspec
layer. This, predictably, led down a rabbit hole of more missing dependencies: s3fs (for S3-specific fsspec implementation) and, most importantly, zstd support itself.
Round2
I decided to pivot and fully embrace awswrangler
. It’s designed for these kinds of tasks on AWS
import awswrangler as wr
import os
def lambda_handler(event, context):
bucket = event['bucket']
key = event['key']
s3_path = f"s3://{bucket}/{key}"
print(f"Attempting to read with awswrangler: {s3_path}")
df_day = wr.s3.read_parquet(path=s3_path) # It works!
print(f"Successfully read {len(df_day)} rows.")
return {"status": "success", "rows": len(df_day)}
I checked the official AWS Data Wrangler Lambda Layers
Region Realignment: The first snag – the layer wasn’t available in my target deployment region. No problem! Thanks to Infrastructure as Code (I’m using Pulumi), a quick config change moved my Lambda to
us-east-1
.Python Version Downgrade: The specific layer version also required a Python downgrade to 3.9. Again, a simple update in my Pulumi code.
With these adjustments, I deployed using the official awswrangler layer. Excitement! And then…
[WARNING] xxxx Could not read data for 2025-06-13. Path: s3://tsgs-market-data-prod-ap-southeast-ll/SFP/year=2025/month=06/day=13/. Error: Support for codec 'zstd' not built
[ERROR] xxxx Bootstrap failed: Could not read any data from the specified days. Exiting.
Round3
If you want something done right (or with specific features), sometimes you have to build it yourself. It was time for a custom Lambda layer.
The official aws-wrangler
even do not include zstd
compression method. So have to build this aws-wrangler
for zstd
compression by myself.
The Plan:
- Create a local environment (e.g., using a Docker container matching the Lambda runtime like
amazonlinux:2
). - Install
awswrangler
with the necessary extras, ensuring pyarrow is built withzstd
. Typically, installawswrangler[zstd, s3]
. - Package this into the required python/lib/pythonX.Y/site-packages structure for a Lambda layer.
- Zip it up as
layer.zip
.
Here’s my requirements.txt:
# requirements.txt
awswrangler[zstd, s3]
And the Dockerfile to build the layer:
# Dockerfile
FROM public.ecr.aws/lambda/python:3.9-x86_64
# Create the directory structure for the layer
RUN mkdir -p /asset/python/lib/python3.9/site-packages
# Copy requirements file
COPY requirements.txt /
# Install dependencies into the layer directory
RUN pip install -r /requirements.txt -t /asset/python/lib/python3.9/site-packages
# The final asset will be in the /asset directory
Build and package the layer:
# Build the Docker image
docker build -t lambda-layer-builder .
# Create a container from the image so we can copy the file out
docker create --name builder lambda-layer-builder
# Copy the built layer (the /asset/python directory) to your local machine
docker cp builder:/asset/python ./build/
# Clean up the container
docker rm builder
# Now zip the result
cd build
zip -r layer.zip python
cd ..
Size Matters: My custom layer.zip was almost 85MB. Lambda allows direct uploads up to 50MB for the zip, but for anything over it, it’s best practice to upload the layer.zip to S3 and create the Lambda Layer version by pointing to the S3 object, as it allow the layer zip files up to 250 MB.
I followed this blog to integrate this custom layer into the lambda function.
Round 4
With my custom awswrangler[zstd,s3]
layer in place, I deployed, triggered the Lambda, and…
[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 's3fs'
Traceback (most recent call last):
Seriously? Despite specifying s3 as an extra for awswrangler
, which should pull in s3fs
, it seemed it wasn’t explicitly available in the Python path in a way the Lambda runtime.
The Fix: One last layer. I decided to create a very minimal, separate layer just for s3fs
.
# requirements.txt
s3fs
I repeated the Docker build process above with this new requirements file, creating s3fs_layer.zip. This zip was small (around 10MB), so it fit comfortably alongside my custom awswrangler layer without busting size limits.
With this new s3fs_layer to my Lambda function… and FINALLY! Success! The Lambda could now read the zstd-compressed Parquet file from S3! The final configuration involved the customized awswrangler[zstd, s3]
and s3fp
layers.
Lessons Learned the Hard Way (So You Don’t Have to)
- Strategic Layering is Key: Respect Lambda’s function and size (e.g, layer) limits.
- Embrace IaC (like Pulumi): Rapidly iterate on configurations (regions, Python versions, layers) in code, avoiding manual errors and saving time.
- Build Lambda Layers with Docker: For reliable compatibility, mirror the Lambda runtime using Docker. Don’t assume official layers have every feature.
This journey through Lambda, Pulumi, and Python packaging was a tough but valuable lesson. May it smooth your path in the serverless world!