One Adventure on Lambda and Pulumi

12 Mar 2025 in Engineer on Article

When AWS Lambda Shines (and When It Hits a Wall)
Why Pulumi for Infrastructure as Code?
- Recommended Reading for Pulumi Success
The Two-Day Bug
Lessons Learned the Hard Way (So You Don’t Have to)

The role of the modern data scientist is changing. The lines between data science, data engineering, and MLOps are blurring, and success now demands skills that span the entire data lifecycle. I’ve experienced this evolution firsthand in my recent transition from model training to deploying full-scale data pipelines in the cloud.

This post shares the view and errors I met on deploying lambda with pulumi, starting with a core decision: when to use a tool like AWS Lambda.

When AWS Lambda Shines (and When It Hits a Wall)

AWS Lambda is an excellent choice for small, independent tasks that respond to events. Think of it as a highly scalable function that runs on demand. However, it’s crucial to be aware of its limitations:

Code Size: Your Lambda function, including all its Lambda Layers (where you’d store libraries like Pandas), has an unzipped size limit of 250 MB. Most of time, after pandas, there is no space for other packages.
Memory: You can allocate up to 10 GB of memory. This is sufficient for many applications, but if your task demands hundreds of GBs, Lambda isn’t the right fit. Exceeding 10 GB could also make it less cost-effective compared to alternatives like EC2, ECS, or Apache Airflow.
Time Limit: Lambda functions have a maximum execution time of 15 minutes. If your code runs longer, you’ll need a different solution.

For complex, long-running workflows, Lambda isn’t ideal. This is where tools like Apache Airflow (or AWS’s managed service, MWAA) come into play. Airflow is specifically designed for orchestrating multi-step data pipelines and long-running jobs.

Here’s a quick comparison:

Feature	AWS Lambda	Apache Airflow (DAGs)
Trigger	Events (e.g., S3, API Gateway)	Scheduled Time (fixed/regular)
Use Case	Short, event-driven tasks	Longer ETL, data pipelines
Memory	Up to 10 GB	Scales with underlying compute
Cost	Pay-per-execution and duration	Based on underlying infrastructure

Why Pulumi for Infrastructure as Code?

Pulumi is my preferred tool for Infrastructure as Code (IaC). It’s an open-source solution that allows you to define and manage your cloud infrastructure using familiar programming languages like Python.

What I particularly appreciate about Pulumi is its ability to deploy and then completely tear down entire cloud environments. This feature is invaluable for avoiding accidentally leaving resources running and incurring unnecessary costs.

Compared to Terraform, which uses its own domain-specific language (HCL), Pulumi lets me write my infrastructure definitions in Python (or other general-purpose languages), in which I can leverage all the familiar Python tools, libraries, and testing frameworks.

Which IaC you like to use? Let me know in the comments!

The Two-Day Bug

To Solve: Read a Parquet file compressed with Zstandard (zstd) from S3 using a Lambda function.

Round1

My immediate thought was to use Pandas and AWS even provides a public Pandas Lambda Layer as pip packages in lambda environment.

import pandas as pd
import os

def lambda_handler(event, context):
    bucket = event['bucket']
    key = event['key']
    s3_path = f"s3://{bucket}/{key}"

    print(f"Attempting to read: {s3_path}")
    df_day = pd.read_parquet(s3_path) # The hopeful line
    print(f"Successfully read {len(df_day)} rows.")
    return {"status": "success", "rows": len(df_day)}

However when deploying to aws, it raises

[ERROR] ImportError: Missing optional dependency 'fsspec'.  Use pip or conda to install fsspec.
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 112, in lambda_handler
    df_day = pd.read_parquet(s3_path) # This is what's failing
File "/opt/python/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
File "/opt/python/pandas/io/parquet.py", line 233, in read
    path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
File "/opt/python/pandas/io/parquet.py", line 82, in _get_path_or_handle
    fsspec = import_optional_dependency("fsspec")
File "/opt/python/pandas/compat/_optio

It turns out pandas.read_parquet() relies on fsspec for handling file systems beyond your local machine, like S3. Crucially, the AWS-provided Pandas layer I was using didn’t bundle fsspec.

Now, Lambda has limits on layer size (250MB total uncompressed) and count (5 per function). The official Pandas Lambda Layer is already quite chunky, often leaving little room for more.

Still, I tried to add a separate fsspec layer. This, predictably, led down a rabbit hole of more missing dependencies: s3fs (for S3-specific fsspec implementation) and, most importantly, zstd support itself.

Round2

I decided to pivot and fully embrace awswrangler. It’s designed for these kinds of tasks on AWS

import awswrangler as wr
import os

def lambda_handler(event, context):
    bucket = event['bucket']
    key = event['key']
    s3_path = f"s3://{bucket}/{key}"

    print(f"Attempting to read with awswrangler: {s3_path}")
    df_day = wr.s3.read_parquet(path=s3_path) # It works!
    print(f"Successfully read {len(df_day)} rows.")
    return {"status": "success", "rows": len(df_day)}

I checked the official AWS Data Wrangler Lambda Layers

Region Realignment: The first snag – the layer wasn’t available in my target deployment region. No problem! Thanks to Infrastructure as Code (I’m using Pulumi), a quick config change moved my Lambda to us-east-1.
Python Version Downgrade: The specific layer version also required a Python downgrade to 3.9. Again, a simple update in my Pulumi code.

With these adjustments, I deployed using the official awswrangler layer. Excitement! And then…

[WARNING]	xxxx	Could not read data for 2025-06-13. Path: s3://tsgs-market-data-prod-ap-southeast-ll/SFP/year=2025/month=06/day=13/. Error: Support for codec 'zstd' not built
[ERROR]	   xxxx 	Bootstrap failed: Could not read any data from the specified days. Exiting.

Round3

If you want something done right (or with specific features), sometimes you have to build it yourself. It was time for a custom Lambda layer.

The official aws-wrangler even do not include zstd compression method. So have to build this aws-wrangler for zstd compression by myself.

The Plan:

Create a local environment (e.g., using a Docker container matching the Lambda runtime like amazonlinux:2).
Installawswrangler with the necessary extras, ensuring pyarrow is built with zstd. Typically, install awswrangler[zstd, s3].
Package this into the required python/lib/pythonX.Y/site-packages structure for a Lambda layer.
Zip it up as layer.zip.

Here’s my requirements.txt:

# requirements.txt 
awswrangler[zstd, s3]

And the Dockerfile to build the layer:

# Dockerfile
FROM public.ecr.aws/lambda/python:3.9-x86_64

# Create the directory structure for the layer
RUN mkdir -p /asset/python/lib/python3.9/site-packages

# Copy requirements file
COPY requirements.txt /

# Install dependencies into the layer directory
RUN pip install -r /requirements.txt -t /asset/python/lib/python3.9/site-packages

# The final asset will be in the /asset directory

Build and package the layer:

# Build the Docker image
docker build -t lambda-layer-builder .

# Create a container from the image so we can copy the file out
docker create --name builder lambda-layer-builder

# Copy the built layer (the /asset/python directory) to your local machine
docker cp builder:/asset/python ./build/

# Clean up the container
docker rm builder

# Now zip the result
cd build
zip -r layer.zip python
cd ..

Size Matters: My custom layer.zip was almost 85MB. Lambda allows direct uploads up to 50MB for the zip, but for anything over it, it’s best practice to upload the layer.zip to S3 and create the Lambda Layer version by pointing to the S3 object, as it allow the layer zip files up to 250 MB.

I followed this blog to integrate this custom layer into the lambda function.

Round 4

With my custom awswrangler[zstd,s3] layer in place, I deployed, triggered the Lambda, and…

[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 's3fs'
Traceback (most recent call last):

Seriously? Despite specifying s3 as an extra for awswrangler, which should pull in s3fs, it seemed it wasn’t explicitly available in the Python path in a way the Lambda runtime.

The Fix: One last layer. I decided to create a very minimal, separate layer just for s3fs.

# requirements.txt 
s3fs

I repeated the Docker build process above with this new requirements file, creating s3fs_layer.zip. This zip was small (around 10MB), so it fit comfortably alongside my custom awswrangler layer without busting size limits.

With this new s3fs_layer to my Lambda function… and FINALLY! Success! The Lambda could now read the zstd-compressed Parquet file from S3! The final configuration involved the customized awswrangler[zstd, s3] and s3fp layers.

Lessons Learned the Hard Way (So You Don’t Have to)

Strategic Layering is Key: Respect Lambda’s function and size (e.g, layer) limits.
Embrace IaC (like Pulumi): Rapidly iterate on configurations (regions, Python versions, layers) in code, avoiding manual errors and saving time.
Build Lambda Layers with Docker: For reliable compatibility, mirror the Lambda runtime using Docker. Don’t assume official layers have every feature.

This journey through Lambda, Pulumi, and Python packaging was a tough but valuable lesson. May it smooth your path in the serverless world!