Building Beam Python SDK Image Guide

There are two options to build Beam Python SDK image. If you only need to modify the Python SDK boot entrypoint binary, read Update Boot Entrypoint Application Only. If you need to build a Beam Python SDK image fully, read Build Beam Python SDK Image Fully.

Update Boot Entrypoint Application Only.

If you only need to make a change to the Python SDK boot entrypoint binary. You can rebuild the boot application only and include the updated boot application in the preexisting image. Read the Python container Dockerfile for reference.

# From beam repo root, make changes to boot.go.your_editor sdks/python/container/boot.go # Rebuild the entrypoint./gradlew :sdks:python:container:gobuild cd sdks/python/container/build/target/launcher/linux_amd64 # Create a simple Dockerfile to use custom boot entrypoint.cat >Dockerfile <<EOF FROM apache/beam_python3.10_sdk:2.60.0 COPY boot /opt/apache/beam/boot EOF# Build the imagedocker build . --tag us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom-boot docker push us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom-boot 

You can build a docker image if your local environment has Java, Python, Golang and Docker installation. Try ./gradlew :sdks:python:container:py<PYTHON_VERSION>:docker. For example, :sdks:python:container:py310:docker builds apache/beam_python3.10_sdk locally if successful. You can follow this guide building a custom image from a VM if the build fails in your local environment.

Build Beam Python SDK Image Fully

This section introduces a way to build everything from the scratch.

Prepare VM

Prepare a VM with Debian 11. This guide was tested on Debian 11.

Google Compute Engine

An option to create a Debian 11 VM is using a GCE instance.

gcloud compute instances create beam-builder \  --zone=us-central1-a \  --image-project=debian-cloud \  --image-family=debian-11 \  --machine-type=n1-standard-8 \  --boot-disk-size=20GB \  --scopes=cloud-platform 

Login to the VM. All the following steps are executed inside the VM.

gcloud compute ssh beam-builder --zone=us-central1-a --tunnel-through-iap 

Update the apt package list.

sudo apt-get update 

[!NOTE]

  • A high CPU machine is recommended to reduce the compile time.
  • The image build needs a large disk. The build will fail with “no space left on device” with the default disk size 10GB.
  • The cloud-platform is recommended to avoid permission issues with Google Cloud Artifact Registry. You can use the default scopes if you don’t push the image to Google Cloud Artifact Registry.
  • Use a zone in the region of your docker repository of Artifact Registry if you push the image to Artifact Registry.

Prerequisite Packages

Java

You need Java to run Gradle tasks.

sudo apt-get install -y openjdk-11-jdk 

Golang

Download and install. Reference: https://go.dev/doc/install.

# Download and installcurl -OL https://go.dev/dl/go1.23.2.linux-amd64.tar.gz sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.2.linux-amd64.tar.gz # Add go to PATH.exportPATH=:/usr/local/go/bin:$PATH

Confirm the Golang version

go version 

Expected output:

go version go1.23.2 linux/amd64 

[!NOTE] Old Go version (e.g. 1.16) will fail at :sdks:python:container:goBuild.

Python

This guide uses Pyenv to manage multiple Python versions. Reference: https://realpython.com/intro-to-pyenv/#build-dependencies

# Install dependenciessudo apt-get install -y make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev \ libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev # Install Pyenvcurl https://pyenv.run | bash # Add pyenv to PATH.exportPATH="$HOME/.pyenv/bin:$PATH"eval"$(pyenv init -)"eval"$(pyenv virtualenv-init -)"

Install Python 3.9 and set the Python version. This will take several minutes.

pyenv install 3.9 pyenv global 3.9 

Confirm the python version.

python --version 

Expected output example:

Python 3.9.17 

[!NOTE] You can use a different Python version for building with -PpythonVersion option to Gradle task run. Otherwise, you should have python3.9 in the build environment for Apache Beam 2.60.0 or later (python3.8 for older Apache Beam versions). If you use the wrong version, the Gradle task :sdks:python:setupVirtualenv fails.

Docker

Install Docker following the reference.

# Add GPG keys.sudo apt-get update sudo apt-get install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the Apt repository.echo\ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ $(. /etc/os-release &&echo"$VERSION_CODENAME") stable"|\  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update # Install docker packages.sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin 

You need to run docker command without the root privilege in Beam Python SDK image build. You can do this by adding your account to the docker group.

sudo usermod -aG docker $USERnewgrp docker 

Confirm if you can run a container without the root privilege.

docker run hello-world 

Git

Git is not necessary for building Python SDK image. Git is just used to download the Apache Beam code in this guide.

sudo apt-get install -y git 

Build Beam Python SDK Image

Download Apache Beam from the Github repository.

git clone https://github.com/apache/beam beam cd beam 

Make changes to the Apache Beam code.

Run the Gradle task to start Docker image build. This will take several minutes. You can run :sdks:python:container:py<PYTHON_VERSION>:docker to build an image for different Python version. See the supported Python version list. For example, py310 is for Python 3.10.

./gradlew :sdks:python:container:py310:docker 

If the build is successful, you can see the built image locally.

docker images 

Expected output:

REPOSITORY TAG IMAGE ID CREATED SIZE apache/beam_python3.10_sdk 2.60.0 33db45f57f25 About a minute ago 2.79GB 

[!NOTE] If you run the build in your local environment and Gradle task :sdks:python:setupVirtualenv fails by an incompatible python version, please try with -PpythonVersion with the Python version installed in your local environment (e.g. -PpythonVersion=3.10)

Push to Repository

You may push the custom image to a image repository. The image can be used for Dataflow custom container.

Google Cloud Artifact Registry

You can push the image to Artifact Registry. No additional authentication is necessary if you use Google Compute Engine.

docker tag apache/beam_python3.10_sdk:2.60.0 us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom docker push us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom 

If you push an image in an environment other than a VM in Google Cloud, you should configure docker authentication with gcloud before docker push.

Docker Hub

You can push your Docker hub repository after docker login.

docker tag apache/beam_python3.10_sdk:2.60.0 <my-account>/beam_python3.10_sdk:2.60.0-custom docker push <my-account>/beam_python3.10_sdk:2.60.0-custom 
close