Dataflow CI/CD with Github actions

Published in

Israeli Tech Radar

7 min readJan 4, 2022

Data pipelines are a very common tool for data processing. After writing my code for Dataflow and testing it, I turned to the task of the continuous integration and deployment (aka CI/CD) of the pipeline jobs. This post will detail how I had done it with Github actions.

Dataflow

First, let’s remind ourselves what Dataflow is:

Dataflow is a managed service in the Google cloud platform (aka GCP) for “Unified stream and batch data processing that’s serverless, fast, and cost-effective.”

Dataflow is based on Apache Beam, a unified model for defining both batch- and streaming-data parallel-processing pipelines, meaning I practically wrote a Beam job and ran it with a Dataflow runner to deploy the job to GCP.

My technical stack is:

Apache Beam Java SDK
Kotlin — cross-platform, statically typed, general-purpose programming language with type inference. Kotlin is designed to interoperate fully with Java
Gradle — build automation tool for multi-language software development
Github actions — allows building, testing, and deploying your code right from GitHub
gcloud — the primary CLI tool to create and manage Google Cloud resources

Running Dataflow jobs

In order for a Beam job to be sent to the Dataflow platform, running the application (i.e. the Main class) is required with the Dataflow runner and some GCP arguments. Meaning sending the job looks something like:

java -cp my-jar.jar com.tikal.Mainkt \
--runner=DataflowRunner \
--serviceAccount=**** \
--subnetwork=*** \
--region=****

The reason I specified the main class is because we have multiple jobs in one jar file.

CI

The goals of CI were-

Run unit tests
Build a self contained jar file
Upload the jar to the company’s jars library
Make the file accessible to the CD stages, which will follow

Run unit tests

To run unit tests, we used Gradle’s ‘build’ task which already includes running unit tests:

./gradlew build

By adding this command to a workflow step, unit tests are included in the CI.

Self contained jar

A self contained jar, means a jar which contains all its dependencies and is ready to run with the command:

java -jar my-jar.jar

The reason for that is that the job runs on a Serverless platform and thus as a developer I have very little control on the classpath in which it runs. In addition this would make running the jar in the CD stage much easier for the same reason.

To build the self-contained jar I used the com.github.johnrengelman.shadow Gradle plugin:

plugins{
    kotlin("jvm") version "1.5.21"
    id("com.github.johnrengelman.shadow") version "7.1.0"
}

Notice I use the Kotlin DSL for Gradle.

In addition, for the job to run successfully all jar services files need to be merged:

import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar
tasks.withType<ShadowJar> {
    mergeServiceFiles()
}

The github workflow file contains:

jobs:
  build-jar:
    name: Build dataflow jobs jar
    runs-on: ubuntu-latest
    steps:
      - name: Checkout master
        uses: actions/checkout@master

      - name: Set up JDK 11
        uses: actions/setup-java@v1
        with:
          java-version: 11

      - name: Prepare gradlew
        run: gradle wrapper --info

      - name: Gradle build jar
        run: ./gradlew clean shadowJar --info

      - name: Cleanup Gradle Cache
        run: |
          rm -f ~/.gradle/caches/modules-2/modules-2.lock
          rm -f ~/.gradle/caches/modules-2/gc.properties

However, the ‘shadowJar’ task doesn’t run unit tests, and the ‘build’ task doesn’t run the ‘shadowJar’ task, forcing us to modify the workflow to:

jobs:
  build-jar:
    name: Build dataflow jobs jar
    runs-on: ubuntu-latest
    steps:
      - name: Checkout master
        uses: actions/checkout@master

      - name: Set up JDK 11
        uses: actions/setup-java@v1
        with:
          java-version: 11

      - name: Prepare gradlew
        run: gradle wrapper --info

      - name: Gradle build jar
        run: ./gradlew clean check shadowJar --info

      - name: Cleanup Gradle Cache
        run: |
          rm -f ~/.gradle/caches/modules-2/modules-2.lock
          rm -f ~/.gradle/caches/modules-2/gc.properties

The ‘check’ tasks runs all verification tasks including ‘test’.

Upload the jar to the company’s jars registry

The company uses Github’s maven registry, to upload the build artifacts I used the Gradle plugin: maven-publish, so my Gradle plugins section extended to:

plugins{
    kotlin("jvm") version "1.5.21"
    `maven-publish`
    id("com.github.johnrengelman.shadow") version "7.1.0"
}

Next, I specified the publication name and, of course, the registry and its credentials:

publishing {
    publications {
        create<MavenPublication>("dataflowJobs") {
            from(components["java"])
        }
    }

    repositories {
        maven {
            url = uri("https://maven.pkg.github.com/******/maven-packages/")
            credentials {
                username = System.getenv("MAVEN_USERNAME")
                password = System.getenv("MAVEN_PASSWORD")
            }
        }
    }
}

As, you can see the credentials are set in the environment, meaning would have to be given by the CI process. These are provided by the workflow:

env:
  MAVEN_USERNAME: ${{ secrets.WRITE_PACKAGES_USER }}
  MAVEN_PASSWORD: ${{ secrets.WRITE_PACKAGES_TOKEN }}

The credentials are taken from Company wide secrets that are set for all Company repositories and are set into the environment at build time.

I then modified the build step in the workflow to publish:

- name: Gradle build and publish jar
  run: ./gradlew clean check publish --info

Note that ‘publish’ includes:

Building the shadow Jar (I suppose it detects the shadow plugin)
Creating a maven module
Publishing both the light jar and the shadow jar

Make the file accessible to the CD stages that will follow

At first, I thought that the CD stage would receive a parameter from the CI stage with the jar build in the Maven registry, however, my research showed that this is not the way things are done with Github actions.

My understanding is that the way to share a jar is to attach it to a build workflow, and for the next workflow to download it.

You may say: “If that is so, you no longer need to publish the jar to the Maven registry”. Though that is possible, I did not want to break the structure of the CI in the company, and thus I kept the publishing to the registry and decided to leave a very short retention to the attached artifacts.

To attach the artifact I used the upload-artifact github action:

- name: 'Upload Artifact'
  uses: actions/upload-artifact@v2
  with:
    name: my-jar-all.jar
    path: /home/runner/work/my-jar/my-jar/build/libs/my-jar-*-all.jar
    retention-days: 1
    if-no-files-found: error

The result is a workflow with an attached jar file:

Shadow jar in the artifacts section of the workflow

CD

Deploying Dataflow jobs means submitting the jobs to Google’s Dataflow platform for it to run. Continuous deployment means from our perspective deploying the jobs whenever a change is made in the main branch (i.e. push).

At the time of writing this post, we have multiple jobs in the jar, so to achieve CD we added to the workflow discussed in the CI section a new workflow job for each dataflow job:

deploy-job1:
  needs: build-jar
  name: Deploy job1
  runs-on: ubuntu-latest

The “needs” notation makes sure this job runs only after the jar is built, published and uploaded to the CI job artifacts.

The rest of this section describes the steps needed for each of the jobs, and thus was repeated in our workflow:

Download the jar
Prepare for running commands on GCP
Run the job

Download the jar

To download the file we used the download-artifact github action:

- name: Download artifact
  uses: actions/download-artifact@v2
  with:
    name: my-jar-all.jar

Now, this was strange: for some reason which I did not get into, the downloaded file has the jar file’s original name from the CI job, meaning before it was uploaded with a new file name. To resolve this I renamed the file in one of the workflow’s steps (simple mv in a “run” step).

Prepare for running commands on GCP

There are multiple ways to authenticate with GCP, the workflow we wrote used a service account’s credentials to do so. We used Google’s authentication github action to do so, and chose the authentication method after reading it’s documentation and Google’s recommendation.

- name: GCP Auth
  uses: google-github-actions/auth@v0.4.1
  with:
    *****

After the authentication, we installed the gcloud CLI and for convenience set the GCP project ID instead of repeating it in all the following steps using the gcloud setup github action:

- name: GCP setup
        uses: google-github-actions/setup-gcloud@master
        with:
          project_id: ${{ secrets.GCP_PROJECT }}

Run the job

To run the job Java must be installed, so we added the JDK:

- name: Set up JDK 11
  uses: actions/setup-java@v1
  with:
    java-version: 11

At first I chose to “just deploy” the job with a unique name, however, on the next deployment the job would fail because a job with the same name already exists. Eventually after some research I chose the following route:

Check if a job with the same name already exists, to check whether a job with that name is already running you can use the gcloud CLI:

gcloud dataflow jobs list --region <region> --status=active | grep <job-name>

The result looks like the following:

<job ID> <Job name> <Type> <Creation Date> <Creation Time> <State> <Region>

2. Copy the state of a job with the name “my-job” to an output parameter name status.

- name: Check if dataflow job is running
  id: setStatus
  run: |
    status=`gcloud dataflow jobs list --region ${{ secrets.GCP_REGION }} --status=active | grep my-job | awk '{print $6}'`
    echo "::set-output name=status::$status"

3. By checking whether the status is ‘Running’ I decided whether to run the job as if it is not yet deployed or to run a rolling update.

The deployment steps are:

- name: Update dataflow job
  if: ${{ steps.setStatus.outputs.status == 'Running' }}
  run: <deploy rolling update>- name: Create dataflow job
  if: ${{ steps.setStatus.outputs.status != 'Running' }}
  run: <deploy>

I do realize a rolling update will not work best for every change made, but as a whole in most cases it is the desired behavior, and for the rest of the scenarios the user can manually kill the jobs or we may add a manual workflow to kill the jobs.

The job deployment to the GCP Dataflow jobs is in our case:

java -cp /home/runner/work/***/***/***-all.jar com.tikal.sigal.MyJobKt \
--runner=DataflowRunner \
--serviceAccount=${{ secrets._SA_EMAIL }} \
--subnetwork=<GCP subnetwork> \
--region=${{ secrets.GCP_REGION }} \
--tempLocation=gs://my-bucket/temp-location \
--usePublicIps=false \
--numWorkers=<numWorkers> \
--jobName=my-job
<--update>

When the workflow completes, the jars are on the Dataflow platform and the process of running/updating the job is in process.

The result of the workflows described is:

Our code compiled and published to the company’s jar registry
Multiple Dataflow jobs deployed to GCP Dataflow’s platform and running (or preparing to run)

About me

I am a technical leader at Tikal and one of the Backend’s group leaders. I am passionate about backend and data systems.