ERESULT TEAM DATA ENGINEERING

How to run a Talend ETL in Kubernetes (the easy way)

Maurizio Vivarelli

4 min readJul 8, 2021

eResult - Advanced software solutions

Each company is unique, with very different dynamic operational characteristics and management needs. eResult…

www.eresult.it

Often, when IT professional, or at least me, need to do a Job for the first time seek for help on the Internet.

Luckily finding a good article on Medium … ahah

Now I have an integration project between our Omniaplace data pipeline (Mongo DB) and our Omniaplace application (SQL server) to realize.

The first part regard the development of the ETL itself, done with Talend for big data Java framework.

The second part regard where and how to run the outputted software module.

Given that I work on the pipeline side of our Omniaplace infrastructure that is entirely deployed on Kubernetes I decided to run this Job within our on premises cluster.

So this article regard the second part but it’s important to spend a few words about the first one.

The Talend module give the possibility to create ETL jobs by means of a visual interface, keeping coding requirement to a minimum. The output is a Java package filled with all the required libraries. So to run this with reliability all you need to do is set up a Java environment with same interpreter version.

In my case for the windows VM where I developed the job I choose to install Java version 1.8:

So I identified the following docker image:

openjdk:8u282-jdk-buster

given that Talend require full JDK.

At this point I searched the Internet for a guide on how to test/schedule a Java application in Kubernetes, but I cannot find one suitable for me.

In particular I wanted to find a methodology that:

does not require to build/rebuild docker images;
give a simple way to test and finalize the integration routine;
give a simple way to schedule the finalized integration.

The rest of this article describe the step I took to obtain the result.

The key ideas were to:

work with persistent storage;
work with idle pods.

Let see in detail.

Testing

The first thing I wanted to do is to test the script and I wanted to do that with the possibility to update the code without rebuilding the docker image.

To achieve this you need some building blocks

persistent storage provider in the form of a default storage class;

Kubernetes : NFS and Dynamic NFS provisioning

The goal of this post is to understand how NFS provisioning work in Kubernetes. In the first part i will deploy the…

medium.com

always on pod with a fast and simple way to connect to the console;

Define a Command and Arguments for a Container

This page shows how to define commands and arguments when you run a container in a Pod. Before you begin You need to…

kubernetes.io

eventually custom name resolution for local services that reside outside the cluster, in my case SQL Server;

Service without selector

An abstract way to expose an application running on a set of Pods as a network service. With Kubernetes you don't need…

kubernetes.io

Putting all together, below you can find a yaml to create an idle pod with persistent storage:

apiVersion: v1
kind: PersistentVolume
metadata:
 name: java-scripts-pv
 labels:
  name: java-scripts-pv
spec:
 storageClassName: java-scripts-scn # same storage class as pvc
 capacity:
  storage: 100Gi
 volumeMode: Filesystem
 accessModes:
  — ReadWriteMany
 nfs:
  path: <path to phisical volume folder>
  server: <server name or IP>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: java-scripts-pvc
 namespace: java-scripts
 labels:
 name: java-scripts-pvc
spec:
 storageClassName: java-scripts-scn
 accessModes:
 — ReadWriteMany # must be the same as PersistentVolume 
 resources:
 requests:
 storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: java-container
 namespace: java-scripts
spec:
 replicas: 1
 selector:
 matchLabels:
 app: java-container
 template:
 metadata:
 labels:
 app: java-container
 spec:
 volumes:
 — name: java-container-storage
 persistentVolumeClaim:
 claimName: java-scripts-pvc 
 containers:
 — name: java-container
 image: registry.hub.docker.com/library/openjdk:8u282-jdk-buster
 imagePullPolicy: “IfNotPresent” 
 args: [/bin/bash, -c, ‘while true; do echo $(date); sleep 30; done’]
 volumeMounts:
 — mountPath: /usr/src/myapp
 name: java-container-storage

And this is the pod:

and with this command is possible to get to the console:

kubectl exec --namespace <namespace> --stdin --tty <pod name> -- /bin/bash

in this situation all you need to do is decompress the job in the persistent volume folder and execute the script that Talend prepared for you

Scheduling

At this point below you can find the code to create the Kubernetes CronJob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
 name: <job name>
 namespace: java-scripts
spec:
 schedule: “*/10 * * * *”
 concurrencyPolicy: Forbid
 successfulJobsHistoryLimit: 5
 failedJobsHistoryLimit: 5
 jobTemplate:
 metadata:
 labels:
 app: <label>
 spec:
 backoffLimit: 3
 template:
 spec:
 restartPolicy: “Never”
 volumes:
 — name: java-container-storage
 persistentVolumeClaim:
 claimName: java-scripts-pvc
 containers:
 — name: java-container
 image: registry.hub.docker.com/library/openjdk:8u282-jdk-buster
 imagePullPolicy: “IfNotPresent”
 args: [/bin/bash, -c, ‘<path to script>/<script name>_run.sh’]
 volumeMounts:
 — mountPath: /usr/src/myapp
 name: java-container-storage

And this is it:

Conclusion

Nothing difficult in this article, as stated in title, is just the easiest way for me to achieve the result.