ERESULT TEAM DATA ENGINEERING

How to run a Talend ETL in Kubernetes (the easy way)

Maurizio Vivarelli
4 min readJul 8, 2021

Often, when IT professional, or at least me, need to do a Job for the first time seek for help on the Internet.

Luckily finding a good article on Medium … ahah

Now I have an integration project between our Omniaplace data pipeline (Mongo DB) and our Omniaplace application (SQL server) to realize.

The first part regard the development of the ETL itself, done with Talend for big data Java framework.

The second part regard where and how to run the outputted software module.

Given that I work on the pipeline side of our Omniaplace infrastructure that is entirely deployed on Kubernetes I decided to run this Job within our on premises cluster.

So this article regard the second part but it’s important to spend a few words about the first one.

The Talend module give the possibility to create ETL jobs by means of a visual interface, keeping coding requirement to a minimum. The output is a Java package filled with all the required libraries. So to run this with reliability all you need to do is set up a Java environment with same interpreter version.

In my case for the windows VM where I developed the job I choose to install Java version 1.8:

So I identified the following docker image:

openjdk:8u282-jdk-buster

given that Talend require full JDK.

At this point I searched the Internet for a guide on how to test/schedule a Java application in Kubernetes, but I cannot find one suitable for me.

In particular I wanted to find a methodology that:

  • does not require to build/rebuild docker images;
  • give a simple way to test and finalize the integration routine;
  • give a simple way to schedule the finalized integration.

The rest of this article describe the step I took to obtain the result.

The key ideas were to:

  • work with persistent storage;
  • work with idle pods.

Let see in detail.

Testing

The first thing I wanted to do is to test the script and I wanted to do that with the possibility to update the code without rebuilding the docker image.

To achieve this you need some building blocks

  • persistent storage provider in the form of a default storage class;
  • always on pod with a fast and simple way to connect to the console;
  • eventually custom name resolution for local services that reside outside the cluster, in my case SQL Server;

Putting all together, below you can find a yaml to create an idle pod with persistent storage:

apiVersion: v1
kind: PersistentVolume
metadata:
name: java-scripts-pv
labels:
name: java-scripts-pv
spec:
storageClassName: java-scripts-scn # same storage class as pvc
capacity:
storage: 100Gi
volumeMode: Filesystem
accessModes:
— ReadWriteMany
nfs:
path: <path to phisical volume folder>
server: <server name or IP>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: java-scripts-pvc
namespace: java-scripts
labels:
name: java-scripts-pvc
spec:
storageClassName: java-scripts-scn
accessModes:
— ReadWriteMany # must be the same as PersistentVolume
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-container
namespace: java-scripts
spec:
replicas: 1
selector:
matchLabels:
app: java-container
template:
metadata:
labels:
app: java-container
spec:
volumes:
— name: java-container-storage
persistentVolumeClaim:
claimName: java-scripts-pvc
containers:
— name: java-container
image: registry.hub.docker.com/library/openjdk:8u282-jdk-buster
imagePullPolicy: “IfNotPresent”
args: [/bin/bash, -c, ‘while true; do echo $(date); sleep 30; done’]
volumeMounts:
— mountPath: /usr/src/myapp
name: java-container-storage

And this is the pod:

and with this command is possible to get to the console:

kubectl exec --namespace <namespace> --stdin --tty <pod name> -- /bin/bash

in this situation all you need to do is decompress the job in the persistent volume folder and execute the script that Talend prepared for you

Scheduling

At this point below you can find the code to create the Kubernetes CronJob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: <job name>
namespace: java-scripts
spec:
schedule: “*/10 * * * *”
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
jobTemplate:
metadata:
labels:
app: <label>
spec:
backoffLimit: 3
template:
spec:
restartPolicy: “Never”
volumes:
— name: java-container-storage
persistentVolumeClaim:
claimName: java-scripts-pvc
containers:
— name: java-container
image: registry.hub.docker.com/library/openjdk:8u282-jdk-buster
imagePullPolicy: “IfNotPresent”
args: [/bin/bash, -c, ‘<path to script>/<script name>_run.sh’]
volumeMounts:
— mountPath: /usr/src/myapp
name: java-container-storage

And this is it:

Conclusion

Nothing difficult in this article, as stated in title, is just the easiest way for me to achieve the result.

--

--

Maurizio Vivarelli

Data Engineer, ERP developer. I like build things, sport, space and futurism.