ERESULT TEAM DATA ENGINEERING

Spark Python - executor code debug

Maurizio Vivarelli
6 min readMay 15, 2022

A Visual Studio Code step by step guide

Introduction

Debugging Spark executor code is tricky, I struggled a lot before putting together all the pieces needed, so I decided to write down all the step, maybe this will help somebody.

Let’s start with a brief introduction to the topics I will discuss in this article.

First of all I had to investigate and choose between the various IDE and related remote debugging tools available, I wanted a solution that does not rely on licensed software packages, so I ended choosing Visual Studio Code.

Then I had to work to check connectivity between executors and debug machine (in my case Spark cluster was run by an On Premises Kubernetes cluster).

Then I needed to find correct place to put server side debug library call (this was not trivial).

Finally it was necessar to take specific step to handle the dynamic way in witch python code is run by executors (closures).

So I will speak about:
- Connectivity
- Spark closures
- Actual debug procedure
- Basic print-debugging
- Conclusion

Connectivity

This constraint is not really covered here because it depends on the platform on wich the Spark cluster is run. In my case Spark cluster run over an On-premises kubernetes cluster and to get connectivity with executor I just needed a static route on my development machine. It worth to note that for increase security is also possible to route debug data communication through an ssh channel as described here:

Spark closures

To execute jobs, Spark breaks up the processing operations into tasks, each of which is executed by an executor.

But what’s a closure ?

A closure is a data structure that contains variables and methods that must be available to the executor to perform its computations.

Being prepared by the driver the closure is serialized and sent to each executor prior to execution (the closure is sent one time per application).

This inner working can cause some confusion on how executor can deal with global variables, in effect they cannot deal at all with global variables:

and other than that this means that the entry point of execution is not backed by any filesystem file on executors, and this is a problem when you try to debug.

We will discuss and find a solution to this in the next paragraphs.

Actual debug procedure

First of all it’s necessar to include debugpy library in development machine and executor environment. In my case I had to rebuild executor docker image.

After that i had to find a place where code is executed once per Spark application and per executor.

It’s possible to obtain that by wrapping worker main code as shown below:

After that ther’s a need to tell Spark to run this file (named remote_debug.py) to start python application on executors:

one that, every executor upon startup will listen on port 5680 waiting for debugging IDE to connect.

As I said before I choose Visual Studio Code as the tool to use to do remote debugging.

To enable that it’s necessar to install this Python plugin:

and after that to create a remote debug configuration file:

it’s worth noting that here you set to what executor you will connect, in my case spark-exec-2.

At this point is possible to start Spark application, I will do this with Pycharm:

and finally connect the debugger in VSC:

but still we have a last problem, if you try to set a breakpoint in the main file:

VSC tells you that there is no such file on the destination so line by line debug is not available

You cannot set breakpoint too while you can stop execution putting the following statement in the code:

debugpy.breakpoint()

and you can see stack trace and variables content, but still after the break you can’t go ahead line by line

to overcome this I put the code I wanted to debug in a separate python library file:

copied the file to executors:

and put references to theese functions on the main module:

finally at this point I’m able to set breakpoint as usual ;-)

Remote print debugging

While remote debugging is obviously the solution of choice, sometimes may be better an old school print way of debugging.

The problem is that printing or logging on executors goes to the console log of each executor, so you, standing at the driver, see nothing.

I always desired to print or log something on executor an see this information on the driver side.

Spark does not help on this, you have to build this for yourself.

There are surely many ways to achieve this, my solution is based on a message queue library named ZeroMQ:

here’s how it works.

First of all ther’s a need to include the ZeroMQ library pyzmq on the development machine and on executors, like we do for VSC remote debug library.

Then the solution require to configure the communication channel in PUSH/PULL mode.

The code on the development machine (that is also the driver) listen on a tcp port in PULL mode:

the code above, running on a separate thread, listen on the communication pipe and sink the messages received into the standard local driver log.

on each executor the code open a PUSH pipe:

this way logging commands like theese:

result in messages displayed on the driver console.

Conclusion

It was not easy for me to put all the stuff you saw here togheter but It was a way to understand better some concepts too. If I found an article like this before it would have been very helpfoul to me, so as I said in the beginning I hope somebody will find interesting and helpfoul too.

Other references

You can find source code here.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Maurizio Vivarelli
Maurizio Vivarelli

Written by Maurizio Vivarelli

Data Engineer, ERP developer. I like build things, sport, space and futurism.

Responses (1)

Write a response