Tame The Spark UI

Ani
4 min readJun 10, 2022

“An artist’s concern is to capture beauty wherever he finds it.”
Kazuo Ishiguro, An Artist of the Floating World

Vangogh

We are data engineers and Spark is our best friend and the natural choice when the job is massive parallel data processing. Many times a day we interact with spark by shell, IDE for coding or firing some intuitive commands to perform certain tasks. It is evident that not all of them will fly like butter on top of hot and soft pancake! No they don’t. When we try to understand bottlenecks and failures of a spark process we look into it’s face. Yeah, you guessed it right, the Spark UI.

I have seen many world class packages for years and they did fantastic jobs when we ask them about execution, monitoring, alerting or logging, but in reality few did the same or provisioned the apis to extend and tweak for consumer to achieve certain functionalities.

Again, what I am going to show you might be done by 1000 people before me, 10000 different ways but here is my take on this.

You must have witnessed one thing when you fire any spark command and look into the UI spark does spin up resources and does the below thing.

Spark Job, Stage(s), Task(s)

When you go and look into the Spark UI, you basically don’t see the command you fired or the SQL you have executed but you see only the actions which are translated into a job(s) and then the DAG associated to it.

Enough of blabbering, let us make the hand dirty.

I have a simple method to execute SQLs(I am fan of scala so expect me to show off 😊) and executing multiple SQLs one after another. Cool!

So, how Spark UI JOB tab will show this to me? Eh! Is that you ever wanted to see when you debug it? No! Right?

Okay, you can tell me “Hello stupid, open the SQL tab, there you can see!” But, but but? Not here too? Yeah, that’s the pain.

Once you use EMR for execution or custom configured Jupyter notebooks you get better experience but when you have your own, then no. No, worries I am trying to do that only.

Let us understand one thing, Spark is one of the best, may be the best software of last two decades which gave you tremendous opportunity to extend, thanks to Scala though.

Have you ever explored Spark Context methods? setLocalProperty, setJobGroup, setJobDescription etc? If not go and read here.

The below methods are pretty cool and looking into name and parameters you could guess what you can do with them.

public void setLocalProperty(String key,
String value)
public void setJobGroup(String groupId,
String description,
boolean interruptOnCancel)
public void setJobDescription(String value)

So here is the deal. I am using CallSite for the short message to display the sql order.

public class CallSite
extends Object
implements scala.Product, scala.Serializable

CallSite represents a place in user code. It can have a short and a long form.

I am doing two things here.

  1. In the setLocalProperty I am telling the spark context to set SQL_{n} as the short description of the job.
  2. In setJobDescription method I am passing the SQL query as the job description.

Very simple

The Magic

Now look into the Job tab in Spark UI. Ta Da!

Hello, Ani can you show the SQL tab?

BINGO!

Like this way, you can achieve your own type of meessages and tame the spark ui. Remember, Spark is the data engineer’s best friend and Scala is the data engineer’s closest guide.

For any type of help regarding career counselling, resume building, discussing designs or know more about latest data engineering trends and technologies reach out to me at anigos.

P.S : I don’t charge money

--

--

Ani

Big Data Architect — Passionate about designing robust distributed systems