Apache Spark — Play with nested files

Ani
2 min readMar 2, 2023

“Setting goals give you a life to live. When you have zero goals its life that consumes you.” ― Thomas Vato

In general when we try to read a directory in spark with say a pattern we do this.

val path = "examples/src/main/resources/people.csv"

val df = spark.read.csv(path)
df.show()
+------------------+
| _c0|
+------------------+
| name;age;job|
|Jorge;30;Developer|
| Bob;32;Developer|
+------------------+

Imagine a situation where you need to read files underneath a directory which has multiple sub levels. In that case the above way will not work.

Now apache spark is any data engineers best friend so you have a little to do. You an use recursiveFileLookup to do that thing.

val filesDf = spark.read.format("csv")
.option("recursiveFileLookup", "true")
.option("pathGlobFilter","*.csv")
.option("header", true)
.load("file:///tmp/data")

filesDf.show(false)

Now say you have requirement to know the file name too from where the data is coming. Yay, that’s easy too.

import org.apache.spark.sql.functions.input_file_name

So when you have input_file_name method with you can can get output of that method in a column to get the file name too. Isn’t it amazing.

val filesWithNameDf = filesDf.withColumn("file_name", input_file_name())

filesWithNameDf.show(false)

For any type of help regarding career counselling, resume building, discussing designs or know more about latest data engineering trends and technologies reach out to me at anigos.

P.S : I don’t charge money

--

--

Ani

Big Data Architect — Passionate about designing robust distributed systems