“Setting goals give you a life to live. When you have zero goals its life that consumes you.” ― Thomas Vato
In general when we try to read a directory in spark with say a pattern we do this.
val path = "examples/src/main/resources/people.csv"
val df = spark.read.csv(path)
df.show()
+------------------+
| _c0|
+------------------+
| name;age;job|
|Jorge;30;Developer|
| Bob;32;Developer|
+------------------+
Imagine a situation where you need to read files underneath a directory which has multiple sub levels. In that case the above way will not work.
Now apache spark is any data engineers best friend so you have a little to do. You an use recursiveFileLookup to do that thing.
val filesDf = spark.read.format("csv")
.option("recursiveFileLookup", "true")
.option("pathGlobFilter","*.csv")
.option("header", true)
.load("file:///tmp/data")
filesDf.show(false)
Now say you have requirement to know the file name too from where the data is coming. Yay, that’s easy too.
import org.apache.spark.sql.functions.input_file_name
So when you have input_file_name method with you can can get output of that method in a column to get the file name too. Isn’t it amazing.
val filesWithNameDf = filesDf.withColumn("file_name", input_file_name())
filesWithNameDf.show(false)
For any type of help regarding career counselling, resume building, discussing designs or know more about latest data engineering trends and technologies reach out to me at anigos.
P.S : I don’t charge money