I would rather walk with a friend in the dark, than alone in the light. — Helen Keller
Well apache spark is that friend, Helen was talking about in the big data world. Multiple ocassions when we deal with very low volume of data in some of the interfaces that introducing Kafka in between for staging becomes critical to save money and to stay away from infrastructure management. In such situations using simple approach in scala to read the rest api data and converting that to spark dataframe does a great job. Let’s see how we can do that.
The data
Here I am using node.js rest api service hosting in my macbook at port 9090.
http://localhost:9090/epldata/
Sample
{
"name": "Premier League 2015/16",
"rounds": [
{
"name": "Matchday 1",
"matches": [
{
"date": "2015-08-08",
"team1": "Manchester United FC",
"team2": "Tottenham Hotspur FC",
"score": {
"fullTime": [
1,
0
]
}
},
{
"date": "2015-08-08",
"team1": "AFC Bournemouth",
"team2": "Aston Villa FC",
"score": {
"fullTime": [
0,
1
]
}
},
{
"date": "2015-08-08",
"team1": "Leicester City FC",
"team2": "Sunderland AFC",
"score": {
"fullTime": [
4,
2
]
}
},
Schema
Here is the schema for the JSON response.
root
| — name: string (nullable = true)
| — matches: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — date: string (nullable = true)
| | | — team1: string (nullable = true)
| | | — team2: string (nullable = true)
| | | — score: struct (nullable = true)
| | | | — fullTime: array (nullable = true)
| | | | | — element: double (containsNull = false)
Approach
I always love the way scala and spark help each other by understanding mutually. Testimonial of one of such is the realtionship of datasets and case classes. Here for the very same reason using case class to receive the response.
My helper imports
import org.apache.spark.sql.SparkSession
import org.json4s._
import org.json4s.jackson.JsonMethods.parse
The body
Using implicit formats boost the compiler to understand the schema and infer it with the case class and so using it.
All set!
Now you have mySourceDataSet which is nothing but Dataset[Rounds]. Here you go!
Using dataframe/dataset APIs or SparkSQL Api you are good to use the same data. You can now write to datalake, RDBMS or any cloud DW.
mySourceDataset.createOrReplaceTempView("epl")
spark.sql("select epl.name as Matchday, explode(epl.matches) as matches from epl").show(false)
Bingo!
For any type of help regarding career counselling, resume building, discussing designs or know more about latest data engineering trends and technologies reach out to me at anigos.
P.S : I don’t charge money