Design a Scalable Data Solution : Know the requirements

3 min readJul 17, 2022

A recipe has no soul. You as the cook must bring soul to the recipe.”
– Thomas Keller

We the data engineers for me, should be no different than a spiritual seeker. A seeker mind is always curious to get more clarity, more transparency. Unless things become clear to the heart there is not true realisation.

This is completely my opinion and I don’t force/provoke people to go in my way.

When we talk about designing data engineering solutions I saw many people just rush to create some tables and asking for the dashboard requirements. That’s the biggest mistake one probably can do on the first day and enough to screw up the design. There’s no short cut to do a scalable and successful design.

If you follow or at least aware of the SDLC of a software design then you got me right where I am saying to put more efforts, yes Planning and Requirement Analysis, which is the fundamental and most critical in designing the best solution. Let me try to explain it in my natural story telling style.

Questions, Questions

A user, customer or whoever is your target consumer should spend the maximum time in the initial phases of a solution. Ask as many questions you can ask. Now the question is what you are going to ask?

A good question needs qualification!

When you are seeking requirements you need to understand the variety of data use cases around the enterprise and frame your set of questions likewise.

Putting yourself at your customer’s shoes, yes that is the best way understand their pains. Do not go for shinny technologies and over-engineer a solution. You just need fork to eat a sausage not a chainsaw.

Lets us ask the questions now

Strengthen your Quants :

What are the different source systems?
How many are on premise and how many are on cloud?
How many of the sources are batch or streaming(Latency)?
How much volume we are talking about?
What is the frequency of data induction?
When do you need the data? Yeah, the hot/cold path
How many users do you have?
How many visualisation outcomes aka reports/dashboards?
How many of the reports are live reports/extracts/time based?

Understand the Qualitative asks

What type of SQL/Analytics/ML use cases the users will execute?
Do you need a sandbox area for the users to do wrangling, cooking, creating objects etc?
Are you looking for data federation? Like combining datasets of heterogeneous types say joining Oracle data with Redshift or even delta tables etc.
How the users want to schedule their jobs?
Do the users need a report for data publish for the alerting and checking anomaly if any?

Once you get the answers for all the above questions you should start chalking the design of data pipelines, data engineering etc. Without getting proper clarifications of those will make the system fail big time not just from the point of delivery but from the side of design. Do not try to make things generalise always, otherwise instead of making a cycle which you planned to use for road you might make an aeroplane and keep it in ur courtyard as you do not have permission to fly.

Now, how you are going to design a solution and how you can decide the candidates I will try to write in my next article. Thanks for reading.

For any type of help regarding career counselling, resume building, discussing designs or know more about latest data engineering trends and technologies reach out to me at anigos.

P.S : I don’t charge money