How we scaled from 30-Terabytes to 10-Petabytes Data Platform

Here are some of the key lessons we learned in scaling toward a petabyte-scaled data platform.

Nov 07, 2022

I am writing this so that the words here directly speak to data and/or engineering teams or CEOs / CTOs of small startup companies to share lessons I’ve learned in evolving effective data platforms. This is not a discussion of strategy but just curated insights into one’s journey in building a data platform.

Introduction

As a startup or an established enterprise, you are probably inundated with a wave of data you need to acquire, process, store, and operate / secure and hence you’ve built a data platform. It started with collecting transactional data either stored in your local network or has recently been pushed to the cloud. With these initial steps you have built the first data-driven project, it could be a simple forecast model or a machine learning application that was the main soul that powers your startup or an enabler for your enterprise's digital transformation.

“The journey that brought here is different from the next phase in the journey would push you further. ”

These are a collection of learnings from building data platforms that I want to share with you so you might consider them when pushing forward.

From Centralised to Localised teams

In your initial journey, you needed specialized technology and resources with special skills to help you bootstrap a data warehouse, data lake, or data platform. It made sense at the start to centralize the effort and probably prioritize your organization’s biggest problem to solve.

Fast forward a few years, you have created a data capability that allows you to ingest, transform and store data needed for analytics and data science work. However, a centralized data team usually becomes a bottleneck where the team could not react to ever-evolving business needs. It would help if you considered that each business domain is solving different problems for its stakeholders and customers.

Building hyper-local data teams that are business domain-aware is a logical next step. The team can be composed of an engineer, an analyst, or a data scientist with support from product and design. Each team is supported by a designer or product manager who determines worthwhile problems to solve. Problems should be translated into opportunities and in designing solutions, the requirements for data needs should be clear.

For example, if you are solving cold start problems — that is customers who use your app for the first time and do not know what content they like, building a content recommendation engine is a good way forward. For this, you need to acquire data from transactional systems of what content is being consumed by which cohort of users. In addition, you organize contents that are related to each other. In this use case alone you need to collect data from users and how much time they spent on content (video, blog, audio), which part of the journey users fall off and why as well as the demographics of users and their circumstances that lead them to consume content. There is a multitude of data points needed to create a recommendation engine so to scale this you need a hyper-local data team that can work together to solve these problems.

From engineers to data product builders

You need to evolve your engineering teams so that data is not just a byproduct of their work but a clear feature in any application they build by providing data products which the whole company builds a foundation from. What is a data product?

A data product is data that is an output of processing that is consumed by applications or humans that drive business value.

Be it measuring product success, building a data-driven application, or establishing analytical capability, your engineering team should take the role of “data producers.”

What does that entail? It means the team creates data sets from the applications they build, makes them shareable in a standard framework (Rest API or batch), and stores them in a way that follows good data design (i.e. proper partitioning). The team also needs to operate this dataset and should be part of their CI / CD pipelines where any changes in the code should check, as part of regression testing, any downstream consumer of this data. The team also needs to make the data easily discoverable (see next section) for everyone in the organization and provide full documentation (i.e. column data types, column description, and sample values). The team should also be aware of the costs involved in producing, maintaining, and operating such data. Most cloud solutions like AWS, GCP, and Azure have automated cost prompts based on defined budgets — engineering teams should be aware of these to ensure the cost-efficient production of data.

Make data discoverable, accessible, and secure

One question to ask one’s self, if you ask your data science, analytics, or data engineering team which is the hardest part in their journey to building analytics or machine learning project?

Discovering and accessing data is the hardest part when bootstraping any data-driven project.

Any data-driven project will undergo a phase where there is an exploration of what datasets are out there available. These datasets are usually not available in a central data lake and are residing in transaction databases or local storages. The goal is to make this data easily found and accessible.

Discoverability: There are tons of data catalog vendors in the market that solve the data discoverability problem. You can either use an off-the-shelf solution or you build something internal. Data discoverability includes data catalog services, data lineage, and metadata management. The most important thing is that for this data catalog to work, a team should provide details about this data:

Tables, columns, and column descriptions
Data entity, data types, and table refresh frequency
Owning team, producing team, and support contacts for tables

Lastly and a bonus is the availability of the statistical shape of the data that provides data density, frequency, skewness, etc…

Accessibility: To make data available, your engineering team might want to build a standard library for connecting to supported data sources where the proper authentication is also applied. This could be a Python library that connects to your known data sources and can be installed in any support environment. You also need an environment where data is accessed. Some teams create notebook laboratory environments, for example, Google collab (free, usually for research) or establish an internal Jupyterlab where there is dedicated resources for analyst, data scientist, or data engineers to do low-level analysis. When a large amount of data needs to be processed, technologies with high computing capacity could be used like Data Bricks, AWS EMR, and AWS Sagemaker. These should be used and users should be informed to be cautious in monitoring costs.

A single source of truth is partially true

You have heard the term:

“There is no single source of truth.”

While this is a clear problem that requires solving, it is not easy. With the explosion of data sources, multiple data pipelines created by your team, and multiple consumers of data, you need to consider that not all data is made equal. One data pipeline could be very tailored and customized for a specific use case and some data is raw data for general purposes. You need to change your operating model.

The rule you can follow is:

Data is curated — if data is used by multiple analytics / BI applications or serves as a data source for multiple ML projects, then you need to ensure the KPI calculated as part of curated datasets should be defined in a centralized manner that is standard across different consuming applications. It needs to contain conformed dimensions that allow facts and measures to be categorized and described in the same way across multiple facts and/or data marts, ensuring consistent reporting across the enterprise. An example of this is date is a common conformed dimension as it has attributes like a week, month, and year which have the same meaning when joined to any fact table.
Data is adhoc — data could be adhoc in nature and are usually coming from early data discovery or a view of another dataset for a specific purpose. This type of data could be data scraped from websites or an initial transformation of raw data. When presenting this data to consumers, you usually present it as raw as possible without the needed pre-calculation and will usually require consumers to define business logic. Adhoc data that becomes useful must be curated.

Data is not the new oil

This analogy is flawed and a lot of organizations fell for this trap during the early ramp-up days of data/digitalization technologies and jumped on the bandwagon of creating data platforms and unloading as much data as possible.

Not all data is created equal, if you treat data with intrinsic value you tend to just store it. The attitude of “Let’s store data just in case” does not scale in the long term.

While data storage is cheap, computing as well as operations and maintenance isn’t! Only process data that is required to solve a clear stakeholder or customer problem. You also need to have a transparent way how to measure the impact of producing this data. What ROI measures can we attribute if such data is created into curated datasets supported with properly engineered data pipelines?

Of course, we cannot determine data and its value in the short term, what if the value resides in the future? Consider creating a very highly performant architecture where applications can share data (batch or via API) and then automatic pipelines store the data in a very raw format. This data publishing infrastructure should be operated as frugally as possible. If the data becomes useful, then that is the time you can scale the infrastructure.

Balancing autonomy & innovation while maintaining order

Teams can be given autonomy to innovate with data to help them solve stakeholder and customer problems. They need to be able to acquire, transform and store data with ease and do it securely. For this, you need a central approach to creating platform services which will then be owned by foundational central data teams to ensure deep and consistent integration. This infrastructure could help with the auto-provisioning of computing and storage technologies and where proper security (logical data isolation) and compliance (i.e. handling PII — personally identifiable information) are in place. With this central approach, the main goal is to enable bootstrapping of data technologies so that new projects can hit the ground running with data.

Summary

As you probably noticed, I did not recommend any solutions, architecture, or designs here. These are merely learnings — the idea is for you as a data leader to figure this out with the team by starting conversations around these topics. If you found this useful, a clap would motivate me to write more. Thank you for reading!

darth_data's Newsletter

Discussion about this post