Nov 14, 2024

It's Not All About Data Pipelines

Before data pipelines there were spreadsheets taped together with array formulae and VBA

It‘s easy to get sucked into data pipelines – they are a big part of data engineering and have made our lives significantly easier when we have to explain what we do. The term data pipeline is an abstraction that encompasses how we take meaningless data and transform that into something useful – dare I say it, insights... In future posts we’ll discuss different types of pipeline, ELT, ETL, orchestration and various no-code data extraction tools. There is plenty of time to dig into pipelines later.

Feathers and Frameworks

To help navigate the world of data engineering, I have enlisted the help of two birds: Owlgorithm the Owl and Petabyte the Penguin. Today they’ll be helping me to explain infrastructure.

Owlgorithm is known for his wisdom and Petabyte for her ability to adapt to harsh environmental factors.

A nest is only as good as the materials it’s made from. Use the wrong type of stick and you’re gonna have a bad time. We can say the same about infrastructure. Let’s talk about some of the components you might need to consider.

Compute

Before the days of cloud computing, compute would be utilised via bare-metal. This made infrastructure more costly in-flexible. Platform as a service (PaaS) became popular as it offered more tools to simplify deployments, better scalability and cost benefits.

Today, DataBricks and Snowflake are the big compute players in the data-space. The more data you have, the more compute, and query optimisation you’ll need to balance compute expenditure and speed. PaaS is still popular, but now uses cloud-based architecture to make cloud infrastructure easier to use, albeit at premium prices.

There are other cases where you may choose to use Kubernetes or containers to run code. This is a valid choice if you have custom code to ingest or export data that uses a workflow orchestrator such as Airflow or Prefect. I recently found that AWS Fargate in the flavour of ECS works particularly well without the base costs of EKS.

Data Lakehouse Or Data Warehouse?

Unfortunately, a data lakehouse doesn’t have a magical mailbox like the 2006 film the lake house, but it is a pretty nifty and optimised way of storing data.

Data lakehouses are designed to store a wide range of data types in their original formats, using object storage. In contrast, data warehouses are optimised specifically for structured data and SQL-based analytics.”

Owlgorithm likes the structure of data warehouses because every piece of data is meticulously placed – similar to the construction of his nest. In contrast, Petabyte ’s nest relies on a few different materials so she prefers data lake houses.

In the real world you might use S3 as the foundation for your data lakehouse. Snowflake is pretty good for data warehousing.

Data freshness

Data freshness will impact how much compute and storage you need, but let’s face it, there are some cases where you need real-time data and other cases where you don’t.

The key part of having real-time data (or at least near real-time) is for time sensitivity or if a source system creates significant “back-pressure”.

Don’t get me wrong, there are plenty of use cases where real-time data is important: train timetables, credit card transactions, and health monitoring, to name a few. But before you build something real-time, consider if a daily batch may be good enough and more cost effective. This especially attractive if you have a lot of data transformation to do!

Let's recap

Before building pipelines, it’s important to set a foundation. Consider your compute engines, the way you want to store data and what level of freshness. Remember that it's okay for your platform to be scrappy in the beginning – you can scale and tweak it over time.

Remember, it's better to have a functional, adaptable platform than to chase perfection right from the start.