Big Data Explained: Everything You Need To Know

What is Big Data and why does it matter? In this video, we’ll break down the concept of Big Data in plain English so you can finally understand what the buzzword actually means. From how mankind processes massive amounts of information, to how Big Data pipelines work and the different types of infrastructure that make it possible, this explainer will give you a clear picture of one of the most important topics in tech today. Whether you’re a beginner curious about data, a student learning the basics, or just someone who wants to know what “Big Data” really is — this video is for you.

Thanks to OpenMetal for sponsoring today’s video! Check them out to see how they can assist you with your Big Data goals!

YouTube player

What is Big Data?

First, let’s spend a few moments talking about what “Big Data” is, and why it’s so important.

As I mentioned earlier in the video, mankind generates a ginormous amount of data regularly. Every second of every day, countless people are uploading photos, sharing their status, buying products, ordering meals, and getting directions to their destination. As I’m sure you can imagine, if we employed people to individually handle every request that’s made online, the number of workers we’d have to hire to manually process this data would be impossible. (Keep in mind, here on Earth we generate hundreds of exabytes of data every day!)

As the data we generate grows exponentially, technology needs to scale along with it. And as we scale our infrastructure, we’ll naturally run into a series of challenges along the way. And it’s how we overcome those hurdles that determines whether our company succeeds or fails. For example, let’s say you’ve developed a brand-new social media app and you have just a small number of users. Then, one day, perhaps your app starts to become popular and then turns into the next big thing.

One of the first challenges you’ll run into is that a single server just won’t cut it anymore. Eventually, you’ll have to create a cluster of servers in order to keep up with the traffic.

Another situation you’ll run into is storage constraints. Maybe you’ve created enough servers to keep up with the demand, but as people upload more and more photos to your service, you’ll have to build a separate cluster to handle store everything.

At that point, you now have a large number of servers – and also some additional challenges. How do you set up your servers to sync with one another? How do you set up and queue messages to be delivered between them? What’s the most efficient way to configure a transactional database server to minimize response times? And once you have a large database, how do you make logical connections in order to understand your customers better?

Basically, “Big Data” translates to a large quantity of information. But the concept is more than just a term to describe a large flood of ones and zeroes, it also involves building a solution that’s able to process everything. This involves a specific set of instructions and a combination of services (called a “pipeline”) that keeps everything running smoothly. The pipelines that we create within Big Data consists of best practices and associated tools that give us the ability to handle an immense amount of data – automatically.

And like I mentioned during the intro, Linux is a major player in this space. As I’m sure you know, Linux powers just about everything these days – from servers to smart phones. Even when we focus on just servers, there’s a wide spectrum of use-cases, from a personal blog running on a single individual VM, to large services that run tens of thousands of servers. Since Linux scales very well, it’s a natural choice for Big Data – since our infrastructure should grow along with demand. Another benefit of choosing Linux for Big Data is that you benefit from its reliability and the fact that it can be tuned in all kinds of ways in order to ensure peak performance. In fact, the vast majority of Big Data operations utilize Linux.

But Linux isn’t the only major player in this field, open-source software is also a natural fit for Big Data. When we create pipelines, it’s important that we configure each and every service to ensure our needs are met. And nothing is more customizable than open-source technologies. While proprietary technologies do exist within Big Data, Linux and open-source gives us the most control and flexibility.

As an aside, when I first started with Linux, it had a very small footprint. Even the occasional server running it was a rare sight. During my entire career, I watched Linux start off as the underdog in virtually every category, and then grow to now have the largest marketshare when it comes to both traditional data centers and Big Data. The growth has been incredible.

Anyway, like I mentioned, Big Data involves a pipeline, which refers to an “assembly line” of sorts geared toward processing huge workloads. But what exactly does a Big Data pipeline consist of? Well, let’s discuss that.

What is a Big Data pipeline?

A Big Data pipeline is a structured, automated process that takes massive amounts of raw data, often coming from multiple difference sources, and turns it into something useful.

You can think of it as a supply chain for data:

Raw data is collected from various sources, such as from apps, logs, databases, and so on. The data is then organized and converted into a specific format. Finally, the data is stored and then analyzed, in order to extract insights or make decisions.

Each stage of the pipeline is able to handle an immense volume of data reliably and efficiently, and it’s able to get the job done in near or real time. This allows organizations to scale their business to always be able to keep up with their customers demand.

A Big Data pipeline consists of any number of servers, applications, and services that are designed to communicate with one another and spread the workload. I’ve already mentioned Linux, which is the platform of choice for Big Data to run on. After that, it’s just a question of which services in particular you should configure to handle the workload and among these solutions, here are some of the highlights.

Messaging & Data Streaming

Apache Kafka → Kafka is a distributed platform for handling high-volume event streaming and messaging between systems. It allows different services in a Big Data pipeline to publish and subscribe to data streams in real time. This makes it a backbone for moving large amounts of data quickly and reliably.

Data Storage & Lakehouse

Delta Lake → Delta Lake is an open-source storage layer that adds reliability and structure to traditional data lakes. It supports ACID transactions, schema enforcement, and time-travel queries, making large datasets easier to manage. With Delta Lake, teams can trust their data while still benefiting from the flexibility of a lake.

Ceph → Ceph is a distributed storage platform that provides object, block, and file storage in a single system. It’s designed to scale horizontally, allowing organizations to store petabytes of data reliably across clusters of commodity hardware. In Big Data pipelines, Ceph often serves as the foundation for cost-effective and resilient data storage.

Data Processing & Machine Learning

Apache Spark: Spark is a powerful data processing engine known for speed and scalability. It allows developers to run analytics, ETL (Extract, Transform, Load) tasks, and machine learning workloads across massive datasets in memory. This makes it a key tool for deriving insights quickly from Big Data.

MLflow: MLflow is a platform that helps manage the entire machine learning lifecycle. It tracks experiments, organizes code and dependencies, and streamlines the deployment of models. By doing this, MLflow ensures reproducibility and simplifies collaboration in data science teams.
Database Analytics & Change Capture

ClickHouse: ClickHouse is a column-oriented database built for fast online analytical processing (OLAP). It excels at running real-time queries on large volumes of data, making it great for dashboards and analytics. Because of its speed and efficiency, ClickHouse has become a favorite for analyzing logs, metrics, and other high-volume datasets.

Debezium Debezium is a tool for change data capture (CDC), which means it continuously streams changes from relational databases and other sources into systems like Kafka. This allows Big Data pipelines to react to updates in real time instead of relying on batch jobs. In practice, Debezium is key for keeping data synchronized across modern architectures.

Infrastructure Types

Now that we have a general idea of some of the solutions that are available, where do we run them? Well, there’s three primary types of architecture that’s used.

Bare metal deployments refer to software that’s installed on actual physical servers. Basically, the old-fashioned way. As you can imagine, running dedicated servers isn’t all that efficient nowadays, since anytime you want to add additional services, you’d need to place an order with your server provider and wait for it to arrive at your door. For that reason, bare-metal is the least common type of infrastructure for Big Data – but it does exist.

Second, public cloud deployments are those that are built within a cloud provider, with common providers being AWS, Google Cloud, Azure, Digital Ocean, and others. In this scenario, you’re using someone else’s server infrastructure to run the services your business needs. The downside though is that you don’t have full control or visibility on how your data is being handled. You’re basically trusting someone else to do a good job, and hoping that they do.

Third, private cloud deployments are similar to public cloud, with the difference being that the company runs everything themselves, even the underlying virtualization engine. While private clouds are more difficult when it comes to the initial setup, you gain full control and have full visibility to everything – since you’re running everything. Another benefit is that this also means you can customize your environment from the ground up and tune it for the best possible performance, which directly benefits Big Data.

As you can see, Big Data is a very important topic, one that doesn’t get discussed all that often. In this article, I wanted to shed some light on this subject and also go over key terms and concepts, so that way, any future videos about this subject I may produce will have a “starting point” for new viewers.

  • Ad-free Content
  • Early access to select videos
  • Discord access

And more!