• All You Need to Know about Isolation Levels and Read Inconsistencies

    Just like real life, the world of computer science is replete with trade-offs. Relational databases are no exception. When interacting with relational databases, we face the dilemma between data consistency and transaction concurrency. The former guarantees the data is trustworthy, while the latter ensures relational databases can conduct transactions swiftly. Both are desirable qualities of relational databases, but we cannot simultaneously achieve them to the fullest extent. Today, I will discuss how isolation levels can help us structure our decision-making regarding this dilemma.

    ...
  • CAP Theorem But Better? Introduce the PACELC Theorem

    In the previous blog, I introduced the famous CAP Theorem (please give it a read if you haven’t already before you start reading this one). It involves a trilemma of needing to give up one of the following three qualities: consistency, availability, and partition tolerance. Since all three are desirable features of modern-day distributed systems, determining which one to relinquish has become one of the most important and delicate decisions for designers of complicated distributed systems. While the CAP Theorem is widely-known in computer science, its extension, the PACELC theorem, has received less attention. Today, I will shine a long-overdue spotlight on the PACELC theorem.

    ...
  • What You Need to Know about the CAP Theorem

    The world that we live in is far from perfect. We constantly find ourselves in dilemmas, sometimes even trilemmas, that require us to make trade-offs. When shopping, we can only choose two out of “cheap,” “fast,” and “good.” In economics, a government cannot enjoy “sovereign monetary policy,” “fixed exchange rate,” and “free capital flow” at the same time. It can only achieve two of them by giving up the third. Similarly, the CAP (consistency, availability, and partition tolerance) theorem involves an equally head-scratching trilemma that has troubled computer scientists and software engineers ever since distributed computing became a popular solution to large-scale computation. Today, we will dive deep into the CAP theorem and learn how to make wise trade-offs based on our needs.

    ...
  • A Brief Introduction to Kafka

    There’s no denying that we have already ushered in the era of big data. An enormous amount of information is generated every second. While decision-makers can gain invaluable insights from this ever-growing data, its sheer volume also poses considerable challenges to data engineers–greater demand for storage spaces, the need to handle increasingly complex data formats, and highly unpredictable network traffic. Luckily, recent years have witnessed the creation of various technologies devoted to efficiently digesting big data, and Kafka is one of them. Today, I will demonstrate how Kafka works to help kick-start your Kafka journey.

    ...
  • A Brief Introduction to the Inner Working of MapReduce

    As a data engineer, you probably have heard about Hadoop. It is one of the most popular frameworks for distributed processing of large data sets. It is less costly and more secure than other frameworks. At its center is a programming model called MapReduce. Today we will take a closer look at MapReduce to understand the inner working of Hadoop.

    ...
  • ETL vs. ELT: Pick the Most Suitable Data Integration Method for Your Project

    As a data engineer, you probably have heard of the data integration methodology called ETL (Extract-Transform-Load). It has been around for a while, and many data engineers have used this methodology to build data pipelines. However, ETL is not the only option up our sleeves. Recently, ELT has also been gaining a lot of popularity. In this article, I will compare ETL and ELT to help you understand their respective advantages and drawbacks so that you can choose the methodology more suitable for your project.

    ...