Blogs

Introduction to Database Replication

Jacqueline Cheong
January 15, 2024

What is it?

Database replication is a cornerstone in achieving a robust, scalable, and fault-tolerant data management infrastructure. Database replication is a key component in data management, involving the creation of multiple copies of a database across different servers or locations. This process ensures data redundancy, reliability, and performance. Data accuracy and latency are two important factors in database replication to ensure reliable access to data for decision-making and operational continuity. 

On a high level, there are two broad categories of database replication: internal and external.

Types

Definition

Goals

Internal

Data is synced from a primary or master database to secondary or replica databases.

  • High availability. To have standby replicas in case the primary database is unavailable.
  • Load balance workload to achieve higher performance.

External

Data is synced to destinations outside of the database cluster.

  • Varies, but includes supporting different access patterns and types of workloads.

 

 

Diagrams comparing internal database replication and external database replication

In this blog, we will focus on external database replication and in the context of data warehousing.

Database replication to data warehouses

The importance of database replication extends to the realm of data warehousing, where it plays a vital role in maintaining up-to-date and synchronized data across various systems. Data in the data warehouse is then used by downstream workloads to perform machine learning, experimentation, business intelligence, and reverse ETLs, among others. 

Replicating data from operational databases to data warehouses is a critical process for several reasons:

Separation of operational and analytical workloads. Operational (OLTP) databases are optimized for transactions, such as CRUD operations: creating, reading, updating, and deleting records. OLTP databases are characterized for having high throughput and low latency, and are meant for applications with well-known access patterns. 

Data warehouses on the other hand, are designed for analytical processing (OLAP) and specifically optimized for fast query performance on large volumes of data, with the ability to perform more freeform queries. Replicating data from databases to data warehouses separates these two distinct types of workloads, ensuring that the performance of operational systems is not impacted by analytical queries, which can be resource-intensive. 

Read more here on the differences between OLTP and OLAP databases. 

Consolidating data in a centralized location. Data replication allows companies to consolidate data from multiple sources, including various databases and other data systems (i.e. SaaS sources such as Salesforce and Zendesk data), into one centralized location (i.e. a data warehouse). This consolidated view in a data warehouse is essential for comprehensive analytics, business intelligence, customer support, financial reporting, and providing a more complete picture of the organization's operations. 

Scalability of data systems. Data warehouses are designed to handle very large volumes of data and are more scalable for storing historical data than transactional databases. This helps organizations efficiently manage their data storage.

Backup and disaster recovery. Having data replicated in a data warehouse can serve as a form of backup. In case of a failure in the operational database, the data in the data warehouse can provide a recent copy that can be used for recovery purposes.

Role of CDC in database replication

Change Data Capture (CDC) is a technique used to identify and capture changes made to the data in a database, so these changes can be managed or replicated to other systems. CDC is a vital tool in database replication, offering an efficient way to sync data with minimal impact on source systems. It supports a wide range of applications, from data warehousing and analytics to real-time monitoring and event-driven architectures. CDC can be leveraged in various types of replication methods, which we will outline in the next section.

Types of replication

 

Description

Trade-offs

Full table replication

Copying the entire table into the data warehouse at regular intervals.

  • Simple to set up but resource-intensive.
  • Processes every record regardless of whether it has changed or not.
  • Can be slow and unscalable for large tables.

Snapshot replication

Involves taking snapshots of the database at specific points in time and replicating those snapshots to the data warehouse. 

  • Similar to full table replication.
  • Not suitable for scenarios requiring up-to-date information as data is outdated between snapshots.

Incremental replication

Only the changes made since the last replication (new, updated, or soft deleted rows) are transferred to the data warehouse. 

  • More efficient than full table replication.
  • More complicated to set up and can lead to issues with data inconsistency and complexity in handling schemas.
  • Cannot replicate hard deleted rows and tables that do not have an updated field.

Key-based incremental replication 

This is similar to incremental replication but uses a key such as timestamp or auto-incrementing ID to identify new and updated rows. 

  • Similar to incremental replication.
  • More suitable for append-only workloads.

Trigger-based replication

Triggers are set up in the operational database to capture changes to data. Whenever a specified event occurs (i.e. an insert, update, or delete), the trigger is activated and the change is replicated to the data warehouse.

  • Can add significant overhead to database operations since triggers are executed with each relevant operation.
  • Can impact databases with high data volume and add latency to each operation.

Log-based replication

This replication method reads the transaction logs of the database, which record all changes to the data. When set up correctly, it's a non-intrusive way to replicate data in real-time and is highly efficient as it does not impact the performance of the source database. 


Log-based replication does not rely on triggers, but rely on transaction logs which are a fundamental part of a database’s architecture and are how databases internally replicate, which means there is no additional load or queries performed on the database. 

  • Non-intrusive to the database and its performance.
  • More complicated to set up.
  • Efficient method for replicating data for real-time data requirements.

Challenges with database replication

Database replication is a critical part of data management that offers significant benefits.  However, it also presents several key challenges that organizations need to manage effectively. Here are three key common challenges with database replication. 

  1. Data consistency. Ensuring data consistency is a major challenge that arises for various reasons - potential failure scenarios include transactions being only partially replicated, schema changes not properly managed, and issues handling failures and recovery. Inconsistent data impacts all downstream workloads and can lead to erroneous results in analytics, reporting, and decision-making processes. As such, regular testing, proper configuration, and choosing robust software that can handle complex scenarios are essential to any data replication process.
  2. Performance overhead. Replication can introduce performance overhead on the source database system, which can affect the performance of the database and even crash the system. This is particularly true for synchronous replication or systems that use triggers. Properly setting up a replication solution to minimize database load and balancing the replication frequency and method with performance needs are some ways to minimize performance overhead.
  3. Latency. Maintaining low latency in database replication is a challenge due to several inherent and external factors, including data volume, data complexity, database resource limitations (CPU, memory, disk capacity), and security and compliance requirements (encryption, audit trails). Leveraging efficient data compression and optimizing query processing are some strategies that can be employed to lower latency. 
  4. Network reliability. Replicating data over a network can consume significant bandwidth. Network reliability issues can lead to delays in replication, resulting in stale data. In a worst case scenario, network outages can disrupt the replication process completely. Efficient network utilization strategies, such as compressing data before transmission and utilizing incremental or logical replication, can help optimize bandwidth usage. Implementing robust failover mechanisms is also crucial for maintaining data integrity.

Real-world applications of database replication

Ecommerce: customer personalization and inventory management

Replication allows real-time data from transactional databases (like purchases, user interactions) to be quickly copied to analytical databases. This enables the generation of immediate insights into customer behavior, sales trends, and inventory levels. E-commerce businesses can use these insights to make informed decisions, such as personalizing user experience (i.e. providing personalized offers based on customer browsing and purchase history), managing inventory, or tailoring marketing campaigns. 

Fintech: scalable architecture and real-time analytics

Replication ensures that once a financial transaction occurs, it is immediately reflected across all replicated databases. As fintech companies grow, replicating to a separate analytical database/data warehouse allows them to scale their database infrastructure to handle increased load, ensuring that transaction processing remains fast and efficient. By replicating transactional data to a data warehouse, fintech companies can perform real-time analytics for risk assessment, fraud detection, and customer behavior analysis without impacting the performance of the main transactional database.

Artie: log-based database replication solution

Artie leverages change data capture (CDC) and stream processing to sync databases and data warehouses in real-time, enabling sub-minute latency and reducing compute costs. Artie also handles stateful data and schema evolution (DML and DDL) automatically in-flight, and is a fully managed SaaS solution that just works out of the box. Contact us to discuss your use case and start a 14-day free trial.