Blogs

Introducing Artie Transfer

Robin Tang
January 17, 2023

Being an avid data warehouse (DWH) user myself at various roles, I have always found limitations (and frustrations!) around the data latency between our DWH and the production data that sits in the transactional databases. Depending on where I worked, the lag spanned multiple hours to day(s) and larger companies tend to be on the slower end. 

What are some use cases that are hindered by this replication lag?

Given that the DWH is a platform, there are a multitude of use cases that can be enabled or strengthened by having shorter data lag. I’ll go over some examples below.

Operations heavy companies

Ops heavy companies typically have requirements and business processes that are constantly changing. As a result, the engineering team typically struggles to keep up - thus solutions like Zapier, Typeform, Retool, Tinybird and other no-code solutions have become part of the standard toolkit for these types of companies.

These tools can stack on top of each other and reference the DWH data, of which the efficacy is determined by the data replication lag. That said, it's not necessary for every table to be blazing fast. That would be nice to have… but still, most companies typically do have a handful of critical tables where it would be necessary.

As an example, a food delivery company may set up a Retool application as a custom Zendesk application such that it can pull the customer’s most recent orders and interactions. 

In addition to the previous examples, companies operating in these industries would care to reduce the data lag in the following tables (what we call “critical tables”):

Critical production tables based on industry vertical

Lifecycle and Paid Marketing Use Cases

Within Lifecycle Marketing, it is common to purchase marketing automation tools like Iterable, Braze and Intercom. Each of these tools have their own version of what the user model and events should look like, such that marketers can create email templates like: Hi {{first_name}}!

Examples of additional user attributes to send:

  1. Paid marketing: Once the customer requests a ride, we'd like to send as many customer traits as possible to destinations like Google and Facebook so they can optimize their algorithm and find more look-alikes. 
  2. Nurture campaigns: When the customer signs up on our website, we'd like to drop them in a drip campaign that effectively welcomes and onboards the customer. We'd like to reference dynamic fields like product interactions (Did they do more than just sign up? Did they view an article already? Have they requested a ride yet?) and other customer attributes.

> So how do we send this data today?

Typically, teams will have different pipelines to build custom schemas to send to various destinations. With this type of setup, there are various drawbacks:

  1. Changes are blocked by engineering.
  2. User fields are almost never in one service. As a result, we’ll need to call other services (now we’ll need error handling and retries). Backfilling may also generate too much traffic and DDOS internal services not equipped to handle the load.
  3. Requires maintenance and ongoing support. If a field is incorrectly added and not referencing an indexed field, it may cause the whole pipeline to slow down.
  4. Data is usually not in the DWH. As a result, it’s hard to do segmentation and reporting.

This could be solved by building materialized views, especially if we could use dbt (a more expressive framework) to create these views. However, these solutions are not mature enough to handle most marketing use cases.

> Why are dbt materialized views not mature enough?

When creating a materialized view, we need to specify how often these views are regenerated by specifying a schedule (example: Snowflake tasks). To visualize the latency on this, we can draw out the following equation:

For example: we can have the materialized view run every 5 minutes, but it doesn't really matter if the referenced table is being updated every 6 hours. The resulting view is still 6 hours behind.

This is why we built Artie. Artie Transfer is able to alleviate the first part of the equation: MAX(Referenced Table Staleness) by removing the raw OLTP (online transactional processing) table latency. This way, for faster views, we can just increase the frequency of view generation.

How are companies solving the data lag problem today and why isn’t it a great fix?

The traditional process for updating DWH data looks something like this:

  • Taking a snapshot of the Postgres table (in CSV) via pg_dump
  • Parse the output and format the data for DWH consumption
  • Upload this data to DWH

And repeat. How often can you run this on a daily basis though?

For savvy companies that are experiencing this pain - they would typically purchase Fivetran for Databases or Fivetran Teleport. However, this is not always a viable option for companies. 

Engineering teams can and have tried to build this out themselves, but the solution is tricky and hard to scale. Why?

  • Change data capture does not support Data Definition Languages (DDL) such as adding or dropping columns
  • Companies typically use more than one database type and each type requires a different parser and potentially a new pipeline
  • Reliability is feature zero. Faster data replication is great, but only if the results are reliable. If an in-house process misses a CDC event or processes rows out of order, the resulting data is no longer eventually consistent and we end up with the wrong view
  • DWHs are meant to handle high QPS (queries per second) for COPY commands and this would require a workaround to handle data mutations (update and delete)

Enter Artie Transfer

As technologists, we believe that approaching zero replication lag between OLTP and OLAP databases should be the norm and that this should be widely accessible. As such, we have open-sourced Artie Transfer

Artie Transfer enables OLTP data to be streamed and replicated continuously to the DWH and reduces the replication lag from hours/days down to mere seconds. We believe this will allow the whole ecosystem to generate better insights and unlock additional use cases that were previously unattainable due to this constraint.

This is how Artie Transfer is configured under the hood to provide this capability.

Artie simple architecture diagram
  • OLTP logs will be Kafka connectors (such as Debezium) that publishes to Kafka
  • Each table will have their own Kafka topic and we will set the right number of Kafka partitions to achieve the right throughput
  • Artie Transfer will consume from the Kafka topic interface with the DWH

To support this workload, Artie Transfer has the following features:

  1. Automatic retries & idempotency. We take reliability seriously and it's feature 0. Latency reduction is nice, but doesn't matter if the data is wrong. We provide automatic retries and idempotency such that we will always achieve eventual consistency.
  2. Automatic table creation. Transfer will create the table in the designated database if the table doesn't exist.
  3. Error reporting. Provide your Sentry API key and errors from data processing will appear in your Sentry project.
  4. Schema detection. Transfer will automatically detect column changes and apply them to the destination.
  5. Scalable architecture. Transfer's architecture stays the same whether we’re dealing with 1GB or 100+ TB of data.
  6. Sub-minute latency. Transfer is built with a consumer framework and is constantly streaming messages in the background. Say goodbye to schedulers!

In addition to this offering, we will also be providing a paid managed version that includes:

  1. Setting up your database to enable CDC
  2. Setting up connectors to read from your database logs and publishing them to Kafka
  3. Providing a managed version of Artie Transfer

If you are interested in learning more: check out Artie Transfer’s Github repo, schedule a demo here or drop us a note at hi@artie.so. To stay updated on our progress – check out our LinkedIn and Twitter page!