DEV Community

Cover image for Reverse ETL with Hightouch, Census & RudderStack: Operational Analytics in Practice
Gowtham Potureddi
Gowtham Potureddi

Posted on

Reverse ETL with Hightouch, Census & RudderStack: Operational Analytics in Practice

reverse etl is the discipline that closes the loop a data team starts the first time it lands raw events in a warehouse and then realises the warehouse, however beautiful, is invisible to the GTM team. Forward ETL moved source data into the warehouse so analysts could ask questions; reverse ETL ships the answers back out into the operational tools — Salesforce, HubSpot, Marketo, Intercom, Slack, Facebook Ads, Iterable — where the people and systems that act on customers actually live. It is the bridge between analytical truth and operational action, and in 2026 it is the single fastest-growing surface in the modern data stack.

This guide walks the practitioner's view of operational analytics end to end. It defines the data activation pattern (model → audience → sync → destination), compares the three production-grade reverse etl tools — Hightouch, Census, and RudderStack — across destinations, dbt integration, hosting, and pricing, deconstructs the sync architecture that turns a warehouse query into a queue of API calls absorbing 429s and dead letters, and lays out the governance and observability layer that distinguishes a real data product from a fragile pipeline. Each section pairs a teaching block with a Solution-Tail worked answer — code, a step-by-step trace, an output table, and a concept-by-concept breakdown of why it works.

PipeCode blog header for a reverse ETL tutorial — bold white headline 'Reverse ETL · Operational Analytics' with subtitle 'Hightouch · Census · RudderStack · data activation' and a stylised flow showing a central warehouse cylinder sending glowing branches outward to SaaS-tool hexagons on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps while reading, drill the ETL practice library →, layer in API integration drills →, and stack the warehouse muscles with dimensional modelling problems →.


On this page


1. Why reverse ETL exists — operational analytics as a discipline

Forward ETL moves data INTO the warehouse so analysts can ask questions; reverse ETL moves data OUT so operational systems can act

The one-sentence invariant: forward ETL turns raw source data into warehouse rows that humans read on dashboards; reverse ETL turns those warehouse rows back into API calls that machines and SaaS tools execute against customers. Once you internalise that the warehouse is now the source of truth for every customer attribute, the question stops being "should we sync this?" and becomes "which destinations, which fields, how often, and with what governance?"

The data activation gap in three bullets.

  • Dashboards inform people; syncs inform systems. A lead score in Looker is a number a manager looks at on Monday. A lead score in Salesforce is a field a routing rule reads at midnight to assign the lead to the right rep. The two consumers want the same number but through different surfaces.
  • The warehouse aggregates across silos; SaaS tools cannot. Stripe knows about payments. HubSpot knows about emails. The product database knows about feature usage. Only the warehouse joins them. Reverse ETL ships that join back into every silo.
  • Manual CSV exports do not scale. A "send a CSV to ops once a week" workflow has zero observability, no schema contract, and breaks the first time a column is renamed. Reverse ETL turns the export into a versioned, scheduled, monitored data product.

Common destinations in 2026.

  • CRMs. Salesforce, HubSpot, Microsoft Dynamics, Pipedrive.
  • Marketing automation. Marketo, Iterable, Customer.io, Braze, Klaviyo, Mailchimp.
  • Support + success. Intercom, Zendesk, Gainsight, Vitally, ChurnZero.
  • Ad platforms. Facebook / Meta custom audiences, Google Ads customer match, TikTok audiences, LinkedIn matched audiences.
  • Collaboration + ops. Slack channels, Microsoft Teams webhooks, Notion databases, Asana tasks.
  • Product analytics. Amplitude cohorts, Mixpanel cohorts, Heap audiences.

Why the warehouse won as source of truth.

  • Compute and storage are now cheap. Snowflake, BigQuery, Databricks, Redshift — every cloud warehouse runs the joins at a price that makes "send the join result downstream" feasible at scale.
  • dbt made transformation governable. Once models/marts/customers.sql is the single SQL definition of "a customer," every downstream system can subscribe to its rows instead of recomputing them.
  • Data teams finally have leverage on the operational stack. Reverse ETL gives the data team a contract with marketing, sales, and CS without writing custom Python in five different SaaS APIs.

When NOT to use reverse ETL.

  • Sub-second latency requirements. Reverse ETL is a batch + micro-batch architecture. Hightouch ships syncs as fast as ~5 minutes; Census as fast as ~1 minute; RudderStack with streaming-event reverse ETL can hit seconds. Below that, you want event streaming (RudderStack event stream, Segment, Kafka → consumer) — not warehouse syncs.
  • True event streaming. "Page view fires → personalisation engine reacts in 200ms" is not a reverse ETL problem; it is a Kafka / Kinesis / event-bus problem.
  • One-off backfills. A 50k-row one-time list does not need a sync pipeline; a CSV import inside the destination is faster.

Worked example — the lead score sync that justifies reverse ETL

Detailed explanation. A B2B SaaS company computes a lead score in dbt by joining Salesforce contacts, product usage events, and marketing engagement. The score lives in marts.lead_scores. Sales wants the same score visible on the Salesforce Contact record so routing and prioritisation rules can act on it. Without reverse ETL the team writes a custom Python script, schedules it in Airflow, builds retries, builds dedupe, and rebuilds it every time the score model changes. With reverse ETL the team writes a one-page sync definition and inherits all of that infrastructure.

Question. Given the dbt model marts.lead_scores with columns (salesforce_contact_id, lead_score, last_engagement_at, churn_risk), how do you ship the row into Salesforce Contact.lead_score__c, Contact.last_engagement_at__c, and Contact.churn_risk__c so that routing rules can act on it?

Input — marts.lead_scores.

salesforce_contact_id lead_score last_engagement_at churn_risk
003A1 87 2026-06-12 0.12
003A2 41 2026-05-30 0.55
003A3 92 2026-06-14 0.08
003A4 NULL NULL NULL

Code.

-- The dbt model that becomes the sync source.
-- File: models/marts/lead_scores.sql
SELECT
    c.salesforce_contact_id,
    ROUND(s.raw_score, 0)                       AS lead_score,
    s.last_engagement_at,
    s.churn_risk
FROM {{ ref('dim_contacts') }}            c
LEFT JOIN {{ ref('int_lead_scoring') }}   s
       ON s.contact_id = c.contact_id
WHERE c.salesforce_contact_id IS NOT NULL
Enter fullscreen mode Exit fullscreen mode
# Hightouch sync definition (illustrative YAML).
# File: hightouch/syncs/salesforce_lead_score.yaml
model: marts.lead_scores
destination: salesforce_production
sync_mode: upsert
primary_key: salesforce_contact_id
schedule: "*/30 * * * *"   # every 30 minutes
mappings:
  - source: lead_score          -> Contact.lead_score__c
  - source: last_engagement_at  -> Contact.last_engagement_at__c
  - source: churn_risk          -> Contact.churn_risk__c
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The dbt model produces one row per Salesforce contact with a stable salesforce_contact_id primary key. The model is the contract — change the SQL, change every downstream consumer.
  2. Hightouch reads the model on the cron schedule. On the first run it stores a snapshot; on every later run it diffs the current rows against the previous snapshot to find changes.
  3. The sync_mode upsert tells the destination "insert if salesforce_contact_id does not exist, update otherwise." Salesforce External ID matching is configured in the Hightouch UI to map salesforce_contact_id to Salesforce's Id field.
  4. The three field mappings turn warehouse columns into Salesforce custom fields. NULL lead_score for 003A4 becomes a blank update on the Salesforce field; the destination keeps any previous value if the sync setting is "do not overwrite with NULL."
  5. The cron */30 runs every 30 minutes — far below Salesforce's daily API limit but fast enough for sales routing.

Output (Salesforce after sync).

Salesforce Contact Id lead_score__c last_engagement_at__c churn_risk__c
003A1 87 2026-06-12 0.12
003A2 41 2026-05-30 0.55
003A3 92 2026-06-14 0.08
003A4 (unchanged) (unchanged) (unchanged)

Rule of thumb. Every operational team that asks for "a number on the record so we can route on it" is asking for reverse ETL. Push back when they ask for "a CSV every Monday" — propose the sync instead, because it ships with observability, history, and a schema contract for free.

Worked example — the dashboards-vs-syncs contrast

Detailed explanation. A common mistake is treating a dashboard and a sync as the same artefact with a different surface. They are not. A dashboard runs on demand and serves humans; a sync runs on a schedule and serves machines. Different SLA, different failure mode, different governance, different consumer.

Question. Given a churn-risk metric, write the two access patterns side by side — Looker dashboard query vs reverse ETL sync — and explain why both exist.

Input.

Surface Cadence Consumer Failure mode
Looker dashboard on demand account manager empty card
Reverse ETL sync every 6h Intercom tag automation stale tag

Code.

-- Looker explore (shared view).
-- explore: account_health
-- view: marts.account_health
view: account_health {
  sql_table_name: marts.account_health ;;
  measure: avg_churn_risk {
    type: average
    sql: ${TABLE}.churn_risk ;;
  }
}
Enter fullscreen mode Exit fullscreen mode
# Hightouch sync — same underlying model, machine surface.
model: marts.account_health
destination: intercom
sync_mode: mirror
primary_key: account_id
schedule: "0 */6 * * *"
mappings:
  - source: churn_risk                 -> Company.churn_risk_attr
  - source: CASE WHEN churn_risk > 0.7
            THEN 'at_risk' ELSE 'ok' END  -> Company.health_tag
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The same marts.account_health model feeds both surfaces. There is exactly one definition of "churn risk" in the company.
  2. The dashboard query runs when a human opens it. The SLA is "the query returns in less than 10 seconds and the number is no older than the last warehouse refresh."
  3. The Hightouch sync runs every 6 hours regardless of human attention. The SLA is "the Intercom tag reflects yesterday's risk score by the end of every 6-hour window."
  4. Failure modes differ: a dashboard failure is loud (empty card, error toast); a sync failure is quiet (a stale tag still looks like data). Observability for the sync must be explicit.

Output.

Surface Behaviour when warehouse fails
Looker dashboard error visible immediately to the user
Hightouch sync last successful tag persists; alert fires only if observability is set up

Rule of thumb. Treat the sync as a different product than the dashboard, even when both subscribe to the same model. Stamp a SLA on the sync, add an explicit row-error alert, and surface the sync as a dbt exposure so it shows up in lineage.

Worked example — when reverse ETL is the wrong tool

Detailed explanation. Reverse ETL has a lower bound on latency around a minute (Census) and a typical floor of 15–30 minutes for cost-efficient syncs (Hightouch on shared infrastructure). For sub-second personalisation, fraud-blocking, or in-session experiences, reverse ETL is the wrong tool — you need an event stream.

Question. Given a "personalise the homepage banner based on the user's churn risk" requirement, decide between reverse ETL and an event-stream architecture. Show the latency budget that drives the decision.

Input.

Requirement Latency target Architecture
Sales Salesforce score 30 minutes reverse ETL
Marketing Intercom tag 6 hours reverse ETL
Ad audience refresh 24 hours reverse ETL
Homepage personalisation < 500 ms event stream
Fraud block at checkout < 200 ms online ML feature store

Code.

Decision rubric (pseudo-code):

if latency_target >= 5_minutes:
    use reverse_etl (Hightouch / Census / RudderStack)

elif latency_target >= 30_seconds:
    use event_stream_reverse_etl (RudderStack event stream)

elif latency_target >= 100_ms:
    use online_feature_store + low_latency_api (Tecton, Feast, custom)

else:
    use in_request_compute (edge function, cached cache lookup)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The latency floor for batch reverse ETL is a function of warehouse query time + diff computation + destination API throughput. On a shared tenant in Hightouch this typically lands at 5–15 minutes.
  2. RudderStack's event-stream reverse ETL closes the loop in seconds for individual event triggers but still cannot serve a single-millisecond synchronous API call.
  3. Online ML feature stores (Tecton, Feast) maintain a serving layer separate from the warehouse precisely for sub-100ms reads. Reverse ETL pre-materialises features into that layer on a slower cadence.
  4. The rubric ranks tools by the actual latency budget the use case requires. Picking the wrong tier wastes either money (using a feature store for a daily ad audience) or signal (using reverse ETL for sub-second personalisation).

Output.

Use case Tool Why
Lead score in Salesforce Hightouch upsert 30-min cadence, batch fine
Churn risk tag in Intercom Census sync 6h cadence, batch fine
Homepage banner Edge feature read sub-500ms, batch insufficient
Fraud rule at checkout Online feature store sub-200ms, must be pre-materialised

Rule of thumb. Sketch the latency budget first. Anything above 5 minutes is a reverse ETL problem. Anything below 5 minutes is a streaming or feature-store problem. Mixing the two architectures costs more than picking the right one from the start.

Reverse ETL interview question on the lift-up from forward ETL

A senior interviewer often asks: "Walk me through what changes in the data team's responsibility model when reverse ETL enters the stack. What dbt practices have to harden? What new SLAs do you accept?"

Solution Using the data activation contract

The data team takes on three new responsibilities the day reverse ETL ships:

1. Model stability is now an operational SLA.
   - Every sync model needs a stable primary key (renaming it
     breaks identity resolution downstream).
   - Column renames now break SaaS-tool fields that humans rely on.
   - Type changes can silently corrupt destination fields.
   - Solution: dbt contract tests + dbt exposures + protected branch
     for any model with downstream syncs.

2. Freshness is now a destination-level SLA.
   - Warehouse "fresh as of midnight" is no longer enough.
   - Each destination has its own freshness contract (Salesforce: 30m,
     Intercom: 6h, Facebook ads: 24h).
   - Solution: per-sync alerting, last_synced_at columns, freshness
     dashboards.

3. Governance now spans warehouse + SaaS tools.
   - PII synced to Marketo is now subject to Marketo's retention.
   - GDPR delete must propagate to every destination.
   - Solution: PII tags on every column, per-destination policy,
     destination-side deletes.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Responsibility Before reverse ETL After reverse ETL
Model PK stability nice-to-have hard contract
Column rename dashboard fix downstream sync break
Freshness warehouse-wide per-destination
PII warehouse policy propagated to N SaaS tools
Lineage dbt + BI dbt + BI + syncs

The data team learns to think of every model with at least one sync as an operational data product. The discipline is closer to backend engineering than to "writing SQL" — versioned, monitored, alerted, paged.

Output:

Practice New requirement once reverse ETL is in the stack
dbt contracts required on every sync model
dbt exposures every sync surfaced in lineage
PII tagging per-column tags propagated to destination policy
Alerting per-sync row-error rate and freshness SLA
On-call one person owns sync health

Why this works — concept by concept:

  • Models become contracts — the (primary_key, columns, types) tuple is now a stable API. Any change is a versioned migration with downstream blast-radius assessment.
  • Freshness becomes per-destination — the warehouse SLA is the upper bound; each sync has its own, often tighter, freshness contract because downstream SaaS automation acts on it.
  • PII becomes propagated — a column tagged "email PII" in the warehouse must inherit the same handling everywhere it lands. GDPR delete is the canonical stress test.
  • Lineage becomes end-to-end — dbt exposures are the standard way to surface "this model is consumed by this Hightouch sync" inside the dbt docs and the data catalog.
  • On-call gets a new pager — the day a sync fails silently is the day the data team learns operational analytics needs operational ownership. One person owns sync health, full stop.
  • Cost — the new responsibilities are mostly process; the dbt features (contracts, exposures, tags) ship out of the box. Marginal infrastructure cost is the reverse ETL vendor subscription itself.

ETL
Topic — etl
ETL pipeline problems (data engineering)

Practice →


2. The reverse ETL data model — models, audiences, syncs

Every reverse ETL platform organises around four nouns: model, audience, sync, destination — learn them once and every vendor feels the same

The mental model in one line: a model is a warehouse query that produces one row per entity; an audience is a filtered subset of a model; a sync is a mapping of model rows into a destination; a destination is the SaaS tool. Once you learn this four-noun vocabulary, every vendor UI collapses to the same shape and the differences become mostly cosmetic.

Visual diagram of the reverse ETL data model — a warehouse cylinder on the left feeding a 'model' card, which connects to an 'audience' subset card, which connects through a 'sync' card with mapping arrows to a destination hexagon on the right; small entity-id chips show identity resolution at the boundary, on a light PipeCode card.

The four-noun glossary.

  • Model. A SQL query (or dbt model reference) that returns rows of a single entity — one_row_per_user, one_row_per_account, one_row_per_subscription. The model has a primary key column and a set of attribute columns.
  • Audience. A filter expression layered on top of a model — WHERE plan = 'pro' AND last_seen_at < CURRENT_DATE - INTERVAL '30 days'. Audiences are reusable across syncs and across destinations.
  • Sync. The full specification: which model (or audience), which destination, which field mappings, which sync mode, which schedule. A sync is the deployable unit.
  • Destination. The SaaS tool credentials + the destination object (Salesforce Contact, HubSpot Company, Intercom User, Marketo Lead, Facebook Custom Audience).

Sync modes you will encounter.

  • Insert. New rows are inserted into the destination; existing rows are untouched. Used for append-only destinations like logging or analytics events.
  • Update. Existing rows are updated; new rows are not inserted. Used when the destination owns identity creation (e.g. only update Salesforce contacts that already exist via lead capture).
  • Upsert. Insert new rows, update existing rows. The most common mode for customer attribute syncs.
  • Mirror. Make the destination match the model exactly — insert new, update changed, delete rows no longer in the model. The most powerful and the most dangerous; usually scoped to audiences (e.g. "the at-risk audience").
  • Delete only. Remove rows from the destination based on a "tombstone" model. Often used for GDPR delete propagation.

Identity resolution at the sync boundary.

  • External ID matching. The most common pattern: the warehouse primary key (salesforce_contact_id, hubspot_vid) is the same as the destination's primary key. The sync upserts on that key.
  • Email / phone matching. When the warehouse and the destination both store contact PII, syncs can match on email or phone. Brittle to changes (a user's email change creates a "new" record) but works for greenfield setups.
  • Custom external_id field. Hightouch and Census both support designating a custom external ID field in the destination (e.g. Marketo's external_id_c). The sync writes the warehouse PK there once, then matches on it forever.
  • Composite key matching. Some destinations (Salesforce, Marketo) support compound external IDs (e.g. account_id + region). Rarely used; useful when the same person lives in multiple tenants.

Idempotency — the contract that saves the team.

  • Stable primary key on every model. If the warehouse PK can change, the sync will double-write or fail to dedupe — every reverse ETL platform assumes the model PK is stable across runs.
  • Idempotent upserts. A retry on the same row must produce the same destination state. Most SaaS APIs support id based upsert; some require a "create-or-update" two-step.
  • Diff-only by default. Sync only the rows that changed since the last successful run. Saves API quota, reduces destination clutter, simplifies observability ("zero diffs is a healthy sync, not a broken one").

Change detection — three strategies.

  • Full refresh. Read the entire model every run, ship every row. Simple, expensive, almost never the right answer above 100k rows.
  • Diff-only (snapshot). Store a hash of every (PK, attribute) tuple on each successful run. On the next run, compare hashes and only ship the diffs.
  • CDC mirror. Subscribe to the warehouse's change-data-capture stream (Snowflake streams, BigQuery change streams, Databricks CDC) and apply diffs incrementally. The lowest-latency option; vendor support varies.

Worked example — defining a model with a stable PK and clean attributes

Detailed explanation. A reverse ETL model is not a fact table. It is a one-row-per-entity row set with attributes the destination cares about. The biggest mistake newcomers make is reusing an analytics fact table as the model — fact tables have multiple rows per entity, and the sync will explode or drop most of them.

Question. Given a fact_orders table and a dim_customers table, write the right dbt model for a "current customer state" reverse ETL sync into Salesforce.

Input — fact_orders.

order_id customer_id amount order_date
1 C1 50 2026-06-01
2 C1 30 2026-06-10
3 C2 200 2026-05-20

Input — dim_customers.

customer_id name salesforce_contact_id
C1 Alice 003A1
C2 Bob 003A2

Code.

-- WRONG — multiple rows per customer; will fail upsert.
SELECT
    c.salesforce_contact_id,
    o.amount,
    o.order_date
FROM dim_customers c
JOIN fact_orders o ON o.customer_id = c.customer_id;

-- RIGHT — one row per customer with aggregated attributes.
-- File: models/marts/reverse_etl_customer_state.sql
SELECT
    c.salesforce_contact_id,
    c.name,
    COUNT(o.order_id)                              AS lifetime_orders,
    COALESCE(SUM(o.amount), 0)                     AS lifetime_revenue,
    MAX(o.order_date)                              AS last_order_at,
    COUNT(o.order_id) FILTER
        (WHERE o.order_date >= CURRENT_DATE - 30)  AS orders_last_30d
FROM dim_customers c
LEFT JOIN fact_orders o ON o.customer_id = c.customer_id
GROUP BY c.salesforce_contact_id, c.name;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The wrong model emits two rows for C1 (one per order). The Hightouch sync sees two rows with the same salesforce_contact_id, fails the "unique PK" assertion, and either rejects the sync or upserts the last row arbitrarily.
  2. The right model wraps fact_orders in a GROUP BY on customer_id, collapsing every customer to one row. Attributes are aggregated: COUNT for orders, SUM for revenue, MAX for last order date.
  3. LEFT JOIN preserves customers with zero orders. COALESCE(SUM(...), 0) turns the NULL sum into a clean 0 for downstream Salesforce automations.
  4. COUNT(...) FILTER (WHERE ...) produces the "last 30 days" attribute without a separate subquery. Postgres / Snowflake / BigQuery support FILTER; SQL Server uses COUNT(CASE WHEN ... THEN 1 END).

Output (the reverse ETL model).

salesforce_contact_id name lifetime_orders lifetime_revenue last_order_at orders_last_30d
003A1 Alice 2 80 2026-06-10 2
003A2 Bob 1 200 2026-05-20 0

Rule of thumb. A reverse ETL model is SELECT ... FROM ... GROUP BY entity_id plus joins. If the model emits more than one row per entity, the sync is wrong. Add a dbt-unique test on the PK column so the next CI run catches it.

Worked example — defining an audience from a model

Detailed explanation. Audiences are reusable filtered subsets of a model. A typical pattern: one underlying marts.reverse_etl_customer_state model, multiple audiences ("at-risk", "high-value", "trial-expiring"), each subscribed to a different destination.

Question. Define three audiences on top of the customer state model: at-risk (churn_risk > 0.7), high-value (lifetime_revenue > 5000), and active-trial (plan = 'trial' AND days_remaining < 7). Show how each maps to a different destination.

Input — marts.reverse_etl_customer_state.

salesforce_contact_id name plan lifetime_revenue churn_risk trial_ends_at
003A1 Alice pro 8000 0.05 NULL
003A2 Bob trial 0 NULL 2026-06-18
003A3 Cara pro 1200 0.82 NULL
003A4 Dan trial 0 NULL 2026-06-30

Code.

-- Audience: at_risk
SELECT * FROM {{ ref('marts.reverse_etl_customer_state') }}
WHERE churn_risk > 0.7;

-- Audience: high_value
SELECT * FROM {{ ref('marts.reverse_etl_customer_state') }}
WHERE lifetime_revenue > 5000;

-- Audience: active_trial
SELECT * FROM {{ ref('marts.reverse_etl_customer_state') }}
WHERE plan = 'trial'
  AND trial_ends_at IS NOT NULL
  AND trial_ends_at - CURRENT_DATE BETWEEN 0 AND 7;
Enter fullscreen mode Exit fullscreen mode
# Three syncs, one model, three destinations.
- name: at_risk_to_intercom
  audience: at_risk
  destination: intercom
  sync_mode: mirror
  mappings:
    - source: churn_risk         -> Company.churn_risk_attr
    - source: lifetime_revenue   -> Company.ltv_attr

- name: high_value_to_facebook_ads
  audience: high_value
  destination: facebook_ads
  sync_mode: mirror
  mappings:
    - source: email              -> custom_audience.email_hash

- name: active_trial_to_iterable
  audience: active_trial
  destination: iterable
  sync_mode: mirror
  mappings:
    - source: trial_ends_at      -> User.trial_end_date
    - source: name               -> User.first_name
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The single underlying model marts.reverse_etl_customer_state is the source of truth. Every audience is a filter on top of it.
  2. Audience at_risk mirrors to Intercom for CS alerting. The sync ships only the matching subset and removes the tag when a customer drops out of the audience (mirror mode).
  3. Audience high_value mirrors hashed emails to a Facebook custom audience. Add/remove behaviour follows audience membership automatically.
  4. Audience active_trial syncs to Iterable for an automated email sequence. The mirror mode adds users when they enter the trial window and removes them when the trial ends.
  5. Each sync inherits the same model contract — change the column, every audience and sync notices on the next run.

Output.

Audience Members Destination Effect
at_risk 003A3 (Cara) Intercom tagged as at_risk
high_value 003A1 (Alice) Facebook added to custom audience
active_trial 003A2 (Bob) Iterable trial-end sequence triggered

Rule of thumb. Build one model per entity, many audiences per model, one or more syncs per audience. The fan-out pattern (1 model → N audiences → M syncs) keeps the definition of an entity DRY and lets each downstream team pick the slice they care about.

Worked example — change detection: snapshot diff vs full refresh

Detailed explanation. Diff-only syncs are the default in every modern reverse ETL platform. They store a hash (or row checksum) per primary key after each successful run; on the next run they compare the new model output against the stored snapshot and emit only the changed rows. Full refresh is sometimes correct but very expensive.

Question. Given a 1M-row customer state model where 0.2% of rows change between runs, compare full-refresh API cost (every run ships every row) with diff-only (only changed rows shipped). Use a destination with a 200-row-per-API-call batch limit.

Input — assumptions.

Variable Value
Total model rows 1,000,000
Rows changed per run 2,000 (0.2%)
Destination batch size 200 rows
Syncs per day 24
Destination API call cost $0.001

Code.

Full refresh per run:
    api_calls    = ceil(1_000_000 / 200) = 5_000
    runs_per_day = 24
    daily_cost   = 5_000 * 24 * $0.001 = $120

Diff-only per run:
    api_calls    = ceil(2_000 / 200) = 10
    runs_per_day = 24
    daily_cost   = 10 * 24 * $0.001 = $0.24

Cost ratio: 500x cheaper with diff-only.
Time ratio: same — typical API latency dominated by call count, not payload size.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Full refresh ships every row on every run. With 1M rows and 200/batch, the platform issues 5,000 API calls per run. 24 runs/day is 120,000 calls/day.
  2. Diff-only ships only the 0.2% changed rows. 2,000 rows / 200 per batch = 10 API calls per run. 24 runs/day is 240 calls/day.
  3. The math is independent of vendor — every reverse ETL platform that supports diff-only will produce this savings on a typical attribute-update workload.
  4. Diff-only does require the platform to maintain the previous snapshot. The snapshot is typically stored in the reverse ETL platform's own metadata DB (Hightouch) or as a hidden audit table in the source warehouse (Census's "tracking table" pattern).

Output.

Strategy API calls / day Cost / day Quota risk
Full refresh 120,000 $120 high (Salesforce 15k cap)
Diff-only 240 $0.24 very low

Rule of thumb. Default to diff-only on every sync. Use full refresh only for "catch-up after a destination outage" or for small reference tables under ~10k rows. The 100–500× API quota savings are not optional at scale — Salesforce will hard-stop you at 15k API calls per 24h on the standard plan.

Reverse ETL interview question on idempotency

A senior interviewer often probes: "Your nightly sync runs, fails halfway through with a network blip, and reruns automatically. How do you guarantee the destination ends up in the same state it would have been if the sync had succeeded the first time?"

Solution Using the idempotent upsert contract

Idempotency is guaranteed if and only if:

1. Every model row has a stable primary key.
   - The PK is the natural identity (salesforce_contact_id),
     not a row number or a hash that changes between runs.
   - dbt test: unique + not_null on the PK column.

2. The sync mode is upsert (not insert) on a destination-side
   external ID field.
   - Salesforce: Upsert /sobjects/Contact/extId/{externalId}
   - HubSpot: Upsert /contacts/v1/contact/createOrUpdate/email/{email}
   - Marketo: leads/createOrUpdate with lookupField=externalId

3. The destination accepts a duplicate row as a no-op when
   nothing has actually changed.
   - Hightouch: built-in "skip unchanged rows" toggle.
   - Census: built-in idempotency cache.
   - RudderStack: ETag / If-Match conditional updates.

4. Retries on transient errors (5xx, network timeout) are
   safe because step 2 guarantees the second call lands the
   same destination state as the first.

5. Permanent errors (4xx) go to a dead-letter queue for
   manual inspection, NOT into the auto-retry loop.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Run Outcome Destination state Idempotent?
Run 1 (initial) success 1000 rows in Salesforce n/a
Run 2 (no diff) success 1000 rows unchanged yes — zero API calls
Run 3 (1 row change) success 1000 rows, 1 updated yes — 1 API call
Run 4 (mid-run network blip) partial fail at row 500 500 of 999 deltas applied next run resumes
Run 4 retry success all deltas applied yes — final state matches success-on-first-try

The fourth row shows the key behaviour: a half-applied sync is safe because each row's upsert is idempotent. The retry picks up the unfinished deltas without re-applying the already-applied ones.

Output:

Property Behaviour with idempotency contract
Network blip safe — retry resumes
Same model, two runs back-to-back second run is a no-op
Schema change downstream sync fails loudly, no half-update
Concurrent runs platform locks the sync to one instance
Duplicate row in model dbt test fails before sync starts

Why this works — concept by concept:

  • Stable PK — the primary key is the bridge between warehouse identity and destination identity. The whole upsert mechanism depends on it being stable across runs.
  • External ID upsert — every modern SaaS API offers an upsert primitive keyed on a custom external ID. Use it. Two-step "search-then-create-or-update" patterns are error-prone and not idempotent under concurrency.
  • Diff-only + skip-unchanged — short-circuits the destination call entirely when nothing has changed. A healthy sync run can legitimately make zero API calls.
  • Dead-letter queue — permanent errors (validation failure, missing required field) are not retried in a tight loop; they go to an inspect-and-fix queue. The retry loop is only for transient errors.
  • Concurrent-run lock — every reverse ETL platform single-instances each sync. Two parallel runs of the same sync would race on the diff snapshot and corrupt the next-run baseline.
  • Cost — idempotency is essentially free once the contract is in place. The cost is the up-front discipline of designing models with stable PKs and configuring destination external IDs.

Data
Topic — dimensional-modeling
Dimensional modelling problems (data engineering)

Practice →


3. Hightouch vs Census vs RudderStack — vendor comparison

Each vendor optimises for a different team shape — pick by who owns syncs and how dbt-native your stack is

The mental model in one line: Hightouch is the audience-builder-first managed platform, Census is the dbt-native data-team-first managed platform, RudderStack is the open-source CDP + reverse ETL combined platform with a self-hostable option. Once you map team shape and stack constraints to vendor identity, the choice becomes obvious — and obvious choices are easier to defend in a procurement meeting.

Three-column vendor comparison card — Hightouch (purple), Census (orange), RudderStack (green) each shown as a tall rounded card with a header strip, a short tagline, four feature badges (destinations, dbt integration, hosting model, pricing model), and a small icon at the top representing the vendor's identity, on a light PipeCode card.

The vendor matrix in one table.

Capability Hightouch Census RudderStack
Destinations (2026) 200+ 180+ 200+ (events + reverse ETL)
dbt integration strong (model picker, exposures) strongest (dbt exposures native, "data-team first") adequate
Audience builder first-class visual UI SQL-first, basic UI builder basic
Sequences / journeys yes (Hightouch sequences) yes (Census audiences with priority) partial
Identity resolution strong (configurable matching) strong (entity model) event-stream-first
Self-hosted option no (managed only) no (managed only) yes (RudderStack OSS + BYOC)
Combined CDP + reverse ETL no no yes (event stream + reverse ETL)
Observability strong (per-row, per-sync) strong (sync alerts) adequate
Pricing model per-destination + MTU per-row synced per-MTU + events

Hightouch — audience-builder first, GTM-team first.

  • Strengths. Best-in-class audience builder UI (drag-and-drop filters, custom calculations); broadest destination catalogue; "Hightouch sequences" let marketing build journeys without leaving the tool; deep observability with row-level error inspection.
  • Best fit. Teams where the audience definitions live half in SQL and half in marketing's head; companies with 5+ destinations across CRM + marketing + ads.
  • Trade-offs. Managed-only (no self-host); MTU-based pricing surprises mid-market companies as their user count grows; Hightouch's UI-first audience editor can drift from the dbt definition of an entity if not policed.

Census — data-team first, dbt-native.

  • Strengths. Tightest dbt integration of the three — Census reads dbt_project.yml, recognises exposures, and surfaces sync metadata back into the dbt docs; "entity" model is a first-class concept; sync alerting is mature.
  • Best fit. Data teams that already live in dbt and want the warehouse-to-SaaS contract owned by analytics engineers, not marketing ops.
  • Trade-offs. Audience-builder UI is intentionally minimal (SQL is the way); fewer "GTM goodies" like multi-channel journeys; managed-only; per-row pricing means batch refreshes can sting.

RudderStack — open-source CDP + reverse ETL combined.

  • Strengths. Open-source under AGPLv3 with a managed plan; combines event streaming (Segment-style) with reverse ETL in one tool; self-hostable for BYOC / on-prem / compliance-driven shops; the only one of the three that can serve sub-30-second event reverse ETL.
  • Best fit. Companies that need both event collection and reverse ETL but want to avoid SaaS sprawl; compliance / BYOC use cases; engineering-heavy teams comfortable running infra.
  • Trade-offs. UI is less polished than Hightouch / Census; destination catalogue runs slightly behind on long-tail SaaS tools; the self-hosted operational cost is real (operate Postgres, Kubernetes, observability).

Pricing dimensions to model before procurement.

  • MTU (Monthly Tracked Users). Most platforms charge per unique entity synced per month. The metric grows roughly with total customer base.
  • Per-row synced. Census's primary metric. Drives a "diff-only is required" discipline because full refresh becomes ruinously expensive.
  • Per-destination. Hightouch's standard plans cap the number of destinations on lower tiers. Multi-channel companies feel this fast.
  • Per-seat. Both Hightouch and Census charge per audience-builder seat above a baseline.
  • Events (RudderStack). Event-stream pricing is per event, not per unique user. Plan for both axes.

Self-hosted (RudderStack OSS) vs managed trade-off.

  • Self-hosted wins for. BYOC compliance, data-residency, "all data must stay in our VPC," low cost at very large MTU counts (>1M).
  • Managed wins for. Speed (live in a day vs a quarter), no infra ops burden, faster destination roll-outs, no upgrade cycles.
  • Hybrid pattern. Many shops run RudderStack OSS for event collection (zero per-event vendor cost) and Hightouch managed for reverse ETL (fastest catalogue + audience UI).

Worked example — picking Hightouch when GTM owns audiences

Detailed explanation. A B2B SaaS company has a 6-person revenue ops team that owns Salesforce, HubSpot, Marketo, Outreach, and a half-dozen ad accounts. They want to build "buying-committee" audiences without filing a Jira to data each time. The data team owns the underlying dbt model; revenue ops owns the audience layer on top.

Question. Given the company profile (GTM-heavy, 5+ destinations, audience-builder UI matters), justify Hightouch as the right pick. List the decisive feature differences vs Census and RudderStack.

Input — the company profile.

Property Value
Audience owners Revenue ops (non-SQL)
Destinations Salesforce, HubSpot, Marketo, Outreach, FB Ads, LinkedIn Ads
Warehouse Snowflake + dbt
Sync latency 30 minutes acceptable
Self-host requirement none

Code.

Decision matrix:

| Need                       | Hightouch | Census | RudderStack |
|----------------------------|-----------|--------|-------------|
| Drag-drop audience builder | strong    | basic  | basic       |
| 6+ destinations            | yes       | yes    | yes         |
| dbt exposure surfacing     | yes       | best   | adequate    |
| Multi-channel sequences    | yes       | partial| partial     |
| No-SQL revenue ops users   | strong    | weak   | weak        |

Decision: Hightouch wins on (1) audience builder, (4) sequences,
(5) non-SQL audience editors. Census's dbt-first stance is a real
strength but the GTM team owns audiences in this org.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The team's bottleneck is "GTM ops cannot self-serve audiences." Hightouch's audience builder is the only one of the three optimised for that exact persona.
  2. Census's strength (dbt-native) does not help when the audience layer is owned outside the data team. The model is still in dbt; the audience-on-top-of-model is what's UI-driven.
  3. RudderStack's event-stream story is not relevant — this team is not building real-time personalisation, just attribute syncs at 30-minute cadence.
  4. The decisive feature is the audience builder UI, with Hightouch sequences as a bonus for multi-step marketing journeys.

Output.

Decision Hightouch
Why audience builder + sequences + destination catalogue
Estimated MTU cost $$ (mid-market plan)
Implementation timeline 4 weeks to first sync

Rule of thumb. Hightouch wins when GTM owns the audience layer and non-SQL editors need to ship audiences without filing tickets. Census wins when the data team owns the audience layer and dbt is the single source of truth. RudderStack wins when you need both CDP event collection and reverse ETL or you must self-host.

Worked example — picking Census when dbt is the source of truth

Detailed explanation. A fintech with strict change-management has a small analytics engineering team that defines every metric, every entity, and every audience in dbt. Marketing ops "subscribes" to dbt models via tickets. The team wants the sync layer to inherit dbt's contract testing, exposures, and lineage natively.

Question. Given the company profile (dbt-first, analytics engineering owns audiences, strict change management), justify Census over Hightouch. List the decisive dbt integration features.

Input — the company profile.

Property Value
Audience owners Analytics engineering (SQL-fluent)
Source of truth dbt models, branch-protected
Destinations Salesforce, Iterable, Customer.io
Sync latency 1 hour acceptable
Compliance strict — every change reviewed

Code.

Census dbt-native features that decided it:

1. dbt project sync — Census reads dbt_project.yml directly.
   Models appear in Census with the same name as in dbt.

2. dbt exposures — every Census sync is automatically surfaced
   as a dbt exposure. Lineage in dbt docs shows the destination.

3. Git-backed sync definitions — sync YAML lives in the dbt
   repo, change-managed via PR.

4. dbt tests propagate — failing dbt tests block the sync.
   Census never ships a failing-test row to a destination.

5. Entity model — Census's "entity" concept is the equivalent
   of a dbt model with documented PK + columns. Discoverable
   across the team.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The data team's discipline is "everything ships via PR." Census's git-backed sync definitions extend that discipline to the reverse ETL layer.
  2. Hightouch supports a Terraform provider for sync-as-code, but the UI-first culture pulls non-engineers off the git workflow. Census's SQL-first culture matches the team.
  3. dbt exposures inside Census are decisive — every destination becomes a known consumer in the lineage graph. Census surfaces "this sync depends on this model" automatically.
  4. Failing dbt tests blocking the sync is the killer feature for compliance — it means a regression in the model never silently corrupts a downstream SaaS field.

Output.

Decision Census
Why dbt-native + git-backed syncs + exposures + test-gating
Estimated cost $$ (per-row pricing acceptable at this volume)
Implementation timeline 6 weeks to production sync

Rule of thumb. Census wins when the analytics engineering team owns the audience layer and "everything ships via PR" is a non-negotiable. The dbt integration is real, not cosmetic — it changes how the team operates day to day.

Worked example — picking RudderStack OSS for BYOC compliance

Detailed explanation. A healthcare SaaS must keep PII inside its own VPC. Sending raw email addresses through a multi-tenant SaaS reverse ETL platform is a compliance blocker. RudderStack OSS runs inside the customer VPC, never touches the vendor's infrastructure, and combines event collection (replacing Segment) with reverse ETL in one tool.

Question. Given the company profile (PII must stay in VPC, single tool preferred for events + syncs), justify RudderStack OSS over the managed options.

Input — the company profile.

Property Value
Compliance PII must stay in customer VPC
Existing event tool considering Segment replacement
Destinations Salesforce Health Cloud, HubSpot, internal API
Sync latency 5 minutes for high-priority
Team engineering-heavy, comfortable running infra

Code.

Why RudderStack OSS wins on this profile:

1. Self-hosted in customer VPC.
   - No PII leaves the customer's cloud account.
   - Audit trail end-to-end within customer-owned storage.

2. Combined event stream + reverse ETL.
   - Single tool covers Segment-like event collection AND
     Hightouch-like warehouse reverse ETL.
   - One destinations catalogue, one UI, one set of credentials.

3. Event-stream reverse ETL.
   - Sub-30-second latency on high-priority warehouse changes
     via the event-stream path (not the batch path).

4. AGPLv3 source-available.
   - Customer can patch, audit, and extend.
   - No vendor lock-in for compliance-critical features.

Trade-offs accepted:
- Operate Postgres, Redis, K8s yourself.
- Destination catalogue runs slightly behind Hightouch on
  long-tail tools.
- UI is less polished — engineers, not marketers, configure syncs.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The PII-in-VPC requirement removes Hightouch and Census from contention immediately — both are managed-only.
  2. The combined event-stream + reverse ETL story removes Segment from the picture and consolidates spend.
  3. RudderStack OSS's event-stream reverse ETL path is the only sub-30-second option in this comparison — relevant for the "high-priority sync" use case.
  4. The trade-off is operational burden. The team must own the Postgres metadata DB, Redis broker, and Kubernetes orchestration. An engineering-heavy org accepts this.

Output.

Decision RudderStack OSS
Why self-hosted compliance + combined CDP + sub-30s reverse ETL
Estimated cost infrastructure + 0.5 SRE FTE
Implementation timeline 8 weeks to production

Rule of thumb. RudderStack wins on three triggers: BYOC compliance, single-tool consolidation of CDP + reverse ETL, or sub-30-second latency requirements via event-stream reverse ETL. If none of those triggers fires, prefer Hightouch or Census for the operational simplicity of managed.

Reverse ETL interview question on the buy-vs-build decision

A senior interviewer often frames it as: "Your CTO is asking whether we can just build reverse ETL in-house with Airflow + Python + the destination SDKs. Walk me through the buy-vs-build decision."

Solution Using the operational-burden lens

The build-it-yourself stack:

1. Airflow / Dagster orchestration.
2. Custom Python writers for each destination API.
3. Snapshot diff engine (you build it).
4. Queue + worker pool with retry semantics (you build it).
5. Dead-letter queue + inspection UI (you build it).
6. Per-row error logging (you build it).
7. Schema-change detection (you build it).
8. Audit log + lineage (you build it).
9. Audience builder UI for non-engineers (... you build it).
10. PII tagging + governance UI (you build it).

The buy stack:

1. Hightouch / Census / RudderStack subscription.
2. Sync configuration (a week of work).

The break-even calculation:

- Year 1 build cost: 2 senior engineers × 6 months = ~$300k.
- Year 1 buy cost:   ~$30k–$80k subscription, depending on MTU.
- Year 2 build cost: 1 engineer × full year maintenance = ~$200k.
- Year 2 buy cost:   ~$50k–$120k subscription.

Buy wins decisively unless:
- You have a destination not on the vendor catalogue (rare).
- You have a sub-second latency requirement (use a feature store).
- You have a compliance constraint requiring on-prem (use OSS).
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Component Buy time Build time
Destination connectors 1 day per destination (config) 2 weeks per destination (code + tests)
Diff engine included 4 weeks
Queue + retry included 6 weeks
Dead-letter inspection included 2 weeks
Audience builder UI included 12+ weeks (and your data team has to maintain it)
Schema-change detection included 4 weeks

The "build everything" path lands at 6–9 months for a v1 covering 5 destinations with no UI. The "buy" path lands at 4–6 weeks for the same scope plus an audience UI.

Output:

Year Build cost Buy cost
1 ~$300k ~$50k
2 ~$200k ~$80k
3 ~$200k ~$100k

Why this works — concept by concept:

  • Connector breadth — vendors maintain hundreds of destination integrations as their full-time job. A 2-engineer team building from scratch will cover 5–10 destinations at best in year 1.
  • Diff engine is the moat — every reverse ETL platform's secret sauce is the diff/snapshot/incremental detection logic. Building a reliable one is a 6-month research project, not a weekend hack.
  • Audience UI — the moment a non-engineer needs to ship an audience, you need a UI. Building that internally is a years-long product investment that has nothing to do with your company's actual product.
  • Observability — per-row error tracking, dead-letter queues, sync success ring charts — all included in the vendor stack. Building them stalls your data team for months.
  • Compliance escape hatch — RudderStack OSS exists precisely for the rare cases where vendor managed cannot work. Use OSS, not in-house build.
  • Cost — over a 3-year window the buy path is 3–5× cheaper and ships in 1/10 the time. The only counter-arguments are scale (>10M MTU and you renegotiate hard) or compliance (and OSS solves that).

ETL
Topic — api-integration
API integration problems (data engineering)

Practice →


4. Sync architecture — incremental detection, queues, rate limits

A sync is a diff engine plus a queue plus a worker pool plus a rate-limited destination API — every reverse ETL platform implements the same four-stage pipeline

The mental model in one line: the warehouse query produces rows, the diff engine classifies each row as insert/update/delete vs the previous snapshot, the queue absorbs back-pressure, and the worker pool drains the queue into the destination API while respecting per-destination rate limits. Once you can draw the four stages on a whiteboard, every "why is my sync slow / failing / partial?" question becomes a probe of which stage is the bottleneck.

Visual sync architecture — a warehouse cylinder on the left feeds a 'snapshot diff' engine that produces a stream of insert/update/delete events into a queue, which is drained by parallel API workers that hit a destination card; rate-limit and retry annotations float above the workers, on a light PipeCode card.

Stage 1 — warehouse query and snapshot detection.

  • Query. The model SQL (or audience-filtered model SQL) runs against the warehouse. Result is materialised either into a temp table or streamed.
  • Snapshot store. The previous run's (pk, hash(attributes)) set lives somewhere — a hidden table in the warehouse, a Postgres metadata DB in the vendor's infra, or a CDC stream offset.
  • Diff classification. For each current row: if PK absent in snapshot → INSERT; if PK present and hash differs → UPDATE; for each snapshot PK absent in current → DELETE (or "tombstone").

Stage 2 — staging / queue.

  • Per-sync queue. Each sync gets its own queue, single-instanced. No parallel runs of the same sync.
  • Back-pressure absorption. When the destination's API is slow, the queue grows; workers pull at the destination's pace, not the warehouse's pace.
  • Persistence. Queues persist to disk so a vendor restart does not lose in-flight rows.

Stage 3 — worker pool.

  • Worker concurrency. Configured per destination; usually 1–8 parallel workers per sync.
  • Batch packing. Workers pack queue rows into destination-specific batches (Salesforce: 200/batch, HubSpot: 100/batch, Marketo: 300/batch).
  • Token-bucket rate limiter. Each worker checks the destination's quota before issuing the call.

Stage 4 — destination API.

  • Auth. OAuth, API key, service account — refreshed automatically by the platform.
  • Rate limit response. 429 (Too Many Requests) triggers exponential backoff and a slowdown of the worker pool.
  • Per-row error response. 4xx errors on specific rows are recorded as row-level failures, surfaced in the sync log, and either retried (transient) or dead-lettered (permanent).

Destination rate limits in the wild (2026 baselines).

Destination Limit Notes
Salesforce 15,000 / 24h (standard) per-org, all APIs share
HubSpot 100 / 10s + 250k / day per-portal
Marketo 100 / 20s + 50k / day per-instance
Intercom 1,000 / minute per-app
Iterable 4 / second list endpoints varies by endpoint
Facebook Custom Audience 200,000 users / API call batched mode
Slack 1 / second per webhook basic tier

Retry semantics.

  • Transient (5xx, 429, network timeout) — retry with exponential backoff. Typical: 1s → 2s → 4s → 8s → 16s → 32s, then dead-letter.
  • Permanent (4xx with validation error) — log and dead-letter immediately. Retrying will not help.
  • Auth (401, token expired) — refresh the token and retry once, then alert.
  • Quota exhausted (429 with daily-cap header) — pause the sync until the quota window resets; alert if the window is >12 hours.

Latency tiers.

  • Hourly batches. Default for most syncs. 5–60 minutes end-to-end.
  • Sub-minute batches. Census + small models. 30 seconds–5 minutes.
  • CDC mirror. Continuous; reflects warehouse changes in seconds.
  • Event-stream reverse ETL. RudderStack's path; reflects in 1–30 seconds.

Worked example — the diff engine in pseudo-code

Detailed explanation. The diff engine is the heart of every reverse ETL platform. It compares the current model row set against the previous snapshot and emits a stream of insert/update/delete events. Knowing the shape of this code helps debug "why did my sync ship row X?" questions.

Question. Write a pseudo-code sketch of a diff engine that takes (current_rows, previous_snapshot) and emits classified events. Explain how it handles deletes.

Input.

previous_snapshot:
  C1 -> hash("Alice|pro|0.05")
  C2 -> hash("Bob|trial|null")
  C3 -> hash("Cara|pro|0.40")

current_rows:
  C1 -> ("Alice", "pro", 0.05)         # unchanged
  C2 -> ("Bob",   "pro", 0.10)         # changed (trial -> pro)
  C4 -> ("Dan",   "trial", null)       # new
  # C3 missing -> deleted
Enter fullscreen mode Exit fullscreen mode

Code.

def diff_engine(current_rows, previous_snapshot):
    """Yield classified change events."""
    current_keys  = set(current_rows.keys())
    previous_keys = set(previous_snapshot.keys())

    # INSERTs — PKs in current but not previous.
    for pk in current_keys - previous_keys:
        yield ("INSERT", pk, current_rows[pk])

    # UPDATEs — PKs in both, hash differs.
    for pk in current_keys & previous_keys:
        new_hash = row_hash(current_rows[pk])
        if new_hash != previous_snapshot[pk]:
            yield ("UPDATE", pk, current_rows[pk])
        # else: unchanged, emit nothing (this is the big saving).

    # DELETEs — PKs in previous but not current.
    # Only if sync_mode == "mirror"; otherwise skip deletes.
    for pk in previous_keys - current_keys:
        yield ("DELETE", pk, None)

    # Persist new snapshot for next run.
    new_snapshot = {pk: row_hash(row) for pk, row in current_rows.items()}
    save_snapshot(new_snapshot)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The set difference current - previous yields rows present this run but not last run — INSERTs.
  2. The set intersection plus hash comparison yields rows present in both runs whose attributes changed — UPDATEs. Unchanged rows are skipped silently (zero API calls).
  3. The set difference previous - current yields rows present last run but absent this run — DELETEs. Only emitted in mirror sync mode; upsert mode ignores them.
  4. The new snapshot is persisted at the end. If the run crashes before this point, the next run sees the same previous snapshot and re-classifies the same diffs (idempotent recovery).
  5. The row hash function is typically MD5 / xxHash over the JSON serialisation of attributes in a canonical column order. Hash collisions are theoretically possible; in practice the rate is negligible at billion-row scale.

Output.

Event PK Attributes
INSERT C4 (Dan, trial, NULL)
UPDATE C2 (Bob, pro, 0.10)
DELETE C3

Rule of thumb. Always store the previous snapshot durably (warehouse table, Postgres, or S3). A lost snapshot triggers a "full diff against empty," which classifies every row as INSERT and floods the destination — the canonical "first run after vendor restart was a disaster" outage.

Worked example — the rate limiter and the 429 backoff loop

Detailed explanation. Every destination has rate limits. The worker pool must respect them or risk getting the entire integration locked. The token-bucket + exponential-backoff pattern is the universal solution.

Question. Sketch a worker loop that drains a queue of upsert events into a Salesforce-like API with a 15k/24h limit, handles 429 responses, and emits to dead-letter on permanent errors.

Input.

Queue items:
  - upsert C1 with payload P1
  - upsert C2 with payload P2
  - upsert C3 with payload P3 (will return 400 — invalid email)
  - upsert C4 with payload P4

Destination state:
  - quota_remaining = 14_998
  - quota_resets_at = 24h from now
Enter fullscreen mode Exit fullscreen mode

Code.

def worker(queue, destination, rate_limiter, dead_letter):
    while True:
        event = queue.pop()
        if event is None:
            break

        # 1. Respect the destination's rate limit.
        rate_limiter.acquire(destination)

        # 2. Make the API call.
        backoff = 1
        for attempt in range(7):
            try:
                response = destination.upsert(event.pk, event.payload)
                if response.status == 200:
                    break
                if response.status == 429:
                    # Rate limited — exponential backoff.
                    sleep(backoff)
                    backoff = min(backoff * 2, 60)
                    continue
                if 400 <= response.status < 500:
                    # Permanent error — dead letter.
                    dead_letter.push(event, response.body)
                    break
                if 500 <= response.status:
                    # Transient server error — retry.
                    sleep(backoff)
                    backoff = min(backoff * 2, 60)
                    continue
            except NetworkTimeout:
                sleep(backoff)
                backoff = min(backoff * 2, 60)
                continue
        else:
            # Out of attempts — dead letter.
            dead_letter.push(event, "max_retries_exceeded")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. rate_limiter.acquire blocks the worker until the token bucket has a slot. Implementation is typically a Redis script that decrements a per-destination counter and refills it at the destination's rate.
  2. The retry loop runs up to 7 attempts. On 429, the worker sleeps and retries (backoff 1s → 2s → 4s → ... capped at 60s).
  3. On 5xx transient server errors, the worker also retries — server-side issues are usually self-healing within seconds.
  4. On 4xx permanent errors (validation failure, malformed payload, missing required field), the worker stops retrying and pushes the event to the dead-letter queue for human inspection.
  5. Network timeouts (no response) are treated as transient — the worker retries with backoff.
  6. If all 7 attempts fail, the event is dead-lettered with max_retries_exceeded so on-call has visibility.

Output (events that reach the destination vs dead-letter).

Event Destination state Dead-letter?
C1 upserted no
C2 upserted no
C3 rejected (400 invalid email) yes
C4 upserted no

Rule of thumb. The retry loop should always distinguish transient (4 categories: 429, 5xx, timeout, auth-refresh) from permanent (4xx). Mixing them either burns rate limits on hopeless retries or silently drops fixable failures.

Worked example — back-pressure from a slow destination

Detailed explanation. When the destination API is slow (or rate-limit-restricted), the queue grows. A well-designed reverse ETL platform absorbs the growth and only fails when the queue passes a configured high-water mark — not every time the destination has a slow minute.

Question. Given a warehouse producing 10k rows/minute and a destination accepting 100 rows/minute, model the queue growth over an hour. Show why a "queue depth" alert is the right SLI and how to use it for early warning.

Input.

Variable Value
Warehouse output rate 10,000 rows/min
Destination accept rate 100 rows/min
Initial queue depth 0
Alert threshold 50,000 rows

Code.

def queue_growth(warehouse_rate, destination_rate, minutes):
    depth = 0
    growth_per_min = warehouse_rate - destination_rate
    log = []
    for minute in range(1, minutes + 1):
        depth += growth_per_min
        log.append((minute, depth))
    return log

# Compute for one hour:
growth = queue_growth(10_000, 100, minutes=60)
# Alert fires when depth crosses 50_000.
alert_minute = next(m for m, d in growth if d > 50_000)
print(f"Queue depth alert at minute {alert_minute}")
# -> Queue depth alert at minute 6
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Net growth per minute = warehouse output - destination accept = 10000 - 100 = 9900 rows/min.
  2. After 1 min: 9,900 rows queued. After 5 min: 49,500 queued. After 6 min: 59,400 — crosses the 50k alert threshold.
  3. The alert at minute 6 gives on-call 50 minutes of headroom before the queue passes a typical "platform refuses to enqueue" limit of ~500k rows.
  4. The right remediation depends on the cause: (a) destination is rate-limited — wait for the quota to reset and accept the lag; (b) destination is genuinely broken — pause the sync until the destination is healthy; (c) warehouse is producing duplicates — fix the model.
  5. Without the queue-depth alert the team only learns about the problem when the platform errors out at 500k+ — too late, downstream is already stale by hours.

Output.

Minute Queue depth Alert?
1 9,900 no
3 29,700 no
5 49,500 no
6 59,400 yes
60 594,000 platform errors

Rule of thumb. Alert on queue depth, not on sync errors. A sync error is the symptom; queue depth is the leading indicator. Set the alert threshold at 30–50% of the platform's enqueue ceiling to buy on-call time.

Reverse ETL interview question on rate-limit-aware design

A senior interviewer often asks: "Salesforce has a 15k API calls per day quota and our customer state model has 200k rows. How do you design a sync that fits inside the quota?"

Solution Using batching + diff-only + audience filtering

The math first:

  raw rows                            = 200_000
  Salesforce upsert batch size        = 200 rows / call
  full refresh calls                  = 1_000 calls / run
  diff-only on 0.5% changed rows      = 1_000 changed rows
  diff-only batch calls               = ceil(1_000 / 200) = 5 calls / run
  hourly cadence                      = 24 runs / day
  daily API calls                     = 5 * 24 = 120 calls / day

  Headroom under the 15k quota: 124x.

The design:

1. Composite Tooling API batching.
   - Use Salesforce's Composite/sObject Collections API:
     200 records per call vs 1 record per Standard upsert.

2. Diff-only sync mode (no full refresh).
   - Reverse ETL platform stores last-run snapshot.
   - Ship only rows whose attribute hash changed.

3. Audience scoping.
   - Many syncs only need the "active" subset of customers.
   - Filter at the audience layer (plan != 'churned')
     so the diff engine compares smaller sets.

4. Cadence sized to business need.
   - Sales routing: every 30 minutes.
   - Account health: every 6 hours.
   - LTV refresh: every 24 hours.
   - Do not over-spec freshness; quota is finite.

5. Per-sync quota guard.
   - Configure the reverse ETL platform's "max API calls per
     window" knob to a sub-quota share per sync.
   - Hightouch and Census both expose this; RudderStack via config.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Design choice Effect on quota
Full refresh 1,000 calls/run × 24 = 24,000/day — over quota
Diff-only 5 calls/run × 24 = 120/day
Audience scoping reduces diff size further
Per-sync quota guard prevents any one sync from monopolising quota
Hourly vs 30-min cadence doubles or halves daily API calls

The combination of (2) and (4) is decisive. Diff-only converts the metric from "rows in the model" to "rows that changed," which on most attribute syncs is 0.1–2% of the model.

Output:

Strategy Calls / day Inside 15k quota?
Full refresh hourly 24,000 no
Diff-only hourly 120 yes
Diff-only every 30m 240 yes
Diff-only every 5m 1,440 yes
Full refresh every 5m 288,000 catastrophic

Why this works — concept by concept:

  • Batched upsert — Salesforce's composite endpoint is the single biggest lever. Going from 1 row per call to 200 rows per call drops the call count by 200×.
  • Diff-only sync — the second biggest lever. Only ship rows that actually changed. Drops the call count by 50–500× on typical attribute workloads.
  • Audience filtering — shrinks the model to the rows that matter. Skipping churned customers saves both diff computation and quota.
  • Cadence sizing — the third lever. Match the sync frequency to the actual business cadence; "fresh every 5 minutes" is rarely needed for a CRM attribute.
  • Per-sync quota guard — defensive design. Even if one sync misbehaves (e.g. a model bug emits 200k diffs), the guard prevents it from burning the org-wide quota and breaking unrelated syncs.
  • Cost — the design is essentially free. All the levers are configuration, not code. The cost is the discipline to model the math up-front for each new sync.

ETL
Topic — etl
ETL design problems (data engineering)

Practice →


5. Governance, observability, and failure modes

A sync that has no governance, no observability, and no defined failure modes is not a data product — it is a time bomb

The mental model in one line: governance answers "who can sync what to where"; observability answers "is the sync healthy right now"; failure modes answer "what breaks and how do we know". The discipline that separates a hobbyist sync from a production data product is treating these three pillars as first-class — versioned, owned, and on-call paged.

Three-zone governance and observability card — left zone shows a 'governance' gate card with PII tags and an approval check; middle zone shows an observability dashboard card with a success-rate ring chart and a tiny row-error list; right zone shows a failure-mode card with three labelled warning chips (schema drift, mapping break, row cap), on a light PipeCode card.

Governance — five non-negotiables.

  • Field-level PII tagging. Every column tagged pii=email | phone | address | name | ssn. Tags propagate to the sync layer so destinations can enforce per-tag policy.
  • Per-destination policy. "Email PII can sync to Marketo; SSN PII cannot sync to anything." Hightouch and Census both support sync-level allow/deny rules.
  • Audience approval. New audiences > 10k members require analytics-engineering sign-off. Catches "I just synced 200k users to Facebook by accident."
  • GDPR delete propagation. A user's right-to-delete must reach every destination. The platform must support a "delete pipeline" sync (model = users_to_delete, mode = delete-only, fanned out to every destination).
  • Audit log. Every sync edit, schedule change, and credential rotation is logged with actor + timestamp.

Observability — six SLIs to track.

  • Sync success rate. Percent of runs that finished without a top-level error. Target: >99.5%.
  • Row-error rate. Percent of rows in a successful run that failed (typically destination 4xx validation). Target: <1%.
  • Freshness lag. Time since last successful run vs the scheduled cadence. Target: <2× cadence.
  • Queue depth. Pending rows waiting for the worker pool. Leading indicator of destination slowness.
  • Rejected payload sample. Stratified sample of dead-letter events for human inspection.
  • Latency p50 / p99. Wall-clock time from model row produced to destination row accepted.

Failure modes — the four most common.

  • Mapping drift. Warehouse column renamed; destination field still expects the old name; sync silently writes NULL or fails.
  • Schema drift. Column type changed (INT → BIGINT, VARCHAR(50) → VARCHAR(500)); destination rejects with type-mismatch error.
  • Row-cap breach. Audience suddenly grows from 5k to 200k members because a filter became overly permissive; destination quota burns out.
  • Credential expiry. OAuth refresh token expires; sync fails with 401; team finds out hours later when freshness lag alert fires.

Catalog + lineage — surfacing syncs as dbt exposures.

  • Every sync is a known consumer of one or more dbt models. The standard surface is a dbt exposure:
# models/marts/exposures.yml
exposures:
  - name: salesforce_lead_score_sync
    type: application
    owner:
      name: Analytics Engineering
      email: ae@example.com
    depends_on:
      - ref('reverse_etl_customer_state')
    description: |
      Hightouch sync into Salesforce.Contact.lead_score__c.
      Cadence: every 30 minutes.
      On-call: data-team rotation.
Enter fullscreen mode Exit fullscreen mode

Cost guardrails.

  • Per-sync row caps. "This sync will never ship more than 50k rows per run; abort if it tries."
  • Audience size caps. "This audience will never include more than 100k members; alert if it does."
  • Quota share caps. "This sync will use no more than 30% of the destination's daily API quota."
  • Frequency caps. "Even if scheduled hourly, no more than 24 runs per day."

Worked example — propagating PII tags from dbt to the sync layer

Detailed explanation. Field-level PII tagging is the foundation of governance. When a column is tagged in dbt, the tag must propagate to every downstream sync so per-destination policy can enforce "this PII can/cannot land here." Census and Hightouch both read dbt meta tags directly.

Question. Tag dim_users.email as pii=email in dbt, configure Census to read the tag, and define a per-destination policy that allows email to sync to Marketo but blocks it from a marketing experimentation tool.

Input.

# dbt model schema.yml
version: 2

models:
  - name: dim_users
    columns:
      - name: user_id
        tests: [unique, not_null]
      - name: email
        meta:
          pii: email
          contains_pii: true
      - name: ssn
        meta:
          pii: ssn
          contains_pii: true
Enter fullscreen mode Exit fullscreen mode

Code.

# Census destination policy.
destinations:
  marketo:
    allowed_pii_tags: [email, name]
    blocked_pii_tags: [ssn, phone, address]

  experimentation_tool:
    allowed_pii_tags: [name]
    blocked_pii_tags: [email, ssn, phone, address]
    # Note: email is blocked here.

# Census sync definition.
syncs:
  - name: users_to_marketo
    source: dim_users
    destination: marketo
    mappings:
      - source: email   -> Lead.Email          # OK — email allowed in Marketo
      - source: name    -> Lead.Name           # OK

  - name: users_to_experimentation
    source: dim_users
    destination: experimentation_tool
    mappings:
      - source: name    -> User.display_name   # OK
      - source: email   -> User.identifier     # BLOCKED — sync refuses to compile
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The dbt meta block tags the column with structured PII metadata. Census's dbt project reader picks up the tag automatically — no second source of truth.
  2. The destination policy lists allowed and blocked PII categories per destination. Marketo accepts email + name; the experimentation tool accepts only name.
  3. When the sync to Marketo compiles, every mapping is checked against the policy. Email → Lead.Email is allowed; the sync ships.
  4. When the sync to the experimentation tool compiles, the email mapping triggers a policy violation. Census refuses to compile the sync; the engineer sees a clear error and either removes the mapping or escalates for an exception approval.
  5. The policy is enforced at compile time, before any row hits a network. A misconfigured sync never reaches the destination.

Output.

Sync Policy decision
users_to_marketo compiles + ships
users_to_experimentation refused to compile (email blocked)

Rule of thumb. Tag PII at the dbt column level; let the reverse ETL platform read tags and enforce per-destination policy at compile time. Never enforce PII policy at the row level at runtime — at runtime the data has already left the warehouse.

Worked example — the freshness SLA alert

Detailed explanation. Every sync has a freshness contract — "fresh within 2 hours" — set by the consuming team. The platform tracks the actual freshness and alerts when the contract is breached. The alert wakes on-call before the marketing team complains.

Question. Configure a freshness alert for the salesforce_lead_score_sync (cadence 30 min, SLA 2h) and walk through the on-call response when it fires.

Input.

Sync Cadence SLA Freshness now
salesforce_lead_score_sync 30 min 2h 3h 15m ago

Code.

# Census alert definition (illustrative).
alerts:
  - name: lead_score_sync_freshness
    sync: salesforce_lead_score_sync
    condition: minutes_since_last_success > 120  # 2h SLA
    severity: page
    notify:
      - pagerduty: data-team-oncall
      - slack: "#data-alerts"
    runbook: |
      Sync has not succeeded in over 2 hours.
      Steps:
        1. Check Census dashboard for recent error.
        2. If 401 — refresh OAuth credential.
        3. If 429 — wait for quota reset; backfill afterwards.
        4. If model SQL error — open dbt repo, fix, redeploy.
        5. If destination outage — pause sync, monitor status page.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The alert condition minutes_since_last_success > 120 measures actual freshness against the 2h SLA. The 30-minute cadence is the target; the SLA is the deadline.
  2. When the alert fires, PagerDuty pages the on-call data engineer and posts to the Slack channel. The runbook is in the alert body, not in a separate wiki.
  3. The on-call reads the Census dashboard, identifies the failure category (auth, quota, model error, destination outage), and applies the matching runbook step.
  4. The runbook covers the four most-common failure modes. Steps 1–3 are operational; step 4 escalates to the model owner; step 5 escalates to the destination vendor.

Output (timeline of the on-call response).

Time Event
03:00 Last successful run.
03:30 Scheduled run fails — 401 (token expired).
04:00 Second scheduled run fails — 401.
04:30 Third scheduled run fails — 401.
05:00 Freshness alert fires (2h SLA breached). PagerDuty pages on-call.
05:05 On-call reads runbook, refreshes OAuth credential.
05:10 Sync retries successfully. Freshness lag drops to 10 minutes.

Rule of thumb. Freshness lag is the right top-line SLI for a sync — not "did the last run succeed." A sync that runs and succeeds every hour is fine. A sync that runs every 30 minutes but has failed for the last 4 runs is broken, and only the freshness lag catches it.

Worked example — schema drift catches before deploy

Detailed explanation. Schema drift happens when a model's column type or name changes in a way the downstream sync cannot accept. The right place to catch it is in dbt CI, before merge — not in production after the sync starts failing.

Question. Configure dbt contracts on the reverse_etl_customer_state model and walk through what happens when a developer tries to rename lifetime_revenue to lifetime_value without coordinating with the sync.

Input.

# dbt contract on the model.
models:
  - name: reverse_etl_customer_state
    config:
      contract:
        enforced: true
    columns:
      - name: salesforce_contact_id
        data_type: varchar
        tests: [unique, not_null]
      - name: name
        data_type: varchar
      - name: lifetime_orders
        data_type: integer
      - name: lifetime_revenue
        data_type: numeric
      - name: last_order_at
        data_type: timestamp
Enter fullscreen mode Exit fullscreen mode

Code.

-- Developer's PR — renames lifetime_revenue.
-- File: models/marts/reverse_etl_customer_state.sql
SELECT
    c.salesforce_contact_id,
    c.name,
    COUNT(o.order_id)                AS lifetime_orders,
    COALESCE(SUM(o.amount), 0)       AS lifetime_value,   -- renamed!
    MAX(o.order_date)                AS last_order_at
FROM dim_customers c
LEFT JOIN fact_orders o ON o.customer_id = c.customer_id
GROUP BY c.salesforce_contact_id, c.name;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The developer renames lifetime_revenue to lifetime_value in the SELECT clause.
  2. dbt CI runs dbt build. The contract check inspects the actual output schema against the declared columns: list.
  3. The output column lifetime_value does not match the declared lifetime_revenue. dbt fails the build with a clear error: "column lifetime_revenue not produced; column lifetime_value produced unexpectedly."
  4. The CI failure blocks the merge. The developer either reverts the rename or files a coordinated migration (rename in dbt + rename mapping in sync + cutover plan).
  5. Without the contract, the rename would merge, the next sync run would silently ship NULL for lifetime_revenue (Salesforce field overwritten with NULL), and the marketing team would discover the bug three days later when their nurture sequence fires for everyone.

Output.

Stage Outcome
Pre-contract rename merges, sync silently writes NULL, downstream stale
With contract rename blocked in CI, coordinated migration required

Rule of thumb. Every dbt model with at least one reverse ETL sync should have an enforced contract. The contract is the bridge between "data team owns the model" and "operational team owns the destination" — it makes drift loud instead of silent.

Reverse ETL interview question on the sync as a data product

A senior interviewer often asks: "How do you turn a one-off sync from a side-project into a production data product? What does the full lifecycle look like?"

Solution Using the data-product lifecycle

The data-product lifecycle for a reverse ETL sync:

1. INTAKE
   - Consumer team files a sync request.
   - Required fields: model, destination, fields, cadence,
     SLA, on-call owner.

2. DESIGN
   - Analytics engineer reviews the model PK + idempotency.
   - PII tags audited; destination policy verified.
   - Audience defined if filtering required.
   - dbt contract on the source model.
   - Cost estimate (quota + MTU).

3. BUILD
   - Sync YAML / config committed to git.
   - CI runs dbt build + sync linting.
   - PR review by analytics engineering.

4. DEPLOY
   - Sync deployed to staging destination first.
   - Manual QA on 10 sample rows.
   - Cut over to production destination.

5. MONITOR
   - dbt exposure surfaced in catalog.
   - Freshness alert + row-error alert configured.
   - Queue-depth alert configured.
   - On-call runbook attached.

6. ITERATE
   - Quarterly review of sync health metrics.
   - Audience drift review (size still in expected range?).
   - Destination policy review (PII still compliant?).
   - Cost review (still inside quota envelope?).

7. RETIRE
   - When the consumer no longer needs it: archive the sync,
     drop the dbt exposure, document the deprecation.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Phase Owner Output
Intake Consumer team + AE sync request ticket
Design Analytics engineering sync design doc
Build Analytics engineering sync YAML + PR
Deploy Analytics engineering staging then prod
Monitor Data on-call dashboards + alerts
Iterate Analytics engineering quarterly review notes
Retire Analytics engineering deprecation note

The discipline is the same as any backend service. The vocabulary borrows from product management (intake, MVP, monitoring, deprecation) more than from data engineering (model, refresh, materialise).

Output:

Artifact Where it lives
Sync config dbt repo / sync YAML
dbt contract model schema.yml
dbt exposure exposures.yml
Alerts observability platform
Runbook alert body + wiki
Cost budget per-sync row cap + quota share
On-call rota PagerDuty schedule

Why this works — concept by concept:

  • Intake gates entry — not every "we want a sync" idea becomes a sync. The intake form forces the consumer to articulate model, destination, SLA, and ownership before any engineering time is spent.
  • dbt contracts gate change — every sync model has an enforced contract. Drift is caught at PR time, not at production-failure time.
  • Exposures surface lineage — the data catalog knows every sync. When a model changes, the catalog shows every downstream sync that will be affected.
  • Alerts surface failure — freshness lag, row-error rate, and queue depth are the three SLIs. Every sync has them; on-call wakes up to them.
  • Quarterly review surfaces drift — audiences grow, costs shift, PII policy evolves. Quarterly review catches slow drift before it becomes an incident.
  • Retirement is explicit — syncs are retired explicitly, not abandoned. A retired sync is archived in git and removed from exposures so the catalog stays accurate.
  • Cost — the discipline is overhead. For a low-stakes internal sync, the full lifecycle is overkill. For any sync touching customer-facing automation, the lifecycle is the floor.

Data
Topic — data-transformation
Data transformation problems (data engineering)

Practice →


Cheat sheet — reverse ETL recipes

  • Lead score → Salesforce. Model fct_lead_score (one row per Salesforce contact) → audience "lead_score >= 80" → upsert into Contact.lead_score__c. Cadence: 30 minutes. Use composite API batching for 200 rows/call.
  • Account churn risk → Intercom. Model dim_accounts with churn_risk → audience "churn_risk > 0.7" → mirror sync sets Company.churn_risk_tag = at_risk and clears the tag when the account drops out of the audience.
  • High-value users → Facebook custom audience. Model dim_users joined to fct_user_revenue → audience "ltv_usd > 5000" → mirror sync hashes emails and pushes to a Meta custom audience. Reflects add/remove automatically on each run.
  • Slack high-value signup alert. Model fct_signups filtered to "plan = pro AND first_seen_at >= today" → RudderStack event sync → Slack webhook posts to #sales-alerts with the new account name + plan + region.
  • Marketing suppression list. Model dim_users filtered to "opted_out = true OR gdpr_deleted = true" → mirror sync to every marketing destination's suppression list (Marketo, Iterable, Customer.io, Mailchimp).
  • Reverse ETL → product analytics. Model marts.user_cohorts with (user_id, cohort_label) → upsert into Amplitude's cohorts API, mirrored to Mixpanel's cohort endpoint. Lets PMs filter funnels by warehouse-defined cohorts.
  • GDPR delete pipeline. Model users_to_delete (one row per requested deletion) → delete-only sync fanned out to Salesforce, HubSpot, Marketo, Intercom, Iterable, Facebook. Idempotent: a row deleted twice is a no-op.
  • Trial-ending sequence trigger. Model dim_users filtered to "plan = trial AND trial_ends_at BETWEEN today AND today + 7" → mirror sync to Iterable user property trial_end_date. Iterable workflow fires the in-app + email sequence.
  • Customer attribute fan-out. Single model marts.customer_attributes (one row per customer) → multiple syncs to Salesforce, HubSpot, Intercom, Iterable each picking the columns they need. One source, many destinations.
  • Sales territory routing. Model dim_accounts with territory_code → upsert into Salesforce Account.RoutingTerritory__c. Pairs with a Salesforce assignment rule that reads the field at lead creation.
  • NPS score sync. Model marts.nps (one row per account with rolling NPS) → upsert into Salesforce Account.nps_rolling__c. Customer success team filters Salesforce dashboards by NPS bucket.
  • Webhook fan-out. Model fct_account_events (one row per significant account event) → RudderStack event sync → internal API webhook, Slack channel, and Salesforce task creation in parallel.

Frequently asked questions

Is reverse ETL the same as a CDP?

Not quite — they overlap but solve different starting problems. A CDP (Customer Data Platform like Segment or RudderStack Event) collects events from your sources and forwards them to destinations; the warehouse is optional. Reverse ETL starts from the warehouse — it assumes you already have a single source of truth for customer attributes and ships that to destinations. The modern stack often uses both: a CDP collects events into the warehouse (forward path), and a reverse ETL tool ships warehouse-aggregated state back to operational tools (reverse path). RudderStack is unusual in offering both in one product; Hightouch and Census focus on the reverse ETL half only.

Do I need a customer data warehouse before reverse ETL?

Yes — you need a warehouse and a single canonical definition of the entity you want to sync. The warehouse can be Snowflake, BigQuery, Databricks, Redshift, or Postgres; it does not have to be branded a "customer data warehouse." What matters is that one SQL query produces one row per entity with the attributes you need to ship. If your data is still scattered across SaaS tools with no aggregation layer, you have a forward ETL problem first, and reverse ETL has nothing to sync.

How is Hightouch different from Census?

Hightouch optimises for the GTM / revenue ops persona — drag-and-drop audience builder, multi-channel journeys (Hightouch Sequences), broad destination catalogue (200+), strong observability with row-level error inspection. Census optimises for the analytics engineering / data team persona — tightest dbt integration of any vendor (reads dbt_project.yml, surfaces exposures, git-backed sync configs), SQL-first audience model, sync-test gating tied to dbt tests. Pick Hightouch when non-SQL users own the audience layer; pick Census when the data team owns it end-to-end and dbt is the source of truth.

Can I build reverse ETL myself with Airflow + APIs?

Yes, technically — and you should not, in practice. A v1 covering 5 destinations takes two senior engineers about 6 months to build: connectors, diff engine, queue + retry, dead-letter inspection, audience builder UI, schema-change detection, audit logging, PII governance. The three production vendors (Hightouch, Census, RudderStack) ship all of that for the price of about one engineer-year per year. The only cases where in-house build wins are (a) you have an extremely narrow scope (one destination, never more), (b) you are at a scale where MTU pricing genuinely hurts (>10M MTU and you can renegotiate hard), or (c) you have a hard BYOC compliance constraint and even RudderStack OSS does not fit.

What latency can reverse ETL realistically deliver?

Batch reverse ETL typically delivers 5–60 minute end-to-end latency, dominated by the warehouse query time plus the destination API throughput. Census claims sub-minute sync on small models with their fastest tier; Hightouch's shared infrastructure typically lands around 5–15 minutes. RudderStack's event-stream reverse ETL path closes the loop in seconds to a minute for individual event triggers but is not magic for batch attribute updates. If your use case requires sub-second response (in-session personalisation, fraud blocking, real-time bidding), reverse ETL is the wrong tool — you want an online feature store or an event-stream architecture that does not round-trip through a warehouse query.

How do I handle GDPR deletes through reverse ETL?

Build a dedicated delete pipeline. The pattern: one warehouse model users_to_delete with one row per requested deletion (user_id, email, requested_at), fanned out as a delete-only sync to every destination that received that user's PII. Each destination has a delete or "right-to-be-forgotten" API; Hightouch and Census both expose delete-only sync modes that wire into them. Idempotency matters — a user deleted twice should be a no-op. Audit-log every delete sync run for compliance evidence. Crucially, the platform itself must be able to delete its sync history for the deleted user; verify your vendor's GDPR posture before committing to PII-heavy syncs.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every reverse ETL recipe above ships with hands-on practice rooms where you design the model, write the idempotent upsert, and reason about rate limits against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your sync design will hold up at scale.

Practice ETL now → API integration drills →

Top comments (0)