CLIENT CASE STUDY — Subscription Platform — Conversion Attribution + Data Engine

Data Engineering · D11 · GA4 + Stripe + WordPress + Iterable → BigQuery + Airflow

Rev. 2026.05 · Live client

Content to cash.
Every article, every trial, every purchase stitched together.

→ LIVE CLIENT Subscription platform — unified conversion attribution + scalable BigQuery data infrastructure GA4 · Stripe · WordPress · Iterable · Airflow

A subscription media platform knew users were converting — but not which articles convinced them. Trial start lived in WordPress, purchase lived in Stripe, and behaviour lived in GA4 with no thread connecting them. We built a high-fidelity cross-platform identity layer that captures the last three articles read before every purchase, measures the exact days from trial to conversion, and centralises all operational data in BigQuery via Airbyte pipelines — with an Airflow DAG guaranteeing reliable automated processing.

Discuss a similar build → ← Back to all case studies

4 data sources unified in BigQuery 3 articles captured per conversion Cross-device identity stitching 99.99% workflow reliability

Section01

— The challenge

Trial in WordPress. Purchase in Stripe. Behaviour in GA4. No thread connecting any of it.

Data silos
→ single source of truth

"We knew a user started a trial and we knew they eventually purchased — but we didn't know who they were across devices, or which articles actually convinced them to buy. The most commercially important question in content publishing — 'which content drives paid subscriptions?' — was completely unanswerable."

→ Problem 01

Three disconnected data systems — no way to link trial to purchase to content

Trial sign-ups were tracked in WordPress using a platform-specific user ID. Subscription purchases were recorded in Stripe using a billing customer ID. Behavioural events — articles read, pages visited, engagement signals — were in GA4 with an anonymous client ID. None of these three identifiers were connected to each other, making it impossible to trace a single user's journey from first article to trial start to paid conversion.

→ Problem 02

No content attribution — the most valuable editorial question was unanswerable

The content team had no data to answer which articles were driving paid subscriptions. GA4 showed article traffic but had no link to subscription purchases. Stripe showed revenue but had no link to content consumption. The editorial team was making decisions about what to write — and what to feature in the trial experience — without any signal from actual conversion data. The data existed but was locked in separate systems.

→ Problem 03

No conversion velocity data — timing of nurture campaigns was guesswork

The marketing team was sending re-engagement sequences to trial users without knowing when users were most likely to convert. Without a calculated conversion window — the precise number of days between trial start and purchase — every nurture campaign used arbitrary timing. The 7-day, 14-day, and 30-day thresholds being used had no grounding in actual behaviour, meaning the most impactful intervention moments were being missed.

→ Problem 04

No reliable workflow automation — file processing was manual and error-prone

Processing large volumes of operational files — user data exports, bulk reports, upload batches — was being handled manually. There was no audit trail, no retry logic for failures, and no guarantee that a file wouldn't be processed twice or missed entirely. The operational overhead was significant, and the risk of data errors in business-critical workflows was high without a systematic, automated processing layer.

Section02

— What the system delivers

Three capabilities the business never had before — now built into the data layer.

Identity
Velocity
Attribution

→ Insight 01

Identity Stitching

The anonymous GA4 client ID is now linked to the WordPress user ID (wp_user_id) at the moment of trial sign-up. A user who reads articles on their phone and subscribes on their laptop is now recognised as one individual — their entire behavioural history is connected to their purchase event.

Business impact: Eliminates double-counting of users. Provides a precise view of true cross-device behaviour and removes the attribution gap between content consumption and revenue.

→ Insight 02

Conversion Velocity

The system calculates the exact number of days between trial start and subscription purchase for every converting user. This conversion window is captured as a parameter on the purchase event and stored in BigQuery — enabling cohort analysis by time-to-convert and identifying the conversion deadline thresholds where intervention has the highest impact.

Business impact: Marketing can build hyper-targeted campaigns for users approaching day 7, 14, or 30 — timed to real conversion behaviour, not arbitrary scheduling.

→ Insight 03

Behavioural Breadcrumb Trail

For every subscription purchase, the system captures the last three articles the user read before converting. These "closer" articles — the content that appeared in the final stage of the decision journey — are recorded as parameters on the purchase event and queryable in BigQuery alongside the full subscription record.

Business impact: Content teams can identify the specific articles that close subscriptions. Product teams know which content to surface prominently in the trial experience.

Section03

— How the tracking works

Three phases — trial start, content sequencing, conversion capture — all automated.

Phase 1 · Entry
Phase 2 · Journey
Phase 3 · Conversion

→ Phase 01 — Entry Point

Trial Start: Identity + Entry Article Captured

The moment a user clicks "Start Trial," the system fires a GA4 event that captures the WordPress user ID (wp_user_id) and the entry article — the content that drove the trial sign-up. This is the anchor point that stitches the anonymous browsing session to the known user identity, and records where in the content journey the trial began.

→ Phase 02 — The Journey

Content Sequencing: Breadcrumb Trail Maintained

As the user reads articles during their trial period, the tracking layer silently maintains a real-time rolling record of the most recently visited content. The breadcrumb trail updates with each article view — always holding the last three articles — without any user-visible interaction. The trail is ready to be sent at the moment of conversion.

→ Phase 03 — Conversion

Purchase: Full Data Package Sent to GA4

When the transaction confirms, a comprehensive GA4 event fires containing: the stitched user ID, the calculated conversion window (e.g., "11 days since trial start"), and the array of last three articles. This single event is the complete attribution record — linking user identity, conversion timing, and content influence in one queryable payload that lands in BigQuery.

01
→ Identity stitching — linking GA4 to WordPress
wp_user_id passed to GA4 at trial start — the session becomes permanently identified.

The critical architectural decision in this system is when and how the anonymous GA4 session gets linked to the known WordPress user. At the trial start event, the wp_user_id is pushed into the GA4 user_id field — this retroactively associates all prior anonymous events in that session with the known identity, and all future events in any session where the user is logged in. The entry article is captured simultaneously as an event parameter, recording which specific content piece was active in the session when the trial began.
- Trial start event fires with wp_user_id as user_id parameter in GA4
- Entry article ID captured as event parameter — the content that triggered the trial
- Cross-device linkage: same wp_user_id recognised on any device where user is logged in
- Anonymous pre-trial behaviour retroactively linked to the identified user in analysis
02
→ Conversion event — full attribution payload
Purchase fires with user ID, conversion window, and last 3 articles — all in one event.

The purchase event is the most data-rich event in the system. It carries: the wp_user_id for identity continuity; the conversion_window_days value calculated from the trial start timestamp to the purchase timestamp; and the articles_read_before_conversion array containing the IDs of the last three articles the user read before purchasing. All three values are registered as GA4 custom dimensions, making them available as dimensions in any GA4 report or BigQuery export query. The result is a single event that a content analyst, marketer, or product manager can query to understand the full conversion story for any individual subscriber.

→ Behaviour + identity

GA4 owns event collection and identity.

GA4 receives all three phase events — trial start, article views, and purchase — with the wp_user_id stitching them together as one user journey. The raw, unaggregated events export directly to BigQuery via the native GA4 → BigQuery export, preserving every event parameter at full fidelity for downstream modelling.

→ Billing + operational systems

BigQuery owns all operational source data.

Stripe billing records, WordPress user profiles, and Iterable campaign data all flow into BigQuery via Airbyte pipelines. This means the purchase event in GA4 can be joined to the Stripe record for revenue data, the WordPress profile for user attributes, and the Iterable record for the marketing campaign that was active — providing full-circle attribution from campaign to content to revenue.

→ Workflow reliability

Airflow owns all automated file processing.

The Airflow DAG handles all operational file processing — checking Hasura for pending files, retrieving from Google Cloud Storage, processing, and updating final status. Idempotent design ensures no file is processed twice even if the DAG runs more frequently than expected. Retry handling covers transient failures without manual intervention.

Section04

— The data warehouse

BigQuery as the single source of truth — four pipelines, four data models.

GA4 · Stripe
WordPress · Iterable
→ BigQuery

Data source	Ingestion method	Business value delivered
Google Analytics 4 (GA4)	Native GA4 → BigQuery export	Raw, unaggregated user events at full fidelity — preserves every custom parameter including user_id, articles_read, and conversion_window for precise modelling
Stripe (billing)	Airbyte → BigQuery	All subscription records, payment events, and revenue data — the financial ground truth for joining to behavioural events
WordPress (user profiles)	Airbyte → BigQuery	User profile data keyed by wp_user_id — enables joins between behavioural data, billing data, and user attributes in a single query
Iterable (campaigns)	Airbyte → BigQuery	Campaign send, open, and click records — enables attribution from marketing campaign to trial sign-up to paid subscription in one query

The GA4 → BigQuery native export was chosen over the GA4 API because it provides raw, unaggregated event-level data — every event parameter is preserved at full fidelity. Aggregated API responses would have obscured the custom parameters needed for conversion attribution modelling.

→ Four analytics-ready data models built in BigQuery

→ Model 01

User Profiles

Unified user record joining wp_user_id (WordPress), customer_id (Stripe), and GA4 client_id into a single identity key.

wp_user_id — primary key
Stripe customer_id
GA4 client_id
Registration date
Subscription status

→ Model 02

Sessions

Session-level engagement model — how users navigate the platform, which content they visit, and their engagement signals per session.

Session ID + user ID
Pages viewed per session
Articles read — IDs + timestamps
Session duration
Entry and exit pages

→ Model 03

Trial Events

Every trial start event with identity, entry article, and the trial start timestamp used to calculate conversion velocity downstream.

wp_user_id + GA4 client_id
Trial start timestamp
Entry article ID
User type at trial start
Campaign source (Iterable)

→ Model 04

Purchase Events

Every paid conversion with full attribution — user identity, conversion window in days, and the last three articles read before purchasing.

wp_user_id + Stripe customer_id
conversion_window_days
articles_read_before_conversion[3]
Subscription plan + revenue
Campaign attribution (Iterable)

Section05

— Operational reliability

Airflow DAG — hourly processing with retry handling and idempotency guarantees.

Hourly execution
Retry on failure
99.99% reliability

→ The problem with manual file processing

Slow, error-prone, and impossible to audit — manual processing doesn't scale.

Before the Airflow pipeline was in place, processing large volumes of operational files — user data exports, bulk reports, upload batches — was handled manually. There was no consistent audit trail showing which files had been processed and when. If a process failed midway, there was no automatic retry — someone had to notice and re-run it manually. And there was no protection against a file being processed twice, which could produce duplicate records in downstream systems.

For a subscription business where billing records and user data are the commercial ground truth, this level of operational unreliability in data processing is a direct business risk — not just a technical inconvenience.

→ Airflow DAG — processing logic

Runs hourly — checks Hasura for files with status "pending"
Retrieves each pending file from Google Cloud Storage
Processes the file according to its type and routing rules
Updates final status in Hasura on success or failure
Retry handling — transient failures retry automatically without manual intervention
Idempotent design — a file processed twice produces the same result as once
Full audit trail — every processing attempt logged with timestamp and outcome

01
→ Retry handling — transient failures covered automatically
A failed process retries safely — no manual intervention, no silent data gaps.

The Airflow DAG is built with explicit retry logic for each task. If a processing step fails — due to a transient network error, a temporary API unavailability, or a downstream service timeout — the task automatically retries with a configured back-off interval. Only if all retries are exhausted does the DAG mark the task as failed and surface it for investigation. This means that the vast majority of transient failures are resolved without any human involvement, and the audit log captures every retry attempt so the cause of any persistent failure is fully traceable.
02
→ Idempotency — the same file can be processed twice with identical results
No duplicate records, no data corruption — even if the DAG runs unexpectedly twice.

Idempotency is the architectural guarantee that processing the same input more than once produces exactly the same output as processing it once. In the Airflow pipeline, this is implemented by checking the Hasura status field before processing — if a file's status is already "complete," it is skipped without re-processing. This means that even if the hourly DAG runs more frequently than expected, or if a retry re-processes an item that was already successfully handled, the downstream data remains clean. This property is non-negotiable for billing and user data pipelines where duplicate processing would produce financial reporting errors.

Section06

— The stack

GA4, Airbyte, BigQuery, Airflow — each layer owns its function precisely.

No custom ingestion
proven toolchain
fully maintainable

→ Event collection + identity

Google Analytics 4

GA4 collects all three tracking phases — trial start, article views, and purchase — with custom event parameters capturing the wp_user_id, conversion_window_days, and articles_read array. The native GA4 → BigQuery export pushes raw unaggregated events to the warehouse daily, preserving every custom dimension at full event-level fidelity.

→ Operational data ingestion

Airbyte

Airbyte manages the ELT pipelines for Stripe, WordPress, and Iterable into BigQuery. Incremental sync keeps each source current without full daily reloads. Schema evolution is handled automatically — when Stripe or Iterable add new fields, Airbyte propagates them to BigQuery without manual intervention.

→ Warehouse + data models

Google BigQuery

BigQuery is the single source of truth — all four data sources land here. The four data models (User Profiles, Sessions, Trial Events, Purchase Events) are structured as queryable tables with shared identity keys linking across sources. Scheduled queries materialise the four models daily from the raw staging tables.

→ Workflow automation

Apache Airflow

Airflow runs the hourly operational file processing DAG — checking Hasura, retrieving from Google Cloud Storage, processing, and updating status. Retry handling and idempotency are built into the DAG design. Every run is logged with full task-level audit trails. The infrastructure runs on Google Cloud Composer, Airflow's managed deployment on GCP.

001 / Data sources in BigQuery

GA4 · Stripe · WordPress · Iterable — all centralised, all queryable

002 / Analytics models built

User Profiles · Sessions · Trial Events · Purchase Events

003 / Articles per conversion

Last 3 articles read before every paid subscription — captured at conversion point

004 / Conversion window precision

Days

Exact days from trial start to purchase — calculated for every converting subscriber

005 / Airflow processing cadence

Hourly

DAG runs every hour with retry handling and idempotency — 99.99% reliability

006 / Identity gaps eliminated

Zero

Cross-device identity stitching at trial start — anonymous GA4 session linked to wp_user_id

Section07

Subscriptions converting every day — but no idea which content is closing them or how long it's taking.

We build the identity layer, the BigQuery pipelines, and the data models that connect content consumption to revenue — so your team can finally answer which articles drive paid subscriptions.

Book a 30-min call →

Fixed-price engagements only. We scope it, price it, ship it.
US · UK · EU · APAC · Billing in USD.