Content to cash.
Every article, every trial, every purchase stitched together.
A subscription media platform knew users were converting — but not which articles convinced them. Trial start lived in WordPress, purchase lived in Stripe, and behaviour lived in GA4 with no thread connecting them. We built a high-fidelity cross-platform identity layer that captures the last three articles read before every purchase, measures the exact days from trial to conversion, and centralises all operational data in BigQuery via Airbyte pipelines — with an Airflow DAG guaranteeing reliable automated processing.
Trial in WordPress. Purchase in Stripe. Behaviour in GA4. No thread connecting any of it.
"We knew a user started a trial and we knew they eventually purchased — but we didn't know who they were across devices, or which articles actually convinced them to buy. The most commercially important question in content publishing — 'which content drives paid subscriptions?' — was completely unanswerable."
Three disconnected data systems — no way to link trial to purchase to content
Trial sign-ups were tracked in WordPress using a platform-specific user ID. Subscription purchases were recorded in Stripe using a billing customer ID. Behavioural events — articles read, pages visited, engagement signals — were in GA4 with an anonymous client ID. None of these three identifiers were connected to each other, making it impossible to trace a single user's journey from first article to trial start to paid conversion.
No content attribution — the most valuable editorial question was unanswerable
The content team had no data to answer which articles were driving paid subscriptions. GA4 showed article traffic but had no link to subscription purchases. Stripe showed revenue but had no link to content consumption. The editorial team was making decisions about what to write — and what to feature in the trial experience — without any signal from actual conversion data. The data existed but was locked in separate systems.
No conversion velocity data — timing of nurture campaigns was guesswork
The marketing team was sending re-engagement sequences to trial users without knowing when users were most likely to convert. Without a calculated conversion window — the precise number of days between trial start and purchase — every nurture campaign used arbitrary timing. The 7-day, 14-day, and 30-day thresholds being used had no grounding in actual behaviour, meaning the most impactful intervention moments were being missed.
No reliable workflow automation — file processing was manual and error-prone
Processing large volumes of operational files — user data exports, bulk reports, upload batches — was being handled manually. There was no audit trail, no retry logic for failures, and no guarantee that a file wouldn't be processed twice or missed entirely. The operational overhead was significant, and the risk of data errors in business-critical workflows was high without a systematic, automated processing layer.
Three capabilities the business never had before — now built into the data layer.
Identity Stitching
The anonymous GA4 client ID is now linked to the WordPress user ID (wp_user_id) at the moment of trial sign-up. A user who reads articles on their phone and subscribes on their laptop is now recognised as one individual — their entire behavioural history is connected to their purchase event.
Conversion Velocity
The system calculates the exact number of days between trial start and subscription purchase for every converting user. This conversion window is captured as a parameter on the purchase event and stored in BigQuery — enabling cohort analysis by time-to-convert and identifying the conversion deadline thresholds where intervention has the highest impact.
Behavioural Breadcrumb Trail
For every subscription purchase, the system captures the last three articles the user read before converting. These "closer" articles — the content that appeared in the final stage of the decision journey — are recorded as parameters on the purchase event and queryable in BigQuery alongside the full subscription record.
Three phases — trial start, content sequencing, conversion capture — all automated.
Trial Start: Identity + Entry Article Captured
The moment a user clicks "Start Trial," the system fires a GA4 event that captures the WordPress user ID (wp_user_id) and the entry article — the content that drove the trial sign-up. This is the anchor point that stitches the anonymous browsing session to the known user identity, and records where in the content journey the trial began.
Content Sequencing: Breadcrumb Trail Maintained
As the user reads articles during their trial period, the tracking layer silently maintains a real-time rolling record of the most recently visited content. The breadcrumb trail updates with each article view — always holding the last three articles — without any user-visible interaction. The trail is ready to be sent at the moment of conversion.
Purchase: Full Data Package Sent to GA4
When the transaction confirms, a comprehensive GA4 event fires containing: the stitched user ID, the calculated conversion window (e.g., "11 days since trial start"), and the array of last three articles. This single event is the complete attribution record — linking user identity, conversion timing, and content influence in one queryable payload that lands in BigQuery.
-
01
→ Identity stitching — linking GA4 to WordPress
wp_user_id passed to GA4 at trial start — the session becomes permanently identified.
The critical architectural decision in this system is when and how the anonymous GA4 session gets linked to the known WordPress user. At the trial start event, the wp_user_id is pushed into the GA4 user_id field — this retroactively associates all prior anonymous events in that session with the known identity, and all future events in any session where the user is logged in. The entry article is captured simultaneously as an event parameter, recording which specific content piece was active in the session when the trial began.
- Trial start event fires with wp_user_id as user_id parameter in GA4
- Entry article ID captured as event parameter — the content that triggered the trial
- Cross-device linkage: same wp_user_id recognised on any device where user is logged in
- Anonymous pre-trial behaviour retroactively linked to the identified user in analysis
-
02
→ Conversion event — full attribution payload
Purchase fires with user ID, conversion window, and last 3 articles — all in one event.
The purchase event is the most data-rich event in the system. It carries: the wp_user_id for identity continuity; the conversion_window_days value calculated from the trial start timestamp to the purchase timestamp; and the articles_read_before_conversion array containing the IDs of the last three articles the user read before purchasing. All three values are registered as GA4 custom dimensions, making them available as dimensions in any GA4 report or BigQuery export query. The result is a single event that a content analyst, marketer, or product manager can query to understand the full conversion story for any individual subscriber.
GA4 owns event collection and identity.
GA4 receives all three phase events — trial start, article views, and purchase — with the wp_user_id stitching them together as one user journey. The raw, unaggregated events export directly to BigQuery via the native GA4 → BigQuery export, preserving every event parameter at full fidelity for downstream modelling.
BigQuery owns all operational source data.
Stripe billing records, WordPress user profiles, and Iterable campaign data all flow into BigQuery via Airbyte pipelines. This means the purchase event in GA4 can be joined to the Stripe record for revenue data, the WordPress profile for user attributes, and the Iterable record for the marketing campaign that was active — providing full-circle attribution from campaign to content to revenue.
Airflow owns all automated file processing.
The Airflow DAG handles all operational file processing — checking Hasura for pending files, retrieving from Google Cloud Storage, processing, and updating final status. Idempotent design ensures no file is processed twice even if the DAG runs more frequently than expected. Retry handling covers transient failures without manual intervention.
BigQuery as the single source of truth — four pipelines, four data models.
| Data source | Ingestion method | Business value delivered |
|---|---|---|
| Google Analytics 4 (GA4) | Native GA4 → BigQuery export | Raw, unaggregated user events at full fidelity — preserves every custom parameter including user_id, articles_read, and conversion_window for precise modelling |
| Stripe (billing) | Airbyte → BigQuery | All subscription records, payment events, and revenue data — the financial ground truth for joining to behavioural events |
| WordPress (user profiles) | Airbyte → BigQuery | User profile data keyed by wp_user_id — enables joins between behavioural data, billing data, and user attributes in a single query |
| Iterable (campaigns) | Airbyte → BigQuery | Campaign send, open, and click records — enables attribution from marketing campaign to trial sign-up to paid subscription in one query |
The GA4 → BigQuery native export was chosen over the GA4 API because it provides raw, unaggregated event-level data — every event parameter is preserved at full fidelity. Aggregated API responses would have obscured the custom parameters needed for conversion attribution modelling.
User Profiles
Unified user record joining wp_user_id (WordPress), customer_id (Stripe), and GA4 client_id into a single identity key.
- wp_user_id — primary key
- Stripe customer_id
- GA4 client_id
- Registration date
- Subscription status
Sessions
Session-level engagement model — how users navigate the platform, which content they visit, and their engagement signals per session.
- Session ID + user ID
- Pages viewed per session
- Articles read — IDs + timestamps
- Session duration
- Entry and exit pages
Trial Events
Every trial start event with identity, entry article, and the trial start timestamp used to calculate conversion velocity downstream.
- wp_user_id + GA4 client_id
- Trial start timestamp
- Entry article ID
- User type at trial start
- Campaign source (Iterable)
Purchase Events
Every paid conversion with full attribution — user identity, conversion window in days, and the last three articles read before purchasing.
- wp_user_id + Stripe customer_id
- conversion_window_days
- articles_read_before_conversion[3]
- Subscription plan + revenue
- Campaign attribution (Iterable)
Airflow DAG — hourly processing with retry handling and idempotency guarantees.
Slow, error-prone, and impossible to audit — manual processing doesn't scale.
Before the Airflow pipeline was in place, processing large volumes of operational files — user data exports, bulk reports, upload batches — was handled manually. There was no consistent audit trail showing which files had been processed and when. If a process failed midway, there was no automatic retry — someone had to notice and re-run it manually. And there was no protection against a file being processed twice, which could produce duplicate records in downstream systems.
For a subscription business where billing records and user data are the commercial ground truth, this level of operational unreliability in data processing is a direct business risk — not just a technical inconvenience.
- Runs hourly — checks Hasura for files with status "pending"
- Retrieves each pending file from Google Cloud Storage
- Processes the file according to its type and routing rules
- Updates final status in Hasura on success or failure
- Retry handling — transient failures retry automatically without manual intervention
- Idempotent design — a file processed twice produces the same result as once
- Full audit trail — every processing attempt logged with timestamp and outcome
-
01
→ Retry handling — transient failures covered automatically
A failed process retries safely — no manual intervention, no silent data gaps.
The Airflow DAG is built with explicit retry logic for each task. If a processing step fails — due to a transient network error, a temporary API unavailability, or a downstream service timeout — the task automatically retries with a configured back-off interval. Only if all retries are exhausted does the DAG mark the task as failed and surface it for investigation. This means that the vast majority of transient failures are resolved without any human involvement, and the audit log captures every retry attempt so the cause of any persistent failure is fully traceable.
-
02
→ Idempotency — the same file can be processed twice with identical results
No duplicate records, no data corruption — even if the DAG runs unexpectedly twice.
Idempotency is the architectural guarantee that processing the same input more than once produces exactly the same output as processing it once. In the Airflow pipeline, this is implemented by checking the Hasura status field before processing — if a file's status is already "complete," it is skipped without re-processing. This means that even if the hourly DAG runs more frequently than expected, or if a retry re-processes an item that was already successfully handled, the downstream data remains clean. This property is non-negotiable for billing and user data pipelines where duplicate processing would produce financial reporting errors.
GA4, Airbyte, BigQuery, Airflow — each layer owns its function precisely.
Google Analytics 4
GA4 collects all three tracking phases — trial start, article views, and purchase — with custom event parameters capturing the wp_user_id, conversion_window_days, and articles_read array. The native GA4 → BigQuery export pushes raw unaggregated events to the warehouse daily, preserving every custom dimension at full event-level fidelity.
Airbyte
Airbyte manages the ELT pipelines for Stripe, WordPress, and Iterable into BigQuery. Incremental sync keeps each source current without full daily reloads. Schema evolution is handled automatically — when Stripe or Iterable add new fields, Airbyte propagates them to BigQuery without manual intervention.
Google BigQuery
BigQuery is the single source of truth — all four data sources land here. The four data models (User Profiles, Sessions, Trial Events, Purchase Events) are structured as queryable tables with shared identity keys linking across sources. Scheduled queries materialise the four models daily from the raw staging tables.
Apache Airflow
Airflow runs the hourly operational file processing DAG — checking Hasura, retrieving from Google Cloud Storage, processing, and updating status. Retry handling and idempotency are built into the DAG design. Every run is logged with full task-level audit trails. The infrastructure runs on Google Cloud Composer, Airflow's managed deployment on GCP.
Subscriptions converting every day — but no idea which content is closing them or how long it's taking.
We build the identity layer, the BigQuery pipelines, and the data models that connect content consumption to revenue — so your team can finally answer which articles drive paid subscriptions.
Book a 30-min call →US · UK · EU · APAC · Billing in USD.