Building a Unified Knowledge Graph: How We Solved Multi-Source Event Reconciliation at Scale

AI Blog

Building a Unified Knowledge Graph: How We Solved Multi-Source Event Reconciliation at Scale

February 1, 2026

The Problem: Fragmented Customer Context

Picture this: A sales rep is preparing for a crucial follow-up call with a prospect. They need to know what was discussed in previous meetings, which emails were sent, and who from their team has engaged with this account.

So they open:

Google Calendar to see when meetings happened
Gong to review call recordings and transcripts
Salesforce to check logged activities and notes
Gmail to scan through email threads

After 15 minutes of clicking through tabs and searching across systems, they've pieced together a fragmented picture. But they still miss critical context—like that meeting their colleague had last week, or the email thread that went cold after a specific call.

This isn't just inefficient. It's a fundamental problem: revenue teams have a trove of communication data locked away in different systems, but no single source of truth to analyze all interactions and surface actionable insights.

When a deal stalls, reps can't easily answer questions like "What happened after our demo?" or "When did engagement drop off?" The data exists, but it's scattered.

The Challenge: Messy Data Across Vendor Systems

Building a unified view isn't just about pulling data from multiple sources. It's about reconciling fundamentally incompatible representations of the same reality.

Three Sources of Complexity

1. Data Lives in Silos

Google/Microsoft own calendar and email data, Salesforce controls the CRM, and tools like Gong own the call transcripts

2. Every Vendor Uses Different Schemas

Each vendor structures meeting data differently. Google Calendar handles recurring meetings with instance identifiers whereas Microsoft Outlook represents recurring meetings as linked series. Salesforce has a relaxed schema where reps manually log meetings with inconsistent field usage

3. Vendor-Specific Anomalies

Beyond schema differences, each system has quirks: Some systems record "scheduled start time" while others record "actual join time" and so on.

Our Normalization Pipeline

To combat this chaos, the Rox system implements a multi-stage normalization pipeline:

Source-Specific standardization: As data arrives from each vendor API, we massage it into a common schema for that source type (calendar, call recorder, CRM)
Anomaly Removal: Before copying data to the warehouse for graph processing, we apply source-specific cleanup rules:
- Normalize time zones to UTC
- Trim and lowercase string fields for matching
Field Mapping: We map vendor-specific fields to our canonical event schema, preserving the original values as metadata in case we need to trace back to source systems

This normalized data becomes the foundation for the next challenge: figuring out which records across these different systems represent the same real-world meeting.

The Solution: Multi-Field Entity Resolution

The core challenge: determining that a calendar event, a Gong recording, and a Salesforce activity all represent the same real-world meeting—despite different schemas, missing fields, and vendor-specific quirks.

Field Relationship Architecture

Our entity resolution system works by defining pairwise field relationships between data sources. These relationships tell the system which fields can be compared to determine if records from different systems represent the same real-world event.

Key Design Principle: By explicitly defining which fields can be matched across systems, we enable flexible, multi-signal matching. The relationships are defined pairwise between every data source combination, allowing the system to determine entity equivalence even when some fields are missing or malformed.

When event_id is missing or malformed (common in manually-logged CRM data), the system automatically falls back to other defined relationships like title + start_time. This multi-signal approach increases recall by 35% over single-field matching.

Sequential Join Process

We process data sources sequentially, building up a unified graph but there’s a problem - After joining all sources, the same event may appear multiple times if different matching strategies created multiple candidate matches. A calendar event might match a Gong recording via event_id, and the same Gong recording might independently match a Salesforce activity via title + time. This creates duplicate unified entities that need consolidation.

The Reducer Algorithm: Intelligent Deduplication

When we resolve events, we can't simply take the "first" match—we need to intelligently consolidate duplicates while preserving the most reliable data and maintaining referential integrity across all source systems.

The Challenge: Given duplicate rows representing the same event, select the canonical representation that:

Maximizes source coverage (prefers rows with more source IDs populated)
Prioritizes reliable sources (calendar > call recorder > CRM)
Runs in O(n) time (not O(n²) pairwise comparisons) as we have millions of records.

Our Solution: A multi-pass weighted reduction algorithm.

Why This Works:

Completeness: By running multiple passes (one per source column), we ensure that rows with overlapping source IDs are compared and consolidated.
Weight-Based Prioritization: Records with more sources populated naturally get higher weights, and within ties, more reliable sources (calendar > gong > crm) break the tie.
O(n) Complexity: Each pass partitions and ranks—no nested loops. Three passes handle all pairwise combinations transitively.
Correctness: If Row A beats Row B in Pass 1, and Row B would have beaten Row C in Pass 2, then Row A transitively beats Row C when they meet in Pass 2. The transitive property of "total weight" ensures consistent results.

Production Impact: This approach achieves 95%+ precision while processing millions of events in under 10 minutes per customer. The alternative—evaluating all O(n²) combinations—would be computationally infeasible at scale.

Outcome: A unified event graph where each row represents a unique real-world meeting, with source IDs from all systems that captured it. This foundation enables AI agents to reason about complete customer interaction history without manually correlating across silos.

Huge data problem

Now, let's talk scale. For a mid-sized customer with 500 sales reps, we're looking at:

5-10 million calendar events per year
2-3 million call recordings from Gong/Attention
3-4 million CRM activities logged manually
15-20 million email threads

Every 30 minutes, our graph refresh pipeline needs to:

Ingest hundreds of thousands of new and updated records
Process complex multi-way joins across these massive datasets
Materialize the unified knowledge graph
Write results back to Postgres for real-time querying by AI agents

The naive approach - loading everything into memory and running nested loops - would take hours and exhaust memory. We needed something fundamentally different.

Efficiency & Optimization at Scale

Hash-Based Fingerprinting

Every 30 minutes, we rebuild the knowledge graph. To preserve entity IDs across refreshes (critical for AI agent consistency), we create deterministic fingerprints from source ID combinations.

For an event with calendar_id=cal_12345, gong_id=gong_67890, sf_id=sf_event_abc, we compute: hash("cal_12345|gong_67890|sf_event_abc").

If the same ID combination appears in the next refresh, we reuse the existing entity ID. This enables O(n) hash joins instead of O(n²) nested matching—reducing rebuild time from 45 minutes to under 10 minutes for large customers.

Spark-Based Performance Optimizations

Columnar Storage: Delta Lake's columnar format enables predicate pushdown, column pruning, and data skipping via min/max statistics.

Hash Joins: By casting ID columns to consistent types (STRING), we ensure Spark's Catalyst optimizer uses hash joins—orders of magnitude faster than sort-merge joins.

Incremental Processing: We only reprocess data sources with new changes (detected via CDC timestamps), typically 5-10% of the total dataset per refresh.

Change Data Capture for Write Optimization

Writing 20 million records to Postgres every 30 minutes would cause table locks, replication lag, and massive I/O.

Instead, we compute the diff between graph versions: new_graph EXCEPT old_graph. Only 2-5% of records typically change, reducing write volume from 20M to 400K rows—a 50x improvement. Transaction commits drop from 12 minutes to under 30 seconds.

We use Postgres COPY for bulk inserts, minimizing WAL overhead.

Post-Processing Task for Meeting Transcripts

Call transcripts present a unique challenge. Unlike calendar events or CRM activities, transcripts are:

Large (often 50KB-500KB per meeting)
Slow to generate (Gong/Attention take 30-60 minutes post-meeting to generate transcripts)
Not timing-critical for the graph build

If we waited for all transcripts before materializing the graph, we'd be perpetually 30-60 minutes behind. Instead, we treat transcripts as second-class citizens in our architecture:

Graph Materialisation Happens First: We build the unified event graph using metadata (title, time, attendees, IDs) without waiting for transcripts.
Async Transcript Ingestion: As transcripts become available from vendor APIs, we ingest them into a separate event_transcripts table in Postgres.
Post-Processing Reconciliation: A separate, asynchronous task runs after every graph build to match unlinked transcripts to materialized events.

This asynchronous approach means events become queryable within 10 minutes of the meeting ending, while rich transcript data follows 30-60 minutes later. For AI agents, this provides a progressive enhancement - they can start reasoning about "a meeting happened between these people at this time" immediately, then gain access to "here's what was discussed" as transcripts arrive.

Results & Impact

By building this unified knowledge graph, we've enabled AI agents to:

Answer complex temporal queries across data sources that would have previously required manual correlation
Provide complete context for any customer interaction, regardless of which system originally captured it
Surface insights like "which accounts have had meetings scheduled but consistently decline" or "opportunities where email engagement dropped after a specific meeting"

For our customers, this translates to:

80% reduction in time spent searching for meeting context when preparing for calls
Complete visibility into all customer touchpoints in a single interface
AI-powered insights that simply weren't possible when data lived in silos

The technical approach scales efficiently - customers with 10x more data see less than 2x increase in processing time thanks to Spark's parallelization. And by treating entity resolution as a first-class problem with sophisticated algorithmic solutions, we've built a foundation that extends beyond events to contacts, companies, opportunities, and more.

Building a knowledge graph at scale isn't just about moving data around - it's about intelligently reconciling messy, real-world information into a coherent whole that unlocks genuinely new capabilities. That's the foundation for AI agents that actually understand your business.

The catalyst for your
business’s success.

Driving your business forward with impactful solutions.

Contact Sales Contact Sales

Resources

Salesforce Alternatives: Top CRM Competitors for Businesses

Old systems don’t always cut it for modern sales teams. Explore Salesforce alternatives built for speed, scale, and flexibility, like Rox.

May 23, 2025

Resources

SMART Objectives for Sales: Ultimate Guide for Setting Goals That Convert

Smart sales goals set teams up for success by helping them focus on what matters most. Learn more with this guide from Rox.

July 25, 2025

Resources

Sales Territory Management: Everything You Need To Know

Learn how sales territory management helps optimize coverage, improve rep performance, and drive scalable growth — with the right tools.

September 2, 2025

Resources

Sales Admin Tasks Explained: Responsibilities, Skills, and Career Paths

Explore typical sales admin tasks, the skills needed to succeed, and the role of the sales administrator in driving team efficiency.

September 2, 2025

Resources

Sales and AI: Everything You Need To Know About AI Implementation for Sales Teams

AI support for sales team success: Automate workflows, boost forecasting, and close more deals with cutting-edge tool Rox.

August 1, 2025

Resources

5 Tools for Sales Enablement To Build a Smarter, Faster Sales Engine

Discover the best tools for sales enablement to streamline your sales process, boost productivity, and drive smarter decisions with AI.

July 30, 2025

Building a Unified Knowledge Graph: How We Solved Multi-Source Event Reconciliation at Scale

The Problem: Fragmented Customer Context

The Challenge: Messy Data Across Vendor Systems

Three Sources of Complexity

Our Normalization Pipeline

Field Relationship Architecture

Sequential Join Process

The Reducer Algorithm: Intelligent Deduplication

Huge data problem

Efficiency & Optimization at Scale

Hash-Based Fingerprinting

Spark-Based Performance Optimizations

Change Data Capture for Write Optimization

Post-Processing Task for Meeting Transcripts

Results & Impact

The catalyst for your business’s success.

Related Articles

Salesforce Alternatives: Top CRM Competitors for Businesses

SMART Objectives for Sales: Ultimate Guide for Setting Goals That Convert

Sales Territory Management: Everything You Need To Know

Sales Admin Tasks Explained: Responsibilities, Skills, and Career Paths

Sales and AI: Everything You Need To Know About AI Implementation for Sales Teams

5 Tools for Sales Enablement To Build a Smarter, Faster Sales Engine

The catalyst for your
business’s success.