Knowledge Graph Deep Dive

Damon Lin

Elite sales teams are not differentiated by access to more data. They are differentiated by their ability to make sense of it and act on it.

Modern sales organizations already generate massive amounts of data. Emails flow through inboxes, call recorders produce transcripts, customers file support tickets, and CRMs track deals and contacts. The challenge is not a lack of data, but fragmentation. This data is spread across disconnected systems, making it difficult to extract meaningful insights.

The real problem is unifying and interpreting this data. For example, how do you recognize that a prospect you emailed three months ago, who never responded, is the same person now acting as the executive buyer on a deal at risk? How do you model relationships such as a contact being tied to a deal, or an email being sent to that contact?

At Rox, we are building a knowledge graph for sales teams to solve this exact problem. We ingest data from multiple sources, resolve entities across systems, and construct rich relationships between them. This creates a unified and queryable representation of the entire sales ecosystem.

All of this is computed directly in the data warehouse using Spark. This allows us to operate at massive scale with high parallelism while maintaining a single source of truth. As more sales data moves from traditional CRMs into the warehouse, this approach accelerates a broader shift. Siloed data is transformed into a unified knowledge graph that enables sales teams to operate with clarity and confidence.

In the rest of this post, we will dive into how this knowledge graph is built, from data ingestion and entity resolution to relationship modeling and large-scale, warehouse-native computation that supports customers across different data warehouses.

Constructing the Knowledge Graph

The Rox knowledge graph provides a unified representation of customer data that serves as context for agents. At its core, it models the key entities in sales: accounts, contacts, opportunities, events, and users, along with the relationships between them.

Each entity type is stored in its own table, identified by a primary key we call the Rox ID, along with a set of normalized fields. Relationships between entities are captured in a table called entity_link containing fields source_entity_id, target_entity_id, and relationship_type. In addition, we maintain a mapping table called graph that links each Rox ID to the corresponding ID in the original data source, allowing us to trace every record back to its origin. If an entity is created in Rox, then the Rox ID is the same as the data source ID.

account

rox_id

name

domain

annual_revenue

a01

Ramp

ramp.com

1000000000

a02

OpenAI

openai.com

25000000000

deal

rox_id

name

amount

stage

o01

OpenAI land

50000

S3

o02

Ramp expansion

250000

S5

user

rox_id

name

email

u01

Damon Lin

damon@rox.com

entity_link

source_entity_id

target_entity_id

relationship_type

a01

o02

DEAL_OF_COMPANY

a02

u01

OWNER_OF_ACCOUNT

graph

rox_entity_id

data_source_entity_id

a01

a01

a02

f03ffa3d-2827-434e-82d0-bdbd3bda52bb

o01

69529349739

o02

o02

u01

005Vp00000PKUgfVPB

One of the main challenges in building this system is that every customer structures their data differently. Table names, field names, and schemas vary widely across sources. The knowledge graph is designed to be fully customizable to handle this variability. Instead of enforcing a rigid schema, we allow customers to configure how their data maps into the Rox model.

For example, a customer might store account data in a table called company with fields like property_name, property_domain, and property_annual_revenue. These fields can be mapped to the corresponding fields in Rox’s account entity through configuration. During graph build, the system reads these mappings and automatically translates data from the customer’s schema into the unified Rox schema.

Relationships are configured in a similar way. Once entities and their fields are defined, customers can specify how entities should be connected. For instance, the id field on the user table and the owner_id field on the account table can define a relationship such as OWNER_OF_ACCOUNT. Relationships can also be derived from non-ID fields. An email’s recipient address can be linked to a contact’s email to capture communication history. Customers have full control over which relationships are defined and how they are named.

A critical component of the knowledge graph is entity resolution. This is the process of determining when records from different data sources refer to the same underlying entity, and ensuring they are merged rather than duplicated.

Each entity type uses its own resolution logic:

  • Accounts are resolved using signals such as domain and firmographic attributes.

  • Contacts are resolved primarily by email address.

  • Events are resolved using a combination of title, start time, end time, and attendees.

  • Opportunities are intentionally not resolved. Their creation is typically controlled through strict processes, and merging them risks losing important context.

The result is a knowledge graph that unifies data across disparate systems into a consistent set of entities and relationships, with duplicates resolved and lineage preserved. This foundation enables downstream systems and agents to reason over a complete and coherent view of the customer.

Scaling with Spark and Warehouse-Native Architecture

Operating this system in production requires handling both high data volume and continuous change. A single customer can have millions of accounts and tens of millions of contacts, with frequent inserts, updates, and deletes across multiple upstream systems. Keeping derived state consistent under these conditions is a distributed systems challenge.

Our current architecture uses a periodic batch pipeline that rebuilds the full state every 30 minutes. Each run performs a coordinated read across all connected data sources, recomputes entities and relationships, and applies changes in data. This introduces a hard constraint on end-to-end latency of requiring each job to be complete within the scheduling window to avoid backlog of jobs.

Apache Spark is the execution engine that enables this model to operate reliably at scale. The pipeline is a series of distributed transformations over large datasets, including joins across heterogeneous sources, aggregations for entity resolution, and graph edge construction. Spark’s ability to push computation down, optimize execution plans, and parallelize across partitions allows us to scale reliably even as data volume grows.

The architecture is designed to be warehouse-native. We run the compute pipeline directly where the data already lives instead of requiring the data to live in Rox, which is supported by the trend of more and more customer data moving to warehouses. For customers who already have their data in a warehouse, we can build the knowledge graph by querying data in their warehouse directly rather than replicating it into our warehouse. This is possible because of native data sharing features in warehouses like Snowflake, Databricks, and BigQuery, keeping security boundaries intact and avoiding unnecessary data duplication.

Spark sits at the center of this as the compute layer. It connects to these warehouses, reads from the shared tables, and runs the same transformations regardless of the underlying system. This lets us keep a single pipeline while supporting different warehouse setups across customers.

Future Work

Looking ahead, we are focused on improving both the entity resolution algorithm and scalability of the pipeline.

On the entity resolution side, we plan to incorporate LLMs as part of the algorithm. Instead of relying on deterministic rules, LLMs can help assign confidence scores when determining whether entities from different data sources should be merged. High-confidence matches can be resolved automatically, while medium- and low-confidence cases can be surfaced to users for review.

On the scalability side, we are moving away from batch processing towards a streaming approach. Rather than relying on periodic full recomputation, changes will be processed as event streams and applied incrementally to the knowledge graph. This shift reduces latency, lowers compute overhead, and enables Rox to react to changes in source systems in near real time.

If these are problems you want to work on, we're hiring.

Get started today

Get started today

Get started today

Rox is committed to the privacy and security of its users. Customer data processed through the Rox platform is encrypted in transit and at rest using AES-256 encryption and is never used to train generalized machine learning models. Rox maintains SOC 2 Type II compliance and undergoes independent third-party security audits on an annual basis. All AI-generated outputs, including but not limited to prospect recommendations, message drafts, meeting summaries, and pipeline scoring, are provided for informational purposes and should be reviewed by authorized personnel before any action is taken. Performance metrics referenced on this website, including pipeline generation figures, response rates, and revenue impact, reflect results reported by individual customers under specific configurations and may not be representative of all deployments. Actual results will vary based on factors including but not limited to data quality, CRM configuration, outreach volume, market conditions, and target audience. Rox does not guarantee specific revenue outcomes. The Rox platform integrates with third-party services including Salesforce, HubSpot, Gmail, Microsoft Outlook, Slack, and others; availability and functionality of third-party integrations are subject to the respective providers' terms of service and may change without notice. Features described as "autopilot," "autonomous," or "automated" operate within user-defined parameters and require initial configuration and ongoing oversight. Rox, the Rox logo, and "Revenue on Autopilot" are trademarks of Rox, Inc. All other trademarks are the property of their respective owners. Service availability is subject to the terms outlined in your enterprise agreement. For questions regarding data processing, compliance certifications, or platform capabilities, contact security@rox.com.

Copyright © 2026 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103

Rox is committed to the privacy and security of its users. Customer data processed through the Rox platform is encrypted in transit and at rest using AES-256 encryption and is never used to train generalized machine learning models. Rox maintains SOC 2 Type II compliance and undergoes independent third-party security audits on an annual basis. All AI-generated outputs, including but not limited to prospect recommendations, message drafts, meeting summaries, and pipeline scoring, are provided for informational purposes and should be reviewed by authorized personnel before any action is taken. Performance metrics referenced on this website, including pipeline generation figures, response rates, and revenue impact, reflect results reported by individual customers under specific configurations and may not be representative of all deployments. Actual results will vary based on factors including but not limited to data quality, CRM configuration, outreach volume, market conditions, and target audience. Rox does not guarantee specific revenue outcomes. The Rox platform integrates with third-party services including Salesforce, HubSpot, Gmail, Microsoft Outlook, Slack, and others; availability and functionality of third-party integrations are subject to the respective providers' terms of service and may change without notice. Features described as "autopilot," "autonomous," or "automated" operate within user-defined parameters and require initial configuration and ongoing oversight. Rox, the Rox logo, and "Revenue on Autopilot" are trademarks of Rox, Inc. All other trademarks are the property of their respective owners. Service availability is subject to the terms outlined in your enterprise agreement. For questions regarding data processing, compliance certifications, or platform capabilities, contact security@rox.com.

Copyright © 2026 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103

Copyright © 2026 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103

Rox is committed to the privacy and security of its users. Customer data processed through the Rox platform is encrypted in transit and at rest using AES-256 encryption and is never used to train generalized machine learning models. Rox maintains SOC 2 Type II compliance and undergoes independent third-party security audits on an annual basis. All AI-generated outputs, including but not limited to prospect recommendations, message drafts, meeting summaries, and pipeline scoring, are provided for informational purposes and should be reviewed by authorized personnel before any action is taken. Performance metrics referenced on this website, including pipeline generation figures, response rates, and revenue impact, reflect results reported by individual customers under specific configurations and may not be representative of all deployments. Actual results will vary based on factors including but not limited to data quality, CRM configuration, outreach volume, market conditions, and target audience. Rox does not guarantee specific revenue outcomes. The Rox platform integrates with third-party services including Salesforce, HubSpot, Gmail, Microsoft Outlook, Slack, and others; availability and functionality of third-party integrations are subject to the respective providers' terms of service and may change without notice. Features described as "autopilot," "autonomous," or "automated" operate within user-defined parameters and require initial configuration and ongoing oversight. Rox, the Rox logo, and "Revenue on Autopilot" are trademarks of Rox, Inc. All other trademarks are the property of their respective owners. Service availability is subject to the terms outlined in your enterprise agreement. For questions regarding data processing, compliance certifications, or platform capabilities, contact security@rox.com.

Copyright © 2026 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103

Rox is committed to the privacy and security of its users. Customer data processed through the Rox platform is encrypted in transit and at rest using AES-256 encryption and is never used to train generalized machine learning models. Rox maintains SOC 2 Type II compliance and undergoes independent third-party security audits on an annual basis. All AI-generated outputs, including but not limited to prospect recommendations, message drafts, meeting summaries, and pipeline scoring, are provided for informational purposes and should be reviewed by authorized personnel before any action is taken. Performance metrics referenced on this website, including pipeline generation figures, response rates, and revenue impact, reflect results reported by individual customers under specific configurations and may not be representative of all deployments. Actual results will vary based on factors including but not limited to data quality, CRM configuration, outreach volume, market conditions, and target audience. Rox does not guarantee specific revenue outcomes. The Rox platform integrates with third-party services including Salesforce, HubSpot, Gmail, Microsoft Outlook, Slack, and others; availability and functionality of third-party integrations are subject to the respective providers' terms of service and may change without notice. Features described as "autopilot," "autonomous," or "automated" operate within user-defined parameters and require initial configuration and ongoing oversight. Rox, the Rox logo, and "Revenue on Autopilot" are trademarks of Rox, Inc. All other trademarks are the property of their respective owners. Service availability is subject to the terms outlined in your enterprise agreement. For questions regarding data processing, compliance certifications, or platform capabilities, contact security@rox.com.

Copyright © 2026 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103

Copyright © 2026 Rox. All rights reserved. 251 Rhode Island St, Suite 205, San Francisco, CA 94103