How Rox Mines Signal from Sales Data at Scale

Brian Xu
A Shared Problem
One sales deal can balloon into a massive amount of data: a deal can consist of dozens of calls, each one transcribed into hundreds of text chunks. Scattered throughout these meetings are signals for buying interest, disagreements, competitor mentions, and commitments. When we started building our agent’s ability to reason over data, our first major challenge was meeting transcripts. <Link to Gopal’s post>
We built a simple LLM reranker that got the job done. But gradually, we saw this pattern emerge across all sorts of data. As we expanded our agent’s abilities to searching across documents, emails, CRM notes, or even the web, we needed to aggressively filter out noise and extract clean insights for our agent to reason over.
We needed a single system to take any data source, extract relevant snippets, and do it fast enough at scale to prevent bottlenecking our agent. We ultimately built a unified reranking system that extracts the most relevant snippets across any data source, powering all of our agent’s reasoning.
What We’re Solving
Zooming out, building in sales presents a specialized problem space: signals are embedded everywhere, and the environment is constantly shifting. Let’s take a concrete example from a Rox user: a sales representative is conducting a single account review. Our agent will synthesize:
15 meeting transcripts (390 text chunks split over 7 calls)
40+ email threads
CRM notes from teammates
Recent news on the target company
Realtime LinkedIn data on critical contacts to detect job changes
Internal documents shared throughout the deal
Surfacing a single insight, such as “their VP of Engineering mentioned considering a competitor in last Tuesday’s call”, requires retrieving precise passages out of thousands and weaving them together. If we execute poorly, the rep misses a key detail, sends a follow-up email, and loses credibility with the account. But if we get it right, preparation that typically takes an hour is collapsed into just 30 seconds, and the rep can defend a deal that is silently slipping away to a competitor.
As a result, the Rox team has thought deeply about how to craft the right context from our data to unlock meaningful sales outcomes at scale. How do we crawl thousands of heterogeneous signals, select the most relevant snippets, and present them as clean context for our agent to reason over? This is one of the most important levers for our agent’s quality.
From Hacking to Scaling
Our team’s first instinct was to solve our challenge with transcripts and move on. Our bespoke reranker got the job done, unblocking our agent from analyzing under 5 transcripts to analyzing several dozen. As we integrated emails, CRM notes, web search, and documents, it was clear that implementing context management for each would become a game of whack-a-mole.
We also considered several standard alternatives, designed for solving problems adjacent to ours. Retrieval-Augmented Generation (RAG) is a common approach, allowing a reasoning LLM to dynamically extract information stored in vector databases. However, RAG wasn’t the right fit. Sales reasoning often relies on non-obvious connections: a LinkedIn post announcing that a prospect just hired three backend engineers may signal interest toward building in-house over buying. This is a critical insight that likely has no keyword overlap with the deal’s CRM record. RAG’s focus on semantic similarity would be insufficient. Additionally, RAG requires pre-indexing all data, which is infeasible for realtime data such as web search results.
Another potential approach is to have an agentic harness rip through our raw data, running on a loop of data collection, reasoning, and summarization. While this represents a more generalized approach and can potentially maintain high quality, it leads to high latency, prohibitively high token consumption, and less steerability. Ideally, our agent’s scoring module should take about 2-4 seconds end to end. An agentic loop would likely take minutes of latency as it reasons through each chunk individually. At scale, this approach becomes unacceptable.
So, we stepped back and asked ourselves: what if we could solve this problem once for all our data? We looked at what all our data types had in common, and we saw a clear opportunity. By pulling our extraction logic into a single shared module, we could solve for quality, reliability and scale exactly once.
Our insight was that despite surface-level differences between a transcript and a LinkedIn post, each data source we examine shares an underlying structure. We landed on a simple abstraction: every data source has a parent and children.
Parent: a high-level unit of data; a transcript, document, batch of web search results, or set of email threads
Child: a smaller unit within the parent; a text chunk, passage or summary of a contact
The same abstraction applied to two different data sources. A transcript becomes a parent with text chunks as children; a web search query becomes a parent with result snippets as children. The scoring system treats both identically.
Our goal is to quickly and reliably find the top-K most relevant children given a query, which will be passed to our agent for reasoning. Critically, this query is not static. Our primary agent passes it down at runtime, meaning our reranker is dynamically scoped to each reasoning task, whether it’s an account review, competitor signal, or pricing update. Identical data gets ranked differently depending on the task at hand.
Today, our system solves this problem and powers seven data pipelines across Rox: transcripts, emails, CRM notes, documents, web search, contacts, and org charts. It runs in 2-4 seconds wall-clock and can handle ranking thousands of items. What began as a transcript reranker has evolved to become the foundation of Rox’s data reasoning.
How it All Works
Getting here required several rounds of experimentation and iteration. Here are a few of the lessons we learned.
The entire reranking pipeline. All LLM calls concurrently; wall-clock latency is bounded by the single slowest call. A production workload of 390 chunks completes in ~4 seconds.
Order, Don’t Score
Our initial approach was straightforward: we had an LLM-as-a-judge rate each chunk with a relevance score from 1-10. This sounded great in theory, but we quickly ran into an issue: LLMs are known to struggle with producing calibrated numerical outputs. We witnessed this firsthand, with almost all our outputs clustered around the 5-7 range. Prompt engineering around this problem helped marginally, but it was brittle across different models and data types.
After digging into other approaches, we were inspired by RankGPT. Instead of assigning numerical scores, we switched to having the LLM produce an ordered permutation, listing item IDs in order of relevance. On our internal benchmarks, we achieved a significantly higher precision on retrieving relevant items.
Shuffle, Don’t Randomize
However, our ordering approach highlighted a new problem: positional bias. Positional bias is a well-documented problem in LLM outputs, often favoring items appearing earlier in the context window. Our first solution here was to introduce randomized 6-digit IDs corresponding to each chunk, which would reduce correlation between position and identity. This exposed a new problem: models, especially reasoning models, often struggled to accurately reproduce the right 6-digit IDs in their output. We saw hallucinated IDs, swapped digits, and omitted items. Additionally, outputting 6-digit IDs per call cost us roughly double the token cost of using sequential IDs.
Our next approach was simpler and also more effective. We maintained sequential IDs but shuffled the order of items during the ranking process. Therefore, each ranking pass sees the same items in a different randomized order with new IDs. We then repeat this pass a few times and then aggregate using Reciprocal Rank Fusion (RRF) to obtain a numerical score. From a high level, RRF works by assigning each item a score roughly inversely proportional to its rank. Items the model fails to return receive a worst-rank scores so the system can degrade gracefully, which only occurs in less than 1.2% of production items. This approach led us to achieve higher quality results on internal benchmarks at roughly half the output token cost of our randomized ID approach.
In a representative production trace (15 entities, 390 transcript chunks), our system batches the work into 6 groups, runs 17 concurrent LLM calls (6 batches x 3 passes), and returns the top 200 scored chunks in 4.35 seconds. Larger entities go through aggressive filtering (with 23-40% of chunks remaining) while smaller entities pass through entirely.
Pass Directly, Don’t Summarize
Our last major lesson was related to our summarization and formatting pipeline. Our initial architecture had an intermediate summarization step in several modules: an LLM would condense the data source into natural language before passing it to our agent’s reasoning layer. This seemed efficient in reducing context to process, but it led to several subtle failures: summaries occasionally contained fabricated details or missed non-obvious facts which were critical to the task at hand. This led to compounding errors and hallucinations in certain cases. We decided to switch our system to output raw scored chunks directly to our agent, preserving our source of truth. Our agent is now able to cite quotes, ground claims, and even resolve conflicting information with a holistic set of information. This eliminated a primary class of hallucinations that occurred when our agent references data.
Before: Intermediate summarization
This reads well, but the summarizer mixed up Competitor X with Y. This was also emphasized twice by the speaker, which is context lost in this summary.
After: Raw scored chunks
Every name, number, and attribution is exactly what was said. Our agent can now cite exact quotes and ground every claim in a specific call and timestamp.
What’s Next
The system we’ve described handles all of Rox’s data-related context engineering today. We’re continuing to evolve our architecture all the time: the next frontier for us is to close the feedback loop. We will use outcomes from actual deals (won, lost, stalled) to continuously score our internal reranker and unlock automatic improvements. Today our internal evals require a gold-standard reference with a frontier reasoning model; tomorrow, real deal outcomes will power our ground truth. Here at Rox, we dive deep into novel, challenging problems like this and iterate on solutions at scale. If these problems interest you, Rox is hiring! <insert hiring link>
Similar Articles
We build with the best to make sure we exceed the highest standards and deliver real value.



