

March 14, 2025

Evaluating the Performance of Enterprise Search Solutions

Overview

Enterprise search remains a significant challenge within organizations, with teams reportedly wasting up to 32 days a year navigating through multiple tools to find answers.

At Quench, we address this problem by building a robust enterprise search engine that seamlessly integrates with the tools teams already use – such as Notion, Google Drive, and Slack – via simple pre-built integrations and APIs.

In our work with businesses, we've learned that delivering correct answers is paramount. Enterprises are understandably wary of inaccuracies or hallucinations from Retrieval-Augmented Generation (RAG) solutions.

With this in mind, we conducted a comprehensive analysis of enterprise search tools, comparing Quench against Sana Labs, NotebookLM, and Google Drive Gemini by running over 300 queries using content from two sample datasets. You can find our methodology at the end.

Our evaluation focused on two key metrics that matter most to businesses:

Correctness: How often each solution provided correct answers.
Completeness: How often each solution provided answers that thoroughly covered the full scope of the question.

Here are our key findings:

Results

We evaluated four enterprise search tools - Quench, Sana Labs, NotebookLM, and Google Drive Gemini - by running 324 queries using content from two of our datasets.

We assessed performance based on:

Correctness: How often each solution provided correct answers.
Completeness: How often each solution provided answers that thoroughly covered the full scope of the question.

‍

Quench Performed Highest On Correctness

Quench is 4.2x more preferred than Sana Labs
Quench is 4.8x more preferred than NotebookLM
Quench is 5.7x more preferred than Google Drive Gemini

Our evaluation metric "preferred" represents the ratio of cases where one tool was rated better than another, excluding instances where both performed equally.

For example, if Quench is 4.2x more preferred than Sana Labs in correctness, it means that for every time Sana Labs was rated better, Quench was rated better 4.2 times as often.

‍

Quench Performs Best for Finding Correct Information on Fact-Based Queries

‍

On fact-based queries alone:

Quench is 4.6x more preferred than Sana Labs
Quench is 9.6x more preferred than NotebookLM
Quench is 7.9x more preferred than Google Drive Gemini

Out of 163 fact-based queries, Quench achieved a 89.8% correctness rate, significantly outperforming the other enterprise search tools. This is also higher than its overall correctness score of 86.3% across all 324 queries.

Sana Labs demonstrated a correctness score of 50.9% for fact-based queries, which is slightly higher than its overall score of 48.1%.

In contrast, NotebookLM’s correctness decreased from 37.0% overall to 32.5% for fact-based queries, indicating a drop in accuracy for this subset of queries.

Meanwhile, Google Drive Gemini's accuracy remained stable, with a correctness score of 35.0% and 35.8% for fact-based and overall queries respectively.

‍

NotebookLM Provided the Most Complete Answers

‍

NotebookLM tends to be rated more complete than Quench with NotebookLM returning a complete answer 68% of the time, whereas Quench only returned a complete answer with a success rate of 63%.

We noted that NotebookLM tended to be more verbose and cited a lot more of the documents that have been uploaded.

This makes NotebookLM ideal if your workflow involves open-ended research, brainstorming, and creative ideation.

However, there is a tradeoff as it leads to lower correctness scores and a higher potential for hallucinations and incorrect results.

‍

Deep Dive into Sample Question

‍

Here is how each tool responded to the following question for a synthetic dataset:

“How does the organization ensure that employees’ interests are protected when they report issues?”

Quench

The organization ensures that employees' interests are protected when they report issues by valuing honesty and ensuring there is no retaliation against employees for being forthright. Any retaliatory actions taken by managers are addressed seriously, and managers are held accountable for inappropriate behavior, including receiving necessary retraining [1][3]. Additionally, the organization has implemented effective tracking systems that promote transparency regarding employee performance, which helps in maintaining a culture of accountability and straightforwardness [2]. This approach encourages employees to report any concerns without fear of negative repercussions, supporting a positive workplace culture.

‍

Sana Labs

The organization ensures that employees' interests are protected when they report issues by:

Providing Direct Access to Leadership: Employees can directly message the CEO, COO, or CFO to report any issues, ensuring that their concerns are heard at the highest levels.

Encouraging Transparency: The organization fosters an environment where transparency is valued, and employees are encouraged to speak up without fear of retribution.

Addressing Coercion and Manipulation: Any use of threats, coercion, or manipulation by managers is taken seriously and addressed promptly to protect employees' interests

‍

NotebookLM

NotebookLM produced an extremely detailed response that included several correct elements but also retrieved information that, according to our curated reference answers, isn’t correct for the question. It included mentions of the following which were irrelevant to the question:

“Anonymous Feedback: The company provides an anonymous feedback form to provide a platform for sharing feedback about the organisation, without capturing names or contact details…”
“Focus on Resolution over Blame: When dealing with conflicts, the emphasis is on understanding the root cause and fixing the issue”

Because the system aims for maximum completeness by presenting as much sourced material as possible, it may include contradictory or irrelevant information, which ultimately affects correctness – a critical factor for enterprise search applications.

‍

Google Drive Gemini

"The document does not specify how the organization protects the interests of employees who report issues."

‍

Performance Analysis

Quench demonstrated superior performance on correctness for the following reasons:

Comprehensive coverage of key elements: Quench's response correctly included:
- No retaliation policy
- Manager accountability and retraining
- Transparency through tracking systems
Factual accuracy: All information provided was supported by the reference claims without hallucination.
‍
Balanced detail level: Provided sufficient information without overreaching beyond what was supported by the source document.

Sana Labs performed well and mentioned direct access to leadership which is missing from Quench’s answer. However, it lacks key points including retraining of managers and tracking systems, which were essential elements mentioned in the reference claims.

NotebookLM provided excessive detail that included both correct and irrelevant information, reducing its overall correctness score.

Google Drive Gemini incorrectly claimed the document did not contain information on the topic, when in fact it did.

‍

Why does Quench Outperform Other Solutions?

Quench delivers superior results because of our innovative approach to RAG. While traditional systems often struggle with incomplete or disconnected information, Quench’s proprietary method delivers accurate and complete results.

Context is All You Need

When you use Quench, every piece of information is enriched with its full context.

In an enterprise setting, knowledge is often poorly organised, missing context at scale. We leverage a contextual RAG approach that adds the necessary context to your assets, from documents to transcripts from recordings, creating self-contained knowledge blocks that are complete with relevant background information.

This addresses a common challenge with raw transcripts for example, which often contain fragmented dialogue and irrelevant information.

By reconstructing these into coherent, context-rich segments, we ensure each citation retrieved has the complete information needed to properly answer a question.

Smart Search and Correction Features

Quench goes beyond basic search with:

Query Expansion: We understand what you're really asking, even if your question is phrased differently than the source material.
Citation Reranking: Our system prioritizes the most relevant information for your specific question.
Custom Client Dictionary: Organizations can add custom corrections for company-specific terminology, names, and acronyms.
‍

Why This Matters for Businesses

Enterprise environments have unique communication challenges.

Each organization has its own acronyms, naming conventions, and terminology that generalist LLMs don't understand.

Names like "Husayn" might be incorrectly transcribed as "Hussein," "Husain," or "Usain" in meeting recordings.

When someone asks "What did Husayn say about our Q2 goals?" standard systems might miss the answer entirely because they don't recognize the name variation.

Our extensive testing shows that effective enterprise search requires customization. Quench recognizes that your business is not only another dataset, but a unique environment with its own language. This explains why Quench consistently delivers more accurate, contextually-rich answers compared to other solutions.

‍

When to Use Alternatives?

‍
NotebookLM is ideal if your workflow involves manually managing document collections and detailed citations - making it a great tool for open-ended research, brainstorming, and creative ideation.

On the other hand, Quench is purpose-built for fast, fact-based enterprise search. It automatically captures and indexes an organization's full spectrum of data - from Slack messages and meeting recordings to Notion pages - delivering swift, accurate, and context-rich answers with robust citations. In environments where speed and accuracy are critical for decision-making, Quench offers a more targeted solution.

‍

Want to Try Quench Yourself?

If you are curious about how Quench delivers high precision, or you want to benchmark our precision against your internal solution, we are up for the challenge.

You can sign up here and a member of our team will be happy to help you get started.

‍

Our Method

Choosing Sample Datasets and Creating User Personas:
We selected two companies to test our system and created a user persona for them - employees seeking quick and accurate answers about company processes and product information.

Generating Questions:
We configured a LLM to create questions based on each dataset's content, while fitting the persona. For each question, we would also ask the LLM to state the asset from which the question was inspired from.
‍
Crafting Reference Answers:
We used multiple LLM agents to scour the asset for claims that are relevant for the question. We then configured LLMs to leverage this information and craft a detailed and well supported answer.
‍
Retrieving Answers from Enterprise Search Tools:
We obtained results from Quench as well as the following knowledge management RAG tools, which all have the same datasets uploaded to it:
1. Sana Labs
2. NotebookLM
3. Google Drive Gemini
Comparing Answers:
We used a LLM judge to simulate human labelling. We set the labelling task in accordance with best practices learned from managing human labelling:
1. The task needs to be binary. When comparing 2 answers, the only possible verdicts are the following:
  1. Answer 1 > Answer 2
  2. Answer 1 < Answer 2
  3. Both are equally good
  4. Both are equally poor
2. All the context required to correctly label the predictions needs to be included in the task. Do not rely on labellers’ memory / pre-trained knowledge if possible.
3. Alongside this, we provided the list of reference claims and relevant citations for the LLM to make its decision. We asked the LLM to generate a verdict based on both completeness and correctness. We used GPT-4o to be the judge as it better aligns with human preferences.

Current Limitations of Our Study

Challenges in Assigning Correctness Scores – Our current approach flags responses as "complete but incorrect," but lacks a fine-grained way to measure partial correctness (e.g., a score of 0.5 instead of 0).
‍
Reliance on Generated Reference Answers: Our current approach leverages multiple LLMs to generate reference answers to questions. While we’ve engineered this to best align with human preferences, it is more ideal for reference answers to come from human experts.
‍
Reliance on Automated Evaluations: Our method uses GPT-4o as a judge to simulate human preference labelling. While this method showed good alignment with human preferences, we didn’t measure its alignment with human SME preferences. Further validation against human SMEs in specialised fields could ensure robustness of our approach.

‍