There has been much hand-wringing in recent days and weeks about RAG (Retrieval Augmented Generation) for legal use cases. For many vendors and innovators working with LLMs, RAG had been (perhaps inaccurately) viewed as the Holy Grail - a magical technique that eliminates hallucinations every time and returns the needle in the haystack with 100% accuracy. Last week’s pre-print of a study by researchers at Stanford's RegLab (https://hai.stanford.edu/news/ai-trial-legal-models-hallucinate-1-out-6-queries) pointed out issues with using RAG as a ‘silver bullet’ to eliminate the risk of hallucination. And while there were issues with the lab’s methodology, it’s worth exploring what RAG is and what law firm and legal department buyers should look out for.
What is Retrieval Augmented Generation?
According to Nvidia, “Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.” In real terms, this means that LLM ‘answer engines’ such as LexisNexis’s Lexis+AI or Thomson Reuters’ Westlaw Edge are backed up by databases, which fill any gaps in the LLMs knowledge. This can be very useful for returning answers to specialised queries. For domains such as legal research, RAG has the potential to be very powerful.
The process has two main steps:
Retrieval: A retrieval model searches through the database supplied to find information relevant to the input query. The retrieved data is converted into vectors, ranked by relevance, and the top results are passed to the next step.
Generation: A large language model uses the retrieved information as supplemental context to generate a more accurate, specific, and contextually relevant response to the query.
What are the issues with RAG?
Retrieval Augmented Generation can be fragile to detail. The way the technology works is that it adds information about the data into embeddings. The LLM is then pointed at these embeddings, enriching the existing ‘knowledge’ of the LLM and allowing the model to retrieve this information.
However, when you embed a document that is hundreds of pages of long, you are cramming so much information into each embedding that you hugely increase the risk of errors. In legal contexts, pinpoint accuracy is crucial. So when the embedding space returns sentences that may seem similar but are in fact wholly different in the context of high stakes litigation, you could be faced with the difference between a successful case and a suit for negligence.
Put simply, RAG hallucinates more than is commonly understood. It struggles to decline to answer, so it will bias towards saying ‘something’ rather than nothing. Similarly problematic is its difficulty explaining whether it has returned its answer based on the underlying document set or the training data.
Does Wexler use RAG?
Yes and no. Wexler’s core fact extraction engine does not use RAG, since it doesn’t give our system the necessary accuracy to find and extract the key factual information related to our customers’ matters. Instead, we built a custom 12 step LLM pipeline, which does everything from parsing the information for processing, clustering facts across documents, ‘enriching’ facts by adding key contextual data and then analysing facts with reference to our user-supplied synopsis.
We have been developing our conversational interface over the past few months, with the aim of giving more flexibility to our customers in how they manipulate the facts. We’re building the ability to ask questions in plain english and get sourced answers, returning information from the data in the tables, as well as produce text summaries and ‘sub chronologies’ with different, user defined columns. This approach has involved RAG. However, Wexler only applies RAG over the facts that we have already extracted, rather than the entire document set. This means the amount of information in each embedding is substantially lower, and it reduces the risk of error modes. This creates a much more accurate and robust system.
A shift in perspective
Last week’s pre-print highlighted an uncomfortable truth for legal vendors: RAG over an entire document database can be risky and fraught with difficulty. At Wexler, we have worked extensively to reduce inaccuracy, but we know that there may be occasions when the system returns something that’s not quite right. We make sure that everything is sourced and cited, so you can see from where Wexler has retrieved the information it displays. This is a core part of our thesis that human AI collaboration will be at the heart of the adoption of these tools. Our customers work with our systems, and verify the output, saving a huge amount of time and tweaking what Wexler produces for their different use cases.
We’ll be releasing benchmarks for our testing later this year. If you’d like to discuss RAG, fact gathering or generative AI for disputes you can book in a call with us directly at https://calendly.com/wexler-ai/30min.