Rescuing and securing unstructured data with RAG

Published by Alex Olivier on December 04, 2024
Rescuing and securing unstructured data with RAG

LLMs are incredibly flexible, with a long list of use cases that gets added every day.

One of the more interesting ones is helping companies get more value from their unstructured data.

In 2022, companies generated 57,280 exabytes of unstructured data.

This unstructured data is around 90% of the data companies generate and includes text, multimedia, business documents, email messages, videos, photos, webpages, audio files, and more. As companies archive Zoom recordings, Slack messages, emails, and other data, that number will continue to climb.

Historically, making use of unstructured data has been a major challenge around scaling, processing and security, so most of it gets lost or forgotten.

Sean Falconer and Kirk Marple recently covered this in an interesting video of Partially Redacted Community, called “Privacy and Security Considerations for RAG with Graphlit's Kirk Marple”. The insights from this conversation led to the basics of this article (with the permissions from Partially Redacted Community).

So in our article, we’re going to explore how Retrieval Augmented Generation (RAG) combined with Large Language Models (LLMs) are changing the way companies access and secure all this unstructured data. But before we do, we’ll cover how companies used to access unstructured data, and why those solutions just don’t work anymore.

Wading through workarounds

To organize and salvage data, companies improvised workarounds to make some of that unstructured data searchable.

For text-based data, Document Management Systems (DMSs) have long been used for organizing, storing, and accessing. But, as data has become more and more multi-modal, DMSs have become less useful.

To manage more diverse media types, companies turned to centralized repositories called Data Lakes to store raw, unstructured, and structured data in its native format. The problem is building Data Lakes takes hours of work, careful planning, in-depth data governance, and ongoing maintenance to ensure those lakes don’t turn into swamps. Often, all the work of storing information inside Data Lakes ends up mapping a structured representation over the unstructured data. This flattens the context and often discards the nuanced complexity of unstructured data in order to make it usable.

While both DMSs and Data Lakes had significant drawbacks, they were still valuable for many companies because they allowed some visibility into unstructured data. However, as data generation continues to explode, both are fast becoming obsolete because they lack:

  • Sophistication to deal with rich multi-modal data.
  • Automation required to respond to the rapid expansion of data.
  • Advanced indexing and retrieval capabilities needed for semantic search.
  • Ability to integrate with LLMs.

RAG-enhanced LLMs are becoming more popular because they address those issues.

How RAG accesses unstructured data

RAG is a technique used in natural language processing that combines retrieval (or search) and generation (with an LLM) to improve the quality of responses generated by AI models.

Together with LLMs, RAG addresses the sophistication limitations of DMSs and Data Lakes by leveraging vector embeddings to enable semantic search.

Instead of flattening data to work with simple keyword matching, semantic search focuses on the meaning of queries. Then it retrieves relevant information (even if the exact words don't match) by using vector embeddings to represent the meaning of both the data and the search queries.

This allows for a more nuanced and accurate retrieval process of the inherently richer and more complex unstructured data.

Security solutions for RAG

Though RAG is a powerful enhancement for accessing unstructured data and supplying LLMs with answers, this technology can trigger security issues.

Without proper guardrails, RAG can access and supply an LLM with information the user should not see, whether that’s confidential information or unsuitable data.

Properly implemented, this security should address both incoming information (i.e. the prompts) and the information the LLM has access to while updating current security protocols to work in this new paradigm.

Let’s take a look at three steps you can take to increase security for your AI chatbot or agent.

marketing - security solutions for RAG.jpg

Sanitize the data pool

LLMs are not inherently equipped to handle complex security and privacy concerns. While it’s possible to build in controls at the query or output stage, the flexibility of LLMs often makes it possible to bypass these measures.

Instead of relying on an LLM to act as a gatekeeper, you should integrate data sanitization as a distinct step in the ingestion and preparation of data before it enters the vector database.

Typically, data should be cleansed of:

  • Personally Identifiable Information (PII) such as names, addresses, phone numbers, and social security numbers.
  • Proprietary or confidential information, including internal financial data, trade secrets, strategic plans, or other sensitive business information.
  • Content deemed inappropriate, including content that is offensive, discriminatory, or violates company policies.

Data sanitization requires careful consideration and balance. While you need to ensure your LLM can’t pull any of the above information, it must be free to access enough data to make it useful. Otherwise, you could make the retrieved context less accurate or insightful.

Incoming prompt security (sanitization)

Just like with data sanitization, prompts should undergo a sanitization step before they are given to the LLM for processing. This step ensures the LLM isn’t responsible for detecting whether a prompt is malicious or inappropriate. Instead, the prompt goes through a series of checks and modifications before it’s submitted, allowing the LLM to answer questions instead of vetting them.

Part of that is understanding what data the LLM is allowed to share. For example, if a company has an AI chatbot (AI agent), it shouldn’t be responding to ‘not safe for work’ prompts, no matter who is making the request. This type of cleansing can be made through simple rules that govern tasks or queries.

The other side is context-dependent. For example, a query from a North American employee of an international company shouldn’t turn up results from Asia or Europe. These responses are not only unhelpful, but they may also breach security protocols.

A secondary issue with prompts is context flooding. This is where malicious actors flood the prompt with irrelevant information. This data flood can push security out of the LLM’s working memory, causing it to access information or give responses it shouldn’t.

So while standard rules are helpful to create a baseline, classifying and understanding the context of prompts is essential to ensure the right users get the right answers specific to their use case.

Leverage established security principles

RAG and LLMs will always need updated security measures, but that doesn’t mean previous methods are obsolete. You can tailor established principles to the unique challenges of RAG. That includes practices like adding authentication and authorization steps to LLMs and logging and monitoring user activity to detect and respond to security breaches.

But one of the most important traditional practices for LLMs and RAG is ‘shifting security left’. As we covered above, LLMs were never designed to be gatekeepers, which makes it difficult to build security into the retrieval system. So rather than relying solely on reactive controls at the output stage, developers need to take a more proactive approach by shifting security measures from governing output to including it throughout the development and deployment lifecycle.

Role-based access control (RBAC) and granular permissions are other practices companies can implement to limit which data sources users can access. By breaking up databases and requiring authorization from users before the RAG accesses data, you can limit its vulnerability. This will help to reduce the risk of unauthorized access or data breaches.

The future of LLMs and RAG

RAG technology is still in its early stages, which means it will continue to evolve, especially as it moves from cutting-edge technology to wider adoption. This broader audience will drive innovation and lead to the development of more sophisticated and specialized RAG systems.

Context windows will continue to grow to both accommodate more context and combat flooding that causes LLMs to make poor decisions.

As RAG-enhanced LLMs continue to become more powerful, and use cases expand, maintaining security, both by properly using foundational practices and innovating new solutions, will require constant vigilance.

If you’re interested in this topic and want to dive deeper - check out Cerbos for AI systems.

Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team