10 min read

Blog thumbnail
Published on 05/01/2024
Last updated on 06/18/2024

How to ensure AI data integrity in multi-source environments


As AI deployments increase in number and usage, scrutiny is rising over the reliability and safety of AI systems. With such attention, ensuring the quality of your data is critical. This is also known as AI data integrity. While the foundation of any AI system is your data, a robust and trustworthy AI system requires your data to be accurate, complete, and reliable.

Data integrity can be challenging enough with only a single source of data. Integrating data from multiple sources introduces additional complexities and challenges. This makes data integrity a crucial issue for organizations with large, complex datasets that feed into AI systems. 

The integrity of your AI data is incredibly important, but ensuring that integrity is fraught with challenges. However, a strong understanding of the threat landscape paired with an AI implementation guided by best practices will get your enterprise where it needs to be.

Understanding data integrity in AI

Data integrity refers to the quality and reliability of your data. Within the context of AI systems, the data can come in several forms. For example:

  • Training data is used to create AI/ML models, typically in conjunction with validation and test datasets during the model training process.
  • External data is often used when working with large language models (LLMs) through a process known as retrieval-augmented generation (RAG).
  • Feedback data is provided by users to fine-tune the effectiveness of AI systems. A common example is email spam filtering, where users are encouraged to flag specific emails as spam.

In each of these cases, the data feeds into an AI system and affects the subsequent outputs of that system. On this point, we recall the classic computer programming maxim: Garbage in, garbage out. If our data integrity is lacking or if adversaries manage to pollute our data sources, then our AI systems will be unable to provide meaningful and reliable responses. While data integrity is important for non-AI systems as well, it takes on critical importance in AI systems, which are often “black box” in nature.

In traditional software development, tracing the root cause of bugs or issues is usually straightforward, offering a clear route to problem resolution. However, AI systems operate differently. Because they’re built on complex algorithms and large sets of training data, how they arrive at the outputs is often likened to a “black box.” If you get an undesirable output, it’s not always clear what caused it, and these are difficult flaws to resolve. This is why enterprises need to ensure their training data is free of hidden biases or inaccuracies. Bad inputs will be nearly impossible to fix down the road.

Therefore, data integrity is key both to the successful deployment of an AI system and to maintaining its ongoing value. In cases where an enterprise depends on the system to make critical decisions (such as whether a job applicant is suitable, or whether a business loan application should be approved), maintaining data integrity is key.

Common threats to AI data integrity

To better understand the problem, organizations should be familiar with common threats to AI data integrity.

Inadequate vetting of data sources

With any given dataset, the origin of that data is critical information. You need to know where it came from and how it was collected.The data should be fully traceable—from its origin, through its collection and any other pipelines, to its current place in the dataset.

Datasets that are too small or not comprehensive

In any statistical discipline, relying on datasets that are too small can lead to conclusions that, though supported by the data, are not broadly applicable. This problem is amplified in AI systems, where hidden biases in the data can cause a system to perpetuate those biases across its responses.

When a dataset isn't representative of the broader population it's meant to model, we have sampling bias, leading to skewed results. In AI, this would occur if a set of training data leaned too heavily toward certain characteristics or groups. The system's output would mirror these biases. To put it another way, it's like making a decision when you’ve only heard one side of an argument. Similarly, an AI system with sampling bias in its training data would have misguided understanding and responses.

Improperly secured data pipelines vulnerable to attack

A piece of data in a large dataset typically goes through multiple stages of collection and processing before arriving at its final destination. This is known as the data pipeline. But it’s rare for all stages in a data pipeline to be equally secured. Naturally, adversaries will look for weak points to inject false data. Once false data is in the system it can be difficult to identify.

Insufficient data verification

Data verification is a crucial stage where data is checked to ensure its accuracy and consistency. This verification should be performed at each major stage in the data pipeline. Without sufficient data verification, errors in one process will affect the data used in subsequent downstream processes.

This is especially important when input data comes from user feedback. Feedback can be weaponized, as in the case of Microsoft’s Tay chatbot (read more in the section on “Data poisoning” below).

Unreliable or malicious vendors

In the case where a dataset comes from a third party, it is possible that the third party has not sufficiently verified the dataset. It is also possible that the dataset has been tampered with—perhaps by an adversary and without the vendor’s knowledge, or even maliciously by the vendor itself.

Data poisoning

When data integrity is intentionally attacked by an adversary, it is known as data poisoning. Data poisoning happens when harmful data is deliberately inserted into a system's dataset to compromise its integrity and functionality. The aim is to manipulate the system's behavior to bring about inaccurate or harmful outcomes.

For example, Microsoft’s Tay chatbot had a feedback mechanism where it learned from other users’ posts. Within 24 hours, the bot began posting hateful and racist remarks, and Microsoft needed to take it offline.

Another instance of data poisoning occurred when adversaries attempted to skew the Gmail spam filter. Through dummy accounts and marking spam emails as not spam, these adversaries were attempting to trick the filter into allowing their spam emails to go through.

Best practices for ensuring data integrity

Maintaining data integrity requires a culture of proactive, end-to-end monitoring across the organization. Organizations with high data integrity have in place the following best practices.

Comprehensive assessment of data sources

Data integrity can only be verified when each data source is known and vetted. Begin by documenting the data pipelines for all the data that flows into your AI system. With this information in hand, continue to validate each data source by answering the following questions for each stage of the data pipeline:

  1. Is the data correct and complete?
  2. Is the data going through a conversion of any kind? If so, is the accuracy maintained during that conversion?
  3. How often is this data updated? Is it possible for this data to become stale?
  4. Who is responsible for this data source? How do we communicate with them if we have any questions or concerns?

Vendor due diligence and audits

Assessing your data source becomes even more critical when that source is a third-party vendor. Ensure that your vendor’s standards for security and data integrity match or surpass those of your organization. As your data continues to be sourced from that vendor, this auditing must be conducted regularly. 

Dataset diversity and size

To ensure the efficacy of your AI dataset, make sure that it is:

  • Sufficiently large
  • Comprehensive
  • Diverse

This can be a tall order, and what constitutes “sufficiently large” or “diverse” will depend on the specific AI system and its intended usage.

For example, if you were developing a facial-recognition AI system, a sufficiently large and diverse dataset might be millions of images featuring people of various ages, ethnicities, and expressions, captured under different lighting conditions. This would ensure your system can accurately identify faces across a wide range of scenarios, reflecting the diversity it will encounter in the real world. Otherwise, your system would have limited effectiveness.

Data pipeline security

Data pipelines introduce the possibility of data poisoning at each step in the process, and thus they must be rigorously secured. Implement the following measures at each stage, as any weak links could pollute the rest of the pipeline: 

  • Data validation and verification 
  • Access control 
  • Data encryption during transmission and at rest 
  • Audit trails 
  • Ongoing monitoring and error resolution 

Rigorous data verification protocols 

Data verification is a multi-layered process. It begins with data validation before it even enters your systems. From there, verify the data by comparing it to other, verified sources. Further down the pipeline, periodically compare the data against the original data source to ensure that any transformations have been performed accurately.

Many enterprises today are relying on data tools that show lineage (how a piece of data changes as it goes through the pipeline) and provenance (the origin, history, and lifecycle of a piece of data). These methodologies help you trace back your data as a way of tracking changes.

Suggested frameworks and tools 

As many organizations begin new endeavors with GenAI use cases, they are on the lookout for techniques and tools to help them implement AI data integrity best practices. 

For data pipelines and audit trail maintenance, data provenance tooling such as CamFlow or Linux Provenance Modules can be helpful. In addition, you can use techniques such as data unit tests and data observability monitoring to maintain data integrity over data pipelines. 

To maintain the security of your data pipelines, integrate the use of a Software Bill of Materials (SBOM) for your data pipeline tooling. An SBOM is a software supply chain security tool that helps you detect security vulnerabilities that may exist within your application’s dependencies. Data validation tooling can aid in automating validation and ongoing monitoring of your data. 

Building trust in AI systems through AI data integrity 

Ensuring your AI data integrity is crucial for the success of AI in your business. Everything boils down to starting with good data: making sure it’s accurate, secure, and comes from reliable sources. This foundation is essential not only for building AI systems that work well but also for gaining the trust of those who use them. 

To build these trustworthy AI systems, modern enterprises focus on establishing straightforward, effective strategies to protect data integrity. This means checking where data comes from, keeping it safe, and constantly verifying its accuracy. 

Outshift is at the forefront of pioneering trustworthy and user-friendly AI, helping businesses navigate the challenges of AI by offering the tools and expertise to ensure their AI systems are built on solid, reliable data. At Outshift, our aim is to empower businesses to use AI with confidence, knowing their systems are as trustworthy as they are powerful. 

Read more about Outshift’s take on AI in our blog, Why we care about trustworthy and responsible AI

Subscribe card background
Subscribe to
the Shift!

Get emerging insights on emerging technology straight to your inbox.

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.

Subscribe to
the Shift
emerging insights
on emerging technology straight to your inbox.

The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.

Outshift Background