STRATEGY & INSIGHTS

8 min read

by

Ashley Altus

Published on 05/06/2024

Last updated on 02/03/2025

Published on 05/06/2024

Last updated on 02/03/2025

Responsible enterprise LLMs: Addressing accuracy and LLM bias challenges (Part 1)

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

After its release in November 2022, the sudden popularity of OpenAI’s ChatGPT marked a shift in artificial intelligence (AI). Based on a large language model (LLM), the tool promised to generate accurate responses to various prompts. LLMs are AI models trained on extensive datasets designed to understand and create human-like replies to your questions and instructions.

Reasonably quickly, users realized some of the tool’s main limitations—including its potential for bias and misleading information. While these issues may have concerned early enterprise adopters, models have since evolved with techniques that can help minimize bias and errors.

This underscores the first of three main challenges enterprises face when investing in LLM transformation: Ensuring reliable outputs, keeping LLMs safe and secure, and complying with AI standards and regulations.

Enterprise applications of large language models

LLMs can be adapted and scaled to support a variety of business functions, from creating meeting transcripts to writing marketing content and publishing financial reports. These models help organizations increase efficiency and save costs while improving services like customer chatbots. As enterprise databases grow, LLMs can help your organization get more value out of that information at scale to inform strategic business decisions.

Common challenges with LLM outputs

To perform effectively in enterprise applications, LLMs are trained with immense data from sources like the web, internal databases, or user prompts. LLM outputs simply reflect this input, meaning they’re only as accurate, reliable, or unbiased as the training data from which they’re built.

Because humans create training data, it will always contain some bias that may surface through LLM outputs. There’s also the issue of accuracy. Unless your LLM has been trained on specialized information, it may generate content that isn’t helpful or detailed enough for your use case. In some instances, LLMs have been unreliable in discerning factual news from dubious sources, which could contribute to spreading misinformation.

These challenges raise ethical concerns, especially if outputs are discriminatory or violate content standards set by industry regulators. For example, LLM-generated medical documentation containing bias could have profound legal implications, not to mention causing harm to patients. Similarly, if the model makes a mistake when applied to cybersecurity software, this could lead to costly data breaches. Virtual assistants and customer service chatbots can also impact user trust when they generate unreliable or biased results.

Some of the common reliability challenges users may experience when prompting LLMs include:

Outdated information. Training data remains static, so LLM outputs can quickly become outdated—especially in areas like the sciences, politics, technology, or medical applications, where knowledge changes rapidly. Creating prompts from current trends or events can yield vague, obsolete, or false responses.
Surface-level outputs. Due to their vast and varied training datasets, LLMs often lack in-depth knowledge of specialized topics and develop a “jack of all trades, master of none” tendency. While the model can be prompted on many subjects, outputs may have a generic tone and simulate common knowledge rather than subject matter expertise.
Hallucinations. LLMs are designed to respond to prompts as best as possible with the knowledge they’ve gained during training. While these datasets are extensive, they still have limitations, and models sometimes fill the gaps with false, contradictory, or irrelevant information. For example, ChatGPT may cite research papers that don’t exist when prompted about complex or highly specialized medical topics. When creating scientific abstracts, researchers found that GPT-4 has a hallucination rate of 29%.
Weak reasoning. Even with diverse, high-quality training data, LLMs often need help to tackle complex prompts. Models may provide incorrect responses to requests that require advanced reasoning or pose mathematical problems, even relatively straightforward ones.
Harmful or discriminatory outputs. When training data biases emerge within outputs, LLMs may communicate offensive stereotypes or discriminate based on gender, age, ethnicity, or disability. For example, research on gender bias found that LLMs are three to six times more likely to assign occupations to people based on traditional gender roles.

Practical techniques for more reliable outputs

Understanding these limitations is the first step in adopting practices and techniques to help your LLM produce better responses. Starting with optimized training data is beneficial, but using retrieval augmented generation (RAG) and advanced prompting techniques is crucial for success.

Responsible data management

Because LLM outputs reflect training data, it’s important to ensure that this data is diverse. Including information from a wide variety of different regions, languages, cultures, and perspectives exposes the LLM to many different representations of the human experience.

Some organizations also develop detection models designed to identify bias in training data. Although models can mitigate bias, it is an inherently complex challenge to eradicate it. This complexity comes from the fact that LLM biases are often deeply rooted in training data, and identifying them is subjective.

As a best practice, cleansing your data before training—ensuring that it’s correct, complete, consistent, and relevant—is crucial for generating more accurate results.

Retrieval augmented generation (RAG)

RAG is a method of fine-tuning an LLM to make its outputs more up-to-date. It’s also useful for improving output accuracy for specific topics. Put simply, RAG uses a knowledge base containing updated or specialized data, which functions as an add-on to the model’s initial training data. This knowledge base is embedded to make its contents retrievable based on semantic meaning and context. Users can then make domain-specific prompts and receive more detailed, current, and accurate responses, even if the model’s original training data remains outdated.

A knowledge base built with a more edited, bias-conscious dataset is an effective way to reduce output bias in existing models. RAG also helps users avoid hallucinations or errors caused by outdated training data that lacks subject matter expertise. LLMs supported with RAG have been shown to improve output accuracy significantly.

In one study, an RAG’s knowledge base was created using clinical documents for preoperative medicine. Outputs had 91.4% accuracy, compared to 80.1% without RAG and 86.3% with responses from junior doctors.

Prompting techniques

Adjusting how the LLM is prompted is another way to generate more accurate results. The most straightforward technique is a zero-shot prompt, meaning that an LLM is used for generating responses to requests that it wasn’t necessarily trained on. For example, imagine asking a model to translate a sentence from English to German. Even if it hasn’t been specifically trained on this task, its general understanding of language structure gained during training will most likely deliver an adequate response.

However, zero-shot prompting often isn’t enough to produce accurate results for more complex requests. In this case, several other prompting techniques can help improve output quality:

Chain of thought prompting gives the LLM multiple examples of correct outputs to different prompts, training the model to deliver more accurate responses.
Self-consistency gives the model the same prompt multiple times. The model self-evaluates each response and uses the majority answer as the final output.
Least-to-most prompting breaks down a complex prompt into smaller, simpler sub-prompts. The LLM then solves each of these prompts in sequence to generate the output.
Tree of thoughts (ToT) is an approach that enables an LLM to mimic human reasoning. With ToT, the LLM considers several different reasoning paths based on the prompt. Then, it self-evaluates outputs for each path and chooses the most favorable outcome. With GPT-4, an LLM was able to solve the Game of 24 with 74% accuracy using ToT, compared to just 4% accuracy with the standard chain of thought prompts.
Reasoning via planning (RAP) goes beyond ToT by incorporating a “world model.” A world model allows the LLM to examine the external variables and environments associated with each reasoning path so that it can anticipate future outcomes and rewards. By considering these variables, the LLM is better equipped to determine the most optimal reasoning path.

Ensuring reliable enterprise LLM outputs is a continuous journey

Enterprise LLM usage is exploding because of its compelling advantages. These models are an effective way to improve operational efficiency, enhance user experiences, and adapt and scale for a diverse range of applications. But to accomplish these goals, organizations are responsible for ensuring that LLMs generate outputs that are accurate and helpful and minimize harmful bias.

Using clean and diverse training data, fine-tuning frameworks like RAG, and advanced prompting techniques are effective ways to make outputs more reliable. However, as AI technology advances, enterprises will have to reevaluate these solutions regularly to further reduce bias and improve LLM performance.

While these practices can transform how your organization benefits from LLMs, output reliability is just one piece of the puzzle. In the next article in this series, we’ll discuss how to keep your LLM infrastructure and data safe and secure.

Explore the fundamentals of LLMs in our Breakdown series where we simplify emerging tech topics.

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.

* No email required

Twitter

Facebook

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Twitter

Facebook

Enterprise applications of large language models

Common challenges with LLM outputs

Some of the common reliability challenges users may experience when prompting LLMs include:

Outdated information. Training data remains static, so LLM outputs can quickly become outdated—especially in areas like the sciences, politics, technology, or medical applications, where knowledge changes rapidly. Creating prompts from current trends or events can yield vague, obsolete, or false responses.
Surface-level outputs. Due to their vast and varied training datasets, LLMs often lack in-depth knowledge of specialized topics and develop a “jack of all trades, master of none” tendency. While the model can be prompted on many subjects, outputs may have a generic tone and simulate common knowledge rather than subject matter expertise.
Hallucinations. LLMs are designed to respond to prompts as best as possible with the knowledge they’ve gained during training. While these datasets are extensive, they still have limitations, and models sometimes fill the gaps with false, contradictory, or irrelevant information. For example, ChatGPT may cite research papers that don’t exist when prompted about complex or highly specialized medical topics. When creating scientific abstracts, researchers found that GPT-4 has a hallucination rate of 29%.
Weak reasoning. Even with diverse, high-quality training data, LLMs often need help to tackle complex prompts. Models may provide incorrect responses to requests that require advanced reasoning or pose mathematical problems, even relatively straightforward ones.
Harmful or discriminatory outputs. When training data biases emerge within outputs, LLMs may communicate offensive stereotypes or discriminate based on gender, age, ethnicity, or disability. For example, research on gender bias found that LLMs are three to six times more likely to assign occupations to people based on traditional gender roles.

Practical techniques for more reliable outputs

Responsible data management

As a best practice, cleansing your data before training—ensuring that it’s correct, complete, consistent, and relevant—is crucial for generating more accurate results.

Retrieval augmented generation (RAG)

Prompting techniques

However, zero-shot prompting often isn’t enough to produce accurate results for more complex requests. In this case, several other prompting techniques can help improve output quality:

Chain of thought prompting gives the LLM multiple examples of correct outputs to different prompts, training the model to deliver more accurate responses.
Self-consistency gives the model the same prompt multiple times. The model self-evaluates each response and uses the majority answer as the final output.
Least-to-most prompting breaks down a complex prompt into smaller, simpler sub-prompts. The LLM then solves each of these prompts in sequence to generate the output.
Tree of thoughts (ToT) is an approach that enables an LLM to mimic human reasoning. With ToT, the LLM considers several different reasoning paths based on the prompt. Then, it self-evaluates outputs for each path and chooses the most favorable outcome. With GPT-4, an LLM was able to solve the Game of 24 with 74% accuracy using ToT, compared to just 4% accuracy with the standard chain of thought prompts.
Reasoning via planning (RAP) goes beyond ToT by incorporating a “world model.” A world model allows the LLM to examine the external variables and environments associated with each reasoning path so that it can anticipate future outcomes and rewards. By considering these variables, the LLM is better equipped to determine the most optimal reasoning path.

Ensuring reliable enterprise LLM outputs is a continuous journey

Explore the fundamentals of LLMs in our Breakdown series where we simplify emerging tech topics.

by

Ashley Altus

Published on 05/06/2024

Last updated on 02/03/2025

Published on 05/06/2024

Last updated on 02/03/2025

Responsible enterprise LLMs: Addressing accuracy and LLM bias challenges (Part 1)

Get emerging insights on innovative technology straight to your inbox.

Enterprise applications of large language models

Common challenges with LLM outputs

Practical techniques for more reliable outputs

Responsible data management

Retrieval augmented generation (RAG)

Prompting techniques

Ensuring reliable enterprise LLM outputs is a continuous journey

Welcome to the future of agentic AI: The Internet of Agents

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

by

Ashley Altus

Published on 05/06/2024

Last updated on 02/03/2025

Published on 05/06/2024

Last updated on 02/03/2025

Responsible enterprise LLMs: Addressing accuracy and LLM bias challenges (Part 1)

Get emerging insights on innovative technology straight to your inbox.

Enterprise applications of large language models

Common challenges with LLM outputs

Practical techniques for more reliable outputs

Responsible data management

Retrieval augmented generation (RAG)

Prompting techniques

Ensuring reliable enterprise LLM outputs is a continuous journey

Welcome to the future of agentic AI: The Internet of Agents

Related articles

Inside Outshift

From deterministic code to probabilistic chaos: Securing AI agents that think for themselves

AI/ML

New AI Agent Identity framework from the AGNTCY

AI/ML

Agent Identity: Securing the future of autonomous agents