- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【LangChain系列】第九篇：LLM 应用评估

Freedom123 发表于 2024/05/25 20:34:33 2024/05/25

【摘要】【LangChain系列】第九篇：LLM 应用评估

随着语言模型（LLMs）的不断进步，它们的应用变得越来越复杂和精密。随着这种复杂性的增加，评估这些基于LLM的应用程序的性能和准确性也变得更具挑战性。在这篇博客文章中，我们将深入探讨LLM应用评估的世界，探讨可以帮助您评估和改进模型性能的框架和工具。

一、创建QA应用程序

import os
from dotenv import load_dotenv, find_dotenv
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores.docarray import DocArrayInMemorySearch
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import ChatOpenAI

_ = load_dotenv(find_dotenv())
notebook_path = os.path.abspath("__file__")
notebook_directory = os.path.dirname(notebook_path)
csv_file_path = os.path.join(notebook_directory, '..', 'OutdoorClothingCatalog_1000.csv')
loader = CSVLoader(file_path=csv_file_path)
data = loader.load()
index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch).from_loaders(
    [loader]
)
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs={"document_separator": "<<<<>>>>>"},
)

二、构建测试数据

在我们评估LLM应用程序之前，我们需要一组可靠的测试数据。生成测试数据有两种主要方法：

1.手动创建示例

传统的方法涉及手动审查您的数据并制作查询-答案对。假设您正在使用一个服装数据集。您可以浏览描述并创建问题，比如“Cozy Comfort Pullover Set有侧口袋吗？”并提供相应的答案。虽然这种方法让您完全控制示例，但它可能会耗费时间，并且在处理更大的数据集时可能不太容易扩展。

# Hard-coded examples
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set \
        have side pockets?",
        "answer": "Yes",
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection",
    },
]

2.使用LLM生成示例

您也可以使用LLM本身来生成测试数据。LangChain 提供了 QAGenerateChain，它可以从您的文档自动生成查询-答案对。它是一个可以根据您的数据创建假设性问题和答案的AI助手。

from langchain.evaluation.qa import QAGenerateChain
from pprint import pprint
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))
new_examples = example_gen_chain.batch([{"doc": t} for t in data[:5]])
pprint(new_examples[0]["qa_pairs"])
# Output
# {'answer': "The approximate weight of the Women's Campside Oxfords per pair is "
#            '1 lb. 1 oz.',
#  'query': "What is the approximate weight of the Women's Campside Oxfords per "
#           'pair?'}

data[0]
# Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", 
#         metadata={'source': '/home/voldemort/Downloads/Code/Langchain_Harrison_Chase/Course_1/OutdoorClothingCatalog_1000.csv', 'row': 0})

通过结合手工制作的示例和LLM生成的示例，您可以快速构建一个强大的测试数据集。

examples.extend([inst["qa_pairs"] for inst in new_examples])

三、手动评估和调试

有了测试数据，现在是时候评估你的LLM应用程序的性能了。最简单的方法是通过应用程序运行示例并检查最终输出。

qa.invoke(examples[-1]["query"])
# Output
# Entering new RetrievalQA chain...
# Finished chain.
# {'query': 'What technology is used in the EcoFlex 3L Storm Pants to make them more breathable and keep the wearer dry and comfortable?',
#  'result': 'The EcoFlex 3L Storm Pants use TEK O2 technology to make them more breathable and keep the wearer dry and comfortable.'}

然而，这种方法可能有局限性，因为它无法提供有关应用程序流程中间步骤或潜在问题的洞察。

1.通过应用程序运行示例

为了更深入了解您的应用程序行为，LangChain提供了langchain.debug工具。当启用时，此实用程序会在应用程序执行的每个步骤中打印出详细信息，包括提示、上下文和中间结果。

import langchain
langchain.debug = True
qa.invoke(examples[0]["query"])

通过检查这个输出，您可以识别检索或提示步骤中的潜在问题，从而让您更有效地找出并解决问题。

"""
Output:

> Entering new RetrievalQA chain...
> Entering Chain run with input:
{
  "query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
> Entering StuffDocumentsChain run with input:
[inputs]
> Entering LLMChain run with input:
{
  "question": "Do the Cozy Comfort Pullover Set have side pockets?",
  "context": ": 73\nname: Cozy Cuddles Knit Pullover Set\n...
}
[llm/start] Entering LLM run with input:
{
  "prompts": [
    "System: Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 73\nname: Cozy Cuddles Knit Pullover Set\n...
Human: Do the Cozy Comfort Pullover Set have side pockets?"
  ]
}
[llm/end] [1.89s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "Yes, the Cozy Comfort Pullover Set does have side pockets.",
        ...
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "completion_tokens": 14,
      "prompt_tokens": 733,
      "total_tokens": 747
    },
    "model_name": "gpt-3.5-turbo",
    "system_fingerprint": "fp_3b956da36b"
  },
  "run": null
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "output_text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [2.36s] Exiting Chain run with output:
{
  "result": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
"""

# Final Output:
# {'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
#  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

四、LLM辅助评估

虽然手动评估很有价值，但随着示例数量的增加，它可能会很快变得乏味和主观。这就是LLM辅助评估发挥作用的地方。

1.获取示例的预测

第一步是通过LLM应用程序运行您的示例并收集预测。

predictions = qa.batch(inputs=examples)

2.使用QAEvalChain进行评分

LangChain提供了QAEvalChain，这是一个基于LLM的链，旨在评估您的应用程序预测的正确性。该链使用LLM理解语义相似性的能力，确保即使预测与预期答案不完全匹配，也能准确评分。

from langchain.evaluation import QAEvalChain
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)

通过评分输出，您可以快速识别需要改进的领域，并对您的LLM应用程序进行迭代。

for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]["query"])
    print("Real Answer: " + predictions[i]["answer"])
    print("Predicted Answer: " + predictions[i]["result"])
    print("Predicted Grade: " + graded_outputs[i]["results"])
    print()

最终输出类似如下：

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28", and the dimensions of the medium size are 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer: Some key features of the swimsuit include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom, secure fit, and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps and fully lined bottom for a secure fit and coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts?
Real Answer: The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts is as follows:
- Body: 82% recycled nylon, 18% Lycra® spandex
- Lining: 90% recycled nylon, 10% Lycra® spandex
Predicted Grade: CORRECT

Example 6:
Question: What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is a state-of-the-art air-permeable technology that offers the most breathability tested by the brand.
Predicted Grade: CORRECT

graded_outputs[-1]
# {'results': 'CORRECT'}

小结

评估LLM应用程序是确保其可靠性和性能的关键步骤。通过利用类似LangChain的QAGenerateChain、langchain.debug、QAEvalChain和LangChain评估平台等工具，您可以简化评估过程，深入了解应用程序的行为，并更有效率地进行迭代。无论您是经验丰富的机器学习专业人员还是刚开始学习的人，这些框架和工具都可以帮助您发挥LLM应用程序的全部潜力。

小编是一名热爱人工智能的专栏作者，致力于分享人工智能领域的最新知识、技术和趋势。这里，你将能够了解到人工智能的最新应用和创新，探讨人工智能对未来社会的影响，以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞，评论，收藏，让我们一起探索人工智能的奥秘，共同见证科技的进步！

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

【LangChain系列】第九篇：LLM 应用评估

一、创建QA应用程序

二、构建测试数据

1.手动创建示例

2.使用LLM生成示例

三、手动评估和调试

1.通过应用程序运行示例

四、LLM辅助评估

1.获取示例的预测

2.使用QAEvalChain进行评分

小结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

【LangChain系列】第九篇：LLM 应用评估

一、创建QA应用程序

二、构建测试数据

1.手动创建示例

2.使用LLM生成示例

三、手动评估和调试

1.通过应用程序运行示例

四、LLM辅助评估

1.获取示例的预测

2.使用QAEvalChain进行评分

小结

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品