rag笔记

以灰桦

261人浏览 · 2026-06-07 21:39:39

以灰桦 · 2026-06-07 21:39:39 发布

摘要

本周重点学习并实践了向量存储与向量检索技术，深入理解了向量存储的核心原理与应用场景，分别完成了内置内存向量存储（InMemoryVectorStore）和外部持久化向量数据库（Chroma）的代码实现与功能验证，掌握了文档向量化、入库、相似度检索等关键操作。同时基于检索结果完成提示词构建的全流程实践，实现了将检索到的上下文信息动态注入提示词模板，有效提升问答准确性

abstract

Here's a professional, natural translation of your weekly summary, suitable for a work report: --- This week, I focused on learning and practicing **vector storage and vector retrieval technologies**. I gained an in-depth understanding of the core principles and application scenarios of vector stores, and completed the code implementation and functional verification of both the **in-memory vector store (InMemoryVectorStore)** and the external persistent vector database (**Chroma**). I mastered key operations such as document embedding, data ingestion, and similarity search. At the same time, I carried out end-to-end practice of prompt engineering based on retrieval results, realizing the dynamic injection of retrieved context information into prompt templates, which effectively improved the accuracy of question-and-answer responses.

一，向量存储

向量存储（Vector Store）是一种专门为向量数据设计的数据库，核心作用是高效存储文本、图片等数据转换而来的向量（Embedding），并支持快速的相似度检索，是大语言模型实现知识库问答、语义搜索等功能的核心组件之一。

向量存储的使用

内置向量存储（以 InMemoryVectorStore 为例）这是 LangChain 提供的内存型向量存储，无需额外安装数据库，适合快速测试和开发场景。代码中通过导入 InMemoryVectorStore 初始化向量存储，绑定 DashScopeEmbeddings 作为向量生成工具；接着通过 add_documents 方法添加文档并自定义 ID，支持后续通过 delete 方法按 ID 删除指定文档；最后用 similarity_search 方法实现基于文本的相似度检索，快速返回匹配的相关文档。

# 导入内存向量存储与嵌入模型
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.embeddings import DashScopeEmbeddings

# 1. 初始化嵌入模型
embeddings = DashScopeEmbeddings()

# 2. 创建内置（内存）向量存储
vector_store = InMemoryVectorStore(embedding=embeddings)

# 3. 添加文档
from langchain_core.documents import Document
docs = [
    Document(page_content="向量存储用于高效检索"),
    Document(page_content="InMemoryVectorStore 适合测试")
]
vector_store.add_documents(docs, ids=["doc1", "doc2"])

# 4. 相似度检索
results = vector_store.similarity_search("什么是向量存储？")
print(results)

外部向量存储（以 Chroma 为例）这是独立的开源向量数据库，支持数据持久化存储，适合生产环境使用。代码中通过导入 Chroma 初始化向量存储，指定了集合名称、向量生成工具 DashScopeEmbeddings，以及本地持久化目录；数据会被存储到指定路径，即使程序重启也不会丢失，相比内存存储更适合长期使用的知识库场景。

# 导入外部向量存储 Chroma 与嵌入模型
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import DashScopeEmbeddings

# 1. 初始化嵌入模型
embeddings = DashScopeEmbeddings()

# 2. 创建外部（持久化）向量存储
vector_store = Chroma(
    collection_name="my_knowledge_base",
    embedding_function=embeddings,
    persist_directory="./chroma_db"  # 本地持久化目录
)

# 3. 添加文档
from langchain_core.documents import Document
docs = [
    Document(page_content="Chroma 是开源向量数据库"),
    Document(page_content="支持持久化，适合生产环境")
]
vector_store.add_documents(docs)

# 4. 相似度检索
results = vector_store.similarity_search("生产环境用什么向量存储？")
print(results)

二，向量检索构建提示词

向量检索构建提示词，就是把私有 / 专业文档先转成向量存入向量库，用户提问时，把问题也转成向量，从库里召回最相关的 Top-K 文档片段，再把这些片段和用户问题一起填入提示词模板，让大模型基于给定上下文回答，从而减少幻觉、提升准确性。示例代码

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_core.documents import Document

# 1. 初始化嵌入与外部向量库（持久化）
embeddings = DashScopeEmbeddings()
vector_store = Chroma(
    collection_name="my_knowledge_base",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

# 2. 存入知识库文档（只需首次执行）
docs = [
    Document(page_content="Chroma 是开源持久化向量数据库"),
    Document(page_content="适合生产环境做 RAG 知识库"),
    Document(page_content="检索结果可直接注入提示词")
]
vector_store.add_documents(docs)

# 3. 用户问题 + 向量检索
user_query = "生产环境用什么向量库做提示词？"
retrieved_docs = vector_store.similarity_search(user_query, k=2)

# 4. 构建提示词
context = "\n".join([d.page_content for d in retrieved_docs])
prompt_template = """
你是专业顾问，请严格依据下面上下文回答：
参考资料：
{context}
用户问题：
{query}
"""
final_prompt = prompt_template.format(context=context, query=user_query)

print(final_prompt)

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

追寻像素级监督的视觉预训练：Pixio

AtomGit开源社区

大湾区医疗健康EMBA实测解析与科学选型指南

师资团队国际化程度极高，外籍教师占比约50%，100%博士学历，汇聚哈佛、剑桥、斯坦福、哥伦比亚等全球顶级高校学者，覆盖战略管理、资本运作、市场营销、宏观经济、组织变革等全领域，兼具学术深度与企业实战经验。二是数字化转型，助力传统医疗企业落地AI医疗、智能设备研发等科技升级；依托港科大顶尖的AI、数据科学科研实力，精准匹配当下智慧医疗、医疗数字化转型的行业趋势，同时完善的跨境课程与全球游学体系，完