微软GraphRAG+ollama复现(附问题解决过程)
一、总体概述
GraphRAG
- 定义:一种基于AI的内容解释和搜索功能。使用大语言模型(LLM)解析数据以创建知识图谱并回答用户关于用户提供的私有数据集的问题。
- GraphRAG在信息之间建立连接,可以解决基于关键字和向量搜索机制难以解决或无法回答的问题。
- 预期用途:
- 支持关键信息发现和分析用例。例如:在多文档、混杂噪声及虚假信息的情况下提取有用见解;用户输入底层数据需要回答抽象和主题问题时。
- 用户需要有一定的分析能力并且需要批判的看待答案。GraphRAG能准确提取复杂信息的主题,但需要人工分析,以验证和增强。
- 部署特定领域的文本数据语料库,本身不收集用户数据。
源码地址:https://github.com/microsoft/graphrag
二、执行过程
- 配置镜像源:
- 将conda配置更新为官方的Anaconda通道
conda config --remove-key channels
conda config --add channels defaults
conda config --add channels conda-forge
- 创建环境:
conda create -n graphrag-ollama-local python=3.10
conda activate graphrag-ollama-local
- 安装Ollama
- 安装Ollama依赖
安装IPEX-LLM库,这是Ollama运行所需的依赖:
pip install --pre --upgrade ipex-llm[cpp]

- 使用安装包安装ollama(如果需要切换磁盘则需参考第二个链接)
Ollama-0001-安装 - 知乎 (zhihu.com)
ollama修改模型的下载位置解决C盘空间不足问题_ollama修改模型位置-CSDN博客
- 使用ollama下载使用的开源模型:mistral+nomic-embed-text

- 检查下载是否成功

- 启动服务

出现报错:端口号被占用。Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address-CSDN博客
原因:Windows中默认安装Ollama会开机启动。因此才会在ollama serve时报错。
- 解决问题:退出Ollama:
快捷键win+x打开任务管理器:启动应用中禁用掉ollama,并在进程中结束ollama的任务。 - 再次尝试ollama serve
成功:

- 在此处,无法输入,类似于卡主不动的样式。
- 需要打开一个新的cmd窗口,输入conda activate graphrag-ollama-local,进行接下来的操作和测试
- 测试ollama是否运行成功:


- 下载源代码
cd graphrag-local-ollama/

- 安装依赖包(非常重要!!!)
pip install -e .
报错解决方案:
方案一:更换镜像源下载;
方案二:要使用git clone源代码,再次运行即可
- 创建GraphRAG目录,并且创建input/目录用于存放原始文档:mkdir ragtest\input
- 将原始文档放入到./ragtest/input目录下(仅支持txt文件,可多个):xcopy input\* ragtest\input\ /s /e

- 初始化项目:python -m graphrag.index --init --root ./ragtest
若报错,根据报错安装模块
若运行成功,ragtest目录下有output,input,setting.yaml, prompts.env(默认隐藏)五个目录及文件


- 移动 settings.yaml 文件,这是用 ollama 本地模型配置的主要预定义配置文件:
move settings.yaml .\ragtest\
- 修改配置文件(完全粘贴附录中的修改结果)
- 进行索引,构建图:python -m graphrag.index --root ./ragtest
若报错,则首先检查是否完全粘贴。
当前报错情况(未解决)
三、附录
settings.yaml
修改结果如下:
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: mistral
model_supports_json: false # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
api_base: http://localhost:11434/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: nomic-embed-text
api_base: http://localhost:11434/api
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 200
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: yes
raw_entities: yes
top_level_nodes: yes
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)