一、总体概述

GraphRAG

  • 定义:一种基于AI的内容解释和搜索功能。使用大语言模型(LLM)解析数据以创建知识图谱并回答用户关于用户提供的私有数据集的问题。
  • GraphRAG在信息之间建立连接,可以解决基于关键字和向量搜索机制难以解决或无法回答的问题。
  • 预期用途:
  • 支持关键信息发现和分析用例。例如:在多文档、混杂噪声及虚假信息的情况下提取有用见解;用户输入底层数据需要回答抽象和主题问题时。
  • 用户需要有一定的分析能力并且需要批判的看待答案。GraphRAG能准确提取复杂信息的主题,但需要人工分析,以验证和增强。
  • 部署特定领域的文本数据语料库,本身不收集用户数据。

源码地址:https://github.com/microsoft/graphrag

二、执行过程

  • 配置镜像源:
  • 将conda配置更新为官方的Anaconda通道

conda config --remove-key channels

conda config --add channels defaults

conda config --add channels conda-forge

  • 创建环境:

conda create -n graphrag-ollama-local python=3.10

conda activate graphrag-ollama-local

  • 安装Ollama
  • 安装Ollama依赖

安装IPEX-LLM库,这是Ollama运行所需的依赖:

pip install --pre --upgrade ipex-llm[cpp]

  • 使用安装包安装ollama(如果需要切换磁盘则需参考第二个链接)

Ollama-0001-安装 - 知乎 (zhihu.com)

ollama修改模型的下载位置解决C盘空间不足问题_ollama修改模型位置-CSDN博客

  • 使用ollama下载使用的开源模型:mistral+nomic-embed-text

  • 检查下载是否成功

  • 启动服务

出现报错:端口号被占用。Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address-CSDN博客

原因:Windows中默认安装Ollama会开机启动。因此才会在ollama serve时报错。

  • 解决问题:退出Ollama:
    快捷键win+x打开任务管理器:启动应用中禁用掉ollama,并在进程中结束ollama的任务。
  • 再次尝试ollama serve

成功:

  • 在此处,无法输入,类似于卡主不动的样式。
  • 需要打开一个新的cmd窗口,输入conda activate graphrag-ollama-local,进行接下来的操作和测试
  • 测试ollama是否运行成功:

  • 下载源代码

git clone GitHub - TheAiSingularity/graphrag-local-ollama: Local models support for Microsoft's graphrag using ollama (llama3, mistral, gemma2 phi3)- LLM & Embedding extraction

cd graphrag-local-ollama/

  • 安装依赖包(非常重要!!!)

pip install -e .

报错解决方案:

方案一:更换镜像源下载;

方案二:要使用git clone源代码,再次运行即可

  • 创建GraphRAG目录,并且创建input/目录用于存放原始文档:mkdir ragtest\input
  •  将原始文档放入到./ragtest/input目录下(仅支持txt文件,可多个):xcopy input\* ragtest\input\ /s /e

  • 初始化项目:python -m graphrag.index --init --root ./ragtest

若报错,根据报错安装模块

若运行成功,ragtest目录下有output,input,setting.yaml, prompts.env(默认隐藏)五个目录及文件

  • 移动 settings.yaml 文件,这是用 ollama 本地模型配置的主要预定义配置文件:

move settings.yaml .\ragtest\

  • 修改配置文件(完全粘贴附录中的修改结果)
  • 进行索引,构建图:python -m graphrag.index --root ./ragtest

若报错,则首先检查是否完全粘贴。

当前报错情况(未解决)

三、附录

settings.yaml

修改结果如下:

encoding_model: cl100k_base

skip_workflows: []

llm:

  api_key: ${GRAPHRAG_API_KEY}

  type: openai_chat # or azure_openai_chat

  model: mistral

  model_supports_json: false # recommended if this is available for your model.

  # max_tokens: 4000

  # request_timeout: 180.0

  api_base: http://localhost:11434/v1

  # api_version: 2024-02-15-preview

  # organization: <organization_id>

  # deployment_name: <azure_model_deployment_name>

  # tokens_per_minute: 150_000 # set a leaky bucket throttle

  # requests_per_minute: 10_000 # set a leaky bucket throttle

  # max_retries: 10

  # max_retry_wait: 10.0

  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:

  stagger: 0.3

  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

  ## parallelization: override the global parallelization settings for embeddings

  async_mode: threaded # or asyncio

  llm:

    api_key: ${GRAPHRAG_API_KEY}

    type: openai_embedding # or azure_openai_embedding

    model: nomic-embed-text

    api_base:  http://localhost:11434/api

    # api_version: 2024-02-15-preview

    # organization: <organization_id>

    # deployment_name: <azure_model_deployment_name>

    # tokens_per_minute: 150_000 # set a leaky bucket throttle

    # requests_per_minute: 10_000 # set a leaky bucket throttle

    # max_retries: 10

    # max_retry_wait: 10.0

    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

    # concurrent_requests: 25 # the number of parallel inflight requests that may be made

    # batch_size: 16 # the number of documents to send in a single request

    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request

    # target: required # or optional

chunks:

  size: 200

  overlap: 100

  group_by_columns: [id] # by default, we don't allow chunks to cross documents

   

input:

  type: file # or blob

  file_type: text # or csv

  base_dir: "input"

  file_encoding: utf-8

  file_pattern: ".*\\.txt$"

cache:

  type: file # or blob

  base_dir: "cache"

  # connection_string: <azure_blob_storage_connection_string>

  # container_name: <azure_blob_storage_container_name>

storage:

  type: file # or blob

  base_dir: "output/${timestamp}/artifacts"

  # connection_string: <azure_blob_storage_connection_string>

  # container_name: <azure_blob_storage_container_name>

reporting:

  type: file # or console, blob

  base_dir: "output/${timestamp}/reports"

  # connection_string: <azure_blob_storage_connection_string>

  # container_name: <azure_blob_storage_container_name>

entity_extraction:

  ## llm: override the global llm settings for this task

  ## parallelization: override the global parallelization settings for this task

  ## async_mode: override the global async_mode settings for this task

  prompt: "prompts/entity_extraction.txt"

  entity_types: [organization,person,geo,event]

  max_gleanings: 0

summarize_descriptions:

  ## llm: override the global llm settings for this task

  ## parallelization: override the global parallelization settings for this task

  ## async_mode: override the global async_mode settings for this task

  prompt: "prompts/summarize_descriptions.txt"

  max_length: 500

claim_extraction:

  ## llm: override the global llm settings for this task

  ## parallelization: override the global parallelization settings for this task

  ## async_mode: override the global async_mode settings for this task

  # enabled: true

  prompt: "prompts/claim_extraction.txt"

  description: "Any claims or facts that could be relevant to information discovery."

  max_gleanings: 0

community_report:

  ## llm: override the global llm settings for this task

  ## parallelization: override the global parallelization settings for this task

  ## async_mode: override the global async_mode settings for this task

  prompt: "prompts/community_report.txt"

  max_length: 2000

  max_input_length: 8000

cluster_graph:

  max_cluster_size: 10

embed_graph:

  enabled: false # if true, will generate node2vec embeddings for nodes

  # num_walks: 10

  # walk_length: 40

  # window_size: 2

  # iterations: 3

  # random_seed: 597832

umap:

  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:

  graphml: yes

  raw_entities: yes

  top_level_nodes: yes

local_search:

  # text_unit_prop: 0.5

  # community_prop: 0.1

  # conversation_history_max_turns: 5

  # top_k_mapped_entities: 10

  # top_k_relationships: 10

  # max_tokens: 12000

global_search:

  # max_tokens: 12000

  # data_max_tokens: 12000

  # map_max_tokens: 1000

  # reduce_max_tokens: 2000

  # concurrency: 32

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐