Claude Code 源码分析（番外）：代码索引的真相 —— 没有 Embedding，没有向量数据库，只有 ripgrep

qhvssonic

1124人浏览 · 2026-04-02 07:43:27

qhvssonic · 2026-04-02 07:43:27 发布

本系列文章基于 Claude Code 2.1.88 版本的 TypeScript 源码进行分析。源码版权归 Anthropic 所有，本文仅用于技术研究。

引言

在分析 Claude Code 源码之前，笔者预期会发现一套复杂的代码索引系统——也许是基于 embedding 的语义搜索，也许是 AST 级别的符号索引，也许是某种增量更新的倒排索引。毕竟，一个能够理解大型代码库的 AI 编程助手，似乎离不开这些基础设施。

然而源码揭示的事实恰恰相反：Claude Code 没有做任何传统意义上的"代码索引"。没有 embedding，没有向量数据库，没有离线的语义分析。它的代码理解能力完全建立在"实时搜索 + 模型理解"的架构上。

这一发现本身就值得深入分析——它反映了一种与主流 RAG 方案截然不同的设计哲学。

涉及的核心源码文件：

src/native-ts/file-index/index.ts —— 文件路径模糊搜索引擎
src/hooks/fileSuggestions.ts —— 文件索引的构建与刷新
src/tools/GrepTool/GrepTool.ts —— 基于 ripgrep 的代码内容搜索
src/tools/GlobTool/ —— 文件模式匹配
src/tools/LSPTool/ —— Language Server Protocol 集成
src/tools/ToolSearchTool/ —— 工具能力索引
src/utils/codeIndexing.ts —— 第三方代码索引工具检测（遥测）

一、唯一的索引：文件路径的模糊搜索

Claude Code 唯一真正"索引"的东西是文件路径，不是文件内容。

1.1 FileIndex：一个 nucleo 风格的模糊搜索引擎

src/native-ts/file-index/index.ts 实现了一个纯 TypeScript 的模糊搜索引擎。源码注释表明它最初是一个 Rust NAPI 模块（基于 nucleo，Helix 编辑器的模糊搜索库），后来用 TypeScript 重写以消除原生依赖：

/**
 * Pure-TypeScript port of vendor/file-index-src (Rust NAPI module).
 *
 * The native module wraps nucleo for high-performance fuzzy file searching. 
 * This port reimplements the same API and scoring behavior without native 
 * dependencies.
 */
export class FileIndex {
  private paths: string[] = []
  private lowerPaths: string[] = []
  private charBits: Int32Array = new Int32Array(0)  // a-z 位图
  private pathLens: Uint16Array = new Uint16Array(0)
  private readyCount = 0

  loadFromFileList(fileList: string[]): void { ... }
  search(query: string, limit: number): SearchResult[] { ... }
}

这个索引的用途是输入框中 @ 触发的文件路径补全，不参与代码内容的搜索。

1.2 位图加速：O(1) 排除不可能的匹配

每个路径在索引时预计算一个 26-bit 的字母位图：

private indexPath(i: number): void {
  const lp = this.paths[i]!.toLowerCase()
  this.lowerPaths[i] = lp
  let bits = 0
  for (let j = 0; j < lp.length; j++) {
    const c = lp.charCodeAt(j)
    if (c >= 97 && c <= 122) bits |= 1 << (c - 97)
  }
  this.charBits[i] = bits
}

搜索时，先用位运算检查路径是否包含查询中的所有字母：

// O(1) bitmap reject: path must contain every letter in the needle
if ((charBits[i]! & needleBitmap) !== needleBitmap) continue

源码注释指出，对于包含稀有字符的查询，这一步能排除 90% 以上的路径。即使对于宽泛的查询（如 “test”），也能获得 10% 以上的免费加速。

1.3 评分算法

匹配评分模仿了 fzf-v2 / nucleo 的评分体系：

const SCORE_MATCH = 16          // 基础匹配分
const BONUS_BOUNDARY = 8        // 边界匹配加分（/、_、. 之后）
const BONUS_CAMEL = 6           // 驼峰匹配加分
const BONUS_CONSECUTIVE = 4     // 连续匹配加分
const BONUS_FIRST_CHAR = 8      // 首字符匹配加分
const PENALTY_GAP_START = 3     // 间隔起始惩罚
const PENALTY_GAP_EXTENSION = 1 // 间隔延续惩罚

此外，包含 “test” 的路径会受到 1.05 倍的惩罚，使非测试文件排名略高：

const finalScore = path.includes('test')
  ? Math.min(positionScore * 1.05, 1.0)
  : positionScore

1.4 异步增量构建

对于大型仓库（270k+ 文件），索引构建采用异步增量模式，每 4ms yield 一次事件循环：

loadFromFileListAsync(fileList: string[]): {
  queryable: Promise<void>  // 第一个 chunk 索引完成，可以返回部分结果
  done: Promise<void>       // 全部索引完成
}

queryable 在第一个 chunk 完成后即 resolve，此时搜索可以返回部分结果。UI 层在 done 触发后重新搜索以升级为完整结果。源码注释指出，chunk 大小是基于时间而非数量的——“slow machines get smaller chunks and stay responsive”。

1.5 数据来源与刷新策略

fileSuggestions.ts 负责填充 FileIndex 的数据。数据来源有两个：

第一优先级是 git ls-files（快速，直接读取 git 索引）：

const trackedResult = await execFileNoThrowWithCwd(
  gitExe(),
  ['-c', 'core.quotepath=false', 'ls-files', '--recurse-submodules'],
  { timeout: 5000, abortSignal, cwd: repoRoot },
)

第二优先级是 ripgrep（兜底，用于非 git 仓库）：

const rgArgs = [
  '--files', '--follow', '--hidden',
  '--glob', '!.git/', '--glob', '!.svn/',
]
const files = await ripGrep(rgArgs, '.', abortSignal)

刷新策略有两个触发条件：

.git/index 的 mtime 变化（检测 git add/commit/checkout 等操作）
5 秒定时刷新（捕获 untracked 文件的变化，因为 untracked 文件不会改变 .git/index 的 mtime）

const REFRESH_THROTTLE_MS = 5_000

export function startBackgroundCacheRefresh(): void {
  const indexMtime = getGitIndexMtime()
  if (fileIndex) {
    const gitStateChanged = indexMtime !== null && indexMtime !== lastGitIndexMtime
    if (!gitStateChanged && Date.now() - lastRefreshMs < REFRESH_THROTTLE_MS) {
      return  // 节流：git 状态未变且距上次刷新不足 5 秒
    }
  }
}

此外，系统使用路径列表的 FNV-1a 哈希签名来检测列表是否实际发生了变化，避免不必要的索引重建：

export function pathListSignature(paths: string[]): string {
  const stride = Math.max(1, Math.floor(n / 500))
  let h = 0x811c9dc5 | 0  // FNV-1a offset basis
  for (let i = 0; i < n; i += stride) {
    // 每隔 N 个路径采样一次，270k 路径只哈希约 700 个
  }
  return `${n}:${(h >>> 0).toString(16)}`
}

Untracked 文件在后台异步获取，获取完成后与 tracked 文件合并重建索引。

二、代码内容搜索：完全依赖 ripgrep

Claude Code 的代码内容搜索没有任何预索引，每次搜索都是实时调用 ripgrep。

2.1 GrepTool

GrepTool 是代码内容搜索的主要工具，本质上是 ripgrep 的封装：

export const GrepTool = buildTool({
  name: 'Grep',
  searchHint: 'search file contents with regex (ripgrep)',
  isConcurrencySafe() { return true },   // 只读，可并行
  isReadOnly() { return true },
  
  async call({ pattern, path, glob, type, output_mode, ... }) {
    const args = ['--hidden']
    // 排除版本控制目录
    for (const dir of ['.git', '.svn', '.hg', '.bzr', '.jj', '.sl']) {
      args.push('--glob', `!${dir}`)
    }
    // 限制行长度，防止 base64/minified 内容污染输出
    args.push('--max-columns', '500')
    
    const results = await ripGrep(args, absolutePath, abortController.signal)
    // ...
  }
})

GrepTool 支持三种输出模式：

files_with_matches（默认）：只返回匹配的文件路径，按修改时间排序
content：返回匹配的行内容，支持上下文行（-A/-B/-C）
count：返回每个文件的匹配计数

默认结果上限为 250 条（DEFAULT_HEAD_LIMIT），源码注释解释了原因：

// Unbounded content-mode greps can fill up to the 20KB persist threshold 
// (~6-24K tokens/grep-heavy session). 250 is generous enough for 
// exploratory searches while preventing context bloat.

2.2 GlobTool

GlobTool 用于文件模式匹配（如 **/*.tsx），同样基于 ripgrep 的 --files 模式，没有预索引。

2.3 系统提示中的搜索策略引导

Claude Code 通过系统提示引导模型使用正确的搜索策略，而非依赖预建索引：

ALWAYS use Grep for search tasks. NEVER invoke `grep` or `rg` as a Bash command. 
The Grep tool has been optimized for correct permissions and access.

模型被训练为先用 GrepTool 进行宽泛搜索，再用 FileReadTool 读取具体文件。这种"搜索 → 阅读 → 理解"的模式替代了传统的"索引 → 查询"模式。

三、LSP 集成：借助语言服务器的索引能力

src/tools/LSPTool/ 是 Claude Code 最接近"语义索引"的部分——但索引工作由外部语言服务器完成，Claude Code 只是客户端。

LSPTool 支持 9 种操作：

操作	功能
`goToDefinition`	跳转到符号定义
`findReferences`	查找所有引用
`hover`	获取悬停信息（文档、类型）
`documentSymbol`	获取文档中的所有符号
`workspaceSymbol`	跨工作区搜索符号
`goToImplementation`	跳转到接口/抽象方法的实现
`prepareCallHierarchy`	准备调用层次
`incomingCalls`	查找调用当前函数的所有位置
`outgoingCalls`	查找当前函数调用的所有位置

这些操作依赖语言服务器（如 rust-analyzer、typescript-language-server）的索引。源码中有处理索引未完成时的重试逻辑：

// LSPServerInstance.ts
/**
 * LSP error code for "content modified" - indicates the server's state 
 * changed during request processing (e.g., rust-analyzer still indexing).
 * This is a transient error that can be retried.
 */

当语言服务器尚未完成索引时，workspaceSymbol 操作会返回提示信息：

'No symbols found in workspace. This may occur if the workspace is empty, 
or if the LSP server has not finished indexing the project.'

四、ToolSearch：工具能力的关键词索引

ToolSearchTool 实现了一种特殊的"索引"——不是代码索引，而是工具能力索引。

Claude Code 有 40+ 个工具，为了控制初始 prompt 的 token 消耗，非核心工具被标记为 shouldDefer，不在初始 prompt 中发送完整 schema。模型需要通过 ToolSearch 按关键词发现这些工具：

async function searchToolsWithKeywords(query, deferredTools, tools) {
  for (const tool of deferredTools) {
    const description = await getToolDescriptionMemoized(tool.name, tools)
    const hintNormalized = tool.searchHint?.toLowerCase() ?? ''
    
    let score = 0
    // 工具名匹配
    if (pattern.test(nameParts.full)) score += 10
    // searchHint 匹配（策划的能力短语，信号强于 prompt）
    if (hintNormalized && pattern.test(hintNormalized)) score += 4
    // 描述匹配
    if (pattern.test(descNormalized)) score += 2
  }
}

每个工具的 searchHint 是一个 3-10 词的能力短语（如 BashTool 的 'execute shell commands'、GrepTool 的 'search file contents with regex (ripgrep)'），用于关键词匹配。

五、codeIndexing.ts 的真相：纯遥测模块

src/utils/codeIndexing.ts 这个文件名容易产生误导。它不是代码索引的实现，而是一个遥测模块，用于检测用户是否在使用第三方代码索引工具：

export type CodeIndexingTool =
  | 'sourcegraph' | 'cody' | 'aider' | 'cursor' | 'github-copilot'
  | 'code-index-mcp' | 'local-code-search' | ...

export function detectCodeIndexingFromCommand(command: string): CodeIndexingTool | undefined {
  const firstWord = command.trim().split(/\s+/)[0]?.toLowerCase()
  return CLI_COMMAND_MAPPING[firstWord]
}

export function detectCodeIndexingFromMcpTool(toolName: string): CodeIndexingTool | undefined {
  // 检测 MCP 工具名是否来自代码索引服务器
}

这些函数在 BashTool 执行命令和 MCP 客户端连接时被调用，纯粹用于分析统计——Anthropic 想知道有多少用户在 Claude Code 之外使用了代码索引工具。

六、tree-sitter 的角色：安全分析，非代码索引

源码中大量出现 tree-sitter 的引用，但它的用途是 Bash 命令的安全分析（src/utils/bash/ast.ts），不是代码索引：

/**
 * AST-based bash command analysis using tree-sitter.
 *
 * This module replaces the shell-quote + hand-rolled char-walker approach.
 * Instead of detecting parser differentials one-by-one, we parse with 
 * tree-sitter-bash and walk the tree with an EXPLICIT allowlist of node types.
 */

tree-sitter 被用于将 Bash 命令解析为 AST，然后通过白名单机制判断命令是否安全。它不参与代码库的索引或搜索。

七、设计哲学：为什么不做代码索引

从源码中可以推断出 Claude Code 选择"不索引"的几个原因：

其一，零启动成本。用户安装 Claude Code 后可以立即使用，不需要等待索引构建。对于大型仓库，索引构建可能需要数分钟甚至更长时间。

其二，始终与文件系统同步。实时搜索的结果永远是最新的，不存在索引过期的问题。在频繁修改代码的开发场景中，索引的新鲜度是一个持续的工程挑战。

其三，模型能力的替代。Claude 模型本身具备强大的代码理解能力。给定搜索结果和文件内容，模型可以理解代码结构、追踪调用链、推断类型关系。这些能力在传统工具中需要通过索引来实现，但在 LLM 时代可以由模型直接完成。

其四，ripgrep 足够快。ripgrep 在大型仓库上的搜索速度通常在毫秒级别。对于 AI Agent 的使用场景（每次搜索之间有模型推理的延迟），ripgrep 的速度绑绑有余。

其五，LSP 作为补充。对于需要语义理解的场景（跳转定义、查找引用、调用层次），Claude Code 借助语言服务器的索引能力，而非自建索引。

八、与主流方案的对比

维度	Claude Code	典型 RAG 方案
索引内容	仅文件路径	代码块 embedding
内容搜索	实时 ripgrep	向量相似度检索
语义理解	模型直接理解	embedding 近似匹配
启动成本	零	需要构建索引
存储开销	零	向量数据库
新鲜度	始终最新	需要增量更新
大仓库性能	依赖 ripgrep 速度	依赖索引质量
符号级分析	委托给 LSP	自建 AST 索引