Skip to content

fix(docs-fts): sanitize natural-language queries into safe FTS5 expressions#127

Merged
markshust merged 1 commit into
developfrom
feature/docs-fts-nl-query-sanitization
Jun 23, 2026
Merged

fix(docs-fts): sanitize natural-language queries into safe FTS5 expressions#127
markshust merged 1 commit into
developfrom
feature/docs-fts-nl-query-sanitization

Conversation

@markshust

Copy link
Copy Markdown
Collaborator

Summary

search_docs (the MCP docs tool, backed by docs-fts) returned syntax errors or zero results for ordinary natural-language questions. FtsSearch::search() bound the raw query into docs_fts MATCH :q, and FTS5 interprets that operand as a query expression:

  1. Punctuation = syntax error — an apostrophe (Marko's) or stray quote raised fts5: syntax error.
  2. Implicit AND — a full question (how do observers react to events) required every token in one document → zero results.

Fix

New FtsQueryBuilder::toMatchExpression() converts NL input into a safe expression: tokenize to alphanumeric words (dropping punctuation), remove stop words + FTS5 boolean keywords, quote each term, OR-join them so BM25 ranks by overlap. Empty input short-circuits to no results.

Evidence (real docs-markdown corpus)

Queries that previously failed now surface the right doc in top-3:

Query Before After
observers react to events zero concepts/events
Preferences replace a class miss concepts/preferences (rank 1)
routes defined with attributes miss guides/routing (rank 1)

This recovers ~3 of the 5 misses from an in-project audit (effective ~6/8). The two remaining (modularity ranking, config vs configuration porter stemming) are genuine content gaps, tracked separately.

Testing

packages/docs-fts green: 36 passed. New FtsQueryBuilderTest (7 cases) + updated FtsSearchTest (malformed input is now sanitized, not thrown; NL + apostrophe queries return results).

🤖 Generated with Claude Code

…ssions

FtsSearch bound the raw query string straight into `docs_fts MATCH :q`. FTS5
treats that operand as a query expression, with two failure modes:

1. Punctuation is parsed as syntax — an apostrophe ("Marko's") or stray quote
   raised "fts5: syntax error".
2. Multiple bare tokens are implicitly AND-ed, so a full question ("how do
   observers react to events") required every word in one document and returned
   zero results.

New FtsQueryBuilder tokenizes to alphanumeric words, drops stop words / FTS5
boolean keywords, quotes each remaining term, and OR-joins them so BM25 ranks by
term overlap. Empty input short-circuits to no results.

Smoke-tested against the real docs-markdown corpus: queries that previously
returned zero or the wrong docs (events, preferences, routing) now surface the
correct concept doc in the top 3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@markshust markshust merged commit 00c58d4 into develop Jun 23, 2026
1 check passed
@markshust markshust deleted the feature/docs-fts-nl-query-sanitization branch June 23, 2026 20:32
@github-actions github-actions Bot added the bug Something isn't working label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant