feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows) by cnndabbler · Pull Request #92 · VectifyAI/OpenKB

cnndabbler · 2026-06-09T19:20:16Z

Motivation

The concepts-plan step injects every existing concept/entity brief into the prompt (_read_concept_briefs / _read_entity_briefs read the whole concepts/ and entities/ dirs). So the plan prompt grows O(N) with the KB. On a 165-doc KB this was observed climbing from ~2k tokens early to ~15–18k tokens at a few hundred concepts — which hurts cost/latency on every subsequent doc and degrades the model's ability to reconcile a new doc against the right existing pages (it starts creating near-duplicate concepts).

What this does (opt-in, default off)

Adds openkb/retrieval.py and a small block in _compile_concepts that, only when enabled, keeps the top-K briefs most relevant to the current doc's summary instead of all of them. Two config keys (default off → behaviour byte-identical):

concepts_plan_retrieval: false     # set true to enable
concepts_plan_retrieval_k: 40

Default ranker is dependency-free TF-IDF cosine over the brief lines (no new deps, no extra API calls).
An optional embedding ranker (select_relevant_briefs_embed, provider injected — no SDK dependency in the module) is included for higher-drift corpora / future hybrid use.
No-ops when the brief set is already within budget, so small KBs are unaffected.

Benchmark (recall@K vs. ground-truth concept links)

On a real 335-concept / 489-entity KB, using each summary's [[concepts/X]] links as ground truth and the summary as the query (scripts/bench_retrieval.py):

K	TF-IDF	Embeddings (text-embedding-3-small)	prompt size
20	0.79	0.67	6% of full
40	0.90	0.79	12% of full

TF-IDF wins here (LLM-generated briefs share heavy lexical overlap with summaries) and is free per-doc, so it's the default. K=40 recovers ~90% of the relevant concepts at 12% of the full-inject prompt size.

Tradeoff

At K=40, ~10% of relevant existing concepts may fall outside the window for a given doc (small risk of a duplicate concept) in exchange for a bounded plan prompt as the KB scales. Off by default; tune ..._k higher to trade prompt size for recall.

Tests: tests/test_retrieval.py (ranker behaviour) + full suite green.

The concepts-plan step injects every existing concept/entity brief, so the prompt grows O(N) with the KB (~2k->18k tokens at a few hundred concepts), hurting speed/cost and the model's ability to reconcile against the right existing pages (-> near-duplicate concepts). Add an opt-in top-K relevance filter (openkb/retrieval.py, dependency-free TF-IDF cosine over brief lines, query = doc summary). Off by default (concepts_plan_retrieval / concepts_plan_retrieval_k in config), so behaviour is unchanged unless enabled. The select_relevant_briefs() interface is swappable for an embedding-based ranker later. Prototype for the O(N)->O(K) plan-context scaling discussion.

Benchmark on a real 335-concept KB (ground truth = summary concept links): TF-IDF recall@40=0.90 vs embeddings 0.79. Dependency-free TF-IDF wins on this LLM-generated corpus (high lexical overlap) and is free per-doc, so it stays the default; embedding ranker kept as an option for higher-drift corpora.

cnndabbler added 2 commits June 9, 2026 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows)#92

feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows)#92
cnndabbler wants to merge 2 commits into
VectifyAI:mainfrom
cnndabbler:feat/retrieval-concepts-plan

cnndabbler commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cnndabbler commented Jun 9, 2026

Motivation

What this does (opt-in, default off)

Benchmark (recall@K vs. ground-truth concept links)

Tradeoff

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant