Skip to content

feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows)#92

Open
cnndabbler wants to merge 2 commits into
VectifyAI:mainfrom
cnndabbler:feat/retrieval-concepts-plan
Open

feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows)#92
cnndabbler wants to merge 2 commits into
VectifyAI:mainfrom
cnndabbler:feat/retrieval-concepts-plan

Conversation

@cnndabbler

Copy link
Copy Markdown

Motivation

The concepts-plan step injects every existing concept/entity brief into the prompt (_read_concept_briefs / _read_entity_briefs read the whole concepts/ and entities/ dirs). So the plan prompt grows O(N) with the KB. On a 165-doc KB this was observed climbing from ~2k tokens early to ~15–18k tokens at a few hundred concepts — which hurts cost/latency on every subsequent doc and degrades the model's ability to reconcile a new doc against the right existing pages (it starts creating near-duplicate concepts).

What this does (opt-in, default off)

Adds openkb/retrieval.py and a small block in _compile_concepts that, only when enabled, keeps the top-K briefs most relevant to the current doc's summary instead of all of them. Two config keys (default off → behaviour byte-identical):

concepts_plan_retrieval: false     # set true to enable
concepts_plan_retrieval_k: 40
  • Default ranker is dependency-free TF-IDF cosine over the brief lines (no new deps, no extra API calls).
  • An optional embedding ranker (select_relevant_briefs_embed, provider injected — no SDK dependency in the module) is included for higher-drift corpora / future hybrid use.
  • No-ops when the brief set is already within budget, so small KBs are unaffected.

Benchmark (recall@K vs. ground-truth concept links)

On a real 335-concept / 489-entity KB, using each summary's [[concepts/X]] links as ground truth and the summary as the query (scripts/bench_retrieval.py):

K TF-IDF Embeddings (text-embedding-3-small) prompt size
20 0.79 0.67 6% of full
40 0.90 0.79 12% of full

TF-IDF wins here (LLM-generated briefs share heavy lexical overlap with summaries) and is free per-doc, so it's the default. K=40 recovers ~90% of the relevant concepts at 12% of the full-inject prompt size.

Tradeoff

At K=40, ~10% of relevant existing concepts may fall outside the window for a given doc (small risk of a duplicate concept) in exchange for a bounded plan prompt as the KB scales. Off by default; tune ..._k higher to trade prompt size for recall.

Tests: tests/test_retrieval.py (ranker behaviour) + full suite green.

The concepts-plan step injects every existing concept/entity brief, so the
prompt grows O(N) with the KB (~2k->18k tokens at a few hundred concepts),
hurting speed/cost and the model's ability to reconcile against the right
existing pages (-> near-duplicate concepts).

Add an opt-in top-K relevance filter (openkb/retrieval.py, dependency-free
TF-IDF cosine over brief lines, query = doc summary). Off by default
(concepts_plan_retrieval / concepts_plan_retrieval_k in config), so behaviour
is unchanged unless enabled. The select_relevant_briefs() interface is
swappable for an embedding-based ranker later.

Prototype for the O(N)->O(K) plan-context scaling discussion.
Benchmark on a real 335-concept KB (ground truth = summary concept links):
TF-IDF recall@40=0.90 vs embeddings 0.79. Dependency-free TF-IDF wins on this
LLM-generated corpus (high lexical overlap) and is free per-doc, so it stays
the default; embedding ranker kept as an option for higher-drift corpora.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant