LLM cost-optimization agent runtime
An agent runtime that sits between your code and LLM APIs, intercepting every call to compress context, deduplicate content, route between cheap and expensive models, and track spending. The goal: run autonomous research agents overnight without burning through API budgets.
The core NightShift engine orchestrates a 7-step pipeline per call: compress, deduplicate, manage history, confidence-gate, budget-check, dispatch, track. Each step is a separate module with fallback behavior.
Context compression uses a 4-stage pipeline: chunk by paragraphs, embed with all-MiniLM-L6-v2 for cosine dedup at 0.92 threshold, abstractive summarize each chunk via T5-small, then rank and select top-K via ms-marco-MiniLM-L6-v2 cross-encoder. Extractive fallback (first N sentences) when models fail to load.
UCB1 bandit routing with 4 arms (explore, deepen, synthesize, evaluate) and c=sqrt(2). Reward signal is log1p(facts_per_dollar)/10, capped at 1.0. The overnight loop uses the bandit to pick research actions, with convergence detection (5 low-yield streak) and time/budget hard stops.
Confidence gate classifies tasks by keyword (extraction/retrieval/evaluation/generation). Extraction and retrieval always handled locally (confidence 0.95). Complex generation routes to expensive API. Simple queries stay local.
Knowledge graph uses ChromaDB PersistentClient with cosine HNSW. Facts extracted from API responses via heuristic regex (bullet points, numbered lists, factual indicators). Capped at 20 facts per response.
3-tier sliding window history: active (last N turns full), compressed archive (T5-summarized), knowledge graph (extracted facts persisted). System messages always preserved.
Content dedup via SHA-256: already-sent content is replaced with '[Previously provided in call #N]' references, avoiding paying for the same tokens twice.