Back to projects
Python / T5-small (summarization) / MiniLM-L6-v2 (embeddings)

nightshift

LLM cost-optimization agent runtime

Overview

An agent runtime that sits between your code and LLM APIs, intercepting every call to compress context, deduplicate content, route between cheap and expensive models, and track spending. The goal: run autonomous research agents overnight without burning through API budgets.

Architecture

1

The core NightShift engine orchestrates a 7-step pipeline per call: compress, deduplicate, manage history, confidence-gate, budget-check, dispatch, track. Each step is a separate module with fallback behavior.

2

Context compression uses a 4-stage pipeline: chunk by paragraphs, embed with all-MiniLM-L6-v2 for cosine dedup at 0.92 threshold, abstractive summarize each chunk via T5-small, then rank and select top-K via ms-marco-MiniLM-L6-v2 cross-encoder. Extractive fallback (first N sentences) when models fail to load.

3

UCB1 bandit routing with 4 arms (explore, deepen, synthesize, evaluate) and c=sqrt(2). Reward signal is log1p(facts_per_dollar)/10, capped at 1.0. The overnight loop uses the bandit to pick research actions, with convergence detection (5 low-yield streak) and time/budget hard stops.

4

Confidence gate classifies tasks by keyword (extraction/retrieval/evaluation/generation). Extraction and retrieval always handled locally (confidence 0.95). Complex generation routes to expensive API. Simple queries stay local.

5

Knowledge graph uses ChromaDB PersistentClient with cosine HNSW. Facts extracted from API responses via heuristic regex (bullet points, numbered lists, factual indicators). Capped at 20 facts per response.

6

3-tier sliding window history: active (last N turns full), compressed archive (T5-summarized), knowledge graph (extracted facts persisted). System messages always preserved.

7

Content dedup via SHA-256: already-sent content is replaced with '[Previously provided in call #N]' references, avoiding paying for the same tokens twice.

Design Decisions

Lazy model loading with graceful fallbacks
T5, MiniLM, and the embedding model are only loaded on first use. If any model fails to load (missing dependencies, GPU issues), the system falls back to simpler alternatives (extractive summarization, keyword matching). This means nightshift works even in degraded environments.
SDK monkey-patching over explicit wrapper
NightShift patches both OpenAI and Anthropic SDK create methods at runtime, returning SimpleNamespace mock response objects. This means existing agent code works without modification -- just import nightshift and it intercepts. The tradeoff is fragility if SDK internals change.
Raw httpx dispatch over SDK clients
The dispatcher uses raw httpx rather than provider SDKs, supporting OpenAI, Anthropic, Google, and DeepSeek through a unified interface. This avoids multiple SDK dependencies and version conflicts, but means maintaining API compatibility manually.
Heuristic fact extraction over LLM-based extraction
Facts are extracted from responses using regex patterns (bullet points, numbered lists, factual phrases) rather than an LLM call. Using an LLM to extract facts from LLM responses would be more accurate but would add cost and latency -- defeating the purpose of a cost-optimization tool.

Tech Stack

PythonT5-small (summarization)MiniLM-L6-v2 (embeddings)ms-marco-MiniLM (reranking)ChromaDBtiktokenhttpx141 tests