C++20 / AVX2 SIMD intrinsics / Memory-mapped I/O

VectorVault

HNSW vector search engine from scratch in C++20

Overview

A from-scratch approximate nearest neighbor search engine implementing the HNSW (Hierarchical Navigable Small World) algorithm. The goal was to understand vector search at every level -- graph construction, distance calculation, persistence, and concurrency -- without relying on any existing library.

Architecture

Graph construction follows the HNSW paper: exponential decay level assignment via -log(uniform)/log(2), greedy descent from max level, ef_construction-bounded search at each layer. Default parameters: M=16, ef_construction=200, max_M0=32.

Distance functions (L2 and cosine) are AVX2-accelerated using _mm256_loadu_ps and _mm256_fmadd_ps, processing 8 floats per cycle with scalar tail loops for remainder. Runtime detection falls back to naive scalar implementations when AVX2 is unavailable.

Persistence uses cross-platform memory-mapped I/O (POSIX mmap + Windows CreateFileMapping). Files include a magic header (0x56564C54 / 'VVLT'), format version, and CRC32 checksum. Loads are atomic -- all nodes deserialize into temporaries, CRC validates, neighbor references are integrity-checked, then state is swapped in.

Concurrency uses std::shared_mutex: search() takes shared_lock (concurrent reads), add() takes unique_lock (exclusive writes). A thread pool handles API requests, though index inserts remain serialized.

Neighbor selection uses simple nearest-M rather than the diversity heuristic from the original paper -- a deliberate simplicity tradeoff for faster build times at the cost of potentially lower recall in highly clustered data.

Design Decisions

Simple nearest-M neighbor selection over diversity heuristic

The original HNSW paper proposes a diversity-aware selection that avoids redundant neighbors. I chose the simpler approach: sort by distance, take top M. This is faster to build and easier to reason about. The recall difference matters mainly in highly clustered data, which isn't the primary use case.

Memory-mapped snapshots over write-ahead log

WAL provides incremental persistence and crash recovery, but adds complexity. mmap + CRC32 gives simpler snapshot-oriented persistence. The tradeoff is that a crash mid-write loses the entire update, but the atomic load validation catches corruption.

AVX2-only SIMD over portable SIMD (SSE4/NEON)

Targeting x86-64 server hardware where AVX2 is ubiquitous. Adding SSE4 and ARM NEON paths would broaden compatibility but triple the distance function code. The scalar fallback covers edge cases.

shared_mutex over lock-free or partitioned index

Lock-free HNSW is possible but extremely complex to implement correctly. Partitioned indices require a merging strategy. shared_mutex is correct-first: concurrent reads with exclusive writes, simple to audit.

Per-node heap allocation over arena/pool allocator

Each node is a unique_ptr<Node>. An arena allocator would improve cache locality for sequential access, but adds memory management complexity. For a project focused on algorithmic correctness, simplicity wins.

Tech Stack

C++20AVX2 SIMD intrinsicsMemory-mapped I/OCRC32 checksumsstd::shared_mutexCMakeDockerhttplib (REST API)