HPC cluster observability · Go + Python

Pulse

HPC operators juggle GPU telemetry, job scheduling, and alerting across disconnected tools.

What I built

A containerized observability stack (Go + Python microservices, React 19) with Prometheus / OpenTelemetry instrumentation (DCGM-style GPU metrics), a SLURM-style priority job scheduler, Alertmanager + Grafana, and an Ollama-backed ops assistant that injects live cluster state into prompts.

Stack

GoPythonPrometheusOllamaReact

Status

Demo stack with simulated telemetry; real instrumentation wiring.