Back to work
HPC cluster observability · Go + Python
Pulse
HPC operators juggle GPU telemetry, job scheduling, and alerting across disconnected tools.
View sourceWhat I built
A containerized observability stack (Go + Python microservices, React 19) with Prometheus / OpenTelemetry instrumentation (DCGM-style GPU metrics), a SLURM-style priority job scheduler, Alertmanager + Grafana, and an Ollama-backed ops assistant that injects live cluster state into prompts.
Stack
GoPythonPrometheusOllamaReact
Status
Demo stack with simulated telemetry; real instrumentation wiring.