ClusterCockpit
ClusterCockpit is an open-source framework for job-specific and cluster-wide performance and energy monitoring in HPC environments. With this toolkit, users can analyze how their jobs utilize compute resources, spot inefficiencies, and optimize overall performance.
Developed as a collaborative project led by the NHR center NHR@FAU, ClusterCockpit benefits from contributions from several other HPC centers. All components are released under the MIT license and freely available for use.
Technical background
The solution is composed of several modular components, all implemented in Go (with the web frontend based on Svelte):
- cc-backend: A REST and GraphQL API backend, which also provides the web interface. This is the core component of ClusterCockpit.
- cc-metric-collector: A lightweight agent running on each compute node, responsible for collecting hardware performance and energy metrics (CPU load, memory usage, GPU utilization, power consumption, etc.) and forwarding them to the backend.
- cc-metric-store: An in-memory timeseries cache for fast metric access and visualization.
While ClusterCockpit can serve as a complete, integrated monitoring solution, its components are also designed for seamless integration with external systems.
User Interface Features
The web interface provides a comprehensive set of tools for monitoring and analysis. Users can:
- Browse your running and completed jobs
- Analyze key metrics such as:
- CPU load
- Memory bandwidth and usage
- Floating-Point Operations per Second (FLOPS)
- GPU utilization (if available)
- Energy and power consumption (where supported)
- Identify performance issues such as:
- Underutilized CPU or GPU resources
- Memory bandwidth bottlenecks
- Inefficient process pinning
- Explore Roofline diagrams to visualize computational intensity and performance
Example Views
Job List View
Browse all of your jobs.
Detailed Job View
See time-series plots for CPU load, memory usage, GPU utilization and other metrics.