Difference between revisions of "ClusterCockpit"
(Creation of ClusterCockpit) |
m (Increase cross article links) |
||
| Line 21: | Line 21: | ||
* Browse your running and completed jobs | * Browse your running and completed jobs | ||
* Analyze key metrics such as: | * Analyze key metrics such as: | ||
| − | ** CPU load | + | ** [https://hpc-wiki.info/hpc/Performance_metrics#CPU_load CPU load] |
| − | ** Memory bandwidth and usage | + | ** [https://hpc-wiki.info/hpc/Performance_metrics#Memory_bandwidth Memory bandwidth] and [https://hpc-wiki.info/hpc/Performance_metrics#Memory_usage usage] |
| − | ** Floating-Point Operations per Second (FLOPS) | + | ** [https://hpc-wiki.info/hpc/Performance_metrics#Flops Floating-Point Operations per Second (FLOPS)] |
** GPU utilization (if available) | ** GPU utilization (if available) | ||
| − | ** Energy and power consumption (where supported) | + | ** [https://hpc-wiki.info/hpc/Performance_metrics#Power Energy and power consumption] (where supported) |
* Identify performance issues such as: | * Identify performance issues such as: | ||
| − | ** Underutilized CPU or GPU resources | + | ** [https://hpc-wiki.info/hpc/Job_efficiency#Resource_underutilization Underutilized CPU or GPU resources] |
** Memory bandwidth bottlenecks | ** Memory bandwidth bottlenecks | ||
| − | ** Inefficient process pinning | + | ** Inefficient [https://hpc-wiki.info/hpc/Binding/Pinning process pinning] |
* Explore Roofline diagrams to visualize computational intensity and performance | * Explore Roofline diagrams to visualize computational intensity and performance | ||
Latest revision as of 16:04, 4 September 2025
ClusterCockpit is an open-source framework for job-specific and cluster-wide performance and energy monitoring in HPC environments. With this toolkit, users can analyze how their jobs utilize compute resources, spot inefficiencies, and optimize overall performance.
Developed as a collaborative project led by the NHR center NHR@FAU, ClusterCockpit benefits from contributions from several other HPC centers. All components are released under the MIT license and freely available for use.
Technical background
The solution is composed of several modular components, all implemented in Go (with the web frontend based on Svelte):
- cc-backend: A REST and GraphQL API backend, which also provides the web interface. This is the core component of ClusterCockpit.
- cc-metric-collector: A lightweight agent running on each compute node, responsible for collecting hardware performance and energy metrics (CPU load, memory usage, GPU utilization, power consumption, etc.) and forwarding them to the backend.
- cc-metric-store: An in-memory timeseries cache for fast metric access and visualization.
While ClusterCockpit can serve as a complete, integrated monitoring solution, its components are also designed for seamless integration with external systems.
User Interface Features
The web interface provides a comprehensive set of tools for monitoring and analysis. Users can:
- Browse your running and completed jobs
- Analyze key metrics such as:
- CPU load
- Memory bandwidth and usage
- Floating-Point Operations per Second (FLOPS)
- GPU utilization (if available)
- Energy and power consumption (where supported)
- Identify performance issues such as:
- Underutilized CPU or GPU resources
- Memory bandwidth bottlenecks
- Inefficient process pinning
- Explore Roofline diagrams to visualize computational intensity and performance
Example Views
Job List View
Browse all of your jobs.
Detailed Job View
See time-series plots for CPU load, memory usage, GPU utilization and other metrics.