Difference between revisions of "ClusterCockpit"

Latest revision as of 16:04, 4 September 2025

ClusterCockpit is an open-source framework for job-specific and cluster-wide performance and energy monitoring in HPC environments. With this toolkit, users can analyze how their jobs utilize compute resources, spot inefficiencies, and optimize overall performance.

Developed as a collaborative project led by the NHR center NHR@FAU, ClusterCockpit benefits from contributions from several other HPC centers. All components are released under the MIT license and freely available for use.

Technical background

The solution is composed of several modular components, all implemented in Go (with the web frontend based on Svelte):

cc-backend: A REST and GraphQL API backend, which also provides the web interface. This is the core component of ClusterCockpit.
cc-metric-collector: A lightweight agent running on each compute node, responsible for collecting hardware performance and energy metrics (CPU load, memory usage, GPU utilization, power consumption, etc.) and forwarding them to the backend.
cc-metric-store: An in-memory timeseries cache for fast metric access and visualization.

While ClusterCockpit can serve as a complete, integrated monitoring solution, its components are also designed for seamless integration with external systems.

User Interface Features

The web interface provides a comprehensive set of tools for monitoring and analysis. Users can:

Browse your running and completed jobs
Analyze key metrics such as:
- CPU load
- Memory bandwidth and usage
- Floating-Point Operations per Second (FLOPS)
- GPU utilization (if available)
- Energy and power consumption (where supported)

Identify performance issues such as:
- Underutilized CPU or GPU resources
- Memory bandwidth bottlenecks
- Inefficient process pinning

Explore Roofline diagrams to visualize computational intensity and performance

Example Views

Job List View

Browse all of your jobs.

Detailed Job View

See time-series plots for CPU load, memory usage, GPU utilization and other metrics.

@@ Line 21: / Line 21: @@
 * Browse your running and completed jobs
 * Analyze key metrics such as:
-** CPU load
+** [https://hpc-wiki.info/hpc/Performance_metrics#CPU_load CPU load]
-** Memory bandwidth and usage
+** [https://hpc-wiki.info/hpc/Performance_metrics#Memory_bandwidth Memory bandwidth] and [https://hpc-wiki.info/hpc/Performance_metrics#Memory_usage usage]
-** Floating-Point Operations per Second (FLOPS)
+** [https://hpc-wiki.info/hpc/Performance_metrics#Flops Floating-Point Operations per Second (FLOPS)]
 ** GPU utilization (if available)
-** Energy and power consumption (where supported)
+** [https://hpc-wiki.info/hpc/Performance_metrics#Power Energy and power consumption] (where supported)
 * Identify performance issues such as:
-** Underutilized CPU or GPU resources
+** [https://hpc-wiki.info/hpc/Job_efficiency#Resource_underutilization Underutilized CPU or GPU resources]
 ** Memory bandwidth bottlenecks
-** Inefficient process pinning
+** Inefficient [https://hpc-wiki.info/hpc/Binding/Pinning process pinning]
 * Explore Roofline diagrams to visualize computational intensity and performance

Difference between revisions of "ClusterCockpit"

Latest revision as of 16:04, 4 September 2025

Contents

Technical background

User Interface Features

Example Views

Job List View

Detailed Job View

External Links

Navigation menu

Search