ClusterCockpit

From HPC Wiki
Revision as of 16:06, 30 May 2025 by Robert-externbrink-21b8@ruhr-uni-bochum.de (talk | contribs) (Creation of ClusterCockpit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

ClusterCockpit is an open-source framework for job-specific and cluster-wide performance and energy monitoring in HPC environments. With this toolkit, users can analyze how their jobs utilize compute resources, spot inefficiencies, and optimize overall performance.

Developed as a collaborative project led by the NHR center NHR@FAU, ClusterCockpit benefits from contributions from several other HPC centers. All components are released under the MIT license and freely available for use.

Technical background

The solution is composed of several modular components, all implemented in Go (with the web frontend based on Svelte):

  • cc-backend: A REST and GraphQL API backend, which also provides the web interface. This is the core component of ClusterCockpit.
  • cc-metric-collector: A lightweight agent running on each compute node, responsible for collecting hardware performance and energy metrics (CPU load, memory usage, GPU utilization, power consumption, etc.) and forwarding them to the backend.
  • cc-metric-store: An in-memory timeseries cache for fast metric access and visualization.

While ClusterCockpit can serve as a complete, integrated monitoring solution, its components are also designed for seamless integration with external systems.

User Interface Features

The web interface provides a comprehensive set of tools for monitoring and analysis. Users can:

  • Browse your running and completed jobs
  • Analyze key metrics such as:
    • CPU load
    • Memory bandwidth and usage
    • Floating-Point Operations per Second (FLOPS)
    • GPU utilization (if available)
    • Energy and power consumption (where supported)
  • Identify performance issues such as:
    • Underutilized CPU or GPU resources
    • Memory bandwidth bottlenecks
    • Inefficient process pinning
  • Explore Roofline diagrams to visualize computational intensity and performance

Example Views

Job List View

Browse all of your jobs.

Clustercockpit job list.png

Detailed Job View

See time-series plots for CPU load, memory usage, GPU utilization and other metrics.

Clustercockpit job view.png

External Links