Metrics

1. Basics

Rarog Platform gathers metrics from all components and expose them in OpenMetrics format. Thanks to that you can choose tool of your choice for monitoring and alerting.

By default, metrics are available in OpenMetrics format under URL <base_url>/rest/telemetry/1.0/expose/openmetrics/. They are also exported as telemetry to Rarog Team Graphite server.

Its behaviour can be disabled or adjusted.

1.1. Securing metrics

By default, metrics page are available for anonymous access aka any service or user can access metrics page without authorization. It may be unwanted to expose metrics data, because they may be used to figure out system state, version or info about installed plugins.

By providing value for system property rarog.metrics.open.metric.auth.token it is possible to set authentication token requirement for metrics endpoint. System will automatically detect token and will require users and services to provide this token in request headers.

When authorization is enabled, it is necessary to include header Authorization: Code <authorization_token> to each request. Requests without valid header will be rejected.

1.2. Disabling and configuring metrics

By default, metrics all enabled, but it is possible to disable and configure them using REST API. All REST API docs are available on Rarog application instance, under URL <base_url>/docs/swagger.html.

Quick guide:

  • To disable all metrics sent {"value": false} to endpoint <base_url>>/rest/telemetry/1.0/metrics/control/enabled

  • To disable sending metrics to external systems sent {"value": false} to endpoint <base_url>/rest/telemetry/1.0/metrics/control/snapshot/enabled

All endpoints are available only for admins. So you must first log in to instance or use one of available authentication methods.

2. Metrics reference

2.1. Meta-metrics - metrics about metrics

They may be useful to find performance issues related to metrics themselves.

  • rarog_metrics_snapshot_generation - histogram - seconds

    Measures time it took system to create snapshot of current metrics state.

  • rarog_metrics_snapshot_export - histogram - seconds

    Measures time it took system to export saved metric snapshot to external metrics gathering systems (like Graphite).

  • rarog_open_metrics_expose - histogram - seconds

    Measures how long it took to generate metrics page in Open Metrics format.

  • rarog_open_metrics_expose - histogram - kilobytes

    Measures how big generated metrics page in Open Metrics format was.

  • rarog_open_metrics_forbidden_attempts - counter

    Counts number of attempts to access Open Metrics page that failed due to invalid authorization. Usually it indicates some system has outdated authorization, but it can also be hint that some automated scanners tried to get data about system.

  • rarog_open_metrics_exposed - counter

    Counts number of successful access attempts to Open Metrics page. It brings most value when compared with rarog_open_metrics_forbidden_attempts and rarog_open_metrics_exceptions.

  • rarog_open_metrics_exceptions - counter

    Counts number of access attempts to Open Metrics page that ended with internal error. Usually it indicates that there is some problem inside system. When detected the best thing is to investigate logs.

2.2. Startup metrics

Useful metrics for detecting problems during startup of Rarog application.

  • rarog_startup_memory_usage gauge - megabytes

    Measures memory usage during startup steps.

  • rarog_startup_memory_percent gauge - percent

    Measures percent of used available memory. If values here are high (>90%), it may suggest that application has assigned not enough memory.

  • rarog_startup_phase_duration gauge - milliseconds

    Measures time it took system to end startup step.Useful to identify performance problems during startup.

  • rarog_startup_total_duration gauge - megabytes

    Measures cumulative time it took system to end all steps since start to current step.Usful to identify performance problems during startup.

2.3. Background tasks

  • rarog_task_manager_active_task_count gauge

    Measures count of tasks that are currently in execution.

  • rarog_task_manager_executed_task_count counter

    Measures amount of executed tasks in system since start.

  • rarog_task_manager_failed_task_count counter

    Measures amount of tasks ended with exception since system start. It may be signal that system is in unstable state. Please refer to logs for details.

  • rarog_task_manager_task_duration histogram - seconds

    Measures time it took task to end execution. If there is a lot of long-running tasks, it may be signal that there is plugin installed that misbehaves or system is overloaded.

  • rarog_task_manager_task_declared_execution_diff histogram - seconds Measures how long tasks waited since their scheduled start to actual start. If there is a lot of high values it means either system is overloaded or thread pool is too low. You can try to change thread pool maximal size or investigate trace logs for eu.rarogsoftware.rarog.platform.core.task.DefaultTaskManager and eu.rarogsoftware.rarog.platform.core.task.MonitoredScheduledThreadPoolExecutor.

  • rarog_task_manager_active_thread_count counter

    Measures number of active threads in current moment, created with TaskManager.

  • rarog_task_manager_created_thread_count counter

    Measures amount of created threads with TaskManager since system start.

  • rarog_task_manager_thread_life_duration histogram - seconds

    Measures how long threads created with TaskManager were running.