Metrics
1. Basics
Rarog Platform gathers metrics from all components and expose them in OpenMetrics format. Thanks to that you can choose tool of your choice for monitoring and alerting.
By default, metrics are available in OpenMetrics format under URL <base_url>/rest/telemetry/1.0/expose/openmetrics/
. They are also exported as telemetry to Rarog Team Graphite server.
Its behaviour can be disabled or adjusted.
1.1. Securing metrics
By default, metrics page are available for anonymous access aka any service or user can access metrics page without authorization. It may be unwanted to expose metrics data, because they may be used to figure out system state, version or info about installed plugins.
By providing value for system property rarog.metrics.open.metric.auth.token
it is possible to set authentication token requirement for metrics endpoint. System will automatically detect token and will require users and services to provide this token in request headers.
When authorization is enabled, it is necessary to include header Authorization: Code <authorization_token>
to each request. Requests without valid header will be rejected.
1.2. Disabling and configuring metrics
By default, metrics all enabled, but it is possible to disable and configure them using REST API.
All REST API docs are available on Rarog application instance, under URL <base_url>/docs/swagger.html
.
Quick guide:
-
To disable all metrics sent
{"value": false}
to endpoint<base_url>>/rest/telemetry/1.0/metrics/control/enabled
-
To disable sending metrics to external systems sent
{"value": false}
to endpoint<base_url>/rest/telemetry/1.0/metrics/control/snapshot/enabled
All endpoints are available only for admins. So you must first log in to instance or use one of available authentication methods.
2. Metrics reference
2.1. Meta-metrics - metrics about metrics
They may be useful to find performance issues related to metrics themselves.
-
rarog_metrics_snapshot_generation - histogram - seconds
Measures time it took system to create snapshot of current metrics state.
-
rarog_metrics_snapshot_export - histogram - seconds
Measures time it took system to export saved metric snapshot to external metrics gathering systems (like Graphite).
-
rarog_open_metrics_expose - histogram - seconds
Measures how long it took to generate metrics page in Open Metrics format.
-
rarog_open_metrics_expose - histogram - kilobytes
Measures how big generated metrics page in Open Metrics format was.
-
rarog_open_metrics_forbidden_attempts - counter
Counts number of attempts to access Open Metrics page that failed due to invalid authorization. Usually it indicates some system has outdated authorization, but it can also be hint that some automated scanners tried to get data about system.
-
rarog_open_metrics_exposed - counter
Counts number of successful access attempts to Open Metrics page. It brings most value when compared with rarog_open_metrics_forbidden_attempts and rarog_open_metrics_exceptions.
-
rarog_open_metrics_exceptions - counter
Counts number of access attempts to Open Metrics page that ended with internal error. Usually it indicates that there is some problem inside system. When detected the best thing is to investigate logs.
2.2. Startup metrics
Useful metrics for detecting problems during startup of Rarog application.
-
rarog_startup_memory_usage gauge - megabytes
Measures memory usage during startup steps.
-
rarog_startup_memory_percent gauge - percent
Measures percent of used available memory. If values here are high (>90%), it may suggest that application has assigned not enough memory.
-
rarog_startup_phase_duration gauge - milliseconds
Measures time it took system to end startup step.Useful to identify performance problems during startup.
-
rarog_startup_total_duration gauge - megabytes
Measures cumulative time it took system to end all steps since start to current step.Usful to identify performance problems during startup.
2.3. Background tasks
-
rarog_task_manager_active_task_count gauge
Measures count of tasks that are currently in execution.
-
rarog_task_manager_executed_task_count counter
Measures amount of executed tasks in system since start.
-
rarog_task_manager_failed_task_count counter
Measures amount of tasks ended with exception since system start. It may be signal that system is in unstable state. Please refer to logs for details.
-
rarog_task_manager_task_duration histogram - seconds
Measures time it took task to end execution. If there is a lot of long-running tasks, it may be signal that there is plugin installed that misbehaves or system is overloaded.
-
rarog_task_manager_task_declared_execution_diff histogram - seconds Measures how long tasks waited since their scheduled start to actual start. If there is a lot of high values it means either system is overloaded or thread pool is too low. You can try to change thread pool maximal size or investigate trace logs for
eu.rarogsoftware.rarog.platform.core.task.DefaultTaskManager
andeu.rarogsoftware.rarog.platform.core.task.MonitoredScheduledThreadPoolExecutor
. -
rarog_task_manager_active_thread_count counter
Measures number of active threads in current moment, created with
TaskManager
. -
rarog_task_manager_created_thread_count counter
Measures amount of created threads with
TaskManager
since system start. -
rarog_task_manager_thread_life_duration histogram - seconds
Measures how long threads created with
TaskManager
were running.