Monitoring Gitaly and Gitaly Cluster
You can use the available logs and Prometheus metrics to monitor Gitaly and Gitaly Cluster (Praefect).
Metric definitions are available:
- Directly from Prometheus
/metrics
endpoint configured for Gitaly. - Using Grafana Explore on a Grafana instance configured against Prometheus.
Monitor Gitaly rate limiting
Gitaly can be configured to limit requests based on:
- Concurrency of requests.
- A rate limit.
Monitor Gitaly request limiting with the gitaly_requests_dropped_total
Prometheus metric. This metric provides a total count
of requests dropped due to request limiting. The reason
label indicates why a request was dropped:
-
rate
, due to rate limiting. -
max_size
, because the concurrency queue size was reached. -
max_time
, because the request exceeded the maximum queue wait time as configured in Gitaly.
Monitor Gitaly concurrency limiting
You can observe specific behavior of concurrency-queued requests using the Gitaly logs and Prometheus:
- In the Gitaly logs, look for the string (or structured log field)
acquire_ms
. Messages that have this field are reporting about the concurrency limiter. - In Prometheus, look for the following metrics:
-
gitaly_concurrency_limiting_in_progress
indicates how many concurrent requests are being processed. -
gitaly_concurrency_limiting_queued
indicates how many requests for an RPC for a given repository are waiting due to the concurrency limit being reached. -
gitaly_concurrency_limiting_acquiring_seconds
indicates how long a request has to wait due to concurrency limits before being processed.
-
Monitor Gitaly cgroups
You can observe the status of control groups (cgroups) using Prometheus:
-
gitaly_cgroups_reclaim_attempts_total
, a gauge for the total number of times there has been a memory relcaim attempt. This number resets each time a server is restarted. -
gitaly_cgroups_cpu_usage
, a gauge that measures CPU usage per cgroup. -
gitaly_cgroup_procs_total
, a gauge that measures the total number of processes Gitaly has spawned under the control of cgroups.
pack-objects
cache
The following pack-objects
cache metrics are available:
-
gitaly_pack_objects_cache_enabled
, a gauge set to1
when the cache is enabled. Available labels:dir
andmax_age
. -
gitaly_pack_objects_cache_lookups_total
, a counter for cache lookups. Available label:result
. -
gitaly_pack_objects_generated_bytes_total
, a counter for the number of bytes written into the cache. -
gitaly_pack_objects_served_bytes_total
, a counter for the number of bytes read from the cache. -
gitaly_streamcache_filestore_disk_usage_bytes
, a gauge for the total size of cache files. Available label:dir
. -
gitaly_streamcache_index_entries
, a gauge for the number of entries in the cache. Available label:dir
.
Some of these metrics start with gitaly_streamcache
because they are generated by the
streamcache
internal library package in Gitaly.
Example:
gitaly_pack_objects_cache_enabled{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache",max_age="300"} 1
gitaly_pack_objects_cache_lookups_total{result="hit"} 2
gitaly_pack_objects_cache_lookups_total{result="miss"} 1
gitaly_pack_objects_generated_bytes_total 2.618649e+07
gitaly_pack_objects_served_bytes_total 7.855947e+07
gitaly_streamcache_filestore_disk_usage_bytes{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 2.6200152e+07
gitaly_streamcache_filestore_removed_total{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
gitaly_streamcache_index_entries{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
Useful queries
The following are useful queries for monitoring Gitaly:
-
Use the following Prometheus query to observe the type of connections Gitaly is serving a production environment:
sum(rate(gitaly_connections_total[5m])) by (type)
-
Use the following Prometheus query to monitor the authentication behavior of your GitLab installation:
sum(rate(gitaly_authentications_total[5m])) by (enforced, status)
In a system where authentication is configured correctly and where you have live traffic, you see something like this:
{enforced="true",status="ok"} 4424.985419441742
There may also be other numbers with rate 0, but you only have to take note of the non-zero numbers.
The only non-zero number should have
enforced="true",status="ok"
. If you have other non-zero numbers, something is wrong in your configuration.The
status="ok"
number reflects your current request rate. In the example above, Gitaly is handling about 4000 requests per second. -
Use the following Prometheus query to observe the Git protocol versions being used in a production environment:
sum(rate(gitaly_git_protocol_requests_total[1m])) by (grpc_method,git_protocol,grpc_service)
Monitor Gitaly Cluster
To monitor Gitaly Cluster (Praefect), you can use these Prometheus metrics. There are two separate metrics endpoints from which metrics can be scraped:
- The default
/metrics
endpoint. -
/db_metrics
, which contains metrics that require database queries.
/metrics
endpoint
Default Prometheus The following metrics are available from the /metrics
endpoint:
-
gitaly_praefect_read_distribution
, a counter to track distribution of reads. It has two labels:-
virtual_storage
. -
storage
.
They reflect configuration defined for this instance of Praefect.
-
-
gitaly_praefect_replication_latency_bucket
, a histogram measuring the amount of time it takes for replication to complete after the replication job starts. Available in GitLab 12.10 and later. -
gitaly_praefect_replication_delay_bucket
, a histogram measuring how much time passes between when the replication job is created and when it starts. Available in GitLab 12.10 and later. -
gitaly_praefect_node_latency_bucket
, a histogram measuring the latency in Gitaly returning health check information to Praefect. This indicates Praefect connection saturation. Available in GitLab 12.10 and later. -
gitaly_praefect_connections_total
, the total number of connections to Praefect. Introduced in GitLab 14.7.
To monitor strong consistency, you can use the following Prometheus metrics:
-
gitaly_praefect_transactions_total
, the number of transactions created and voted on. -
gitaly_praefect_subtransactions_per_transaction_total
, the number of times nodes cast a vote for a single transaction. This can happen multiple times if multiple references are getting updated in a single transaction. -
gitaly_praefect_voters_per_transaction_total
: the number of Gitaly nodes taking part in a transaction. -
gitaly_praefect_transactions_delay_seconds
, the server-side delay introduced by waiting for the transaction to be committed. -
gitaly_hook_transaction_voting_delay_seconds
, the client-side delay introduced by waiting for the transaction to be committed.
To monitor the number of repositories that have no healthy, up-to-date replicas:
gitaly_praefect_unavailable_repositories
To monitor repository verification, use the following Prometheus metrics:
-
gitaly_praefect_verification_queue_depth
, the total number of replicas pending verification. This metric is scraped from the database and is only available when Prometheus is scraping the database metrics. -
gitaly_praefect_verification_jobs_dequeued_total
, the number of verification jobs picked up by the worker. -
gitaly_praefect_verification_jobs_completed_total
, the number of verification jobs completed by the worker. Theresult
label indicates the end result of the jobs:-
valid
indicates the expected replica existed on the storage. -
invalid
indicates the replica expected to exist did not exist on the storage. -
error
indicates the job failed and has to be retried.
-
-
gitaly_praefect_stale_verification_leases_released_total
, the number of stale verification leases released.
You can also monitor the Praefect logs.
/db_metrics
endpoint
Database metrics Introduced in GitLab 14.5.
The following metrics are available from the /db_metrics
endpoint:
-
gitaly_praefect_unavailable_repositories
, the number of repositories that have no healthy, up to date replicas. -
gitaly_praefect_read_only_repositories
, the number of repositories in read-only mode in a virtual storage. This metric is available for backwards compatibility reasons.gitaly_praefect_unavailable_repositories
is more accurate. -
gitaly_praefect_replication_queue_depth
, the number of jobs in the replication queue.