lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michał Politowski (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-13234) Prometheus Metric Exporter Not Threadsafe
Date Tue, 14 May 2019 10:18:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839293#comment-16839293
] 

Michał Politowski commented on SOLR-13234:
------------------------------------------

Wouldn't it be worth noting somewhere that this goes against the best practices for writing
exporters given in [Prometheus documentation|https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling]?

> Prometheus Metric Exporter Not Threadsafe
> -----------------------------------------
>
>                 Key: SOLR-13234
>                 URL: https://issues.apache.org/jira/browse/SOLR-13234
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: metrics
>    Affects Versions: 7.6, 8.0
>            Reporter: Danyal Prout
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>              Labels: metric-collector
>             Fix For: 7.7.2, 8.1, master (9.0)
>
>         Attachments: SOLR-13234-branch_7x.patch
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The Solr Prometheus Exporter collects metrics when it receives a HTTP request from Prometheus.
Prometheus sends this request, on its [scrape interval|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config].
When the time taken to collect the Solr metrics is greater than the scrape interval of the
Prometheus server, this results in concurrent metric collection occurring in this [method|https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/collector/SolrCollector.java#L86].
This method doesn’t appear to be thread safe, for instance you could have concurrent modifications
of a [map|https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/collector/SolrCollector.java#L119].
After a while the Solr Exporter processes becomes nondeterministic, we've observed NPE and
loss of metrics.
> To address this, I'm proposing the following fixes:
> 1. Read/parse the configuration at startup and make it immutable. 
>  2. Collect metrics from Solr on an interval which is controlled by the Solr Exporter
and cache the metric samples to return during Prometheus scraping. Metric collection can be
expensive, for example executing arbitrary Solr searches, it's not ideal to allow for concurrent
metric collection and on an interval which is not defined by the Solr Exporter.
> There are also a few other performance improvements that we've made while fixing this,
for example using the ClusterStateProvider instead of sending multiple HTTP requests to each
Solr node to lookup all the cores.
> I'm currently finishing up these changes which I'll submit as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message