lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Wartes <jwar...@whitepages.com>
Subject Effects of insert order on query performance
Date Thu, 11 Aug 2016 17:39:50 GMT

This isn’t really a question, although some validation would be nice. It’s more of a warning.

Tldr is that the insert order of documents in my collection appears to have had a huge effect
on my query speed.


I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued
field (“permissions”) that for 90% of docs contains one particular value, (“A”) and
for 10% of docs contains another distinct value. (“B”) It’s intended to represent something
like permissions, so more values are possible in the future, but not present currently. In
fact, the addition of docs with value B to this index was very recent, previously all docs
had value “A”. All queries, in addition to various other Boolean-query type restrictions,
have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions
v=A,B}

Last week, I tried to re-index the whole collection from scratch, using source data. Query
performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous
query throughput, and even that was at latencies that were orders of magnitude higher than
what I had in production.

I hooked up some CPU profiling to a server that had shards from both the old and new version
of the collection, and eventually it looked like the significant difference in processing
the two collections was coming from ConstantWeight.scorer()
Specifically, this line
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
was far more expensive in my re-indexed collection. From there, the call chain goes through
an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here:
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169

I don’t pretend to understand all that code, but the difference in my re-index appears to
have something to do either with that cache, or the aggregate docIdSets that need weights
generated is simply much bigger in my re-index.


But the queries didn’t change, and the data is basically the same, what else could have
changed?

The documents with the “B” distinct value were added recently to the high-performance
collection, but the A’s and the B’s were all mixed up in the source data dump I used to
re-index. On a hunch, I manually ordered the docs such that the A’s were all first and re-indexed
again, and performance is great!

Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index
are contained in the largest segments. I’m guessing there’s an optimization somewhere
that says something like “This segment only has A’s”. By indexing all the A’s first,
those biggest segments only contain A’s, and only the smallest, newest segments are unable
to make use of that optimization.

Here’s the scary part: Although my re-index is now performing well, if this theory is right,
some random insert (or a deliberate optimize) at some random point in the future could cascade
a segment merge such that the largest segment(s) now contain both A’s and B’s, and performance
suddenly goes over a cliff. I have no way to prevent this possibility except to stop doing
inserts.

My current thinking is that I need to pull the terms-query part out of the query and do a
filter query for it instead. Probably as a post-filter, since I’ve had bad luck with very
large filter queries and the filter cache. I’d tested this originally (when I only had A’s),
but found the performance was a bit worse than just leaving it in the query. I’ll take a
bit worse and predictability over a bit better and a time bomb though, if those are my choices.


If anyone has any comments refuting or supporting this theory, I’d certainly like to hear
it. This is the first time I’ve encountered anything about insert order mattering from a
performance perspective, and it becomes a general-form question around how to handle low-cardinality
fields.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message