Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Fri, 25 Sep 2015 10:05:04 +0000 (UTC)
From: "Uwe Schindler (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.12896310.1443150920000.71999.1443175504722@Atlassian.JIRA>
In-Reply-To: <JIRA.12896310.1443150920000@Atlassian.JIRA>
References: <JIRA.12896310.1443150920000@Atlassian.JIRA>
 <JIRA.12896310.1443150920290@arcas>
Subject: [jira] [Comment Edited] (SOLR-8096) Major faceting performance
 regressions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/SOLR-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907870#comment-14907870 ] 

Uwe Schindler edited comment on SOLR-8096 at 9/25/15 10:04 AM:
---------------------------------------------------------------

bq. Use of the highly optimized faceting that Solr had for multi-valued fields over relatively static indexes was secretly removed as part of LUCENE-5666, causing severe performance regressions.

Hi, the removal was not "secret". Removal of FieldCache from Lucene (and replacement by UninvertingReader) was discussed on the Issue tracker, although interest by Solr people was small. I think this is the main issue here. Sometimes it would be good to have Solr committers taking part of discussions on Lucene issues. If you want to make Solr bettre, you should also help in making Lucene better!

The old field cache was also put into a separate module (with the new DocValues emulating-API), because we (Lucene Committers) knew that Solr still uses it. Sure, we could have used UninvertingReader on top of SlowCompositeReaderWrapper, but this would bring other slowness! So the committers decided to step forward and remove the top-level facetting (which was long overdue).

It was announced in several talks about Lucene 5 that FieldCache was removed and all facetting in Solr was implicitely changed to only use per segment field caches (e.g., see my talk @ focdem 2015, JAX 2015, or berlinbuzzwords - around one of the last slides). Maybe there should have been added a changes entry also to the Solr CHANGES.txt about this, but 

The CHANGES.txt about this entry was, the first line mentions that facetting in Solr is involved. Any Solr committer could have looked into the code and bring up complaints about those changes in the issue tracker also after this commit has been done:

{quote}
* LUCENE-5666: Change uninverted access (sorting, faceting, grouping, etc)
  to use the DocValues API instead of FieldCache. For FieldCache functionality,
  use UninvertingReader in lucene/misc (or implement your own FilterReader).
  UninvertingReader is more efficient: supports multi-valued numeric fields,
  detects when a multi-valued field is single-valued, reuses caches
  of compatible types (e.g. SORTED also supports BINARY and SORTED_SET access
  without insanity).  "Insanity" is no longer possible unless you explicitly want it. 
  Rename FieldCache* and DocTermOrds* classes in the search package to DocValues*. 
  Move SortedSetSortField to core and add SortedSetFieldSource to queries/, which
  takes the same selectors. Add helper methods to DocValues.java that are better 
  suited for search code (never return null, etc).  (Mike McCandless, Robert Muir)
{quote}

So everybody was informed.

bq. The people who did this are elasticsearch employees. That is one way to deal with Solr's faster faceting!

This is speculation and really a bad behaviour on an Open Source issue tracker. We should discuss here about technical stuff, not make any assumptions about what people intend to do. This statement was posted by a person ([~mmurphy3141]) who I never met in person, and who really seldem took place in Lucene/Solr discussions at all. So I don't think we should count on that. It is also bad behaviour to accuse committers on twitter about sabotage: https://twitter.com/mmurphy3141/status/647254551356162048; please don't do this. I would ask to remove this tweet, thanks.

I was informed about the changes mentioned here and I strongly agree with the committers behind LUCENE-5666. I was always in favour of removing those top-level facetting algorithms. So they still have my strong +1. On my Solr customers I have seen nobody who complained about slow top-level facetting recently (because I told them long time ago to no longer use those outdated top-level algorithms if they have dynamic indexes). Of course I don't know about people using static indexes.

The right thing to do for Solr people would be to remove those top-level stuff completely. This is no longer fitting the new reader structure (composite and atomic/leaf readers) of Lucene 3 (with API cleanups to better reflect the new structure in Lucene 4). Lucene 3 is now several years retired already! So there was long time to fix Solr's facetting to go away from top-level. People with static indexes can still force merge their index and will have the same performance with the new algorithms.

Please keep in mind that it took about half a year until the first one recognized a problem like this, which makes me think that only few people are using those mostly-static indexes. 

*We should work on this issue to fix the issue, not accuse people, thanks!*


was (Author: thetaphi):
bq. Use of the highly optimized faceting that Solr had for multi-valued fields over relatively static indexes was secretly removed as part of LUCENE-5666, causing severe performance regressions.

Hi, the removal was not "secret". Removal of FieldCache from Lucene (and replacement by UninvertingReader) was discussed on the Issue tracker, although interest by Solr people was small. I think this is the main issue here. Sometimes it would be good to have Solr committers taking part of discussions on Lucene issues. If you want to make Solr bettre, you should also help in making Lucene better!

The old field cache was also put into a separate module (with the new DocValues emulating-API), because we (Lucene Committers) knew that Solr still uses it. Sure, we could have used UninvertingReader on top of SlowCompositeReaderWrapper, but this would bring other slowness! So the committers decided to step forward and remove the top-level facetting (which was long overdue).

It was announced in several talks about Lucene 5 that FieldCache was removed and all facetting in Solr was implicitely changed to only use per segment field caches (e.g., see my talk @ focdem 2015, JAX 2015, or berlinbuzzwords - around one of the last slides). Maybe there should have been added a changes entry also to the Solr CHANGES.txt about this, but 

The CHANGES.txt about this entry was, the first line mentions that facetting in Solr is involved. Any Solr committer could have looked into the code and bring up complaints about those changes in the issue tracker also after this commit has been done:

{quote}
* LUCENE-5666: Change uninverted access (sorting, faceting, grouping, etc)
  to use the DocValues API instead of FieldCache. For FieldCache functionality,
  use UninvertingReader in lucene/misc (or implement your own FilterReader).
  UninvertingReader is more efficient: supports multi-valued numeric fields,
  detects when a multi-valued field is single-valued, reuses caches
  of compatible types (e.g. SORTED also supports BINARY and SORTED_SET access
  without insanity).  "Insanity" is no longer possible unless you explicitly want it. 
  Rename FieldCache* and DocTermOrds* classes in the search package to DocValues*. 
  Move SortedSetSortField to core and add SortedSetFieldSource to queries/, which
  takes the same selectors. Add helper methods to DocValues.java that are better 
  suited for search code (never return null, etc).  (Mike McCandless, Robert Muir)
{quote}

So everybody was informed.

bq. The people who did this are elasticsearch employees. That is one way to deal with Solr's faster faceting!

This is speculation and really a bad behaviour on an Open Source issue tracker. We should discuss here about technical stuff, not make any assumptions about what people intend to do. This statement was posted by a person ([~mmurphy3141]) who I never met in person, and who really seldem took place in Lucene/Solr discussions at all. So I don't think we should count on that. It is also bad behaviour to accuse committers on twitter about sabotage: https://twitter.com/mmurphy3141/status/647254551356162048; please don't do this. I would ask to remove this tweet, thanks.

I was informed about the changes mentioned here and I strongly agree with the committers behind LUCENE-5666. I was always in favour of removing those top-level facetting algorithms. So they still have my strong +1. On my Solr customers I have seen nobody who complained about slow top-level facetting (because I told them long time ago to no longer use those outdated top-level algorithms if they have dynamic indexes).

The right thing to do for Solr people would be to remove those top-level stuff completely. This is no longer fitting the new reader structure (composite and atomic/leaf readers) of Lucene 3 (with API cleanups to better reflect the new structure in Lucene 4). Lucene 3 is now several years retired already! So there was long time to fix Solr's facetting to go away from top-level. People with static indexes can still force merge their index and will have the same performance with the new algorithms.

Please keep in mind that it took about half a year until the first one recognized a problem like this, which makes me think that only few people are using those mostly-static indexes. 

*We should work on this issue to fix the issue, not accuse people, thanks!*

> Major faceting performance regressions
> --------------------------------------
>
>                 Key: SOLR-8096
>                 URL: https://issues.apache.org/jira/browse/SOLR-8096
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.0, 5.1, 5.2, 5.3, Trunk
>            Reporter: Yonik Seeley
>            Priority: Critical
>
> Use of the highly optimized faceting that Solr had for multi-valued fields over relatively static indexes was *secretly removed* as part of LUCENE-5666, causing severe performance regressions.
> Here are some quick benchmarks to gauge the damage, on a 5M document index, with each field having between 0 and 5 values per document.  *Higher numbers represent worse 5x performance*.
> Solr 5.4_dev faceting time as a percent of Solr 4.10.3 faceting time		
> ||...................................|| Percent of index being faceted
> ||num_unique_values||	10%	|| 50% || 90% ||
> |10	        | 351.17%	| 1587.08%	| 3057.28% |
> |100   	| 158.10%	| 203.61%	| 1421.93% |
> |1000	| 143.78%	| 168.01%	| 1325.87% |
> |10000	| 137.98%	| 175.31%	| 1233.97% |
> |100000	| 142.98%	| 159.42%	| 1252.45% |
> |1000000	| 255.15%	| 165.17%	| 1236.75% |
> For example, a field with 1000 unique values in the whole index, faceting with 5x took 143% of the 4x time, when ~10% of the docs in the index were faceted.
> One user who brought the performance problem to our attention: http://markmail.org/message/ekmqh4ocbkwxv3we
> "faceting is unusable slow since upgrade to 5.3.0" (from 4.10.3)
> The disabling of the UnInvertedField algorithm was previously discovered in SOLR-7190, but we didn't know just how bad the problem was at that time.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org