lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Ireson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-6803) Pivot Performance
Date Tue, 12 May 2015 11:30:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539672#comment-14539672
] 

Neil Ireson commented on SOLR-6803:
-----------------------------------

I also made the naive change of removed the offending line from the code, by replacing

{code}
        DocSet subset = getSubset(docs, sfield, fieldValue);
{code}
with
{code}
        DocSet subset = null;
        if ( subField != null || ((isShard || 0 < pivotCount) && ! statsFields.isEmpty())
) {
          subset = getSubset(docs, sfield, fieldValue);
        }
{code}
Just to show that in this case the pivot still provides the best results.

| Values     |  Combined |     Facet |     Pivot |
| 100        |       202 |       133 |        67 |
| 1000       |       215 |       183 |        73 |
| 10000      |       255 |       392 |       145 |
| 100000     |       464 |      1301 |       395 |
| 500000     |      1307 |      4458 |      1179 |
| 1000000    |      2471 |      7783 |      2148 |

Note that with this change the code passed all the compile tests, so it's still not clear
why to me why getSubset has to be called every time. 


> Pivot Performance
> -----------------
>
>                 Key: SOLR-6803
>                 URL: https://issues.apache.org/jira/browse/SOLR-6803
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.1
>            Reporter: Neil Ireson
>            Priority: Minor
>         Attachments: PivotPerformanceTest.java
>
>
> I found that my pivot search for terms per day was taking an age so I knocked up a quick
test, using a collection of 1 million documents with a different number of random terms and
times, to compare different ways of getting the counts.
> 1) Combined = combining the term and time in a single field.
> 2) Facet = for each term set the query to the term and then get the time facet 
> 3) Pivot = use the term/time pivot facet.
> The following two tables present the results for version 4.9.1 vs 4.10.1, as an average
of five runs.
> 4.9.1 (Processing time in ms)
> |Values (#)   |  Combined (ms)|     Facet (ms)|     Pivot (ms)|
> |100       |        22|        21|        52|
> |1000      |       178|        57|       115|
> |10000     |      1363|       211|       310|
> |100000    |      2592|      1009|       978|
> |500000    |      3125|      3753|      2476|
> |1000000   |      3957|      6789|      3725|
> 4.10.1 (Processing time in ms)
> |Values (#)   |  Combined (ms)|     Facet (ms)|     Pivot (ms)|
> |100       |        21|        21|        75|
> |1000      |       188|        60|       265|
> |10000     |      1438|       215|      1826|
> |100000    |      2768|      1073|     16594|
> |500000    |      3266|      3686|     99682|
> |1000000   |      4080|      6777|    208873|
> The results show that, as the number of pivot values increases (i.e. number of terms
* number of times), pivot performance in 4.10.1 get progressively worse.
> I tried to look at the code but there was a lot of changes in pivoting between 4.9 and
4.10, and so it is not clear to me what has cause the performance issues. However the results
seem to indicate that if the pivot was simply a combined facet search, it could potentially
produce better and more robust performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message