lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4795) Add FacetsCollector based on SortedSetDocValues
Date Mon, 25 Feb 2013 15:02:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585903#comment-13585903
] 

Michael McCandless commented on LUCENE-4795:
--------------------------------------------

bq. If you want to integrate that with the current classes, all you need to do is to implement
a partial TaxonomyReader, which resolves ordinals to CPs using the global ord map? Or actually
make that TR the entity that's responsible to manage to global ordinal map, so that TR.doOpenIfChanged
opens the new segments and updates the global map?

That sounds great!

bq. Since this taxonomy, at least currently, doesn't support hierarchical facets, you'll need
to hack something as a ParallelTaxoArray, but that should be easy .. I think.

OK.

I think it could be hierarchical w/o so much work, ie on reopen as it
walks the terms it should be able to easily build up the parent/child
arrays since the terms are in sorted order.  Hmm, except, with SSDV
you cannot have a term/ord that had no docs indexed.  So the
"ancestor" ords would not exist... hmm.  Better start
non-hierarchical.

I guess if we are non-hierarchical then we don't really need to
integrate at indexing time?  Ie, app can just add the facet values
using SortedSetDVF.

bq. Is the only benefit in this approach that you don't need to manage a sidecar taxonomy
index?

I think so?

                
> Add FacetsCollector based on SortedSetDocValues
> -----------------------------------------------
>
>                 Key: LUCENE-4795
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4795
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>         Attachments: LUCENE-4795.patch, LUCENE-4795.patch, pleaseBenchmarkMe.patch
>
>
> Recently (LUCENE-4765) we added multi-valued DocValues field
> (SortedSetDocValuesField), and this can be used for faceting in Solr
> (SOLR-4490).  I think we should also add support in the facet module?
> It'd be an option with different tradeoffs.  Eg, it wouldn't require
> the taxonomy index, since the main index handles label/ord resolving.
> There are at least two possible approaches:
>   * On every reopen, build the seg -> global ord map, and then on
>     every collect, get the seg ord, map it to the global ord space,
>     and increment counts.  This adds cost during reopen in proportion
>     to number of unique terms ...
>   * On every collect, increment counts based on the seg ords, and then
>     do a "merge" in the end just like distributed faceting does.
> The first approach is much easier so I built a quick prototype using
> that.  The prototype does the counting, but it does NOT do the top K
> facets gathering in the end, and it doesn't "know" parent/child ord
> relationships, so there's tons more to do before this is real.  I also
> was unsure how to properly integrate it since the existing classes
> seem to expect that you use a taxonomy index to resolve ords.
> I ran a quick performance test.  base = trunk except I disabled the
> "compute top-K" in FacetsAccumulator to make the comparison fair; comp
> = using the prototype collector in the patch:
> {noformat}
>                     Task    QPS base      StdDev    QPS comp      StdDev            
   Pct diff
>                OrHighLow       18.79      (2.5%)       14.36      (3.3%)  -23.6% ( -28%
-  -18%)
>                 HighTerm       21.58      (2.4%)       16.53      (3.7%)  -23.4% ( -28%
-  -17%)
>                OrHighMed       18.20      (2.5%)       13.99      (3.3%)  -23.2% ( -28%
-  -17%)
>                  Prefix3       14.37      (1.5%)       11.62      (3.5%)  -19.1% ( -23%
-  -14%)
>                  LowTerm      130.80      (1.6%)      106.95      (2.4%)  -18.2% ( -21%
-  -14%)
>               OrHighHigh        9.60      (2.6%)        7.88      (3.5%)  -17.9% ( -23%
-  -12%)
>              AndHighHigh       24.61      (0.7%)       20.74      (1.9%)  -15.7% ( -18%
-  -13%)
>                   Fuzzy1       49.40      (2.5%)       43.48      (1.9%)  -12.0% ( -15%
-   -7%)
>          MedSloppyPhrase       27.06      (1.6%)       23.95      (2.3%)  -11.5% ( -15%
-   -7%)
>                  MedTerm       51.43      (2.0%)       46.21      (2.7%)  -10.2% ( -14%
-   -5%)
>                   IntNRQ        4.02      (1.6%)        3.63      (4.0%)   -9.7% ( -15%
-   -4%)
>                 Wildcard       29.14      (1.5%)       26.46      (2.5%)   -9.2% ( -13%
-   -5%)
>         HighSloppyPhrase        0.92      (4.5%)        0.87      (5.8%)   -5.4% ( -15%
-    5%)
>              MedSpanNear       29.51      (2.5%)       27.94      (2.2%)   -5.3% (  -9%
-    0%)
>             HighSpanNear        3.55      (2.4%)        3.38      (2.0%)   -4.9% (  -9%
-    0%)
>               AndHighMed      108.34      (0.9%)      104.55      (1.1%)   -3.5% (  -5%
-   -1%)
>          LowSloppyPhrase       20.50      (2.0%)       20.09      (4.2%)   -2.0% (  -8%
-    4%)
>                LowPhrase       21.60      (6.0%)       21.26      (5.1%)   -1.6% ( -11%
-   10%)
>                   Fuzzy2       53.16      (3.9%)       52.40      (2.7%)   -1.4% (  -7%
-    5%)
>              LowSpanNear        8.42      (3.2%)        8.45      (3.0%)    0.3% (  -5%
-    6%)
>                  Respell       45.17      (4.3%)       45.38      (4.4%)    0.5% (  -7%
-    9%)
>                MedPhrase      113.93      (5.8%)      115.02      (4.9%)    1.0% (  -9%
-   12%)
>               AndHighLow      596.42      (2.5%)      617.12      (2.8%)    3.5% (  -1%
-    8%)
>               HighPhrase       17.30     (10.5%)       18.36      (9.1%)    6.2% ( -12%
-   28%)
> {noformat}
> I'm impressed that this approach is only ~24% slower in the worst
> case!  I think this means it's a good option to make available?  Yes
> it has downsides (NRT reopen more costly, small added RAM usage,
> slightly slower faceting), but it's also simpler (no taxo index to
> manage).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message