lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (LUCENE-3079) Faceting module
Date Wed, 29 Jun 2011 11:03:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057148#comment-13057148
] 

Toke Eskildsen edited comment on LUCENE-3079 at 6/29/11 11:02 AM:
------------------------------------------------------------------

I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill
down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point
for the facet request was added. I also checked that indexing does indeed run with 48MB for
LUCENE-2369, but again this is just plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || LUCENE-3079 (d1+'deep/a/b')
|| LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -           |
    347 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 MB       
   |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 ms       
   | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 ms       
   |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 MB        
  |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even
faster, but I guess I'll have to dig into the code to understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we encounter category
ordinals, we just increment the count by 1 in the respective location in the count array.
No need to ask whether this is an ordinal the user asked to count at all. Later when we compute
the top-K, we know more efficiently while root ordinal the user requested to count, and its
children, so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having
explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very
clearly shows that it is faster overall to skip the drill-down in the extraction phase. At
least for this test (and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the
docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || LUCENE-2369 (d6)
||
| Index build time             |   28 min          |    -              |     17 min      
 |
| Memory required for indexing |    ? MB           |    ? MB           |      ? MB       
 |
| First facet request          |13933 ms           |13367 ms           | 46,000 ms       
 |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 ms       
 |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 MB       
 |

      was (Author: toke):
    I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to
drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting
point for the facet request was added. I also checked that indexing does indeed run with 48MB
for LUCENE-2369, but again this is just plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || LUCENE-3079 (d1+'deep/a/b')
|| LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -           |
    238 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 MB       
   |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 ms       
   | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 ms       
   |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 MB        
  |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even
faster, but I guess I'll have to dig into the code to understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we encounter category
ordinals, we just increment the count by 1 in the respective location in the count array.
No need to ask whether this is an ordinal the user asked to count at all. Later when we compute
the top-K, we know more efficiently while root ordinal the user requested to count, and its
children, so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having
explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very
clearly shows that it is faster overall to skip the drill-down in the extraction phase. At
least for this test (and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the
docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || LUCENE-2369 (d6)
||
| Index build time             |   28 min          |    -              |     17 min      
 |
| Memory required for indexing |    ? MB           |    ? MB           |      ? MB       
 |
| First facet request          |13933 ms           |13367 ms           | 46,000 ms       
 |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 ms       
 |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 MB       
 |
  
> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch,
LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message