lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3079) Facetiing module
Date Tue, 28 Jun 2011 14:59:19 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056553#comment-13056553
] 

Shai Erera commented on LUCENE-3079:
------------------------------------

You write LUCENE-3097 which is about "post group faceting", while this issue is LUCENE-3079.
I assume you meant the latter, but want to confirm :). You also write LUCENE-2309 which is
about decoupling IW from Analyzers. Are you perhaps referring to a Solr issue, or a different
Lucene issue? If so can you please let me know which one?

This is a great test, and it matches more or less the test we've been running. Is it in 'benchmark'
form? Can you post it on this issue so I can try the same?

What do you mean by "top 5 facets/tags"? If I were to speak of dimensions, where a dimensions
is like "tags", "authors", "date", then do you mean you've requested to count 5 dimensions,
or you indexed just one dimension (i.e. one "root") and requested to fetch the top-5 results
for it? I assume it's the latter, but again, confirming my understanding.

So assuming I understood correctly the terminology and test setup, you execute one query which
matches 50% of the documents and ask to count the top-5 facets under a single "root"/"dimension",
and record the time as 'first facet request'. And then you execute it 4-5 additional times,
and record 'best of 5 requests'. Do I understand it correctly?

One difference between the two approaches, assuming you're referring to a faceting approach
that uses the FieldCache is that by default, the faceting approach here reads everything from
disk. So it would be interesting to run w/ the facets-in-memory feature.

I don't know how to relate to the memory usage -- on the last test it consumed 50% less than
the other approach, on the first it consumed nearly the same and on the second test it consumed
150% more. This is odd. Do you trust this measurement?

The 'first facet request' result is not surprising, because it takes time to warm up the FieldCache
(assuming that's what you use).

I am interested in the memory observed for indexing because that too seems fluctuating? I.e.,
in the second test the difference is nearly x20 more, which is weird.

Also, the difference in indexing time is interesting too, as it too is not very consistent.
And I find the x2 factor suspicious - would like to understand it better. Since trunk reports
to improve indexing speed by a large factor (nearly 200%), I think it will be wise if we wait
with this comparison until I bring the patch up w/ trunk.

I like it that you test the default behavior. I think it's very important that we have the
greatest out-of-the-box experience. Since the two approaches read from disk/memory, I first
would like to test the in-memory facets using this approach, so we can at least compare the
same thing. I know that trunk plays some role here (definitely at indexing time), so we can
focus on search time for now.

This is great stuff Toke !

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch,
LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message