lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Fahl <sascha.f...@googlemail.com>
Subject Re: Indexstructure design
Date Mon, 09 Jun 2008 07:13:22 GMT
Hi and thanks a lot,

I'll take a look at the HitCollector thing.
I think I will have around 500.000 - 1.000.000 docs per class.
So having different indeces is a good idea I think. Especially because
half of the requests will point to only one document class and not to  
all classes.
Is there a way to get the index from a result? So I would not need to  
have this one
special field for identifying the document classes.

Best,
sascha

Am 08.06.2008 um 17:49 schrieb Erick Erickson:

> well, why not just include a field that identifies each document's  
> class?
> Then, to search over
> all classes, just don't mention the class field.
>
> When you *do* want to restrict by class, include "AND class:blah" in  
> your
> query.
>
> This assumes you don't have a huge number of classes and it wouldn't  
> be a
> problem
> to have a clause like AND (class:cl1 OR class:cl2 OR  
> class:cl3......). That
> said, a
> clause containing 100s of class:blah isn't a problem.
>
> And if you want to get really fancy, create a Filter for each class  
> and you
> can just
> combine the filters appropriately when you want to query and  
> restrict to
> classes.
>
> Limiting the results to the top 10/20 per class is tricker, but see
> HitCollectors
> for a way to intervene in the hit selection process. You could keep  
> a simple
> map
> that counted by class and reject each doc that came through the  
> hitcollector
> class after it reached your limit. Beware of doing a Reader.get() on  
> each
> hit, you'd have to do some work with TermDocs/TermEnum.
>
> All that said, I haven't really kept up on the faceting that's been  
> talked
> about
> in the archive, so you may want to look at the searchable mail  
> archive and
> see what's up with that.
>
> How big is your index? That is, how many documents and how many fields
> (approx) are you talking about here? That'll influence whether you  
> would
> be better off keeping them all in one index or splitting them up.  
> But if you
> can keep them in a single index, your maintenance will be *much*  
> easier.
>
> Best
> Erick
>
> On Sat, Jun 7, 2008 at 10:34 AM, Sascha Fahl <sascha.fahl@googlemail.com 
> >
> wrote:
>
>> Hi,
>> I am quite new to the lucene scene and I need your help :-)
>> There are several document classes. Lets say documents from class  
>> A, B, C,
>> D and E. What I need is the following:
>>
>> 1) I want to search over all classes together. So the query should  
>> hit
>> results from all different classes - ideally it is possible
>> to limit the results from each class to lets say 10 or 20 results per
>> class.
>>
>> 2) I want to search over all classes seperately.
>>
>> My first idea was to have one index per class and use a MultiReader  
>> for
>> searching over all classes and use an IndexReader for
>> searching over the classes seperately. Right now I have 3 questions:
>>       1. Is that a good idea?
>>       2. Is there a way to identify the index a result comes from?
>>       3. Is it possible to limit the results to a number of hits  
>> from each
>> index?
>>
>>
>> Thank you,
>> Sascha
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message