lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anshum <>
Subject Re: Indexstructure design
Date Mon, 09 Jun 2008 07:57:34 GMT
Considering you have that number of documents for each class, you may think
of splitting the index (as I believe that the total number would be high).
What exactly would you mean by 'get the index' from the result? Do you mean
that you would want to fetch the class as well (without actually fetching
If thats the case you might just collect documents from different indexes
into different holders so you would know what belongs to which index.


On Mon, Jun 9, 2008 at 12:43 PM, Sascha Fahl <>

> Hi and thanks a lot,
> I'll take a look at the HitCollector thing.
> I think I will have around 500.000 - 1.000.000 docs per class.
> So having different indeces is a good idea I think. Especially because
> half of the requests will point to only one document class and not to all
> classes.
> Is there a way to get the index from a result? So I would not need to have
> this one
> special field for identifying the document classes.
> Best,
> sascha
> Am 08.06.2008 um 17:49 schrieb Erick Erickson:
>  well, why not just include a field that identifies each document's class?
>> Then, to search over
>> all classes, just don't mention the class field.
>> When you *do* want to restrict by class, include "AND class:blah" in your
>> query.
>> This assumes you don't have a huge number of classes and it wouldn't be a
>> problem
>> to have a clause like AND (class:cl1 OR class:cl2 OR class:cl3......).
>> That
>> said, a
>> clause containing 100s of class:blah isn't a problem.
>> And if you want to get really fancy, create a Filter for each class and
>> you
>> can just
>> combine the filters appropriately when you want to query and restrict to
>> classes.
>> Limiting the results to the top 10/20 per class is tricker, but see
>> HitCollectors
>> for a way to intervene in the hit selection process. You could keep a
>> simple
>> map
>> that counted by class and reject each doc that came through the
>> hitcollector
>> class after it reached your limit. Beware of doing a Reader.get() on each
>> hit, you'd have to do some work with TermDocs/TermEnum.
>> All that said, I haven't really kept up on the faceting that's been talked
>> about
>> in the archive, so you may want to look at the searchable mail archive and
>> see what's up with that.
>> How big is your index? That is, how many documents and how many fields
>> (approx) are you talking about here? That'll influence whether you would
>> be better off keeping them all in one index or splitting them up. But if
>> you
>> can keep them in a single index, your maintenance will be *much* easier.
>> Best
>> Erick
>> On Sat, Jun 7, 2008 at 10:34 AM, Sascha Fahl <>
>> wrote:
>>  Hi,
>>> I am quite new to the lucene scene and I need your help :-)
>>> There are several document classes. Lets say documents from class A, B,
>>> C,
>>> D and E. What I need is the following:
>>> 1) I want to search over all classes together. So the query should hit
>>> results from all different classes - ideally it is possible
>>> to limit the results from each class to lets say 10 or 20 results per
>>> class.
>>> 2) I want to search over all classes seperately.
>>> My first idea was to have one index per class and use a MultiReader for
>>> searching over all classes and use an IndexReader for
>>> searching over the classes seperately. Right now I have 3 questions:
>>>      1. Is that a good idea?
>>>      2. Is there a way to identify the index a result comes from?
>>>      3. Is it possible to limit the results to a number of hits from each
>>> index?
>>> Thank you,
>>> Sascha
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

The facts expressed here belong to everybody, the opinions to me.
The distinction is yours to draw............

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message