lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bamford <>
Subject Re: searching in 2 indexes
Date Mon, 15 Dec 2008 17:44:08 GMT
Hi Erick,

You raise some excellent points. First let me explain why our structure 
is the way it is ..

There is not actually a strict 1-1 relationship between entries in the 
two indexes. One represents content, the other, references.
There can only ever be 1 content item, but there can be several 
references to it (think: symlinks to a file).
So if a content item has 2 references to it, my search needs to return 2 
hits - 1 per reference with matching content.

You've really got me thinking, especially your point (1) about putting 
different shaped data into the same index. We could presumably then 
combine the content and ref indexes into one and improve performance ....?

Regarding size of indexes, it varies per user, but the largest example I 
can find is:

Ref Index: 24Mb
Content Index: 193Mb

Most users are way smaller than that, though..

One question (thinking ahead now): how to differentiate between content 
and reference hits if they are retrieved from the same index? Presumably 
I will need to try to retrieve a field which only exists in one and if 
it fails, I know it must be of the other type?

- Chris

Erick Erickson wrote:
> Stop it right now <G>. You've gotta take off your DB hat and put
> on your searching hat to get the most out of Lucene. So I'd
> think about the following:
> 1> Why do you have two indexes? Why not just put all
>      the data into a single index? The fields are disjoint anyway....
>      Note that there is no requirement that all documents in an
>      index have the same fields either, if that makes things easier....
> 2> Disk space is cheap. Very cheap. I know it goes against the
>      grain to de-normalize your data, but think about doing just that.
>      The idea is to be able to submit a single *search* where
>      each document returned is complete rather than thinking in
>      terms of joins etc....
> 3> Storing and indexing are two separate concepts. Your index (at
>      least the searchable part) won't grow if you store (but don't index)
>      lots of data. So if you need to pile a bunch of junk into your records
>      but *not* search them, having a humongous index where most
>      of it is simply stored isn't nearly as costly as you might think.
> 4> Some of this depends upon how much data you're talking about. If
>      both indexes total 10M, there's no reason in the world to keep them
>      separate. If they total 100G, that's another story. Some more
>      details would be helpful.
> 5> I almost guarantee that if you've merely translated database tables
>      into Lucene indexes on a one-for-one basis, you won't be very
>      satisfied with the results........
> Best
> Erick
> On Mon, Dec 15, 2008 at 11:33 AM, Chris Bamford <>wrote:
>> Hi
>> I have a situation where I have two related indexes which are logically
>> linked by a common field called INDEXID. All other fields differ between the
>> two indexes. For any given INDEXID I would like to be able to retrieve the
>> matching pair of documents, one from each index. (Logically this is an AND
>> /i.e. /only return anything if there is a document with INDEXID /X/ in index
>> A *and* in index B.)
>> Is there a nifty way to do this with a single query or must I first search
>> one, then the other?
>> I thought perhaps MultiSearcher might do it, but now I'm not so sure ...
>> Thanks...
>> - Chris

Chris Bamford
Senior Development Engineer
Tel: +44 (0)1344 381814

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message