lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: searching in 2 indexes
Date Mon, 15 Dec 2008 18:12:18 GMT
This is where things get exciting, when theory runs right up
against the particular problem at hand...

What if your document consisted of a content field (perhaps broken
into as many pieces as necessary) and multiple references?
Something like:

Document doc = new Document()
doc.add("content", .....)
doc.add("reference", ref1)
doc.add("reference", ref2)
writer.addDocument(doc).

Your "reference" field can still be searchable, but you have to
give some thought to the analyzers you use, depending upon
how you want to search it (or not, depending upon your
requirements).

Document.getValues("reference") will return an array of entries
corresponding
to each call to doc.add("reference", ....) in the original.

Note that when you index AND store a value, the value is "split". That
is, the data is analyzed and indexed for searching, AND stored verbatim
without any analysis..... These are orthogonal operations for all that they
can both be specified at once.


At < 200M, there's no real need to skimp for space unless you have an
index per user and want to keep many of them open at once. Which also
begs the question whether it would be better to put them all together in
one index and include  "user" field for each doc. But that's largely a
tuning question for later....

Best
Erick

On Mon, Dec 15, 2008 at 12:44 PM, Chris Bamford <chris.bamford@scalix.com>wrote:

> Hi Erick,
>
> You raise some excellent points. First let me explain why our structure is
> the way it is ..
>
> There is not actually a strict 1-1 relationship between entries in the two
> indexes. One represents content, the other, references.
> There can only ever be 1 content item, but there can be several references
> to it (think: symlinks to a file).
> So if a content item has 2 references to it, my search needs to return 2
> hits - 1 per reference with matching content.
>
> You've really got me thinking, especially your point (1) about putting
> different shaped data into the same index. We could presumably then combine
> the content and ref indexes into one and improve performance ....?
>
> Regarding size of indexes, it varies per user, but the largest example I
> can find is:
>
> Ref Index: 24Mb
> Content Index: 193Mb
>
> Most users are way smaller than that, though..
>
> One question (thinking ahead now): how to differentiate between content and
> reference hits if they are retrieved from the same index? Presumably I will
> need to try to retrieve a field which only exists in one and if it fails, I
> know it must be of the other type?
>
> - Chris
>
>
> Erick Erickson wrote:
>
>> Stop it right now <G>. You've gotta take off your DB hat and put
>> on your searching hat to get the most out of Lucene. So I'd
>> think about the following:
>>
>> 1> Why do you have two indexes? Why not just put all
>>     the data into a single index? The fields are disjoint anyway....
>>     Note that there is no requirement that all documents in an
>>     index have the same fields either, if that makes things easier....
>>
>> 2> Disk space is cheap. Very cheap. I know it goes against the
>>     grain to de-normalize your data, but think about doing just that.
>>     The idea is to be able to submit a single *search* where
>>     each document returned is complete rather than thinking in
>>     terms of joins etc....
>>
>> 3> Storing and indexing are two separate concepts. Your index (at
>>     least the searchable part) won't grow if you store (but don't index)
>>     lots of data. So if you need to pile a bunch of junk into your records
>>     but *not* search them, having a humongous index where most
>>     of it is simply stored isn't nearly as costly as you might think.
>>
>> 4> Some of this depends upon how much data you're talking about. If
>>     both indexes total 10M, there's no reason in the world to keep them
>>     separate. If they total 100G, that's another story. Some more
>>     details would be helpful.
>>
>> 5> I almost guarantee that if you've merely translated database tables
>>     into Lucene indexes on a one-for-one basis, you won't be very
>>     satisfied with the results........
>>
>> Best
>> Erick
>>
>> On Mon, Dec 15, 2008 at 11:33 AM, Chris Bamford <chris.bamford@scalix.com
>> >wrote:
>>
>>
>>
>>> Hi
>>>
>>> I have a situation where I have two related indexes which are logically
>>> linked by a common field called INDEXID. All other fields differ between
>>> the
>>> two indexes. For any given INDEXID I would like to be able to retrieve
>>> the
>>> matching pair of documents, one from each index. (Logically this is an
>>> AND
>>> /i.e. /only return anything if there is a document with INDEXID /X/ in
>>> index
>>> A *and* in index B.)
>>>
>>> Is there a nifty way to do this with a single query or must I first
>>> search
>>> one, then the other?
>>> I thought perhaps MultiSearcher might do it, but now I'm not so sure ...
>>>
>>> Thanks...
>>>
>>> - Chris
>>>
>>>
>>>
>>
>
> --
> Chris Bamford
> Senior Development Engineer
> *Scalix*
> chris.bamford@scalix.com
> Tel: +44 (0)1344 381814
> www.scalix.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message