lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Almost parallel indexes
Date Fri, 28 Sep 2007 00:13:45 GMT
OK, this isn't well thought out, more the first thing that
pops to mind...

You're right, Lucene doesn't do joins. But would it serve
to keep two indexes? One the slow-changing stuff
and one the fast-changing stuff. They are related by
some *external*  (as in "not the Lucene doc id)
field.

You'd have to custom roll something that searched
across both indexes. Depending on your search
semantics this may be hard or easy. It would be
easy if your search were simple, something like
(stuff in the fast-changing index) OR/AND/NOT
(stuff in the slow-changing index). Handling
(some in the fast) AND/OR/NOT (some in the
slow) would be much harder......

The idea is to collect a set of your external doc
IDs, then use TermEnum/TermDocs in each
index to get at the underlying document parts.

Lots depends upon how many hits you expect. 100 is
one thing, 1,000,000 is another.

This, of course, adds a layer of complexity that makes
things harder, but it might be worth a shot.

Disclaimer: I haven't personally done something like this,
so take it for what it's worth.

Best
Erick

On 9/27/07, Tim Sturge <tsturge@metaweb.com> wrote:
>
> Hi,
>
> I have an index which contains two very distinct types of fields:
>
> - Some fields are large (many term documents) and change fairly slowly.
> - Some fields are small (mostly titles, names, anchor text) and change
> fairly rapidly.
>
> Right now I keep around the large fields in raw form and when the small
> fields change, I retokenize the large and the small fields together. The
> problem is that this retokenization is sucking up most of my CPU time,
> making the indexing process too slow (this index needs to track changes in
> almost real time; I'm using one of the reopen() patches from LUCENE-743 in
> JIRA to achieve this).
>
> I can't really use ParallelReader to keep the indexes the same; it
> requires me to add documents to both indexes which means I have to
> retokenize the large fields anyway. I would want to do a "join" on an
> external id, and as far as I can tell, Lucene doesn't support that.
>
> Alternatively, what I'd like is a way to either store a pre-tokenized
> version of the large fields, or to be able to add fields to a document that
> come from an existing document in the index.
>
> I suspect there is more to this question than meets the eye, but I'd be
> interested in any strategies that people have used in the past.
>
> Thanks,
>
> Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message