lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Mintern <>
Subject Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
Date Mon, 23 Apr 2012 20:53:17 GMT
On Mon, Apr 23, 2012 at 1:25 PM, Jong Kim <> wrote:
> Thanks for the reply.
> Our metadata is not stored in a single field, but is rather a collection of
> fields. So, it requires a boolean search that spans multiple fields. My
> understanding is that it is not possible to iterate over the matching
> documents efficiently using termDocs() when the search involves multiple
> terms and/or multiple fields, right?
> /Jong

You can do this by defining your own hits Collector which simply pulls
the matching ID out of each result. Since searching the second index
returns less results, you could do something like this:

Two indexes:
LightWeight - stores metadata fields and document ID
HeavyWeight - stores static data and document ID

Search query:
1. Metadata portion: query LightWeight and retrieve all matching IDs
(NOT Lucene IDs, but your own stored document ID) in a gnu.trove

Now some queries won't even hit the second index, and you have your
full match. If you need to match against the 2nd index as well:

2. Pass in the TIntSet as an argument to another Collector.
3. For each match in the HeavyWeight index, if it is also in the
TIntSet, add it to the final TIntSet result set. Otherwise ignore it.
4. After the collector has been visited by each match, the final
result set is your hits.

You now have the set of document IDs for the complete match. Using
primitives and lightweight objects, this isn't much worse than letting
Lucene do the collection.

Of course, this approach only works if the intersection between
metadata and big data is an AND relationship. If you need other logic,
step 3 above obviously changes.

Another caveat is that if you are relying on Lucene to store and
return the full document for each query, this approach isn't the best
for fetching information out of Lucene. We use a standard relational
database for storing our data, we use Lucene to query for sets of
document IDs, and then we fetch the remaining document fields from our
DB (or in some cases, some information lives on S3, etc.).

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message