lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe MA" <mrj...@comcast.net>
Subject RE: No subsearcher in Lucene 3.3?
Date Tue, 30 Aug 2011 16:29:43 GMT

Thanks for the replies.  Here is why I need the subreader (or subsearcher in earlier Lucene
versions):

I have multiple collections of documents, say broken out by years (it's more complex than
this, but this illustrates the use case):

Collection1 >>> 	D:/some folder/2009/*.pdf 			(lots of PDF files)
Collection2 >>> 	D:/another folder/2010/*.pdf			(lots of different PDF files)

And so forth.  So in the example above, I would have two indicies, one for each year.    When
I index, I store the *relative* path of each document as a field.  For example, 'link:2009/file1.pdf'
or 'link2010/file1.pdf' etc .  I do not store the full path to the files in the index.  This
has a huge advantage because we can move the documents to another file system or server or
path without rebuilding the index.  I stored the required base path to the documents in each
collection in a database, external to the collection.   For example, in the above example,
Collection1 would have a base path of "D:/some folder/".     Therefore, to actually access
a document referenced in a collection, you would concat base_path retrieved from the database
to the "link" field retrieved from the collection.   I would think this is a very common approach.

When searching a single collection, no problem.  But if I want to search the two collections
at the same time, I need to know which collection the hit came from so I can retrieve the
base_path from the database.  These base_paths can be different.  As mentioned, this was trivial
in Lucene 1.x and 2.x as I just grabbed the subsearcher from the result, which would for example
return a 1 or 2 indicating which of the two collections the result came from.  Then I can
build the path to the file.  In other words, subsearcher gave me the foreign key I needed
to map to additional external information associated with each index during a multisearch.
 That is now gone in Lucene 3.3.

I guess a real simple solution is just to store a new field with each document uniquely identifying
which collection.  So in the example above, I could create a new field "foreign_key_index"
 for each document which would be "Collection1" or "Collection2" respectively.  This would
surely work, but it would break backwards compatibility of my system and would require me
to rebuild every collection.      Also seems pretty extensive for something so simple.

If there is another way to do this, please advise.  Thanks in advance and much appreciated.

- JMA



-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Monday, August 29, 2011 8:05 PM
To: java-user@lucene.apache.org
Subject: RE: No subsearcher in Lucene 3.3?

Why do you need to know the subreader? If you want to get the document's stored fields, use
the MultiReader.

If you really want to know the subreader, use this:
http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/util/ReaderUtil.html#subReader(int,
org.apache.lucene.index.IndexReader)

But this is "somewhat slow", so don’t use in inner loops.

Devon suggested:
> If I'm understanding your question correctly, in the Collector, you are told which IndexReader
you are working with when the setNextReader method is called. Hopefully that helps.

This does not work as expected, because the Collector gets the lowest level readers, which
are in fact sub-sub-readers (as each single IndexReader contains itself of more "SegmentReaders",
unless you have optimized sub-indexes).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Joseph MarkAnthony [mailto:mrjama@comcast.net]
> Sent: Monday, August 29, 2011 8:54 PM
> To: java-user@lucene.apache.org
> Subject: No subsearcher in Lucene 3.3?
> 
> Greetings,
>     In the past (Lucene version 2.x) I successfully used
> MultiSearcher.subsearcher() to identify the searchable within a 
> MultiSearcher to which a hit belonged.
> 
> In moving to Lucene 3.3, MultiSearcher is now deprecated, and I am 
> trying to create a standard IndexSearcher over a MultiReader.  I 
> haven't gotten this to work yet but it appears to be the correct 
> approach.  However, I cannot find any corresponding "subsearcher" 
> method that could identify which subreader is the one that finds the hit.
> 
> For example, it used to be straightforward:
> 
> Create a MultiSearcher over several Searchables, and call 
> MultiSearcher.subsearcher to get the searchable that holds each search hit.
> 
> Now, I am creating an IndexSearcher over a MultiReader, which is created over
> an array of IndexReaders.   So when I get a hit, what's the best way to
> determine which of the several subReaders the hit came from?
> 
> Thanks in advance,
> JMA
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message