lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Best way to index document page by page?
Date Fri, 24 Jun 2005 10:48:00 GMT

On Jun 24, 2005, at 3:28 AM, JMA wrote:

>
> Greetings,
>   I have a requirement to search documents page by page.  For  
> example, in a
> 500 page document, if someone searches for "foo", I need to return  
> "Found
> foo on page 4,6,24,100,223,401, and 455".
>
> The way I've implemented this is to index each *page* separately,  
> so my 500
> page document is actually treated as not one but 500 documents.   
> Then when I
> get hits, I can play sort games to aggregate the results to look as
> necessary.
>
> Is this the best way to do this?

That's a great way to do it.  For comparison, lucenebook.com slices  
"Lucene in Action" by section, so each Lucene Document represents a  
single section of the book, with each Document also getting some  
additional information like the starting page and the number of pages  
(and even, though unpresented at the moment) per-page section for  
sections that span across pages.

>   Is there a way to store location
> information associated with each term within a field?  Note that  
> there can
> be thousands of documents containing thousands of pages.

I believe what you want is to store a document identifier for every  
Lucene Document.  In other words, add a field to each Document (which  
represented a page) for the document identifier.  You can then query  
across documents or pages in various ways, narrowing a search to a  
particular document by AND'ing a query with a TermQuery for the  
document identifier.  Does that cover what you're after?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message