lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Best way to index document page by page?
Date Sat, 25 Jun 2005 22:21:25 GMT

: Is this the best way to do this?  Is there a way to store location
: information associated with each term within a field?  Note that there can
: be thousands of documents containing thousands of pages.

It depends on what's important to you.

(FYI: i'm document with "file" in the rest of this mail, to avoid
confusion with lucene's Document concept)

if your goal is just to find files containing words and order those
files by date, or some other fixed attribute of the file -- then
making one Document per page is probably fine.

If you really want to take advantage of score *per file* then indexing
each page is going to make it hard for you to say which files matches a
particular search 'best'

Something to consider is a two pronged approach: one Document per
file (with a special doctype:file field), and one Document per page
(doctype:page).  Initial searches can be BooleanQueries with a mandatory
"doctype:file" clause, and then once you've got a list of fileIds
sorted by score, do a secondary BooleanSearch on: your initial search, the
fileIds from the first page of your results, "doctype:page".



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message