lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Searching for a phrase which spans on 2 pages
Date Tue, 11 Jul 2006 20:23:44 GMT
I can think of several approaches, but the experts will no doubt show me up
<G>..

1> index the entire book as a single document. Also, index the beginning and
ending offset of each page in separate "documents". Assuming you can find
the offset in the big doc of each matching phrase, you can also find out
what pages each match starts on and ends on, and if they are different you'd
know to display two pages. Not sure what this does to relevancy.......

2> Index, say, the 10 words on the previous page and 10 words on the next
page with the current page. You'd have to make sure your match wasn't
entirely within the 10 words you prepended or appended to the "match" page
(again by match position) when you returned data.

3> Have a series of "joiner" "documents". One for the 9 words of page n, and
9 words of  page n + 1 (along with the page number). Another set for 8
before and 8 after. etc. down to 1. If your phrase was 10 words, you'd
search your normal pages, and the 9 word "joiner" pages. Any match in the
joiners would be a page spanner. Again, what does that do to relevancy?


Note that there is no requirement that every document have the same fields,
so your searches can be disjoint. Also, I'm assuming that you can reasonably
decide that, say, 10 word phrases are the max you'll respect, which may not
be true.

I have no idea whether these are reasonable approaches given your problem
domain....

Best
Erick

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message