lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Shackles" <gshack...@gmail.com>
Subject Lucene implementation/performance question
Date Wed, 12 Nov 2008 15:47:47 GMT
I hope this isn't a dumb question or anything, I'm fairly new to Lucene so
I've been picking it up as I go pretty much.  Without going into too much
detail, I need to store pages of text, and for each word on each page, store
detailed information about it.  To do this, I have 2 indexes:

1) pages: this stores the full text of the page, and identifying information
about it
2) words: this stores a single word, along with the page it was on and is
stored in the order they appear on the page

When doing a search, not only do I need to return the page it was found on,
but also the details of the matching words.  Since I couldn't think of a
better way to do it, I first search the pages index and find any matching
pages.  Then I iterate the words on those pages to find where the match
occurred.  Obviously this is costly as far as execution time goes, but at
least it only has to get done for matching pages rather than every page.
Searches still take way longer than I'd like though, and the bottleneck is
almost entirely in the code to find the matches on the page.

One simple optimization I can think of is store the pages in smaller blocks
so that the scope of the iteration is made smaller.  This is not really
ideal, since I also need the ability to narrow down results based on other
words that can/can't appear on the same page which would mean storing 3 full
copies of every word on every page (one in each of the 3 resulting indexes).

I know this isn't a Java performance forum so I'll try to keep this Lucene
related, but has anyone done anything similar to this, or have any
comments/ideas on how to improve it?  I'm in the process of trying to speed
things up since I need to perform many searches often over very large sets
of pages.  Thanks!

- Greg

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message