lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Lucene implementation/performance question
Date Wed, 12 Nov 2008 16:17:17 GMT
If I may suggest, could you expand upon what you're trying to
accomplish? Why do you care about the detailed information
about each word? The reason I'm suggesting this is "the XY
problem". That is, people often ask for details about a specific
approach when what they really need is a different approach

There are TermFrequencies, TermPositions,
TermVectorOffsetInfo and a bunch of other stuff that I don't
know the details of that may work for you if we had
a better idea of what it is you're trying to accomplish...

Best
Erick

On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gshackles@gmail.com> wrote:

> I hope this isn't a dumb question or anything, I'm fairly new to Lucene so
> I've been picking it up as I go pretty much.  Without going into too much
> detail, I need to store pages of text, and for each word on each page,
> store
> detailed information about it.  To do this, I have 2 indexes:
>
> 1) pages: this stores the full text of the page, and identifying
> information
> about it
> 2) words: this stores a single word, along with the page it was on and is
> stored in the order they appear on the page
>
> When doing a search, not only do I need to return the page it was found on,
> but also the details of the matching words.  Since I couldn't think of a
> better way to do it, I first search the pages index and find any matching
> pages.  Then I iterate the words on those pages to find where the match
> occurred.  Obviously this is costly as far as execution time goes, but at
> least it only has to get done for matching pages rather than every page.
> Searches still take way longer than I'd like though, and the bottleneck is
> almost entirely in the code to find the matches on the page.
>
> One simple optimization I can think of is store the pages in smaller blocks
> so that the scope of the iteration is made smaller.  This is not really
> ideal, since I also need the ability to narrow down results based on other
> words that can/can't appear on the same page which would mean storing 3
> full
> copies of every word on every page (one in each of the 3 resulting
> indexes).
>
> I know this isn't a Java performance forum so I'll try to keep this Lucene
> related, but has anyone done anything similar to this, or have any
> comments/ideas on how to improve it?  I'm in the process of trying to speed
> things up since I need to perform many searches often over very large sets
> of pages.  Thanks!
>
> - Greg
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message