lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Shackles" <gshack...@gmail.com>
Subject Re: Lucene implementation/performance question
Date Wed, 12 Nov 2008 16:47:39 GMT
Hi Erick,

Thanks for the response, sorry that I was somewhat vague in the reasoning
for my implementation in the first post.  I should have mentioned that the
word details are not details of the Lucene document, but are attributes
about the word that I am storing.  Some examples are position on the actual
page, color, size, bold/italic/underlined, and most importantly, the text as
it appeared on the page.  The reason the last one matters is that things
like punctuation, spacing and capitalization can vary between the result and
the search term, and can affect how I need to process the results
afterwords.  I am certainly open to the idea of a new approach if it would
improve on things, I admit I am new to Lucene so if there are options I'm
unaware of I'd love to learn about them.

Just to sum it up with an example, let's say we have a page of text that
stores "This is a page of text."  We want to search for the text "of text",
which would span multiple words in the word index.  The final result would
need to contain "of" and "text", along with the details about each as
described before.  I hope this is more helpful!

- Greg

On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> If I may suggest, could you expand upon what you're trying to
> accomplish? Why do you care about the detailed information
> about each word? The reason I'm suggesting this is "the XY
> problem". That is, people often ask for details about a specific
> approach when what they really need is a different approach
>
> There are TermFrequencies, TermPositions,
> TermVectorOffsetInfo and a bunch of other stuff that I don't
> know the details of that may work for you if we had
> a better idea of what it is you're trying to accomplish...
>
> Best
> Erick
>
> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gshackles@gmail.com>
> wrote:
>
> > I hope this isn't a dumb question or anything, I'm fairly new to Lucene
> so
> > I've been picking it up as I go pretty much.  Without going into too much
> > detail, I need to store pages of text, and for each word on each page,
> > store
> > detailed information about it.  To do this, I have 2 indexes:
> >
> > 1) pages: this stores the full text of the page, and identifying
> > information
> > about it
> > 2) words: this stores a single word, along with the page it was on and is
> > stored in the order they appear on the page
> >
> > When doing a search, not only do I need to return the page it was found
> on,
> > but also the details of the matching words.  Since I couldn't think of a
> > better way to do it, I first search the pages index and find any matching
> > pages.  Then I iterate the words on those pages to find where the match
> > occurred.  Obviously this is costly as far as execution time goes, but at
> > least it only has to get done for matching pages rather than every page.
> > Searches still take way longer than I'd like though, and the bottleneck
> is
> > almost entirely in the code to find the matches on the page.
> >
> > One simple optimization I can think of is store the pages in smaller
> blocks
> > so that the scope of the iteration is made smaller.  This is not really
> > ideal, since I also need the ability to narrow down results based on
> other
> > words that can/can't appear on the same page which would mean storing 3
> > full
> > copies of every word on every page (one in each of the 3 resulting
> > indexes).
> >
> > I know this isn't a Java performance forum so I'll try to keep this
> Lucene
> > related, but has anyone done anything similar to this, or have any
> > comments/ideas on how to improve it?  I'm in the process of trying to
> speed
> > things up since I need to perform many searches often over very large
> sets
> > of pages.  Thanks!
> >
> > - Greg
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message