lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Lucene implementation/performance question
Date Wed, 12 Nov 2008 17:01:03 GMT
If your new to Lucene, this might be a little much (and maybe I am not 
fully understand the problem), but you might try:

Add the attributes to the words in a payload with a PayloadAnalyzer. Do 
searching as normal. Use the new PayloadSpanUtil class to get the 
payloads for the matching words. (Think of the PayloadSpanUtil as a 
highlighter - you give it a query, it gives you the payloads to the 
terms that match). The PayloadSpanUtil class is a bit experimental, but 
I'll fix anything you run into with it.

- Mark

Greg Shackles wrote:
> Hi Erick,
>
> Thanks for the response, sorry that I was somewhat vague in the reasoning
> for my implementation in the first post.  I should have mentioned that the
> word details are not details of the Lucene document, but are attributes
> about the word that I am storing.  Some examples are position on the actual
> page, color, size, bold/italic/underlined, and most importantly, the text as
> it appeared on the page.  The reason the last one matters is that things
> like punctuation, spacing and capitalization can vary between the result and
> the search term, and can affect how I need to process the results
> afterwords.  I am certainly open to the idea of a new approach if it would
> improve on things, I admit I am new to Lucene so if there are options I'm
> unaware of I'd love to learn about them.
>
> Just to sum it up with an example, let's say we have a page of text that
> stores "This is a page of text."  We want to search for the text "of text",
> which would span multiple words in the word index.  The final result would
> need to contain "of" and "text", along with the details about each as
> described before.  I hope this is more helpful!
>
> - Greg
>
> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <erickerickson@gmail.com>wrote:
>
>   
>> If I may suggest, could you expand upon what you're trying to
>> accomplish? Why do you care about the detailed information
>> about each word? The reason I'm suggesting this is "the XY
>> problem". That is, people often ask for details about a specific
>> approach when what they really need is a different approach
>>
>> There are TermFrequencies, TermPositions,
>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>> know the details of that may work for you if we had
>> a better idea of what it is you're trying to accomplish...
>>
>> Best
>> Erick
>>
>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gshackles@gmail.com>
>> wrote:
>>
>>     
>>> I hope this isn't a dumb question or anything, I'm fairly new to Lucene
>>>       
>> so
>>     
>>> I've been picking it up as I go pretty much.  Without going into too much
>>> detail, I need to store pages of text, and for each word on each page,
>>> store
>>> detailed information about it.  To do this, I have 2 indexes:
>>>
>>> 1) pages: this stores the full text of the page, and identifying
>>> information
>>> about it
>>> 2) words: this stores a single word, along with the page it was on and is
>>> stored in the order they appear on the page
>>>
>>> When doing a search, not only do I need to return the page it was found
>>>       
>> on,
>>     
>>> but also the details of the matching words.  Since I couldn't think of a
>>> better way to do it, I first search the pages index and find any matching
>>> pages.  Then I iterate the words on those pages to find where the match
>>> occurred.  Obviously this is costly as far as execution time goes, but at
>>> least it only has to get done for matching pages rather than every page.
>>> Searches still take way longer than I'd like though, and the bottleneck
>>>       
>> is
>>     
>>> almost entirely in the code to find the matches on the page.
>>>
>>> One simple optimization I can think of is store the pages in smaller
>>>       
>> blocks
>>     
>>> so that the scope of the iteration is made smaller.  This is not really
>>> ideal, since I also need the ability to narrow down results based on
>>>       
>> other
>>     
>>> words that can/can't appear on the same page which would mean storing 3
>>> full
>>> copies of every word on every page (one in each of the 3 resulting
>>> indexes).
>>>
>>> I know this isn't a Java performance forum so I'll try to keep this
>>>       
>> Lucene
>>     
>>> related, but has anyone done anything similar to this, or have any
>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>       
>> speed
>>     
>>> things up since I need to perform many searches often over very large
>>>       
>> sets
>>     
>>> of pages.  Thanks!
>>>
>>> - Greg
>>>
>>>       
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message