Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of gshackles@gmail.com
 designates 209.85.132.250 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=pCMqJJjZHv+cxhtUCb2yeE8y8M8y0aaX03T+vwFXwcZ96fhX98NQ3jcX2X8trYmOeD
         aZSvaIL9oqcil3jpMB4Fx7xDS2EHZjK9Q9iUJvhJNjjkfNTsZ25re/myjpXdEdmRSbWV
         maonaACpCfYJFqBucw+XOl/1Y/1o6DByERb5c=
Message-ID: <b63a7e9e0811120915r1f34ec34j7dc65e2ab76ef55a@mail.gmail.com>
Date: Wed, 12 Nov 2008 12:15:32 -0500
From: "Greg Shackles" <gshackles@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Lucene implementation/performance question
In-Reply-To: <491B0BCF.2000302@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_7568_238726.1226510132405"
References: <b63a7e9e0811120747h3a17509ehef3c4efa7d53a69d@mail.gmail.com>
	 <359a92830811120817h18df7a98o19063aeb28dfc22c@mail.gmail.com>
	 <b63a7e9e0811120847l3aba0f5qb7aec98bc65b69a1@mail.gmail.com>
	 <491B0BCF.2000302@gmail.com>

------=_Part_7568_238726.1226510132405
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hey Mark,

This sounds very interesting.  Is there any documentation or examples I
could see?  I did a quick search but didn't really find much.  It might just
be that I don't know how payloads work in Lucene, but I'm not sure how I
would see this actually doing what I need.  My reasoning is this...you'd
have an index that stores all the text for a particular page.  Would you be
able to attach payload information to individual words on that page?  In my
head it seems like that would be the job of a second index, which is exactly
why I added the word index.

Any details you can give would be great as I need to keep moving on this
project quickly.  I will also say that I'm somewhat wary of using an
experimental class since this is a really important project that really
won't be able to wait on a lot of development cycles to get the class fully
working.  That said, if it can give me serious speed improvements it's
definitely worth considering.

- Greg


On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <markrmiller@gmail.com> wrote:

> If your new to Lucene, this might be a little much (and maybe I am not
> fully understand the problem), but you might try:
>
> Add the attributes to the words in a payload with a PayloadAnalyzer. Do
> searching as normal. Use the new PayloadSpanUtil class to get the payloads
> for the matching words. (Think of the PayloadSpanUtil as a highlighter - you
> give it a query, it gives you the payloads to the terms that match). The
> PayloadSpanUtil class is a bit experimental, but I'll fix anything you run
> into with it.
>
> - Mark
>
>
> Greg Shackles wrote:
>
>> Hi Erick,
>>
>> Thanks for the response, sorry that I was somewhat vague in the reasoning
>> for my implementation in the first post.  I should have mentioned that the
>> word details are not details of the Lucene document, but are attributes
>> about the word that I am storing.  Some examples are position on the
>> actual
>> page, color, size, bold/italic/underlined, and most importantly, the text
>> as
>> it appeared on the page.  The reason the last one matters is that things
>> like punctuation, spacing and capitalization can vary between the result
>> and
>> the search term, and can affect how I need to process the results
>> afterwords.  I am certainly open to the idea of a new approach if it would
>> improve on things, I admit I am new to Lucene so if there are options I'm
>> unaware of I'd love to learn about them.
>>
>> Just to sum it up with an example, let's say we have a page of text that
>> stores "This is a page of text."  We want to search for the text "of
>> text",
>> which would span multiple words in the word index.  The final result would
>> need to contain "of" and "text", along with the details about each as
>> described before.  I hope this is more helpful!
>>
>> - Greg
>>
>> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>>
>>
>>
>>> If I may suggest, could you expand upon what you're trying to
>>> accomplish? Why do you care about the detailed information
>>> about each word? The reason I'm suggesting this is "the XY
>>> problem". That is, people often ask for details about a specific
>>> approach when what they really need is a different approach
>>>
>>> There are TermFrequencies, TermPositions,
>>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>>> know the details of that may work for you if we had
>>> a better idea of what it is you're trying to accomplish...
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gshackles@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>> I hope this isn't a dumb question or anything, I'm fairly new to Lucene
>>>>
>>>>
>>> so
>>>
>>>
>>>> I've been picking it up as I go pretty much.  Without going into too
>>>> much
>>>> detail, I need to store pages of text, and for each word on each page,
>>>> store
>>>> detailed information about it.  To do this, I have 2 indexes:
>>>>
>>>> 1) pages: this stores the full text of the page, and identifying
>>>> information
>>>> about it
>>>> 2) words: this stores a single word, along with the page it was on and
>>>> is
>>>> stored in the order they appear on the page
>>>>
>>>> When doing a search, not only do I need to return the page it was found
>>>>
>>>>
>>> on,
>>>
>>>
>>>> but also the details of the matching words.  Since I couldn't think of a
>>>> better way to do it, I first search the pages index and find any
>>>> matching
>>>> pages.  Then I iterate the words on those pages to find where the match
>>>> occurred.  Obviously this is costly as far as execution time goes, but
>>>> at
>>>> least it only has to get done for matching pages rather than every page.
>>>> Searches still take way longer than I'd like though, and the bottleneck
>>>>
>>>>
>>> is
>>>
>>>
>>>> almost entirely in the code to find the matches on the page.
>>>>
>>>> One simple optimization I can think of is store the pages in smaller
>>>>
>>>>
>>> blocks
>>>
>>>
>>>> so that the scope of the iteration is made smaller.  This is not really
>>>> ideal, since I also need the ability to narrow down results based on
>>>>
>>>>
>>> other
>>>
>>>
>>>> words that can/can't appear on the same page which would mean storing 3
>>>> full
>>>> copies of every word on every page (one in each of the 3 resulting
>>>> indexes).
>>>>
>>>> I know this isn't a Java performance forum so I'll try to keep this
>>>>
>>>>
>>> Lucene
>>>
>>>
>>>> related, but has anyone done anything similar to this, or have any
>>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>>
>>>>
>>> speed
>>>
>>>
>>>> things up since I need to perform many searches often over very large
>>>>
>>>>
>>> sets
>>>
>>>
>>>> of pages.  Thanks!
>>>>
>>>> - Greg
>>>>
>>>>
>>>>
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_7568_238726.1226510132405--