Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 83091 invoked from network); 12 Nov 2008 17:16:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Nov 2008 17:16:18 -0000 Received: (qmail 62296 invoked by uid 500); 12 Nov 2008 17:16:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 62268 invoked by uid 500); 12 Nov 2008 17:16:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 62257 invoked by uid 99); 12 Nov 2008 17:16:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 09:16:18 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gshackles@gmail.com designates 209.85.132.250 as permitted sender) Received: from [209.85.132.250] (HELO an-out-0708.google.com) (209.85.132.250) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 17:14:59 +0000 Received: by an-out-0708.google.com with SMTP id b2so197496ana.5 for ; Wed, 12 Nov 2008 09:15:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=7FnH1C0bJxRdlZJTE1bFZXvI3DbcsMf0rXaKy8ooOBI=; b=lv+eRp8N/kymA1HjhxI96SrKHbFPWkFt0hPbeRQAbz0ER/j9O2oJXF69qK7Tql75iL U7j+J4zbsfbCzQbEo2F/yEhiYfpa0LmT+4ZhV84ZlQ3cG0qVHcASmoxachwelAIhBlHd 9fEKXsBdvabYaG8tdrnh+kjI3D+ZCgm+Ucs9c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=pCMqJJjZHv+cxhtUCb2yeE8y8M8y0aaX03T+vwFXwcZ96fhX98NQ3jcX2X8trYmOeD aZSvaIL9oqcil3jpMB4Fx7xDS2EHZjK9Q9iUJvhJNjjkfNTsZ25re/myjpXdEdmRSbWV maonaACpCfYJFqBucw+XOl/1Y/1o6DByERb5c= Received: by 10.100.37.20 with SMTP id k20mr4061496ank.5.1226510132409; Wed, 12 Nov 2008 09:15:32 -0800 (PST) Received: by 10.100.9.2 with HTTP; Wed, 12 Nov 2008 09:15:32 -0800 (PST) Message-ID: Date: Wed, 12 Nov 2008 12:15:32 -0500 From: "Greg Shackles" To: java-user@lucene.apache.org Subject: Re: Lucene implementation/performance question In-Reply-To: <491B0BCF.2000302@gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_7568_238726.1226510132405" References: <359a92830811120817h18df7a98o19063aeb28dfc22c@mail.gmail.com> <491B0BCF.2000302@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_7568_238726.1226510132405 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hey Mark, This sounds very interesting. Is there any documentation or examples I could see? I did a quick search but didn't really find much. It might just be that I don't know how payloads work in Lucene, but I'm not sure how I would see this actually doing what I need. My reasoning is this...you'd have an index that stores all the text for a particular page. Would you be able to attach payload information to individual words on that page? In my head it seems like that would be the job of a second index, which is exactly why I added the word index. Any details you can give would be great as I need to keep moving on this project quickly. I will also say that I'm somewhat wary of using an experimental class since this is a really important project that really won't be able to wait on a lot of development cycles to get the class fully working. That said, if it can give me serious speed improvements it's definitely worth considering. - Greg On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller wrote: > If your new to Lucene, this might be a little much (and maybe I am not > fully understand the problem), but you might try: > > Add the attributes to the words in a payload with a PayloadAnalyzer. Do > searching as normal. Use the new PayloadSpanUtil class to get the payloads > for the matching words. (Think of the PayloadSpanUtil as a highlighter - you > give it a query, it gives you the payloads to the terms that match). The > PayloadSpanUtil class is a bit experimental, but I'll fix anything you run > into with it. > > - Mark > > > Greg Shackles wrote: > >> Hi Erick, >> >> Thanks for the response, sorry that I was somewhat vague in the reasoning >> for my implementation in the first post. I should have mentioned that the >> word details are not details of the Lucene document, but are attributes >> about the word that I am storing. Some examples are position on the >> actual >> page, color, size, bold/italic/underlined, and most importantly, the text >> as >> it appeared on the page. The reason the last one matters is that things >> like punctuation, spacing and capitalization can vary between the result >> and >> the search term, and can affect how I need to process the results >> afterwords. I am certainly open to the idea of a new approach if it would >> improve on things, I admit I am new to Lucene so if there are options I'm >> unaware of I'd love to learn about them. >> >> Just to sum it up with an example, let's say we have a page of text that >> stores "This is a page of text." We want to search for the text "of >> text", >> which would span multiple words in the word index. The final result would >> need to contain "of" and "text", along with the details about each as >> described before. I hope this is more helpful! >> >> - Greg >> >> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson > >wrote: >> >> >> >>> If I may suggest, could you expand upon what you're trying to >>> accomplish? Why do you care about the detailed information >>> about each word? The reason I'm suggesting this is "the XY >>> problem". That is, people often ask for details about a specific >>> approach when what they really need is a different approach >>> >>> There are TermFrequencies, TermPositions, >>> TermVectorOffsetInfo and a bunch of other stuff that I don't >>> know the details of that may work for you if we had >>> a better idea of what it is you're trying to accomplish... >>> >>> Best >>> Erick >>> >>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles >>> wrote: >>> >>> >>> >>>> I hope this isn't a dumb question or anything, I'm fairly new to Lucene >>>> >>>> >>> so >>> >>> >>>> I've been picking it up as I go pretty much. Without going into too >>>> much >>>> detail, I need to store pages of text, and for each word on each page, >>>> store >>>> detailed information about it. To do this, I have 2 indexes: >>>> >>>> 1) pages: this stores the full text of the page, and identifying >>>> information >>>> about it >>>> 2) words: this stores a single word, along with the page it was on and >>>> is >>>> stored in the order they appear on the page >>>> >>>> When doing a search, not only do I need to return the page it was found >>>> >>>> >>> on, >>> >>> >>>> but also the details of the matching words. Since I couldn't think of a >>>> better way to do it, I first search the pages index and find any >>>> matching >>>> pages. Then I iterate the words on those pages to find where the match >>>> occurred. Obviously this is costly as far as execution time goes, but >>>> at >>>> least it only has to get done for matching pages rather than every page. >>>> Searches still take way longer than I'd like though, and the bottleneck >>>> >>>> >>> is >>> >>> >>>> almost entirely in the code to find the matches on the page. >>>> >>>> One simple optimization I can think of is store the pages in smaller >>>> >>>> >>> blocks >>> >>> >>>> so that the scope of the iteration is made smaller. This is not really >>>> ideal, since I also need the ability to narrow down results based on >>>> >>>> >>> other >>> >>> >>>> words that can/can't appear on the same page which would mean storing 3 >>>> full >>>> copies of every word on every page (one in each of the 3 resulting >>>> indexes). >>>> >>>> I know this isn't a Java performance forum so I'll try to keep this >>>> >>>> >>> Lucene >>> >>> >>>> related, but has anyone done anything similar to this, or have any >>>> comments/ideas on how to improve it? I'm in the process of trying to >>>> >>>> >>> speed >>> >>> >>>> things up since I need to perform many searches often over very large >>>> >>>> >>> sets >>> >>> >>>> of pages. Thanks! >>>> >>>> - Greg >>>> >>>> >>>> >>> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_7568_238726.1226510132405--