Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 50295 invoked from network); 12 Nov 2008 19:23:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Nov 2008 19:23:50 -0000 Received: (qmail 91430 invoked by uid 500); 12 Nov 2008 19:23:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 91407 invoked by uid 500); 12 Nov 2008 19:23:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 91395 invoked by uid 99); 12 Nov 2008 19:23:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 11:23:49 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 209.85.162.183 as permitted sender) Received: from [209.85.162.183] (HELO el-out-1112.google.com) (209.85.162.183) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 19:22:27 +0000 Received: by el-out-1112.google.com with SMTP id p32so322646elf.14 for ; Wed, 12 Nov 2008 11:23:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=n1A09C5sKuRboJFbBP1vbysED+ozRjXuRQv4bLnbmCs=; b=Vh8T5dez/BNACBHalIwg88gbfP2nc57xCAsBGoswML4vi301PsrOWVJV+sD+qCB2cJ Oo2mwh1LL7t6LkUA4Tsp05BCQouWGmNOF9xQTnR5ahntiHMQEdilV9OzMhASBVN/C8PR TAuB2/vtZ4NWhz2PPqSQyOpG8BXQrYlUKVQD0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=WOsQiNqZifeEy1asg7cFLeDP4JGEAIWvBFgHlyOSTap/6Aae2iriqGQyAOjkyCRv3d BZQHGIbR2BON7snXxcQFQ1AENqG8Mvr5VweR1TCjFdeBtHKF6np8ti8EVhaoK2zvGAp9 O/A1UDsISvPKdqzQAHDuWyusUKQYnPHnXiT5s= Received: by 10.90.28.12 with SMTP id b12mr8496036agb.115.1226517780263; Wed, 12 Nov 2008 11:23:00 -0800 (PST) Received: from ?192.168.1.100? (ool-44c639d9.dyn.optonline.net [68.198.57.217]) by mx.google.com with ESMTPS id 7sm11859366agc.14.2008.11.12.11.22.55 (version=SSLv3 cipher=RC4-MD5); Wed, 12 Nov 2008 11:22:55 -0800 (PST) Message-ID: <491B2D13.8090006@gmail.com> Date: Wed, 12 Nov 2008 14:22:59 -0500 From: Mark Miller User-Agent: Thunderbird 2.0.0.17 (X11/20080925) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Lucene implementation/performance question References: <359a92830811120817h18df7a98o19063aeb28dfc22c@mail.gmail.com> <491B0BCF.2000302@gmail.com> <491B17D8.4090209@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Greg Shackles wrote: > Thanks! This all actually sounds promising, I just want to make sure I'm > thinking about this correctly. Does this make sense? > > Indexing process: > > 1) Get list of all words for a page and their attributes, stored in some > sort of data structure > 2) Concatenate the text from those words (space separated) into a string > that represents the entire page > 3) When adding the page document to the index, run it through a custom > analyzer that attaches the payloads to the tokens > * this would have to follow along in the word list from #1 to get the > payload information for each token > * would also have to tokenize the word we are storing to see how many > Lucene tokens it would translate to (to make sure the right payloads go with > the right tokens) > Right, sounds like you have it spot on. That second * from 3 looks like a possible tricky part. > I haven't totally analyzed the searching process yet since I want to get my > head around the storage part first, but I imagine that would be the easier > part anyway. Does this approach sound reasonable? > Sounds good. > My other concern is your comment about isolating results. If I'm reading it > correctly, it means that I'd have to do the search in multiple passes, one > to get the individual docs containing the matches, and then one query for > each of those to get the payloads within them? > Right...you'd do it essentially how Highlighting works...you do the search to get the docs of interest, and then redo the search somewhat to get the highlights/payloads for an individual doc at a time. You are redoing some work, but if you think about, getting that info for every match (there could be tons) doesn't make much since when someone might just look at the top couple results, or say 10 at a time. Depends on your usecase if its feasible or not though. Most find it efficient enough to do highlighting with, so I'm assuming it should be good enough here. > Thanks again for your help on this one. > > - Greg > > > On Wed, Nov 12, 2008 at 12:52 PM, Mark Miller wrote: > > >> Here is a great power point on payloads from Michael Busch: >> www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt. >> Essentially, you can store metadata at each term position, so its an >> excellent place to store attributes of the term - they are very fast to >> load, efficient, etc. >> >> You can check out the spans test classes for a small example using the >> PayloadSpanUtil...its actually fairly simple and short, and the main reason >> I consider it experimental is that it hasn't really been used too much to my >> knowledge (who knows though). If you have a problem, you'll know quickly and >> I'll fix quickly. It should work fine though. Overall, the approach wouldn't >> take that much code, so I don't think youd be out a lot of time. >> >> The PayloadSpanUtil takes an IndexReader and a query and returns the >> payloads for the terms in the IndexReader that match the query. If you end >> up with multiple docs in the IndexReader, be sure to isolate the query down >> to the exact doc you want the payloads from (the Span scoring mode of the >> highlighter actually puts the doc in a fast MemoryIndex which only holds one >> doc, and uses an IndexReader from the MemoryIndex). >> >> >> Greg Shackles wrote: >> >> >>> Hey Mark, >>> >>> This sounds very interesting. Is there any documentation or examples I >>> could see? I did a quick search but didn't really find much. It might >>> just >>> be that I don't know how payloads work in Lucene, but I'm not sure how I >>> would see this actually doing what I need. My reasoning is this...you'd >>> have an index that stores all the text for a particular page. Would you >>> be >>> able to attach payload information to individual words on that page? In >>> my >>> head it seems like that would be the job of a second index, which is >>> exactly >>> why I added the word index. >>> >>> Any details you can give would be great as I need to keep moving on this >>> project quickly. I will also say that I'm somewhat wary of using an >>> experimental class since this is a really important project that really >>> won't be able to wait on a lot of development cycles to get the class >>> fully >>> working. That said, if it can give me serious speed improvements it's >>> definitely worth considering. >>> >>> - Greg >>> >>> >>> On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller >>> wrote: >>> >>> >>> >>> >>>> If your new to Lucene, this might be a little much (and maybe I am not >>>> fully understand the problem), but you might try: >>>> >>>> Add the attributes to the words in a payload with a PayloadAnalyzer. Do >>>> searching as normal. Use the new PayloadSpanUtil class to get the >>>> payloads >>>> for the matching words. (Think of the PayloadSpanUtil as a highlighter - >>>> you >>>> give it a query, it gives you the payloads to the terms that match). The >>>> PayloadSpanUtil class is a bit experimental, but I'll fix anything you >>>> run >>>> into with it. >>>> >>>> - Mark >>>> >>>> >>>> Greg Shackles wrote: >>>> >>>> >>>> >>>> >>>>> Hi Erick, >>>>> >>>>> Thanks for the response, sorry that I was somewhat vague in the >>>>> reasoning >>>>> for my implementation in the first post. I should have mentioned that >>>>> the >>>>> word details are not details of the Lucene document, but are attributes >>>>> about the word that I am storing. Some examples are position on the >>>>> actual >>>>> page, color, size, bold/italic/underlined, and most importantly, the >>>>> text >>>>> as >>>>> it appeared on the page. The reason the last one matters is that things >>>>> like punctuation, spacing and capitalization can vary between the result >>>>> and >>>>> the search term, and can affect how I need to process the results >>>>> afterwords. I am certainly open to the idea of a new approach if it >>>>> would >>>>> improve on things, I admit I am new to Lucene so if there are options >>>>> I'm >>>>> unaware of I'd love to learn about them. >>>>> >>>>> Just to sum it up with an example, let's say we have a page of text that >>>>> stores "This is a page of text." We want to search for the text "of >>>>> text", >>>>> which would span multiple words in the word index. The final result >>>>> would >>>>> need to contain "of" and "text", along with the details about each as >>>>> described before. I hope this is more helpful! >>>>> >>>>> - Greg >>>>> >>>>> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson < >>>>> erickerickson@gmail.com >>>>> >>>>> >>>>> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>>> If I may suggest, could you expand upon what you're trying to >>>>>> accomplish? Why do you care about the detailed information >>>>>> about each word? The reason I'm suggesting this is "the XY >>>>>> problem". That is, people often ask for details about a specific >>>>>> approach when what they really need is a different approach >>>>>> >>>>>> There are TermFrequencies, TermPositions, >>>>>> TermVectorOffsetInfo and a bunch of other stuff that I don't >>>>>> know the details of that may work for you if we had >>>>>> a better idea of what it is you're trying to accomplish... >>>>>> >>>>>> Best >>>>>> Erick >>>>>> >>>>>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> I hope this isn't a dumb question or anything, I'm fairly new to >>>>>>> Lucene >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> so >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> I've been picking it up as I go pretty much. Without going into too >>>>>>> much >>>>>>> detail, I need to store pages of text, and for each word on each page, >>>>>>> store >>>>>>> detailed information about it. To do this, I have 2 indexes: >>>>>>> >>>>>>> 1) pages: this stores the full text of the page, and identifying >>>>>>> information >>>>>>> about it >>>>>>> 2) words: this stores a single word, along with the page it was on and >>>>>>> is >>>>>>> stored in the order they appear on the page >>>>>>> >>>>>>> When doing a search, not only do I need to return the page it was >>>>>>> found >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> on, >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> but also the details of the matching words. Since I couldn't think of >>>>>>> a >>>>>>> better way to do it, I first search the pages index and find any >>>>>>> matching >>>>>>> pages. Then I iterate the words on those pages to find where the >>>>>>> match >>>>>>> occurred. Obviously this is costly as far as execution time goes, but >>>>>>> at >>>>>>> least it only has to get done for matching pages rather than every >>>>>>> page. >>>>>>> Searches still take way longer than I'd like though, and the >>>>>>> bottleneck >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> is >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> almost entirely in the code to find the matches on the page. >>>>>>> >>>>>>> One simple optimization I can think of is store the pages in smaller >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> blocks >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> so that the scope of the iteration is made smaller. This is not >>>>>>> really >>>>>>> ideal, since I also need the ability to narrow down results based on >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> other >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> words that can/can't appear on the same page which would mean storing >>>>>>> 3 >>>>>>> full >>>>>>> copies of every word on every page (one in each of the 3 resulting >>>>>>> indexes). >>>>>>> >>>>>>> I know this isn't a Java performance forum so I'll try to keep this >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Lucene >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> related, but has anyone done anything similar to this, or have any >>>>>>> comments/ideas on how to improve it? I'm in the process of trying to >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> speed >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> things up since I need to perform many searches often over very large >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> sets >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> of pages. Thanks! >>>>>>> >>>>>>> - Greg >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> >>>> >>>> >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org