Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 47360 invoked from network); 17 Nov 2007 12:42:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Nov 2007 12:42:21 -0000 Received: (qmail 23987 invoked by uid 500); 17 Nov 2007 12:42:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 23953 invoked by uid 500); 17 Nov 2007 12:42:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 23942 invoked by uid 99); 17 Nov 2007 12:42:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Nov 2007 04:42:02 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.83] (HELO spunkymail-a4.g.dreamhost.com) (208.97.132.83) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Nov 2007 12:41:51 +0000 Received: from [10.0.1.2] (1023chost55.starwoodbroadband.com [12.37.185.55]) by spunkymail-a4.g.dreamhost.com (Postfix) with ESMTP id 699CC3B9D2 for ; Sat, 17 Nov 2007 04:41:41 -0800 (PST) Message-Id: <70CC07CE-5F57-4884-8E23-35F541FCE1CD@apache.org> From: Grant Ingersoll To: java-user@lucene.apache.org In-Reply-To: <473E21AC.5000702@uwaterloo.ca> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v912) Subject: Re: Payloads, Tokenizers, and Filters. Oh My! Date: Sat, 17 Nov 2007 07:41:20 -0500 References: <473E21AC.5000702@uwaterloo.ca> X-Mailer: Apple Mail (2.912) X-Virus-Checked: Checked by ClamAV on apache.org Inline below On Nov 16, 2007, at 6:03 PM, Tricia Williams wrote: > Hi All, > > I'll explain what I'm working on, and then I'll ask my two > questions. > > I'm working on the issue https://issues.apache.org/jira/browse/SOLR-380 > which is a feature request that allows one to index a "Structured > Document" which is anything that can be represented by XML in order > to provide more context to hits in the result set. This allows us > to do things like query the index for "Canada" and be able to not > only say that that query matched a document titled "Some Nonsense" > but also that the query term appeared on page 7 of chapter 1. We > can then take this one step further and markup/highlight the image > of this page based on our OCR and position hit. > For example: > > Some > text from page one of a book.Some more text > from page seven of a book. Oh and I'm from Canada. book> > > I accomplished this by creating a custom Tokenizer which strips > the xml elements and stores them as a Payload at each of the Tokens > created from the character data in the input. The payload is the > string that describes the XPath at that location. So for > the payload is "/book[title='Some Nonsense']/chapter[title='One']/ > page[name='7']" > > The other part of this work is the SolrHighlighter which is less > important to this list. I retrieve the TermPositions for the > Query's Terms and use the TermPosition functionality to get back the > payload for the hits and build output which shows hit positions > categorized by the payload they are associated with. > > QUESTION 1: Applying TokenFilters to my Tokenizer creates some > strange (in my opinion) behavior. First of all the TermPositions > change and second the Payload is removed. Is this the expected > behavior, or is this a bug? With the Payload being an "experimental > feature" I can understand if this persistence just hasn't been > implemented yet. But is it, or will it be? > Do you have other TokenFilters in your Analyzer? Are you reusing the same Token or creating a new one in your TokenFilters? If creating a new one, you will have to set the payload as it won't be copied down. Perhaps we should add a constructor that takes a payload. On the other hand, I think we are going to remove the Payload object in favor of just using the byte array. > In the following example I will denote a token by {pos, text>,}: > > input: Dog, and Cat > > XmlPayloadTokenizer: > {1,,},{2,, class[name='mammalia'][startPos='0']>},{3,, class[name='mammalia'][startPos='0']>} > StopFilter: > {1,,},{2,, class[name='mammalia'][startPos='0']>} > WordDelimiterFilter: > {1,,<>} {2,,} > LowerCaseFilter: > {1,,<>} {2,,} > > > QUESTION 2: As I explained I'm storing the String representing the > XPath of the token as the Payload (well the ByteArray of the String) > of each token. Is there a more efficient way to do this? Is this > exploiting Payload functionality and will it turn around and bite me > when I get to indexing hundreds of thousands of documents? Perhaps > I shouldn't be relying on the Payload functionality before it is > deemed not experimental? > I think this is reasonable. Micheal Busch had a nice talk at ApacheCon on payloads that you can find at http://people.apache.org/~buschmi/apachecon/AdvancedIndexingLuceneAtlanta07.ppt I guess you just want to be careful about how big your payloads get. One of the original use cases for payloads was for doing XPath queries. Also, the only thing experimental about Payloads is the actual signature of the methods, not the need for them. If anything, I think you will see an expansion of payload capability in the future. Also note, that you will probably be interested in adding more Payload querying capability. And also note, I am in the process of adding the ability to get payloads from Spans, but I am not sure if this gets into 2.3 or not. Cheers, Grant --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org