Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 54460 invoked from network); 17 Nov 2007 06:51:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Nov 2007 06:51:35 -0000 Received: (qmail 78830 invoked by uid 500); 17 Nov 2007 06:51:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78809 invoked by uid 500); 17 Nov 2007 06:51:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 42519 invoked by uid 99); 16 Nov 2007 23:04:19 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of pgwillia@uwaterloo.ca designates 129.97.152.18 as permitted sender) Message-ID: <473E21AC.5000702@uwaterloo.ca> Date: Fri, 16 Nov 2007 16:03:08 -0700 From: Tricia Williams User-Agent: Thunderbird 2.0.0.6 (X11/20071022) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Payloads, Tokenizers, and Filters. Oh My! Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-3.0 (services10.student.cs.uwaterloo.ca [129.97.152.13]); Fri, 16 Nov 2007 18:03:12 -0500 (EST) X-Miltered: at mailchk-w03 with ID 473E21AC.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Virus-Scanned: ClamAV version 0.91.2, clamav-milter version 0.91.2 on localhost X-Virus-Status: Clean X-UUID: 7d4fda07-2bce-4509-85f8-33dd24054994 X-Virus-Checked: Checked by ClamAV on apache.org Hi All, I'll explain what I'm working on, and then I'll ask my two questions. I'm working on the issue https://issues.apache.org/jira/browse/SOLR-380 which is a feature request that allows one to index a "Structured Document" which is anything that can be represented by XML in order to provide more context to hits in the result set. This allows us to do things like query the index for "Canada" and be able to not only say that that query matched a document titled "Some Nonsense" but also that the query term appeared on page 7 of chapter 1. We can then take this one step further and markup/highlight the image of this page based on our OCR and position hit. For example: Some text from page one of a book.Some more text from page seven of a book. Oh and I'm from Canada. I accomplished this by creating a custom Tokenizer which strips the xml elements and stores them as a Payload at each of the Tokens created from the character data in the input. The payload is the string that describes the XPath at that location. So for the payload is "/book[title='Some Nonsense']/chapter[title='One']/page[name='7']" The other part of this work is the SolrHighlighter which is less important to this list. I retrieve the TermPositions for the Query's Terms and use the TermPosition functionality to get back the payload for the hits and build output which shows hit positions categorized by the payload they are associated with. QUESTION 1: Applying TokenFilters to my Tokenizer creates some strange (in my opinion) behavior. First of all the TermPositions change and second the Payload is removed. Is this the expected behavior, or is this a bug? With the Payload being an "experimental feature" I can understand if this persistence just hasn't been implemented yet. But is it, or will it be? In the following example I will denote a token by {pos,,}: input: Dog, and Cat XmlPayloadTokenizer: {1,,},{2,,},{3,,} StopFilter: {1,,},{2,,} WordDelimiterFilter: {1,,<>} {2,,} LowerCaseFilter: {1,,<>} {2,,} QUESTION 2: As I explained I'm storing the String representing the XPath of the token as the Payload (well the ByteArray of the String) of each token. Is there a more efficient way to do this? Is this exploiting Payload functionality and will it turn around and bite me when I get to indexing hundreds of thousands of documents? Perhaps I shouldn't be relying on the Payload functionality before it is deemed not experimental? I feel these questions are both related to Lucene proper rather than Solr, which is why I've posted here. If you think solr-user is a better place to post my questions let me know. Thanks for your input! Tricia --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org