Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 52187 invoked from network); 13 Jul 2007 16:43:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Jul 2007 16:43:39 -0000 Received: (qmail 70143 invoked by uid 500); 13 Jul 2007 16:43:34 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 70109 invoked by uid 500); 13 Jul 2007 16:43:34 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 70097 invoked by uid 99); 13 Jul 2007 16:43:34 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jul 2007 09:43:34 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [128.100.3.120] (HELO cliff.cs.toronto.edu) (128.100.3.120) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jul 2007 09:43:30 -0700 Received: from [127.0.0.1] (dvp.cs.toronto.edu [128.100.3.9]) by cliff.cs.toronto.edu (Postfix) with ESMTP id A07A25FD03 for ; Fri, 13 Jul 2007 12:43:12 -0400 (EDT) Message-ID: <4697ABCF.1090208@cs.toronto.edu> Date: Fri, 13 Jul 2007 12:43:59 -0400 From: Shahan Khatchadourian User-Agent: Thunderbird 1.5.0.12 (Windows/20070509) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Token offset values for custom Tokenizer Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, I am storing custom values in the Tokens provided by a Tokenizer but when retrieving them from the index the values don't match. I've looked in the LIA book but it's not current since it mentioned term vectors aren't stored. I'm using Lucene Nightly 146 but the same thing has happened with older versions. Looking at the internals, DocumentWriter seems to keep track of the end offset that was placed into the index and modifies the token values (with +1) but I'm not sure whether I should be concerned with it. No existing analyzers are used when adding the document so all the offsets are generated manually. Any suggestions of how the token offsets should be stored? Is this valid? Token, start, end aaa, 0, 3 bbb, 4, 7 ccc, 8, 11 Thanks, Shahan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org