Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 6869 invoked from network); 22 Jul 2004 19:37:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 22 Jul 2004 19:37:20 -0000 Received: (qmail 84674 invoked by uid 500); 22 Jul 2004 19:37:15 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 84460 invoked by uid 500); 22 Jul 2004 19:37:14 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 84443 invoked by uid 99); 22 Jul 2004 19:37:14 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [128.230.248.25] (HELO gwia201.syr.edu) (128.230.248.25) by apache.org (qpsmtpd/0.27.1) with SMTP; Thu, 22 Jul 2004 12:37:10 -0700 Received: from MTA2-MTA by gwia201.syr.edu with Novell_GroupWise; Thu, 22 Jul 2004 15:37:05 -0400 Message-Id: X-Mailer: Novell GroupWise Internet Agent 6.0.4 Date: Thu, 22 Jul 2004 15:36:56 -0400 From: "Grant Ingersoll" To: Subject: Re: Can I retrieve token offsets from Hits? Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I am sensing a common theme throughout a variety of threads here: Namely, = a need for a pluggable set of Reader's and Writers (think Interface) that = can write metadata about an Index/Document/Field/Term (which I see the = TermVector stuff as being a instance of) and can be given to Lucene from = the application level (or at least the application specifies which ones to = use) I proposed something like this a bit earlier, but didn't see any interest. = I suppose I should implement it as this is how things get going, but = would be nice to have some input on requirements and whether the people = who know Lucene better than I think this is possible. Just my two cents on this one. Doesn't help you w/ an immediate solution, = but I think it would help us all in the long run. If this existed, one = could easily implement a Token position store and ask it for all of this = information, I think. :-) -Grant >>> markharw00d@yahoo.co.uk 07/22/04 03:19PM >>> > I wonder if the information in termPositions or termVector can be used > to restore token position from indicies? TermFreqVector gives you term frequencies (not positions). This can be of = use in computing document=20 similarities. TermPositions gives you the sequence number . eg in the last sentence the = word "sequence" was=20 token number 5, (not character position 5). This is used for PhraseQueries= to determine proximity. Character position is what is required to do highlighting and this isnt = stored anywhere currently.=20 The requirements for such a store would be indexed access by doc number, = and a compact means of storing term/character position info. This could add considerable size = to the index. Previously we concluded that highlighting is only typically done on the = first 10 or so records in a result set=20 anyway and that re-analyzing the text shouldnt add too much of an = overhead. If you want to limit the size of an individual document's text to be tokenized use highlighter.setMaxDocByte= sToAnalyze(). If you find tokenizing slow check you arent using StandardAnalyzer - I = have found that to be slow (see http://marc.theaimsgroup.com/?l=3Dlucene-dev&m=3D108080820315779&w=3D2= ) Cheers Mark =20 --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org=20 For additional commands, e-mail: lucene-user-help@jakarta.apache.org=20 --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org