Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 38477 invoked from network); 25 May 2006 21:21:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 25 May 2006 21:21:15 -0000 Received: (qmail 35808 invoked by uid 500); 25 May 2006 21:21:13 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 35764 invoked by uid 500); 25 May 2006 21:21:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 35752 invoked by uid 99); 25 May 2006 21:21:12 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 May 2006 14:21:12 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 May 2006 14:21:12 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C5281714283 for ; Thu, 25 May 2006 21:20:29 +0000 (GMT) Message-ID: <1268543.1148592029790.JavaMail.jira@brutus> Date: Thu, 25 May 2006 21:20:29 +0000 (GMT+00:00) From: "Keiron McCammon (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Created: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars ---------------------------------------------------------------------------------------------------------- Key: LUCENE-579 URL: http://issues.apache.org/jira/browse/LUCENE-579 Project: Lucene - Java Type: Bug Components: Analysis Versions: 1.9 Reporter: Keiron McCammon If you add multiple values for a field with term vector positions and offsets enabled and one of the values ends with a non-term then the offsets for the terms from subsequent values are wrong. For example (note the '.' in the first value): IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true); Document doc = new Document(); doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); writer.addDocument(doc); writer.optimize(); writer.close(); IndexSearcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(new MatchAllDocsQuery()); Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(new TermQuery(new Term("", "camera")), searcher.getIndexReader(), "")); for (int i = 0; i < hits.length(); ++i) { TermPositionVector v = (TermPositionVector) searcher.getIndexReader().getTermFreqVector( hits.id(i), ""); StringBuilder str = new StringBuilder(); for (String s : hits.doc(i).getValues("")) { str.append(s); str.append(" "); } System.out.println(str); TokenStream tokenStream = TokenSources.getTokenStream(v, false); String[] terms = v.getTerms(); int[] freq = v.getTermFrequencies(); for (int j = 0; j < terms.length; ++j) { System.out.print(terms[j] + ":" + freq[j] + ":"); int[] pos = v.getTermPositions(j); System.out.print(Arrays.toString(pos)); TermVectorOffsetInfo[] offset = v.getOffsets(j); for (int k = 0; k < offset.length; ++k) { System.out.print(":"); System.out.print(str.substring(offset[k].getStartOffset(), offset[k].getEndOffset())); } System.out.println(); } } searcher.close(); If I run the above I get: one:1:[0]:one two:1:[1]: tw Note that the offsets for the second term are off by 1. It seems to be that the length of the value that is stored is not taken into account when calculating the offset for the fields of the next value. I noticed ths problem when using the highlight contrib package which can make use of term vectors for highlighting. I also noticed that the offset for the second string is +1 the end of the previous value, so when concatenating the fields values to pass to the hgighlighter I add to append a ' ' character after each string...which is quite useful, but not documented anywhere. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org