lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shahan Khatchadourian (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars
Date Sat, 14 Jul 2007 21:42:04 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512743
] 

Shahan Khatchadourian commented on LUCENE-579:
----------------------------------------------

DocumentWriter seems to be the culprit in adding 1 to the previous token's endOffset. It may
not be possible to provide token offsets that "undo" this operation since it is not possible
to determine the order in which tokens are handled as they are grouped by field which doesn't
necessarily correspond to document-order. This would also interfere with custom synonym tokens
since custom token offsets are no longer guaranteed.

I suggest that there be a flag in Fieldable or IndexWriter which allows exact provided offsets
to be stored rather than increased by one. There does not seem to be any sanity checks on
offset values during reading/writing the term vector.

A current workaround to this issue is to store the correct startOffset, but leave the endOffset
as -1. This has the effect of undoing the +1 of the previous token's endOffset but prevents
endOffset information from being available without retokenizing/reparsing the original document.



> TermPositionVector offsets incorrect if indexed field has multiple values and one ends
with non-term chars
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-579
>                 URL: https://issues.apache.org/jira/browse/LUCENE-579
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9
>            Reporter: Keiron McCammon
>
> If you add multiple values for a field with term vector positions and offsets enabled
and one of the values ends with a non-term then the offsets for the terms from subsequent
values are wrong. For example (note the '.' in the first value):
>         IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);
>         Document doc = new Document();
>         doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>         doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>         writer.addDocument(doc);
>         writer.optimize();
>         writer.close();
>         IndexSearcher searcher = new IndexSearcher(directory);
>         Hits hits = searcher.search(new MatchAllDocsQuery());
>         Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
>             new QueryScorer(new TermQuery(new Term("", "camera")), searcher.getIndexReader(),
""));
>         for (int i = 0; i < hits.length(); ++i) {
>             TermPositionVector v = (TermPositionVector) searcher.getIndexReader().getTermFreqVector(
>                 hits.id(i), "");
>             StringBuilder str = new StringBuilder();
>             for (String s : hits.doc(i).getValues("")) {
>                 str.append(s);
>                 str.append(" ");
>             }
>             
>             System.out.println(str);
>             TokenStream tokenStream = TokenSources.getTokenStream(v, false);
>             String[] terms = v.getTerms();
>             int[] freq = v.getTermFrequencies();
>             for (int j = 0; j < terms.length; ++j) {
>                 System.out.print(terms[j] + ":" + freq[j] + ":");
>                 
>                 int[] pos = v.getTermPositions(j);
>                 
>                 System.out.print(Arrays.toString(pos));
>                 
>                 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
>                 for (int k = 0; k < offset.length; ++k) {
>                     
>                     System.out.print(":");
>                     System.out.print(str.substring(offset[k].getStartOffset(), offset[k].getEndOffset()));
>                 }
>                 
>                 System.out.println();
>             }
>         }
>         searcher.close();
> If I run the above I get:
>         one:1:[0]:one
>         two:1:[1]: tw
> Note that the offsets for the second term are off by 1.
> It seems to be that the length of the value that is stored is not taken into account
when calculating the offset for the fields of the next value.
> I noticed ths problem when using the highlight contrib package which can make use of
term vectors for highlighting. I also noticed that the offset for the second string is +1
the end of the previous value, so when concatenating the fields values to pass to the hgighlighter
I add to append a ' ' character after each string...which is quite useful, but not documented
anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message