lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Duffy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars
Date Wed, 31 Dec 2008 03:28:44 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Duffy updated LUCENE-579:
--------------------------------

    Attachment: offsets.patch

I've attached a patch to 2.4's DocInverterPerField.java that fixes this. The problem is in
line 160, which stores the starting offset for the next value of the same field:

 - if a field value has delimiter text after its last token this is ignore.
 - If there is no extra delimiter text after the last token, the offsets are off by +1 for
the tokens in the second value, +2 for the third value and so on.
 - The problem is hidden when there is exactly one delimiter character after each value.

The patch removes the +1 completely and uses the length of the string to adjust offsets for
fields with a string value. Fields with reader or token stream values can't easily be fixed
but can't be stored either so are much less likely to affect anyone.

> TermPositionVector offsets incorrect if indexed field has multiple values and one ends
with non-term chars
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-579
>                 URL: https://issues.apache.org/jira/browse/LUCENE-579
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9
>            Reporter: Keiron McCammon
>         Attachments: offsets.patch
>
>
> If you add multiple values for a field with term vector positions and offsets enabled
and one of the values ends with a non-term then the offsets for the terms from subsequent
values are wrong. For example (note the '.' in the first value):
>         IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);
>         Document doc = new Document();
>         doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>         doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>         writer.addDocument(doc);
>         writer.optimize();
>         writer.close();
>         IndexSearcher searcher = new IndexSearcher(directory);
>         Hits hits = searcher.search(new MatchAllDocsQuery());
>         Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
>             new QueryScorer(new TermQuery(new Term("", "camera")), searcher.getIndexReader(),
""));
>         for (int i = 0; i < hits.length(); ++i) {
>             TermPositionVector v = (TermPositionVector) searcher.getIndexReader().getTermFreqVector(
>                 hits.id(i), "");
>             StringBuilder str = new StringBuilder();
>             for (String s : hits.doc(i).getValues("")) {
>                 str.append(s);
>                 str.append(" ");
>             }
>             
>             System.out.println(str);
>             TokenStream tokenStream = TokenSources.getTokenStream(v, false);
>             String[] terms = v.getTerms();
>             int[] freq = v.getTermFrequencies();
>             for (int j = 0; j < terms.length; ++j) {
>                 System.out.print(terms[j] + ":" + freq[j] + ":");
>                 
>                 int[] pos = v.getTermPositions(j);
>                 
>                 System.out.print(Arrays.toString(pos));
>                 
>                 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
>                 for (int k = 0; k < offset.length; ++k) {
>                     
>                     System.out.print(":");
>                     System.out.print(str.substring(offset[k].getStartOffset(), offset[k].getEndOffset()));
>                 }
>                 
>                 System.out.println();
>             }
>         }
>         searcher.close();
> If I run the above I get:
>         one:1:[0]:one
>         two:1:[1]: tw
> Note that the offsets for the second term are off by 1.
> It seems to be that the length of the value that is stored is not taken into account
when calculating the offset for the fields of the next value.
> I noticed ths problem when using the highlight contrib package which can make use of
term vectors for highlighting. I also noticed that the offset for the second string is +1
the end of the previous value, so when concatenating the fields values to pass to the hgighlighter
I add to append a ' ' character after each string...which is quite useful, but not documented
anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message