lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Problem with TermVector offsets and positions not being preserved
Date Fri, 24 Aug 2012 16:51:59 GMT
Calling IR.document does not restore your 'original Document'
completely. This is really an age-old trap.
So don't update documents this way: its fine to fetch their contents
but nothing goes thru the effort to ensure that things like term
vectors parameters are the same as what you originally provided. This
would require extra disk seeks.

See https://issues.apache.org/jira/browse/LUCENE-3312 for an effort to
fix this trap for google summer of code.

On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
> I have one more question about term vector positions and offsets being preserved. My
co-worker is working on updating the documents in an index with a field that contains a numerical
value derived from the term frequencies and inverse document frequencies of terms in the document.
His first pass at doing this calculates these values, writes them along with document ids
to a text file and then updates the documents by reading lines from the file, searching for
the document that contains the id, adding the field to the document, and replacing the document
in the index. Some of the fields in these documents have term vectors with offsets and positions.
After the revised document is updated in the index, those fields' term vector offsets and
positions are still found. After closing the searcher, reader and writer that are used in
this process, the fields that have term vectors no longer have positions and offsets in them.
His code looks like this:
>
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, _analyzer);
> IndexWriter writer = new IndexWriter(indexDir, config);
> IndexReader reader = IndexReader.open(writer, true);
> IndexSearcher searcher = new IndexSearcher(reader);
>
> while ((s = in.readLine()) != null) {
>     String[] tokens = s.split(",");
>     float fieldValue = Float.parseFloat(tokens[1].trim());
>     NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
>     nField.setFloatValue(fieldValue);
>     String docId = tokens[0].trim();
>     Term docIdTerm = new Term("DocId", docId);
>     TermQuery query = new TermQuery(docIdTerm);
>     TopDocs hits = searcher.search(query, 2);
>
>     if (hits.scoreDocs.length != 1) {
>         throw new Exception("Unexpected number of documents in index with docId = " +
docId);
>     }
>     int docNum = hits.scoreDocs[0].doc;
>     Document doc = searcher.doc(docNum);
>     doc.add(nField);
>     writer.updateDocument(docIdTerm, doc);
> }
> displayTermVectorInfo(dir);   // for debugging
> writer.close();
> displayTermVectorInfo(dir);   // for debugging
> reader.close();
> searcher.close();
>
> private static void displayTermVectorInfo(Directory dir) throws IOException, CorruptIndexException
{
>     IndexReader reader = null;
>
>     try {
>         reader = IndexReader.open(dir);
>
>         for (int i = 0; i < reader.numDocs; i++) {
>             Document doc = reader.document(j);
>             List<Fieldable> docFields = doc.getFields();
>
>             for (Fieldable field : docFields) {
>                 TermFreqVector termFreqVector = reader.getTermFreqVector(i, field.name());
>
>                 if (termFreqVector != null && termFreqVector instanceof TermPositionVector)
{
>                     TermPositionVector termPositionVector = (TermPositionVector)termFreqVector;
>                     System.out.println("Field " + field.name());
>
>                     for (int j = 0; j < termFreqVector.size(); j++) {
>                         TermVectorOffsetInfo[] offsets = termPositionVector.getOffsets(j);
>
>                         for (TermVectorOffsetInfo offsetInfo : offsets) {
>                             System.out.println("offset: " + offsetInfo.getStartOffset()
+ " " + offsetInfo.getEndOffset());
>                         }
>                     }
>                     for (int k = 0; k < termFreqVector.size(); k++) {
>                         int[] positions = termPositionVector.getTermPositions(k);
>
>                         for (int position : positions) {
>                             System.out.println("position: " + position);
>                         }
>                     }
>                 }
>             }
>         }
>     } finally {
>         if (reader != null) {
>             reader.close();
>         }
>     }
> }
>
> The first time displayTermVectorInfo is called, it displays offsets and positions for
the fields that have term vectors with offsets and positions. The second time it is called,
it doesn't display anything because none of the term vectors satisfy termFreqVector instanceof
TermPositionVector. Is it supposed to work this way? What is it about closing the writer that
alters the term vectors in the affected fields? Is there a way to add a field to the documents
in an index in which this doesn't occur?
> Thanks,
> Mike
>
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Friday, July 20, 2012 5:59 PM
> To: java-user@lucene.apache.org
> Subject: Re: Problem with TermVector offsets and positions not being preserved
>
> On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
>> Hi Robert,
>> I'm not trying to determine whether a document has term vectors, I'm trying to determine
whether the term vectors that are in the index have offsets and positions > stored.
>
> Right: what i'm trying to tell you is that offsets and positions is not an index-wide
setting for a field: its per-document.
>
> I think all the tools you are using to check these values are not doing it correctly:
> 1. DumpIndex is wrongly using values from the Document returned by IndexReader.document(),
but that doesn't and never did retrieve these values (it would be 2 extra disk seeks per document
to figure out the term vector flags) 2. I havent looked at Luke, but its probably printing
the "global"
> bits from FieldInfos. It used to be that we wrote some bits for these options, I don't
ever know what the purpose was since these options can be controlled on/off at a per-document
level: they make no sense.
> Because of this we stopped writing these bits in 3.6 (we only write into FieldInfos if
the field has any term vectors at all), and thats probably whats confusing you there.
>
> Again, if you really want to validate that a specific document has offsets/positions
in its term vectors, you need to check that specific document with IndexReader.getTermFreqVector,
there is no other way, since this can be controlled on a per-document basis for a field.
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message