lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tmole...@uw.edu>
Subject RE: Problem with TermVector offsets and positions not being preserved
Date Sat, 21 Jul 2012 00:24:00 GMT
Hi Robert,
I'm not trying to determine whether a document has term vectors, I'm trying to determine whether
the term vectors that are in the index have offsets and positions stored. Shouldn't the Field
instance variables called storeOffsetWithTermVector and storePositionWithTermVector be set
to true for a field that is defined to store offsets and positions in term vectors? They are
set to true in 3.5, but not in 3.6. When I open an index that I created with 3.6 in Luke,
it says the fields in question have term vectors enabled, but offsets and positions are not
stored. Maybe once term vectors with offsets and positions are created, it doesn't matter
anymore what the values of storeOffsetWithTermVector and storePositionWithTermVector happen
to be, but I'd like to find out for sure if offsets and positions are being handled right
in 3.6 or not because I need to produce indexes that a co-worker can use with a UI that uses
fast vector term highlighting, and I'd like to be sure I have created indexes that work for
him.
Thanks,
Mike

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Friday, July 20, 2012 4:05 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

I think its wrong for DumpIndex to look at term vector information from the Document that
was retrieved from IndexReader.document, thats basically just a way of getting access to your
stored fields.

This tool should be using something like IndexReader.getTermFreqVector for the document to
determine if it has term vectors.

On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
> Hi Robert,
> I put together the following two small applications to try to separate the problem I
am having from my own software and any bugs it contains. One of the applications is called
CreateTestIndex, and it comes with the Lucene In Action book's source code that you can download
from Manning Publications. I changed it a tiny bit to get rid of a special analyzer that is
irrelevant to what I am looking at, to get rid of a few warnings about deprecated functions,
and to add a loop that writes names of fields and their TermVector, offset and position settings
to the console.
>
> The other application is called DumpIndex, and got it from a web site somewhere about
6 months ago. I changed a few lines to get rid of deprecated function warnings and added the
same line of code to it that writes field information to the console.
>
> What I am seeing is that when I run CreateTestIndex, when the fields are first created,
added to a document, and are about to be added to the index, the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS
is specified correctly print out that the values of field.isTermVectorStored(), field.isStoreOffsetWithTermVector()
and field.isStorePositionWithTermVector() are true. When I run DumpIndex on the index that
was created, those fields print out true for field.isTermVectorStored() and false for the
other two functions.
> Thanks,
> Mike
>
> This is the source code for CreateTextIndex:
>
> //////////////////////////////////////////////////////////////////////
> //////////
> package myLucene;
>
> /**
>  * Copyright Manning Publications Co.
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific lan */
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field; import 
> org.apache.lucene.document.Fieldable;
> import org.apache.lucene.document.NumericField;
> import org.apache.lucene.document.DateTools;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.util.Version;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.util.Properties;
> import java.util.Date;
> import java.util.List;
> import java.util.ArrayList;
> import java.text.ParseException;
>
> public class CreateTestIndex {
>
>   public static Document getDocument(String rootDir, File file) throws IOException {
>     Properties props = new Properties();
>     props.load(new FileInputStream(file));
>
>     Document doc = new Document();
>
>     // category comes from relative path below the base directory
>     String category = file.getParent().substring(rootDir.length());    //1
>     category = category.replace(File.separatorChar, '/');              //1
>
>     String isbn = props.getProperty("isbn");         //2
>     String title = props.getProperty("title");       //2
>     String author = props.getProperty("author");     //2
>     String url = props.getProperty("url");           //2
>     String subject = props.getProperty("subject");   //2
>
>     String pubmonth = props.getProperty("pubmonth"); //2
>
>     System.out.println(title + "\n" + author + "\n" + subject + "\n" + 
> pubmonth + "\n" + category + "\n---------");
>
>     doc.add(new Field("isbn",                     // 3
>                       isbn,                       // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.NOT_ANALYZED)); // 3
>     doc.add(new Field("category",                 // 3
>                       category,                   // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.NOT_ANALYZED)); // 3
>     doc.add(new Field("title",                    // 3
>                       title,                      // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.ANALYZED,       // 3
>                       Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
>     doc.add(new Field("title2",                   // 3
>                       title.toLowerCase(),        // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.NOT_ANALYZED_NO_NORMS,   // 3
>                       Field.TermVector.WITH_POSITIONS_OFFSETS));  // 3
>
>     // split multiple authors into unique field instances
>     String[] authors = author.split(",");            // 3
>     for (String a : authors) {                       // 3
>       doc.add(new Field("author",                    // 3
>                         a,                           // 3
>                         Field.Store.YES,             // 3
>                         Field.Index.NOT_ANALYZED,    // 3
>                         Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
>     }
>
>     doc.add(new Field("url",                        // 3
>                       url,                           // 3
>                       Field.Store.YES,                // 3
>                       Field.Index.NOT_ANALYZED_NO_NORMS));   // 3
>     doc.add(new Field("subject",                     // 3  //4
>                       subject,                       // 3  //4
>                       Field.Store.YES,               // 3  //4
>                       Field.Index.ANALYZED,          // 3  //4
>                       Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3  
> //4
>
>     doc.add(new NumericField("pubmonth",          // 3
>                              Field.Store.YES,     // 3
>                              true).setIntValue(Integer.parseInt(pubmonth)));   // 3
>
>     Date d; // 3
>     try { // 3
>       d = DateTools.stringToDate(pubmonth); // 3
>     } catch (ParseException pe) { // 3
>       throw new RuntimeException(pe); // 3
>     }                                             // 3
>     doc.add(new NumericField("pubmonthAsDay")      // 3
>                  .setIntValue((int) (d.getTime()/(1000*3600*24))));   // 3
>
>     for(String text : new String[] {title, subject, author, category}) {           //
3 // 5
>       doc.add(new Field("contents", text,                             // 3 // 5
>                         Field.Store.NO, Field.Index.ANALYZED,         // 3 // 5
>                         Field.TermVector.WITH_POSITIONS_OFFSETS));    // 3 // 5
>     }
>
>     List<Fieldable> fields = doc.getFields();
>
>     for (Fieldable field : fields) {
>         System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
>                         field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
>     }
>     return doc;
>   }
>
>   private static void findFiles(List<File> result, File dir) {
>     for(File file : dir.listFiles()) {
>       if (file.getName().endsWith(".properties")) {
>         result.add(file);
>       } else if (file.isDirectory()) {
>         findFiles(result, file);
>       }
>     }
>   }
>
>   public static void main(String[] args) throws IOException {
>     String dataDir = args[0];
>     String indexDir = args[1];
>     List<File> results = new ArrayList<File>();
>     findFiles(results, new File(dataDir));
>     System.out.println(results.size() + " books to index");
>     Directory dir = FSDirectory.open(new File(indexDir));
>     IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
>     IndexWriter w = new IndexWriter(dir, config);
>     for(File file : results) {
>       Document doc = getDocument(dataDir, file);
>       w.addDocument(doc);
>     }
>     w.close();
>     dir.close();
>   }
> }
>
> /*
>   #1 Get category
>   #2 Pull fields
>   #3 Add fields to Document instance
>   #4 Flag subject field
>   #5 Add catch-all contents field
>   #6 Custom analyzer to override multi-valued position increment */ 
> //////////////////////////////////////////////////////////////////////
> //////////
> And for DumpIndex:
> //////////////////////////////////////////////////////////////////////
> //////////
> package myLucene;
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Fieldable;
>
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
>
> import org.apache.lucene.store.FSDirectory;
>
> import java.io.File;
> import java.io.IOException;
>
> import javax.xml.stream.FactoryConfigurationError;
> import javax.xml.stream.XMLOutputFactory;
> import javax.xml.stream.XMLStreamException;
> import javax.xml.stream.XMLStreamWriter;
>
> /**
>  * Dumps a Lucene index as XML. Dumps all documents with their fields and values to stdout.
>  *
>  * Blog post at
>  * http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
>  *
>  * @author Luis Parravicini
>  */
> public class DumpIndex {
>         /**
>          * Reads the index from the directory passed as argument or "index" if no arguments
are given.
>          */
>         public static void main(String[] args) throws Exception {
>                 String index = (args.length > 0 ? args[0] : "index");
>
>                 new DumpIndex(index).dump();
>         }
>
>         private String dir;
>
>         public DumpIndex(String dir) {
>                 this.dir = dir;
>         }
>
>         public void dump() throws XMLStreamException, FactoryConfigurationError, CorruptIndexException,
IOException {
>                 XMLStreamWriter out = XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
>                 IndexReader reader = 
> IndexReader.open(FSDirectory.open(new File(dir)));
>
>                 out.writeStartDocument();
>                 out.writeStartElement("documents");
>
>                 for (int i = 0; i < reader.numDocs(); i++) {
>                         dumpDocument(reader.document(i), out);
>                 }
>                 out.writeEndElement();
>                 out.writeEndDocument();
>                 out.flush();
>                 reader.close();
>         }
>
>         private void dumpDocument(Document document, XMLStreamWriter out) throws XMLStreamException
{
>                 out.writeStartElement("document");
>
>                 for (Fieldable field : document.getFields()) {
>                 System.out.println(field.name() + " " + field.isTermVectorStored() +
" " +
>                                 field.isStoreOffsetWithTermVector() + 
> " " + field.isStorePositionWithTermVector());
>
>                         out.writeStartElement("field");
>                         out.writeAttribute("name", field.name());
>                         out.writeAttribute("value", field.stringValue());
>                         out.writeEndElement();
>                 }
>                 out.writeEndElement();
>         }
> }
> //////////////////////////////////////////////////////////////////////
> //////////
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Friday, July 20, 2012 6:11 AM
> To: java-user@lucene.apache.org
> Subject: Re: Problem with TermVector offsets and positions not being 
> preserved
>
> Hi Mike:
>
> I wrote up some tests last night against 3.6 trying to find some way to reproduce what
you are seeing, e.g. adding additional segments with the field specified without term vectors,
without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find
any problems.
>
> Can you provide more information?
>
> On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
>> I created an index using Lucene 3.6.0 in which I specified that a certain text field
in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets
and positions. Later I looked at that index in Luke, and it said that term vectors were created
for this field, but offsets and positions were not. The code I used for indexing couldn't
be simpler. It looks like this for the relevant field:
>>
>> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, 
>> Field.Index.ANALYZED_NO_NORMS, 
>> Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
>> The indexer adds these documents to the index and commits them. I ran the indexer
in a debugger and watched the Lucene code set the Field instance variables called storeTermVector,
storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>>
>> When the indexing was done, I ran a simple program in a debugger that opens an index,
reads each document and writes out its information as XML. The values of storeOffsetWithTermVector
and storePositionWithTermVector in the ReportText Field objects were false. Is there something
other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that
needs to be done in order for offsets and positions to be saved in the index? Or are there
circumstances under which the Field.TermVector setting for a Field object is ignored? This
doesn't make sense to me, and I could swear that offsets and positions were being saved in
some older indexes I created that I unfortunately no longer have around for comparison. I'm
sure that I am just overlooking something or have made some kind of mistake, but I can't see
what it is at the moment. Thanks for any help or advice you can give me.
>> Mike
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message