lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tmole...@uw.edu>
Subject RE: Problem with TermVector offsets and positions not being preserved
Date Fri, 20 Jul 2012 22:53:41 GMT
I neglected to mention that CreateTestIndex uses a collection of data files with .properties
extensions that are included in the Lucene In Action source code download.
Mike

-----Original Message-----
From: Mike O'Leary [mailto:tmoleary@uw.edu] 
Sent: Friday, July 20, 2012 2:10 PM
To: java-user@lucene.apache.org
Subject: RE: Problem with TermVector offsets and positions not being preserved

Hi Robert,
I put together the following two small applications to try to separate the problem I am having
from my own software and any bugs it contains. One of the applications is called CreateTestIndex,
and it comes with the Lucene In Action book's source code that you can download from Manning
Publications. I changed it a tiny bit to get rid of a special analyzer that is irrelevant
to what I am looking at, to get rid of a few warnings about deprecated functions, and to add
a loop that writes names of fields and their TermVector, offset and position settings to the
console.

The other application is called DumpIndex, and got it from a web site somewhere about 6 months
ago. I changed a few lines to get rid of deprecated function warnings and added the same line
of code to it that writes field information to the console.

What I am seeing is that when I run CreateTestIndex, when the fields are first created, added
to a document, and are about to be added to the index, the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS
is specified correctly print out that the values of field.isTermVectorStored(), field.isStoreOffsetWithTermVector()
and field.isStorePositionWithTermVector() are true. When I run DumpIndex on the index that
was created, those fields print out true for field.isTermVectorStored() and false for the
other two functions.
Thanks,
Mike

This is the source code for CreateTextIndex:

////////////////////////////////////////////////////////////////////////////////
package myLucene;

/**
 * Copyright Manning Publications Co.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific lan      
*/

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field; import org.apache.lucene.document.Fieldable;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.text.ParseException;

public class CreateTestIndex {
  
  public static Document getDocument(String rootDir, File file) throws IOException {
    Properties props = new Properties();
    props.load(new FileInputStream(file));

    Document doc = new Document();

    // category comes from relative path below the base directory
    String category = file.getParent().substring(rootDir.length());    //1
    category = category.replace(File.separatorChar, '/');              //1

    String isbn = props.getProperty("isbn");         //2
    String title = props.getProperty("title");       //2
    String author = props.getProperty("author");     //2
    String url = props.getProperty("url");           //2
    String subject = props.getProperty("subject");   //2

    String pubmonth = props.getProperty("pubmonth"); //2

    System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth + "\n" + category
+ "\n---------");

    doc.add(new Field("isbn",                     // 3
                      isbn,                       // 3
                      Field.Store.YES,            // 3
                      Field.Index.NOT_ANALYZED)); // 3
    doc.add(new Field("category",                 // 3
                      category,                   // 3
                      Field.Store.YES,            // 3
                      Field.Index.NOT_ANALYZED)); // 3
    doc.add(new Field("title",                    // 3
                      title,                      // 3
                      Field.Store.YES,            // 3
                      Field.Index.ANALYZED,       // 3
                      Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
    doc.add(new Field("title2",                   // 3
                      title.toLowerCase(),        // 3
                      Field.Store.YES,            // 3
                      Field.Index.NOT_ANALYZED_NO_NORMS,   // 3
                      Field.TermVector.WITH_POSITIONS_OFFSETS));  // 3

    // split multiple authors into unique field instances
    String[] authors = author.split(",");            // 3
    for (String a : authors) {                       // 3
      doc.add(new Field("author",                    // 3
                        a,                           // 3
                        Field.Store.YES,             // 3
                        Field.Index.NOT_ANALYZED,    // 3
                        Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
    }

    doc.add(new Field("url",                        // 3
                      url,                           // 3
                      Field.Store.YES,                // 3
                      Field.Index.NOT_ANALYZED_NO_NORMS));   // 3
    doc.add(new Field("subject",                     // 3  //4
                      subject,                       // 3  //4
                      Field.Store.YES,               // 3  //4
                      Field.Index.ANALYZED,          // 3  //4
                      Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3  //4

    doc.add(new NumericField("pubmonth",          // 3
                             Field.Store.YES,     // 3
                             true).setIntValue(Integer.parseInt(pubmonth)));   // 3

    Date d; // 3
    try { // 3
      d = DateTools.stringToDate(pubmonth); // 3
    } catch (ParseException pe) { // 3
      throw new RuntimeException(pe); // 3
    }                                             // 3
    doc.add(new NumericField("pubmonthAsDay")      // 3
                 .setIntValue((int) (d.getTime()/(1000*3600*24))));   // 3

    for(String text : new String[] {title, subject, author, category}) {           // 3 //
5
      doc.add(new Field("contents", text,                             // 3 // 5
                        Field.Store.NO, Field.Index.ANALYZED,         // 3 // 5
                        Field.TermVector.WITH_POSITIONS_OFFSETS));    // 3 // 5
    }

    List<Fieldable> fields = doc.getFields();
    
    for (Fieldable field : fields) {
    	System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
    			field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
    }
    return doc;
  }

  private static void findFiles(List<File> result, File dir) {
    for(File file : dir.listFiles()) {
      if (file.getName().endsWith(".properties")) {
        result.add(file);
      } else if (file.isDirectory()) {
        findFiles(result, file);
      }
    }
  }

  public static void main(String[] args) throws IOException {
    String dataDir = args[0];
    String indexDir = args[1];
    List<File> results = new ArrayList<File>();
    findFiles(results, new File(dataDir));
    System.out.println(results.size() + " books to index");
    Directory dir = FSDirectory.open(new File(indexDir));
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
    IndexWriter w = new IndexWriter(dir, config);
    for(File file : results) {
      Document doc = getDocument(dataDir, file);
      w.addDocument(doc);
    }
    w.close();
    dir.close();
  }
}

/*
  #1 Get category
  #2 Pull fields
  #3 Add fields to Document instance
  #4 Flag subject field
  #5 Add catch-all contents field
  #6 Custom analyzer to override multi-valued position increment */ ////////////////////////////////////////////////////////////////////////////////
And for DumpIndex:
////////////////////////////////////////////////////////////////////////////////
package myLucene;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;

import org.apache.lucene.store.FSDirectory;

import java.io.File;
import java.io.IOException;

import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

/**
 * Dumps a Lucene index as XML. Dumps all documents with their fields and values to stdout.
 *
 * Blog post at
 * http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
 *
 * @author Luis Parravicini
 */
public class DumpIndex {
	/**
	 * Reads the index from the directory passed as argument or "index" if no arguments are given.
	 */
	public static void main(String[] args) throws Exception {
		String index = (args.length > 0 ? args[0] : "index");

		new DumpIndex(index).dump();
	}

	private String dir;

	public DumpIndex(String dir) {
		this.dir = dir;
	}

	public void dump() throws XMLStreamException, FactoryConfigurationError, CorruptIndexException,
IOException {
		XMLStreamWriter out = XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
		IndexReader reader = IndexReader.open(FSDirectory.open(new File(dir)));

		out.writeStartDocument();
		out.writeStartElement("documents");

		for (int i = 0; i < reader.numDocs(); i++) {
			dumpDocument(reader.document(i), out);
		}
		out.writeEndElement();
		out.writeEndDocument();
		out.flush();
		reader.close();
	}

	private void dumpDocument(Document document, XMLStreamWriter out) throws XMLStreamException
{
		out.writeStartElement("document");

		for (Fieldable field : document.getFields()) {
	    	System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
	    			field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
	    	
			out.writeStartElement("field");
			out.writeAttribute("name", field.name());
			out.writeAttribute("value", field.stringValue());
			out.writeEndElement();
		}
		out.writeEndElement();
	}
}
////////////////////////////////////////////////////////////////////////////////

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com]
Sent: Friday, July 20, 2012 6:11 AM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

Hi Mike:

I wrote up some tests last night against 3.6 trying to find some way to reproduce what you
are seeing, e.g. adding additional segments with the field specified without term vectors,
without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find
any problems.

Can you provide more information?

On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
> I created an index using Lucene 3.6.0 in which I specified that a certain text field
in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets
and positions. Later I looked at that index in Luke, and it said that term vectors were created
for this field, but offsets and positions were not. The code I used for indexing couldn't
be simpler. It looks like this for the relevant field:
>
> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, 
> Field.Index.ANALYZED_NO_NORMS, 
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> The indexer adds these documents to the index and commits them. I ran the indexer in
a debugger and watched the Lucene code set the Field instance variables called storeTermVector,
storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>
> When the indexing was done, I ran a simple program in a debugger that opens an index,
reads each document and writes out its information as XML. The values of storeOffsetWithTermVector
and storePositionWithTermVector in the ReportText Field objects were false. Is there something
other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that
needs to be done in order for offsets and positions to be saved in the index? Or are there
circumstances under which the Field.TermVector setting for a Field object is ignored? This
doesn't make sense to me, and I could swear that offsets and positions were being saved in
some older indexes I created that I unfortunately no longer have around for comparison. I'm
sure that I am just overlooking something or have made some kind of mistake, but I can't see
what it is at the moment. Thanks for any help or advice you can give me.
> Mike



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message