lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuan <yuanh...@yahoo.com>
Subject Incremental Indexing in Lucene 4.7
Date Mon, 24 Mar 2014 16:56:27 GMT
We are using Lucene 3.6 to perform incremental indexing.  We use an algorithm
we found on the web to perform the incremental indexing.

1.  For each file that we indexed, we create a UID field to associate with
it. The UID is calculated using the file path and the last updated time.
2.  When performing reindexing, we used the following lines to obtain an UID
interator that we can iterate through the UIDs in alphabetical order.

IndexReader reader = IndexReader.open(writer,true); 
TermEnum uidIter = reader.terms(new Term("uid", ""));       

3.  We then sorts the files to be indexed and compare the files' UID with
the UID returned by the UID iterator.  If the UID is the same, it means that
the file has not been changed.
If the UID of the iterator is less than the file UID, it means the file
associate with the iterator pointed UID has been deleted, we then delete the
file from the index.  If the UID of the iterator is greater than the file
UID, it means the file is newly added, or it is old document that has been
updated, we add the document to the index.

Here is the code code snippet for the algo:

  private void indexDirectory(File docDir, File catalogDir)
  {
    try{
      Directory dir = FSDirectory.open(catalogDir);   
      boolean indexExists = IndexReader.indexExists(dir);
      
      IndexWriter writer = getIndexWriter(dir);
      IndexReader reader = null;
      TermEnum uidIter = null;
      
      if (indexExists)
      {
        reader = IndexReader.open(writer,true);      
        uidIter = reader.terms(new Term("uid", ""));                       
// init uid iterator
      }      
      
      // 2: AddNewAndUpdatedDocs
      // Adds all new and updated (removed above) documents to the index    
      updateDocumentIndexes(uidIter, writer, docDir, results);
      //Clean up indexes that haven't been iterated. It means deleted files
from the file system that has not be removed from the indexes.
      if(uidIter != null){
        cleanupIndexes(uidIter, writer, results);
      }
  
      writer.commit();
      
      
      if (indexExists)
      {        
        uidIter.close();
        reader.close();      
      }      

      writer.close();
      
    }catch(IOException ex){
      writeUserMessage(Level.ERROR, "Index failed for directory: "+
docDir.getPath(), ex);
    }

  }

  private void updateDocumentIndexes(TermEnum uidIter, IndexWriter writer,
File fileToBeIndexed, IndexingResults results)  
  {
      try
      {
        if (uidIter != null)
        {
          String docUid = FileDocument.uid(fileToBeIndexed);            
          while (uidIter.term() != null && uidIter.term().field() == "uid"
&& uidIter.term().text().compareTo(docUid) < 0){            
            writer.deleteDocuments(uidIter.term());            
            uidIter.next();
          }
          if (uidIter.term() != null && uidIter.term().field() == "uid" &&
uidIter.term().text().compareTo(docUid) ==0)
          {   
            uidIter.next();
            results.incrementUnchangedFiles();
          }
          else
          {
            if(uidIter.term() != null){
              if(isIndexableFile(fileToBeIndexed.getName())){                  
                Document doc = FileDocument.Document(fileToBeIndexed);                
                writer.addDocument(doc);
                results.incrementIndexedFiles();
              }
            }else{
              addDocument(writer, fileToBeIndexed, results);
            }
          }
        }
        else
        {
          addDocument(writer, fileToBeIndexed, results);
        }
      }      
      catch (IOException fnfe)
      {        
        results.incrementErrors();
        fnfe.printStackTrace();
        logger.log(Level.ERROR, " Unable to process document at: " +
fileToBeIndexed.getPath(), fnfe);
      }
      catch (Exception ex){
        results.incrementErrors();
        ex.printStackTrace();
        logger.log(Level.ERROR, " Unable to process document at: " +
fileToBeIndexed.getPath(), ex);
      }
  }
   

Now we are trying to upgrade to Lucene 4.7.  The "reader.terms(new
Term("uid", "")); " is no longer supported in 4.7.  I tried to workaround it
by following the Apache Lucene Migration Guide
(http://lucene.apache.org/core/4_0_0/MIGRATE.html).

Instead of "reader.terms(new Term("uid", "")); ", I used the following:


        Fields fields = MultiFields.getFields(reader);        
        if (fields != null) {
          Terms terms = fields.terms("uid");
          
          if (terms != null) {
            uidIter = terms.iterator(null); 
          }
        }
        
 However, I found that the terms the uidIter iterates are no longer in
alphabetical order.  Therefore, it breaks the algorithm. Is there anyway to
workaround this?
 
 Thank you!



--
View this message in context: http://lucene.472066.n3.nabble.com/Incremental-Indexing-in-Lucene-4-7-tp4126620.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message