lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Murarka <ankit.mura...@rancoretech.com>
Subject Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.
Date Thu, 29 Aug 2013 12:05:05 GMT
Yes I know that Lucene should not have any document size limits. All I 
get is a lock file inside my index folder. Along with this there's no 
other file inside the index folder. Then I get OOM exception.
Please provide some guidance...

Here is the example:

package com.issue;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexCommit;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.LiveIndexWriterConfig;
import org.apache.lucene.index.LogByteSizeMergePolicy;
import org.apache.lucene.index.MergePolicy;
import org.apache.lucene.index.SerialMergeScheduler;
import org.apache.lucene.index.MergePolicy.OneMerge;
import org.apache.lucene.index.MergeScheduler;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;


import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.LineNumberReader;
import java.util.Date;

public class D {

   /** Index all text files under a directory. */


     static String[] filenames;

   public static void main(String[] args) {

     //String indexPath = args[0];

     String indexPath="D:\\Issue";//Place where indexes will be created
     String docsPath="Issue";    //Place where the files are kept.
     boolean create=true;

     String ch="OverAll";


    final File docDir = new File(docsPath);
    if (!docDir.exists() || !docDir.canRead()) {
       System.out.println("Document directory '" 
+docDir.getAbsolutePath()+ "' does not exist or is not readable, please 
check the path");
       System.exit(1);
     }

     Date start = new Date();
    try {
      Directory dir = FSDirectory.open(new File(indexPath));
      Analyzer analyzer=new 
com.rancore.demo.CustomAnalyzerForCaseSensitive(Version.LUCENE_44);
      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_44, 
analyzer);
       iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

       IndexWriter writer = new IndexWriter(dir, iwc);
       if(ch.equalsIgnoreCase("OverAll")){
           indexDocs(writer, docDir,true);
       }else{
           filenames=args[2].split(",");
          // indexDocs(writer, docDir);

    }
       writer.commit();
       writer.close();

     } catch (IOException e) {
       System.out.println(" caught a " + e.getClass() +
        "\n with message: " + e.getMessage());
     }
     catch(Exception e)
     {

         e.printStackTrace();
     }
  }

   //Over All
   static void indexDocs(IndexWriter writer, File file,boolean flag)
   throws IOException {

       FileInputStream fis = null;
  if (file.canRead()) {

     if (file.isDirectory()) {
      String[] files = file.list();
       // an IO error could occur
       if (files != null) {
         for (int i = 0; i < files.length; i++) {
           indexDocs(writer, new File(file, files[i]),flag);
         }
       }
    } else {
       try {
         fis = new FileInputStream(file);
      } catch (FileNotFoundException fnfe) {

        fnfe.printStackTrace();
      }

       try {

           Document doc = new Document();

           Field pathField = new StringField("path", file.getPath(), 
Field.Store.YES);
           doc.add(pathField);

           doc.add(new LongField("modified", file.lastModified(), 
Field.Store.NO));

           doc.add(new StringField("name",file.getName(),Field.Store.YES));

          doc.add(new TextField("contents", new BufferedReader(new 
InputStreamReader(fis, "UTF-8"))));

           LineNumberReader lnr=new LineNumberReader(new FileReader(file));


          String line=null;
           while( null != (line = lnr.readLine()) ){
               doc.add(new StringField("SC",line.trim(),Field.Store.YES));
              // doc.add(new 
Field("contents",line,Field.Store.YES,Field.Index.ANALYZED));
           }

           if (writer.getConfig().getOpenMode() == 
OpenMode.CREATE_OR_APPEND) {

             writer.addDocument(doc);
             writer.commit();
             fis.close();
           } else {
               try
               {
             writer.updateDocument(new Term("path", file.getPath()), doc);

             fis.close();

               }catch(Exception e)
               {
                   writer.close();
                    fis.close();

                   e.printStackTrace();

               }
           }

       }catch (Exception e) {
            writer.close();
             fis.close();

          e.printStackTrace();
       }finally {
           // writer.close();

         fis.close();
       }
     }
   }
}
}



On 8/29/2013 4:20 PM, Michael McCandless wrote:
> Lucene doesn't have document size limits.
>
> There are default limits for how many tokens the highlighters will process ...
>
> But, if you are passing each line as a separate document to Lucene,
> then Lucene only sees a bunch of tiny documents, right?
>
> Can you boil this down to a small test showing the problem?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 29, 2013 at 1:51 AM, Ankit Murarka
> <ankit.murarka@rancoretech.com>  wrote:
>    
>> Hello all,
>>
>> Faced with a typical issue.
>> I have many files which I am indexing.
>>
>> Problem Faced:
>> a. File having size less than 20 MB are successfully indexed and merged.
>>
>> b. File having size>20MB are not getting INDEXED.. No Exception is being
>> thrown. Only a lock file is being created in the index directory. The
>> indexing process for a single file exceeding 20 MB size continues for more
>> than 8 minutes after which I have a code which merge the generated index to
>> existing index.
>>
>> Since no index is being generated now, I get an exception during merging
>> process.
>>
>> Why Files having size greater than 20 MB are not being indexed..??.  I am
>> indexing each line of the file. Why IndexWriter is not throwing any error.
>>
>> Do I need to change any parameter in Lucene or tweak the Lucene settings ??
>> Lucene version is 4.4.0
>>
>> My current deployment for Lucene is on a server running with 128 MB and 512
>> MB heap.
>>
>> --
>> Regards
>>
>> Ankit Murarka
>>
>> "What lies behind us and what lies before us are tiny matters compared with
>> what lies within us"
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>      
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>    


-- 
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with what lies within
us"


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message