lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: how to index large number of files?
Date Thu, 21 Oct 2010 08:57:04 GMT
On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote:
> Unfortunately both methods didnt go through. I am getting memory error even
> at reading the directory contents.

Then your problem is probably not Lucene related, but the sheer number
of files returned by listFiles.

A Java File contains the full path name for the file. Let's say that
this is 50 characters, which translates to about (50 * 2 + 45) ~ 150
bytes for the Java String. Add an int (4 bytes) plus bookkeeping and
we're up to about 200 bytes/File.

4.5 million Files thus takes up about 1 GB. Not enough to explain the
OOM, but if the full path name of your files is 150 characters, the list
takes up 2 GB.

> Now, I am thinking this: What if I split 4.5million files into 100.000 (or
> less depending on java error) files directories, index each of them
> separately and merge those indexes(if possible).

You don't need to create separate indexes and merge them. Just split
your 4.5 million files into folders of more manageable sizes and perform
a recursive descend. Something like

public static void addFolder(IndexWriter writer, File folder) {
 File[] files = folder.listFiles();
 for (File file: files) {
    if (file.isDirectory()) {
      addFolder(writer, file);
    } else {
      // Create Document from file and add it using the writer
    }
  }
}

- Toke


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message