lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sahin Buyrukbilen <sahin.buyrukbi...@gmail.com>
Subject Re: how to index large number of files?
Date Thu, 21 Oct 2010 19:26:13 GMT
I dont know why I am getting this error, but it looks normal to me now.
because when I try to list the contents of the folder  I cannot get a
response from linux shell. Now I have created a folder with 100.000 files
and running eclipse with -Xmx2G parameter. it is still indexing for about 15
minutes now, but I am happy it works.

After this I will try Toke's method. Create 100.000 filed folders and try to
index them recursively.

On Thu, Oct 21, 2010 at 4:57 AM, Toke Eskildsen <te@statsbiblioteket.dk>wrote:

> On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote:
> > Unfortunately both methods didnt go through. I am getting memory error
> even
> > at reading the directory contents.
>
> Then your problem is probably not Lucene related, but the sheer number
> of files returned by listFiles.
>
> A Java File contains the full path name for the file. Let's say that
> this is 50 characters, which translates to about (50 * 2 + 45) ~ 150
> bytes for the Java String. Add an int (4 bytes) plus bookkeeping and
> we're up to about 200 bytes/File.
>
> 4.5 million Files thus takes up about 1 GB. Not enough to explain the
> OOM, but if the full path name of your files is 150 characters, the list
> takes up 2 GB.
>
> > Now, I am thinking this: What if I split 4.5million files into 100.000
> (or
> > less depending on java error) files directories, index each of them
> > separately and merge those indexes(if possible).
>
> You don't need to create separate indexes and merge them. Just split
> your 4.5 million files into folders of more manageable sizes and perform
> a recursive descend. Something like
>
> public static void addFolder(IndexWriter writer, File folder) {
>  File[] files = folder.listFiles();
>  for (File file: files) {
>    if (file.isDirectory()) {
>      addFolder(writer, file);
>    } else {
>      // Create Document from file and add it using the writer
>    }
>  }
> }
>
> - Toke
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message