lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sahin Buyrukbilen <sahin.buyrukbi...@gmail.com>
Subject Re: how to index large number of files?
Date Fri, 22 Oct 2010 17:01:46 GMT
thank you for all. I tried Peter's suggestion and it really worked. I added
-Xmx2G to the run/debug VM arguments and it worked. However, before getting
his advice I had already started to split the files into folders 100.000 by
100.000 and now indexing recursively :) Anyway, I have learned much from all
of you guys.

Thank you.

Sahin.

On Fri, Oct 22, 2010 at 9:22 AM, Peter Keegan <peterlkeegan@gmail.com>wrote:

> > running eclipse with -Xmx2G parameter.
>
> This only affects the Eclipse JVM, not the JVM launched by Eclipse to run
> your application.
> Did you add -Xmx2G to the 'VM arguments' of your Debug or Run
> configuration?
>
> Peter
>
> On Thu, Oct 21, 2010 at 3:26 PM, Sahin Buyrukbilen <
> sahin.buyrukbilen@gmail.com> wrote:
>
> > I dont know why I am getting this error, but it looks normal to me now.
> > because when I try to list the contents of the folder  I cannot get a
> > response from linux shell. Now I have created a folder with 100.000 files
> > and running eclipse with -Xmx2G parameter. it is still indexing for about
> > 15
> > minutes now, but I am happy it works.
> >
> > After this I will try Toke's method. Create 100.000 filed folders and try
> > to
> > index them recursively.
> >
> > On Thu, Oct 21, 2010 at 4:57 AM, Toke Eskildsen <te@statsbiblioteket.dk
> > >wrote:
> >
> > > On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote:
> > > > Unfortunately both methods didnt go through. I am getting memory
> error
> > > even
> > > > at reading the directory contents.
> > >
> > > Then your problem is probably not Lucene related, but the sheer number
> > > of files returned by listFiles.
> > >
> > > A Java File contains the full path name for the file. Let's say that
> > > this is 50 characters, which translates to about (50 * 2 + 45) ~ 150
> > > bytes for the Java String. Add an int (4 bytes) plus bookkeeping and
> > > we're up to about 200 bytes/File.
> > >
> > > 4.5 million Files thus takes up about 1 GB. Not enough to explain the
> > > OOM, but if the full path name of your files is 150 characters, the
> list
> > > takes up 2 GB.
> > >
> > > > Now, I am thinking this: What if I split 4.5million files into
> 100.000
> > > (or
> > > > less depending on java error) files directories, index each of them
> > > > separately and merge those indexes(if possible).
> > >
> > > You don't need to create separate indexes and merge them. Just split
> > > your 4.5 million files into folders of more manageable sizes and
> perform
> > > a recursive descend. Something like
> > >
> > > public static void addFolder(IndexWriter writer, File folder) {
> > >  File[] files = folder.listFiles();
> > >  for (File file: files) {
> > >    if (file.isDirectory()) {
> > >      addFolder(writer, file);
> > >    } else {
> > >      // Create Document from file and add it using the writer
> > >    }
> > >  }
> > > }
> > >
> > > - Toke
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message