lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Keegan <peterlkee...@gmail.com>
Subject Re: how to index large number of files?
Date Fri, 22 Oct 2010 13:22:02 GMT
> running eclipse with -Xmx2G parameter.

This only affects the Eclipse JVM, not the JVM launched by Eclipse to run
your application.
Did you add -Xmx2G to the 'VM arguments' of your Debug or Run configuration?

Peter

On Thu, Oct 21, 2010 at 3:26 PM, Sahin Buyrukbilen <
sahin.buyrukbilen@gmail.com> wrote:

> I dont know why I am getting this error, but it looks normal to me now.
> because when I try to list the contents of the folder  I cannot get a
> response from linux shell. Now I have created a folder with 100.000 files
> and running eclipse with -Xmx2G parameter. it is still indexing for about
> 15
> minutes now, but I am happy it works.
>
> After this I will try Toke's method. Create 100.000 filed folders and try
> to
> index them recursively.
>
> On Thu, Oct 21, 2010 at 4:57 AM, Toke Eskildsen <te@statsbiblioteket.dk
> >wrote:
>
> > On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote:
> > > Unfortunately both methods didnt go through. I am getting memory error
> > even
> > > at reading the directory contents.
> >
> > Then your problem is probably not Lucene related, but the sheer number
> > of files returned by listFiles.
> >
> > A Java File contains the full path name for the file. Let's say that
> > this is 50 characters, which translates to about (50 * 2 + 45) ~ 150
> > bytes for the Java String. Add an int (4 bytes) plus bookkeeping and
> > we're up to about 200 bytes/File.
> >
> > 4.5 million Files thus takes up about 1 GB. Not enough to explain the
> > OOM, but if the full path name of your files is 150 characters, the list
> > takes up 2 GB.
> >
> > > Now, I am thinking this: What if I split 4.5million files into 100.000
> > (or
> > > less depending on java error) files directories, index each of them
> > > separately and merge those indexes(if possible).
> >
> > You don't need to create separate indexes and merge them. Just split
> > your 4.5 million files into folders of more manageable sizes and perform
> > a recursive descend. Something like
> >
> > public static void addFolder(IndexWriter writer, File folder) {
> >  File[] files = folder.listFiles();
> >  for (File file: files) {
> >    if (file.isDirectory()) {
> >      addFolder(writer, file);
> >    } else {
> >      // Create Document from file and add it using the writer
> >    }
> >  }
> > }
> >
> > - Toke
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message