lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Paul_Murd...@emainc.com>
Subject RE: Indexing large files? - No answers yet...
Date Fri, 11 Sep 2009 14:33:49 GMT
Thanks Glen!

I will take at your project.  Unfortunately I will only have 512 MB to 1024 MB to work with
as Lucene is only one component in a larger software system running on one machine.  I agree
with you on the C\C++ comment.  That is what I would normally use for memory intense software.
 It turns out that the larger file you want to index is the larger the heap space you will
need.  What I would like to see is a way to "throttle" the indexing process to control the
memory footprint.  I understand that this will take longer, but if I perform the task during
off hours it shouldn't matter. At least the file will be indexed correctly.

Thanks,
Paul


-----Original Message-----
From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org]
On Behalf Of Glen Newton
Sent: Friday, September 11, 2009 9:53 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing large files? - No answers yet...

In this project:
 http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

I concatenate all the text of all of articles of a single journal into
a single text file.
This can create a text file that is 500MB in size.
Lucene is OK in indexing files this size (in parallel even), but I
have a heap size of 8GB.

I would suggest increasing your heap to as large as your machine can
reasonably take.
The reality is that Java programs (like Lucene) take up more memory
than a similar C or even C++ program.
Java may approach C/C++ in speed, but not memory.

We don't use Java because of its memory footprint!  ;-)

See:
 Programming language shootout: speed:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
 Programming language shootout: memory:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0

-glen

2009/9/11 Dan OConnor <doconnor@acquiremedia.com>:
> Paul:
>
> My first suggestion would be to update your JVM to the latest version (or at least .14).
There were several garbage collection related issues resolved in version 10 - 13 (especially
dealing with large heaps).
>
> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>        getMaxFieldLength()
>        getMaxBufferedDocs()
>        getMaxMergeDocs()
>        getRAMBufferSizeMB()
>
> How often are you calling commit?
> Do you close your IndexWriter after every document?
> How many documents of this size are you indexing?
> Have you used luke to look at your index?
> If this is a large index, have you optimized it recently?
> Are there any searches going on while you are indexing?
>
>
> Regards,
> Dan
>
>
> -----Original Message-----
> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
> Sent: Friday, September 11, 2009 7:57 AM
> To: java-user@lucene.apache.org
> Subject: RE: Indexing large files? - No answers yet...
>
> This issue is still open.  Any suggestions/help with this would be
> greatly appreciated.
>
> Thanks,
>
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
> ] On Behalf Of Paul_Murdoch@emainc.com
> Sent: Monday, August 31, 2009 10:28 AM
> To: java-user@lucene.apache.org
> Subject: Indexing large files?
>
> Hi,
>
>
>
> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
> consistently receiving "OutOfMemoryError: Java heap space", when trying
> to index large text files.
>
>
>
> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
> max. heap size.  So I increased the max. heap size to 512 MB.  This
> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
> to do this.  Why so much?
>
>
>
> The class FreqProxTermsWriterPerField appears to be the biggest memory
> consumer by far according to JConsole and the TPTP Memory Profiling
> plugin for Eclipse Ganymede.
>
>
>
> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
> max. heap size.  Increasing the max. heap size to 1024 MB works but
> Lucene uses 826 MB of heap space while performing this.  Still seems
> like way too much memory is being used to do this.  I'm sure larger
> files would cause the error as it seems correlative.
>
>
>
> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
> practice for indexing large files?  Here is a code snippet that I'm
> using:
>
>
>
> // Index the content of a text file.
>
>      private Boolean saveTXTFile(File textFile, Document textDocument)
> throws CIDBException {
>
>
>
>            try {
>
>
>
>                  Boolean isFile = textFile.isFile();
>
>                  Boolean hasTextExtension =
> textFile.getName().endsWith(".txt");
>
>
>
>                  if (isFile && hasTextExtension) {
>
>
>
>                        System.out.println("File " +
> textFile.getCanonicalPath() + " is being indexed");
>
>                        Reader textFileReader = new
> FileReader(textFile);
>
>                        if (textDocument == null)
>
>                              textDocument = new Document();
>
>                        textDocument.add(new Field("content",
> textFileReader));
>
>                        indexWriter.addDocument(textDocument);
> // BREAKS HERE!!!!
>
>                  }
>
>            } catch (FileNotFoundException fnfe) {
>
>                  System.out.println(fnfe.getMessage());
>
>                  return false;
>
>            } catch (CorruptIndexException cie) {
>
>                  throw new CIDBException("The index has become
> corrupt.");
>
>            } catch (IOException ioe) {
>
>                  System.out.println(ioe.getMessage());
>
>                  return false;
>
>            }
>
>            return true;
>
>      }
>
>
>
>
>
> Thanks much,
>
>
>
> Paul
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message