lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Pinkerton <brian.pinker...@lucidimagination.com>
Subject Re: Indexing large files? - No answers yet...
Date Fri, 11 Sep 2009 15:50:10 GMT
Quite possibly, but shouldn't one expect Lucene's resource to track  
the size of the problem in question?     Paul's two examples below use  
input files of 5 and 62MB, hardly the size of input I'd expect to  
handle in a memory-compromised environment.

bri

On Sep 11, 2009, at 7:43 AM, Glen Newton wrote:

> Paul,
>
> I saw your last post and now understand the issues you face.
>
> I don't think there has been any effort to produce a
> reduced-memory-footprint configurable (RMFC) Lucene. With the many
> mobile devices, embedded and other reduced memory devices, should this
> perhaps be one of the areas the Lucene community looks in to?
>
> -Glen
>
> 2009/9/11  <Paul_Murdoch@emainc.com>:
>> Thanks Glen!
>>
>> I will take at your project.  Unfortunately I will only have 512 MB  
>> to 1024 MB to work with as Lucene is only one component in a larger  
>> software system running on one machine.  I agree with you on the C\C 
>> ++ comment.  That is what I would normally use for memory intense  
>> software.  It turns out that the larger file you want to index is  
>> the larger the heap space you will need.  What I would like to see  
>> is a way to "throttle" the indexing process to control the memory  
>> footprint.  I understand that this will take longer, but if I  
>> perform the task during off hours it shouldn't matter. At least the  
>> file will be indexed correctly.
>>
>> Thanks,
>> Paul
>>
>>
>> -----Original Message-----
>> From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org 
>>  [mailto:java-user-return-42272- 
>> Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
>> Sent: Friday, September 11, 2009 9:53 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Indexing large files? - No answers yet...
>>
>> In this project:
>> http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>>
>> I concatenate all the text of all of articles of a single journal  
>> into
>> a single text file.
>> This can create a text file that is 500MB in size.
>> Lucene is OK in indexing files this size (in parallel even), but I
>> have a heap size of 8GB.
>>
>> I would suggest increasing your heap to as large as your machine can
>> reasonably take.
>> The reality is that Java programs (like Lucene) take up more memory
>> than a similar C or even C++ program.
>> Java may approach C/C++ in speed, but not memory.
>>
>> We don't use Java because of its memory footprint!  ;-)
>>
>> See:
>> Programming language shootout: speed:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>> Programming language shootout: memory:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>>
>> -glen
>>
>> 2009/9/11 Dan OConnor <doconnor@acquiremedia.com>:
>>> Paul:
>>>
>>> My first suggestion would be to update your JVM to the latest  
>>> version (or at least .14). There were several garbage collection  
>>> related issues resolved in version 10 - 13 (especially dealing  
>>> with large heaps).
>>>
>>> Next, your IndexWriter parameters would help figure out why you  
>>> are using so much RAM
>>>       getMaxFieldLength()
>>>       getMaxBufferedDocs()
>>>       getMaxMergeDocs()
>>>       getRAMBufferSizeMB()
>>>
>>> How often are you calling commit?
>>> Do you close your IndexWriter after every document?
>>> How many documents of this size are you indexing?
>>> Have you used luke to look at your index?
>>> If this is a large index, have you optimized it recently?
>>> Are there any searches going on while you are indexing?
>>>
>>>
>>> Regards,
>>> Dan
>>>
>>>
>>> -----Original Message-----
>>> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
>>> Sent: Friday, September 11, 2009 7:57 AM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: Indexing large files? - No answers yet...
>>>
>>> This issue is still open.  Any suggestions/help with this would be
>>> greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>> -----Original Message-----
>>> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>>> [mailto:java-user-return-42080- 
>>> Paul_Murdoch=emainc.com@lucene.apache.org
>>> ] On Behalf Of Paul_Murdoch@emainc.com
>>> Sent: Monday, August 31, 2009 10:28 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Indexing large files?
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>>> consistently receiving "OutOfMemoryError: Java heap space", when  
>>> trying
>>> to index large text files.
>>>
>>>
>>>
>>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>>> worked for the 5 MB text file, but Lucene still used 84 MB of heap  
>>> space
>>> to do this.  Why so much?
>>>
>>>
>>>
>>> The class FreqProxTermsWriterPerField appears to be the biggest  
>>> memory
>>> consumer by far according to JConsole and the TPTP Memory Profiling
>>> plugin for Eclipse Ganymede.
>>>
>>>
>>>
>>> Example 2: Indexing a 62 MB text file runs out of memory with a  
>>> 512 MB
>>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>>> Lucene uses 826 MB of heap space while performing this.  Still seems
>>> like way too much memory is being used to do this.  I'm sure larger
>>> files would cause the error as it seems correlative.
>>>
>>>
>>>
>>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the  
>>> best
>>> practice for indexing large files?  Here is a code snippet that I'm
>>> using:
>>>
>>>
>>>
>>> // Index the content of a text file.
>>>
>>>     private Boolean saveTXTFile(File textFile, Document  
>>> textDocument)
>>> throws CIDBException {
>>>
>>>
>>>
>>>           try {
>>>
>>>
>>>
>>>                 Boolean isFile = textFile.isFile();
>>>
>>>                 Boolean hasTextExtension =
>>> textFile.getName().endsWith(".txt");
>>>
>>>
>>>
>>>                 if (isFile && hasTextExtension) {
>>>
>>>
>>>
>>>                       System.out.println("File " +
>>> textFile.getCanonicalPath() + " is being indexed");
>>>
>>>                       Reader textFileReader = new
>>> FileReader(textFile);
>>>
>>>                       if (textDocument == null)
>>>
>>>                             textDocument = new Document();
>>>
>>>                       textDocument.add(new Field("content",
>>> textFileReader));
>>>
>>>                       indexWriter.addDocument(textDocument);
>>> // BREAKS HERE!!!!
>>>
>>>                 }
>>>
>>>           } catch (FileNotFoundException fnfe) {
>>>
>>>                 System.out.println(fnfe.getMessage());
>>>
>>>                 return false;
>>>
>>>           } catch (CorruptIndexException cie) {
>>>
>>>                 throw new CIDBException("The index has become
>>> corrupt.");
>>>
>>>           } catch (IOException ioe) {
>>>
>>>                 System.out.println(ioe.getMessage());
>>>
>>>                 return false;
>>>
>>>           }
>>>
>>>           return true;
>>>
>>>     }
>>>
>>>
>>>
>>>
>>>
>>> Thanks much,
>>>
>>>
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>>
>> -
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> -- 
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message