lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siegfried Goeschl <sgoes...@gmx.at>
Subject Re: OutOfMemoryError for PDF document upload into Solr
Date Fri, 16 Jan 2015 08:25:10 GMT
Hi Dan,

neat idea - made a mental note :-)

That brings us back to the point that in complex setups you should not 
do the document pre-processing directly in SOLR but have an import 
process which can safely crash when processing a 4GB PDF file

Cheers,

Siegfried Goeschl

On 16.01.15 05:02, Dan Davis wrote:
> Why re-write all the document conversion in Java ;)  Tika is very slow.   5
> GB PDF is very big.
>
> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
> mode.   The HTML mode captures some meta-data that would otherwise be lost.
>
>
> If you need to go faster still, you can  also write some stuff linked
> directly against poppler library.
>
> Before you jump down by through about Tika being slow - I wrote a PDF
> indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
> getjmp/longjmp.   But fast...
>
>
>
> On Thu, Jan 15, 2015 at 1:54 PM, <Ganesh.Yadav@sungard.com> wrote:
>
>> Siegfried and Michael Thank you for your replies and help.
>>
>> -----Original Message-----
>> From: Siegfried Goeschl [mailto:sgoeschl@gmx.at]
>> Sent: Thursday, January 15, 2015 3:45 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: OutOfMemoryError for PDF document upload into Solr
>>
>> Hi Ganesh,
>>
>> you can increase the heap size but parsing a 4 GB PDF document will very
>> likely consume A LOT OF memory - I think you need to check if that large
>> PDF can be parsed at all :-)
>>
>> Cheers,
>>
>> Siegfried Goeschl
>>
>> On 14.01.15 18:04, Michael Della Bitta wrote:
>>> Yep, you'll have to increase the heap size for your Tomcat container.
>>>
>>> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
>>> -heap-size-correctly
>>>
>>> Michael Della Bitta
>>>
>>> Senior Software Engineer
>>>
>>> o: +1 646 532 3062
>>>
>>> appinions inc.
>>>
>>> “The Science of Influence Marketing”
>>>
>>> 18 East 41st Street
>>>
>>> New York, NY 10017
>>>
>>> t: @appinions <https://twitter.com/Appinions> | g+:
>>> plus.google.com/appinions
>>> <https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
>>> 3336/posts>
>>> w: appinions.com <http://www.appinions.com/>
>>>
>>> On Wed, Jan 14, 2015 at 12:00 PM, <Ganesh.Yadav@sungard.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Can someone pass on the hints to get around following error? Is there
>>>> any Heap Size parameter I can set in Tomcat or in Solr webApp that
>>>> gets deployed in Solr?
>>>>
>>>> I am running Solr webapp inside Tomcat on my local machine which has
>>>> RAM of 12 GB. I have PDF document which is 4 GB max in size that
>>>> needs to be loaded into Solr
>>>>
>>>>
>>>>
>>>>
>>>> Exception in thread "http-apr-8983-exec-6" java.lang.    : Java heap
>> space
>>>>           at java.util.AbstractCollection.toArray(Unknown Source)
>>>>           at java.util.ArrayList.<init>(Unknown Source)
>>>>           at
>>>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>>>>           at
>> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
>>>>           at
>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
>>>>           at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
>>>>           at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
>>>>           at
>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
>>>>           at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>>>           at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>>>           at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>>>>           at
>>>>
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
>>>>           at
>>>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>           at
>>>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>           at
>>>>
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
>>>>           at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>>>>           at
>>>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
>>>>           at
>>>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
>>>>           at
>>>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>>>>           at
>>>>
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
>>>>           at
>>>>
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>>>>           at
>>>>
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
>>>>           at
>>>>
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
>>>>           at
>>>>
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
>>>>           at
>>>>
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
>>>>           at
>>>>
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
>>>>           at
>>>>
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
>>>>           at
>>>>
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
>>>>           at
>>>>
>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
>>>>           at
>>>>
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
>>>>           at
>>>>
>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
>>>>           at
>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
>>>> t.java:2451)
>>>>
>>>>
>>>> Thanks
>>>> Ganesh
>>>>
>>>>
>>>
>>
>>
>


Mime
View raw message