lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2886) Out of Memory Error with DIH and TikaEntityProcessor
Date Fri, 07 Sep 2012 22:22:07 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-2886:
---------------------------

    Fix Version/s:     (was: 4.0)

removing fixVersion=4.0 since there is no evidence that anyone is currently working on this
issue.  (this can certainly be revisited if volunteers step forward)

FWIW: it's not clear to me reading the comments how Solr would/could use the suggested workaround
in the PDFBOX issue, since Solr dones't invoke PDFBox directly, and delegates to Tika.

If someone with more tika knowledge can suggest a way in which solr users can configure/influence
how Tika uses PDFBox to control this setting, that seems like it would resolve things
                
> Out of Memory Error with DIH and TikaEntityProcessor
> ----------------------------------------------------
>
>                 Key: SOLR-2886
>                 URL: https://issues.apache.org/jira/browse/SOLR-2886
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction)
>    Affects Versions: 4.0-ALPHA
>            Reporter: Tricia Jenkins
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to apache-solr-4.0-2011-10-14_08-56-59.war
and then apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, using
the TikaEntityProcessor.  My indexing would run to completion and was completely successful
under the June build.  The only error was readability of the fulltext in highlighting.  This
was fixed in Tika 0.10 (TIKA-611).  I chose to use the October 14 build of Solr because Tika
0.10 had recently been included (SOLR-2372).  
> On the same machine without changing any memory settings my initial problem is a Perm
Gen error.  Fine, I increase the PermGen space.
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor.  Now I get several
(6)
> SEVERE: Exception thrown while getting data
> java.net.SocketTimeoutException: Read timed out
> SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport
> HandlerException: Exception in invoking url <url removed> # 2975
> pairs.  And after ~3881 documents, with auto commit set unreasonably frequently I consistently
get an Out of Memory Error 
> SEVERE: Exception while processing: f document : null:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap space
> The stack trace points to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).
> The October 30 build performs identically.
> Funny thing is that monitoring via JConsole doesn't reveal any memory issues.
> Because the out of Memory error did not occur in June, this leads me to believe that
a bug has been introduced to the code since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message