lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Date Mon, 09 Apr 2018 17:24:00 GMT
+1

https://lucidworks.com/2012/02/14/indexing-with-solrj/

We should add a chatbot to the list that includes Charlie's advice and the link to Erick's
blog post whenever Tika is used. 😊


-----Original Message-----
From: Charlie Hull [mailto:charlie@flax.co.uk] 
Sent: Monday, April 9, 2018 12:44 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of
Solr's MostlyPassthroughHtmlMapper ?

I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of
problem and prevent it bringing down your Solr installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder <Harinder.Hanjan@calgary.ca>
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 
> we have in our Sharepoint system. I have used the tika-app.jar 
> directly to extract the document in question and it does _not_ throw 
> an exception and extract the contents just fine. So it would seem Solr 
> is doing something different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the 
> HTML document to Tika. As Tika limits nested elements to 100, this 
> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 
> XML element nesting. This is metioned in TIKA-2091 
> (https://issues.apache.org/ jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 
> "solution" is to use Tika's default parsing/mapping mechanism but no 
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr 
> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 
> use Tika's implementation.
>
> Thank you!
> Harinder
>
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it 
> to us by mail if requested by us. The City of Calgary thanks you for your attention and
co-operation.
>
Mime
View raw message