lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Date Tue, 10 Apr 2018 14:40:40 GMT
I know it was a joke, but I've been thinking of something like that.
Not a chatbot per say, but perhaps something that uses Machine
Learning/topic clustering on the past discussions and match them to
the new questions. Still would need to be rechecked by a human for
final response, but could be very helpful. I certainly wished for that
many times as I was answering newbie's questions (or my own).

And, I feel, current version of Solr actually has all the pieces to
make such thing happen..... Could be a fun project/demo/service for
the next LuceneSolrRevolution for somebody with time on their hands
:-)

Regards,
   Alex.

On 9 April 2018 at 13:24, Allison, Timothy B. <tallison@mitre.org> wrote:
> +1
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> We should add a chatbot to the list that includes Charlie's advice and the link to Erick's
blog post whenever Tika is used. 😊
>
>
> -----Original Message-----
> From: Charlie Hull [mailto:charlie@flax.co.uk]
> Sent: Monday, April 9, 2018 12:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead
of Solr's MostlyPassthroughHtmlMapper ?
>
> I'd recommend you run Tika externally to Solr, which will allow you to catch this kind
of problem and prevent it bringing down your Solr installation.
>
> Cheers
>
> Charlie
>
> On 9 April 2018 at 16:59, Hanjan, Harinder <Harinder.Hanjan@calgary.ca>
> wrote:
>
>> Hello!
>>
>> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>> we have in our Sharepoint system. I have used the tika-app.jar
>> directly to extract the document in question and it does _not_ throw
>> an exception and extract the contents just fine. So it would seem Solr
>> is doing something different than a Tika standalone installation.
>>
>> After some Googling, I found out that Solr uses its custom HtmlMapper
>> (MostlyPassthroughHtmlMapper) which passes through all elements in the
>> HTML document to Tika. As Tika limits nested elements to 100, this
>> causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>> XML element nesting. This is metioned in TIKA-2091
>> (https://issues.apache.org/ jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.
>> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>> "solution" is to use Tika's default parsing/mapping mechanism but no
>> details have been provided on how to configure this at Solr.
>>
>> I'm hoping some folks here have the knowledge on how to configure Solr
>> to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>> use Tika's implementation.
>>
>> Thank you!
>> Harinder
>>
>>
>> ________________________________
>> NOTICE -
>> This communication is intended ONLY for the use of the person or
>> entity named above and may contain information that is confidential or
>> legally privileged. If you are not the intended recipient named above
>> or a person responsible for delivering messages or communications to
>> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>> distribution, or copying of this communication or any of the
>> information contained in it is strictly prohibited. If you have
>> received this communication in error, please notify us immediately by
>> telephone and then destroy or delete this communication, or return it
>> to us by mail if requested by us. The City of Calgary thanks you for your attention
and co-operation.
>>

Mime
View raw message