lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-7633) Change the ExtractingRequestHandler to use Tika-Server
Date Wed, 04 Dec 2019 13:29:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987865#comment-16987865
] 

Robert Muir commented on SOLR-7633:
-----------------------------------

trying to resurrect interest in this ancient issue.

tika has its own server: so it seems like the integration could be really simplified (either
server-side, or client-side) to just use tika's server and then index the result. have not
looked at tika's API there, but probably easy to simply mock its responses for tests, and
TONS of third party dependencies go away.

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7633
>                 URL: https://issues.apache.org/jira/browse/SOLR-7633
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>            Priority: Major
>              Labels: memex
>             Fix For: 5.0.1
>
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika fails it
messes up the ExtractingRequestHandler (e.g., the document type caused Tika to fail, etc).
A more reliable way and also separated, and easier to deploy version of the ExtractingRequestHandler
would make a network call to the Tika JAXRS server, and then call Tika on the Solr server
side, get the results and then index the information that way. I have a patch in the works
from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Mime
View raw message