lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Lee <lee.justi...@gmail.com>
Subject Bypassing ExtractingRequestHandler
Date Fri, 10 Jun 2016 01:20:07 GMT
Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually?  I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction.  Am I going to regret this approach?  I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.

Also, I was reading this
<http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message