I’m using manifoldCF with solr, trying to get it working as a webcrawler. Crawling the websites (HTML, Text) works fine, the problem is that links to binary documents (pdf, xlsx, docx, …) don’t work even if I put a tika-Transformation in the job. I haven’t even found a written confirmation that the webcrawler-connector does support binary documents, although some posts to the mailing-lists indicate that it is possible.
The documents are apparently recognized – I put a direct link to a pdf-document in the seeds and it is processed as I run the job.
But there is no error (Tika-errors are not ignored!) and the document is not transferred to solr. With no error-message I have nothing to work with …
Any ideas/hints what to do? Does somebody know a tutorial for setting up a webcrawler with solr & tika? I haven’t found any on the web, which made me ask myself if I’m trying sth impossible here?
Thanks in advance.