manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Farrenkopf, Sven" <Sven.Farrenk...@dreso.com>
Subject Using mainfoldCF as a webcrawler with tika and solr
Date Tue, 14 Aug 2018 09:40:11 GMT
I'm using manifoldCF with solr, trying to get it working as a webcrawler. Crawling the websites
(HTML, Text) works fine, the problem is that links to binary documents (pdf, xlsx, docx, ...)
don't work even if I put a tika-Transformation in the job. I haven't even found a written
confirmation that the webcrawler-connector does support  binary documents, although some posts
to the mailing-lists indicate that it is possible.

The documents are apparently recognized - I put a direct link to a pdf-document in the seeds
and it is processed as I run the job.

But there is no error (Tika-errors are not ignored!) and the document is not transferred to
solr. With no error-message I have nothing to work with ...

Any ideas/hints what to do? Does somebody know a tutorial for setting up a webcrawler with
solr & tika? I haven't found any on the web, which made me ask myself if I'm trying sth
impossible here?

Thanks in advance.

Sven

Mime
View raw message