manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arcadius Ahouansou <arcad...@menelic.com>
Subject Extracting Content from Web Crawler using the new PipeLine
Date Thu, 23 Oct 2014 01:21:56 GMT
Hello.

Given that we now have pipelines in ManifoldCF, How feasible  is it to:

- use Tika's BoilerPipe to get cleaner content from web sites?
- What about extracting specific HTML tags such as all h1 or h2 and map
them to a Solr field?

Thank you very much.

Arcadius.

Mime
View raw message