manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: Extracting Content from Web Crawler using the new PipeLine
Date Thu, 23 Oct 2014 06:03:41 GMT
Hi Arcadius,

> - use Tika's BoilerPipe to get cleaner content from web sites?
Yes, Tika extractor will remove tags in html
and send content and metadata to downstream pipeline/output connection.

> - What about extracting specific HTML tags such as all h1 or h2 and map them to a Solr
field?
No, currently it can map only metadata which is extracted by Tika to Solr field.
For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't treat them as metadata.
Currently when capturing these tags and map them to fields, 
we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).

Regards,
Shinichiro Abe

On 2014/10/23, at 10:21, Arcadius Ahouansou <arcadius@menelic.com> wrote:

> 
> Hello.
> 
> Given that we now have pipelines in ManifoldCF, How feasible  is it to:
> 
> - use Tika's BoilerPipe to get cleaner content from web sites?
> - What about extracting specific HTML tags such as all h1 or h2 and map them to a Solr
field?
> 
> Thank you very much.
> 
> Arcadius.
>   


Mime
View raw message