manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Extracting Content from Web Crawler using the new PipeLine
Date Thu, 23 Oct 2014 07:51:26 GMT
Hi Abe-san,

Is this capability a configurable function of Tika?  We could add Tika
configuration to the Tika Extractor if so.

Karl

On Thu, Oct 23, 2014 at 2:03 AM, Shinichiro Abe <shinichiro.abe.1@gmail.com>
wrote:

> Hi Arcadius,
>
> > - use Tika's BoilerPipe to get cleaner content from web sites?
> Yes, Tika extractor will remove tags in html
> and send content and metadata to downstream pipeline/output connection.
>
> > - What about extracting specific HTML tags such as all h1 or h2 and map
> them to a Solr field?
> No, currently it can map only metadata which is extracted by Tika to Solr
> field.
> For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't
> treat them as metadata.
> Currently when capturing these tags and map them to fields,
> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).
>
> Regards,
> Shinichiro Abe
>
> On 2014/10/23, at 10:21, Arcadius Ahouansou <arcadius@menelic.com> wrote:
>
> >
> > Hello.
> >
> > Given that we now have pipelines in ManifoldCF, How feasible  is it to:
> >
> > - use Tika's BoilerPipe to get cleaner content from web sites?
> > - What about extracting specific HTML tags such as all h1 or h2 and map
> them to a Solr field?
> >
> > Thank you very much.
> >
> > Arcadius.
> >
>
>

Mime
View raw message