manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arcadius Ahouansou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1088) Augment Tika extractor to allow full use of boilerpipe content extraction
Date Sat, 15 Nov 2014 11:04:34 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213524#comment-14213524
] 

Arcadius Ahouansou commented on CONNECTORS-1088:
------------------------------------------------

Thanks [~kwright@metacarta.com] for addressing this.

Arcadius.

> Augment Tika extractor to allow full use of boilerpipe content extraction
> -------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1088
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1088
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.8, ManifoldCF 2.0
>
>
> Boilerpipe has the ability to process content further than our current Tika extractor
implementation allows.  Specifically, we should be allowing a user to specify a BoilerPipe
extractor class, from within the following package (or other places too, one expects):
> http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html
> If the extractor is specified, then our ContentHandler creation code in the Tika extractor
changes from:
> {code}
>             ContentHandler handler = new BodyContentHandler(w);
> {code}
> to:
> {code}
>             ContentHandler handler = new BodyContentHandler(w);
>             boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
>             try {
>               ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
>               Class extractorClass = loader.loadClass(boilerpipe);
>               BoilerpipeExtractor boilerpipeExtractor = (BoilerpipeExtractor)extractorClass.newInstance();
>               handler = new BoilerpipeContentHandler(handler, boilerpipeExtractor);
>              } catch (ClassNotFoundException e) {
>                 log.warn("BoilerpipeExtractor " + boilerpipe + " not found!");
>              } catch (InstantiationException e) {
>                 log.warn("Could not instantiate " + boilerpipe);
>              } catch (Exception e) {
>                 log.warn(e.toString());
>              }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message