manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CONNECTORS-1088) Augment Tika extractor to allow full use of boilerpipe content extraction
Date Tue, 28 Oct 2014 07:58:33 GMT
Karl Wright created CONNECTORS-1088:
---------------------------------------

             Summary: Augment Tika extractor to allow full use of boilerpipe content extraction
                 Key: CONNECTORS-1088
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1088
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Tika extractor
    Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
            Reporter: Karl Wright
            Assignee: Karl Wright
             Fix For: ManifoldCF 1.8, ManifoldCF 2.0


Boilerpipe has the ability to process content further than our current Tika extractor implementation
allows.  Specifically, we should be allowing a user to specify a BoilerPipe extractor class,
from within the following package (or other places too, one expects):

http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html

If the extractor is specified, then our ContentHandler creation code in the Tika extractor
changes from:

{code}
            ContentHandler handler = new BodyContentHandler(w);
{code}

to:

{code}
            ContentHandler handler = new BodyContentHandler(w);
            boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
            try {
              ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
              Class extractorClass = loader.loadClass(boilerpipe);

              BoilerpipeExtractor boilerpipeExtractor = (BoilerpipeExtractor)extractorClass.newInstance();
              handler = new BoilerpipeContentHandler(handler, boilerpipeExtractor);

             } catch (ClassNotFoundException e) {
                log.warn("BoilerpipeExtractor " + boilerpipe + " not found!");
             } catch (InstantiationException e) {
                log.warn("Could not instantiate " + boilerpipe);
             } catch (Exception e) {
                log.warn(e.toString());
             }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message