manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Donald Van den Driessche (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CONNECTORS-1557) HTML Tag extractor
Date Wed, 21 Nov 2018 08:24:00 GMT
Donald Van den Driessche created CONNECTORS-1557:
----------------------------------------------------

             Summary: HTML Tag extractor
                 Key: CONNECTORS-1557
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
             Project: ManifoldCF
          Issue Type: New Feature
            Reporter: Donald Van den Driessche


I wrote a HTML Tag extractor, based on the HTML Extractor.

I needed to extract specific HTML tags and transfer them to their own field in my output repository.

Input
 * Englobing tag (CSS selector)
 * Blacklist (CSS selector)
 * Fieldmapping (CSS selector)
 * Strip HTML

Process
 * Retrieve Englobing tag
 * Remove blacklist
 * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + strip HTML (if
requested)
 * Englobing tag minus blacklist: strip HTML (if requested) and return as output (content)

How can I best deliver the source code?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message