manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Tavard (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CONNECTORS-1500) HTML Extractor transformation connector contribution
Date Thu, 15 Mar 2018 14:51:00 GMT
Olivier Tavard created CONNECTORS-1500:
------------------------------------------

             Summary: HTML Extractor transformation connector contribution
                 Key: CONNECTORS-1500
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
             Project: ManifoldCF
          Issue Type: Improvement
    Affects Versions: ManifoldCF 2.9.1
            Reporter: Olivier Tavard
         Attachments: html_extractor_transformation_connector.txt

Hi,

I developed a transformation connector based on Jsoup. The goal of this code is to simply
choose an encompassing tag in a HTML document for text extracting. And inside this tag, this
connector allows you to remove subparts that you do no want : all the tags corresponding to
declared types or specific attribute tag names for example.
The code is in Apache V2 licence  and it is in attachment.

It needs some work including code refactoring, renaming classes, unit tests that I will be
able to do if you are interested by the code.
The documentation is here :

[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]

 

It does not use additional libraries that the ones already present in MCF project. It is
based on Jsoup library on lib folder.

Best regards,

Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message