manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Tavard (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution
Date Fri, 10 Aug 2018 07:14:00 GMT


Olivier Tavard commented on CONNECTORS-1500:


I did a minor patch to fix log levels of the messages displayed by the connector and delete
some of them. Could you integrate it on the trunk please ?




> HTML Extractor transformation connector contribution
> ----------------------------------------------------
>                 Key: CONNECTORS-1500
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Olivier Tavard
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.10
>         Attachments: fix_englobing_tag_selection.txt, global_patch.txt, html_extractor_transformation_connector.txt,
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code is to simply
choose an encompassing tag in a HTML document for text extracting. And inside this tag, this
connector allows you to remove subparts that you do no want : all the tags corresponding to
declared types or specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests that I will
be able to do if you are interested by the code.
> The documentation is here :
> []<[]
> It does not use additional libraries that the ones already present in MCF project. It
is based on Jsoup library on lib folder.
> Best regards,
> Olivier

This message was sent by Atlassian JIRA

View raw message