manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Tavard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution
Date Sat, 17 Mar 2018 22:46:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403778#comment-16403778
] 

Olivier Tavard commented on CONNECTORS-1500:
--------------------------------------------

Hello,
  
 First there is in attachment a patch to fix an issue with the selection of the englobing
tag.
  
 To answer you, let me give you an example of use :
 Let’s say that we want to crawl the documentation page of MCF. We do not want to have in
the extracted text the menu at the left in the webpage, the text in the the h3 headers and
all the links in the page.
 So if we want to have that in MCF, we first add a Web repository connector with standard
parameters. Then we add a job using this web repository connector and the HTML extractor transformation
connector.
 The seed is : [https://manifoldcf.apache.org/release/release-2.9.1/en_US/end-user-documentation.html]
 In the HTML extractor tab, the config will be :
*englobing tag* : div#content
*html extractor tags to remove* : h3, a, div#menu

So the transformation connector will extract the text in the  englobing tag _div id="content"_.
Then it will delete all the text included in the _h3_ tags, _a_ tags and the text in the _div
id="menu"_ section. It also keeps all the meta tags in the header and will be accessible with
this syntax : jsoup_meta_name.

> HTML Extractor transformation connector contribution
> ----------------------------------------------------
>
>                 Key: CONNECTORS-1500
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
>             Project: ManifoldCF
>          Issue Type: Improvement
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Olivier Tavard
>            Assignee: Karl Wright
>            Priority: Minor
>         Attachments: fix_englobing_tag_selection.txt, html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code is to simply
choose an encompassing tag in a HTML document for text extracting. And inside this tag, this
connector allows you to remove subparts that you do no want : all the tags corresponding to
declared types or specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests that I will
be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF project. It
is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message