manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1557) HTML Tag extractor
Date Wed, 21 Nov 2018 08:51:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694406#comment-16694406
] 

Karl Wright commented on CONNECTORS-1557:
-----------------------------------------

The best way to deliver the code is as a patch attachment to a ticket like this.

I hope that the transformer you wrote is consistent with the other transformers that ManifoldCF
provides, e.g. the HTML Extractor and the Metadata Adjuster.  Generally we are not fond of
transformers that take on more than the most basic part of what might be structured as a multi-part
transformation.  From your description it sounds like you've basically extended the HTML extractor
and added functionality to it similar to what the Metadata Adjuster does.   If that's true,
it might be good to only provide the extraction functionality extension from CSS to the HTML
extractor, and let the Metadata Adjuster handle the field mappings.

Please let me know how you want to proceed.


> HTML Tag extractor
> ------------------
>
>                 Key: CONNECTORS-1557
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Donald Van den Driessche
>            Priority: Major
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field in my output
repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + strip HTML
(if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message