manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Tavard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1523) HTML Extractor transformation connector - "No englobing tag specified"
Date Fri, 10 Aug 2018 07:07:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575850#comment-16575850
] 

Olivier Tavard commented on CONNECTORS-1523:
--------------------------------------------

Hello,

In fact the connector does two jobs : extract the part of the html document that you want
thanks to englobing tag/filters to remove and also extracts the metadata in the tags  named
"meta tags" and in some other tags like the title one (complete list in JsoupProcessing class).

For the englobing tag, it only picks the first one : you can see that on the HtmlExtractor
class line 153 :
metadataExtracted = JsoupProcessing.extractTextAndMetadataHtmlDocument(document.getBinaryStream(),*sp.includeFilters.get(0)*,
sp.excludeFilters, sp.striphtml);
 
 

> HTML Extractor transformation connector - "No englobing tag specified"
> ----------------------------------------------------------------------
>
>                 Key: CONNECTORS-1523
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1523
>             Project: ManifoldCF
>          Issue Type: Bug
>    Affects Versions: ManifoldCF 2.10
>            Reporter: Steph van Schalkwyk
>            Priority: Major
>
> When adding Englobing tag to HTML Extractor transformation, Englobing tag is not persisted. 
> Can add on config screen in job edit, but value is not persisted.
> Results in "No englobing tag specified".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message