any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Kutuzov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ANY23-240) Option to process html tags as spaces in Microdata
Date Wed, 22 Oct 2014 13:24:33 GMT

     [ https://issues.apache.org/jira/browse/ANY23-240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Kutuzov updated ANY23-240:
---------------------------------
    Description: 
When extracting Microdata from html pages, any23 silently drops all html tags inside predicates'
values. See, for example, http://schema.org/Recipe/ingredients at http://kuking.net/3_2070.htm.
The problem is that on this page (and many others) ingredients are separated from each other
only with '<br>' tag. After any23 drops it, the content becomes mixed and unintelligible.
At the same time, Google Structured Data Testing Tool separates them properly with spaces.

Is it possible to implement this behavior (replacing <br> tags with spaces) in any23
as an option?

  was:
When extracting Microdata from html pages, any23 silently drops all html tags inside predicates'
values. See, for example, http://schema.org/Recipe/ingredients at http://kuking.net/3_2070.htm.
The problem is that on this page (and many others) ingredients are separated from each other
only with '<br>' tag. After any23 drops it, the content becomes mixed and unintelligible.
At the same time, Google Structured Data Testing Tool separates them properly with spaces.
Is it possible to implement this behavior (replacing <br> tags with spaces) in any23
as option?


> Option to process html tags as spaces in Microdata
> --------------------------------------------------
>
>                 Key: ANY23-240
>                 URL: https://issues.apache.org/jira/browse/ANY23-240
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: extractors, microdata
>            Reporter: Andrey Kutuzov
>
> When extracting Microdata from html pages, any23 silently drops all html tags inside
predicates' values. See, for example, http://schema.org/Recipe/ingredients at http://kuking.net/3_2070.htm.
> The problem is that on this page (and many others) ingredients are separated from each
other only with '<br>' tag. After any23 drops it, the content becomes mixed and unintelligible.
At the same time, Google Structured Data Testing Tool separates them properly with spaces.
> Is it possible to implement this behavior (replacing <br> tags with spaces) in
any23 as an option?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message