any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.
Date Fri, 25 Mar 2016 22:24:25 GMT


ASF GitHub Bot commented on ANY23-247:

Github user lewismc commented on the pull request:
    I agree. Jumping through this in the debugged made me think the same.
    I think it is different if Any23 is to be a PURE implementation... But that
    is clearly not the case. Any23 fits in best when it can be used to extract
    semantics from any old crap input that it is fed. Parsers and extractors
    *should not* fail when there is a piece of crap input HTML. Currently,
    that's exactly what happens and it is extremely limiting.
    I would like to propose that this PR is committed to master as is, we then
    open a brand new issue which acts exactly your comments refactoring out
    content extractor and reusing the input stream which has been fixed, etc.
    Any thoughts Peter? Thanks fr quick response.
    On Friday, March 25, 2016, Peter Ansell <> wrote:
    > The system does seem a little too complex for our purposes and isn't
    > usable because of that.
    > Removing generics would be the first step IMO as there are too many
    > rawtypes definitions which indicate generics are being used badly.
    > ContentExtractor may be able to be completely removed instead of being
    > refitted into the process after that and the parser should always be set to
    > parse as far as practical for our purposes.
    > It is a little strange that there isn't a buffered, markable, InputStream
    > provided for all of the steps to reuse as necessary rather than pushing a
    > raw InputStream or other source into different extractors.
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <>

> FIX Attribute name "itemscope" associated with an element type "html" must be followed
by the ' = ' character.
> --------------------------------------------------------------------------------------------------------------
>                 Key: ANY23-247
>                 URL:
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.2
> In the following markup
> {code}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
> <html xmlns="" xmlns:og=""
xmlns:fb="" version="HTML+RDFa 1.0" xml:lang="en" itemscope
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
> <meta http-equiv="X-UA-Compatible" content="IE=edge" />
> <meta name="generator" content="ToolTwist" />
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the following error
in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element type "html"
must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a check for
the itemscope value being null, if this is the case then add *=""*, there is a precedent for
us doing something like this before, I just cant find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/
> core/src/main/java/org/apache/any23/extractor/microdata/

This message was sent by Atlassian JIRA

View raw message