lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Naber (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-259) HTML Parser doesn't decode character references in attributes
Date Thu, 15 Jun 2006 21:38:30 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-259?page=all ]

Daniel Naber updated LUCENE-259:
--------------------------------

    Bugzilla Id:   (was: 30621)
      Assign To:     (was: Lucene Developers)
       Priority: Minor  (was: Major)

Decrease priority because this affects the demo only.


> HTML Parser doesn't decode character references in attributes
> -------------------------------------------------------------
>
>          Key: LUCENE-259
>          URL: http://issues.apache.org/jira/browse/LUCENE-259
>      Project: Lucene - Java
>         Type: Bug

>   Components: Examples
>     Versions: 1.4
>  Environment: Operating System: All
> Platform: All
>     Reporter: Dave Sparks
>     Priority: Minor

>
> The HTML Parser includes the values of certain attributes in the summary, the
> metaTags and the output stream.  Character references in the attribute values
> are not decoded.  Specifically:
> 1. The value of the alt= attribute of an <img ...> tag is included in the
> summary and the output stream.  This value is case-significant, and may include
> character references.  The character references are not decoded.
> 2. The value of the content= attribute of a <meta ...> tag is included in the
> metaTags if the tag also has a name= or http-equiv= attribute.  This value is
> case-significant, and may include character references.  The character
> references are not decoded, and the value is downcased (since the fix to bug
> #27423).
> I've patched our version of the parser to decode the character references, by
> adding a decodeAll method to Entities to parse a String for character references
> and return a String where the references have been replaced by the corresponding
> characters (or the original String, if no change is needed).  This method is
> called to decode alt= attributes and content= attributes.  I've removed the
> .toLowerCase() on the content= value.  I'm not really happy with this fix, as it
> seems to me to be wrong to parse a value which was previously parsed as a single
> token; there ought to be a way to get it right the first time.
> I've left the name= and http-equiv= values alone.  It's not entirely clear (to
> me) whether character references are allowed, and it would be perverse to use
> them here.  I also appreciate the convenience of having a single combined
> namespace, with downcased names, even though this is technically wrong.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message