lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Sokolov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-2597) XmlCharFilter
Date Wed, 15 Jun 2011 13:12:47 GMT
XmlCharFilter
-------------

                 Key: SOLR-2597
                 URL: https://issues.apache.org/jira/browse/SOLR-2597
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 4.0
            Reporter: Mike Sokolov


This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content
and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML
entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include)
the content of certain named elements.

In order to compute character offsets properly when mixed line termination styles are present
(\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present,
we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier
versions of the parser could not report these entity events, so we couldn't tell the difference
between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in
the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message