lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <k...@r.email.ne.jp>
Subject Re: [jira] [Created] (SOLR-2597) XmlCharFilter
Date Wed, 15 Jun 2011 13:28:40 GMT
Did you mean Xml*Strip*CharFilter?

koji
-- 
http://www.rondhuit.com/en/

(11/06/15 22:12), Mike Sokolov (JIRA) wrote:
> XmlCharFilter
> -------------
>
>                   Key: SOLR-2597
>                   URL: https://issues.apache.org/jira/browse/SOLR-2597
>               Project: Solr
>            Issue Type: Improvement
>            Components: Schema and Analysis
>      Affects Versions: 4.0
>              Reporter: Mike Sokolov
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text
content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions
like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and
include) the content of certain named elements.
>
> In order to compute character offsets properly when mixed line termination styles are
present (\r, \r\n), or when XML character entities (&lt;,&quot;,&amp;) are present,
we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier
versions of the parser could not report these entity events, so we couldn't tell the difference
between "<" and"&lt;" and the offsets could be wrong.  The upgraded version is in the
patch.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message