lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Sokolov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2597) XmlCharFilter
Date Fri, 17 Jun 2011 02:44:47 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mike Sokolov updated SOLR-2597:
-------------------------------

    Attachment: SOLR-2597.patch

Updated patch addresses (most of) Robert and Hoss' comments (thanks for the speedy review!):

Test now uses the random in the test framework

I added a test for the factory (actually all the tests now use the factory since it is now
used to create the parser), but I haven't plumbed this all the way through to a schema declaration.


Moved to org.apache.solr.analysis: I don't know if this is the right place for this, but at
least it should resolve any jar and java 1.6 dependency problems - I think? - at least I can
compile and run the tests from both eclipse and ant command line although I'm not sure what
that proves exactly.

The parser is now created in the factory rather than being maintained as a static in the reader
class.

> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch, SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text
content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions
like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and
include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are
present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present,
we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier
versions of the parser could not report these entity events, so we couldn't tell the difference
between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in
the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message