lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2597) XmlCharFilter
Date Thu, 16 Jun 2011 02:11:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050184#comment-13050184
] 

Hoss Man commented on SOLR-2597:
--------------------------------

Mike: thanks for the patch!

as Koji mentioned on the mailing list, might want to consider naming this XmlStripCharFilter
... that was my first opinion, but reading the docs the "include" and "exclude" options definitely
make it a bit more generic, so i'm leaning towards the opinion that XmlCharFilter is better.

(there's an argument to be made that we should have an XmlStripCharFilter that only removes
pi/comments/whitespace and resolves entities, and then a distinct XmlTagCharFilter that does
the include/exclude -- but i'm guessing that would be less efficient since this makes it possible
to do in one pass, and anyone who wants include/exclude at the "tag" level is almost certainly
going to want the striping/entities as well)

skiming the patch i'm +1 except for the "new Random" in the test case ... if you take a look
at the existing test cases you'll see how you can hook into the solr test framework to get
random values that are consistent with a global seed -- that way if a test fails, it will
report which seed was used and people can reproduce it using system properties.

would also be nice to have a test of the Factory (using a schema.xml declaration) but that's
not nearly as important.

and of course: would be great if "the xml policeman" uwe could review.

bq. I tried to include the upgraded Woodstox jars, but I don't think I figured how to put
binaries in the patch actually.

it's not possible, so don't worry about it.  the important thing is noting in a comment (like
you did) exactly what new/upgraded jars are needed.


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text
content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions
like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and
include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are
present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present,
we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier
versions of the parser could not report these entity events, so we couldn't tell the difference
between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in
the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message