lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2597) XmlCharFilter
Date Thu, 16 Jun 2011 02:35:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050189#comment-13050189
] 

Robert Muir commented on SOLR-2597:
-----------------------------------

just one comment: taking a look at the patch, it currently won't compile because the analysis
module has no dependencies and thus no woodstox or whatever.
(but, thanks for trying to integrate it here!!!)

One step would be, rather than have this thing static, can we just have the ctors to this
thing take a general XMLInputFactory instead, e.g.
{noformat}
public XmlCharFilter (CharStream reader, XMLInputFactory inputFactory) {
{noformat}

The corresponding Solr CharFilterFactory could then configure it with all the woodstox-specific
parameters.
But, this still wouldn't solve the issue that all of lucene and modules are on java5 (and
it looks like this uses java6-specific APIs).

I don't think it makes sense to block the patch for these issues, so one workaround would
be to just add it to Solr-only.
If/when we ever move to java 6 in lucene we could then move it into the analysis module.
Another option would be if the XML policeman knows some workaround (sorry, not my thing).


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text
content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions
like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and
include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are
present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present,
we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier
versions of the parser could not report these entity events, so we couldn't tell the difference
between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in
the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message