lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Unsupported encoding GB18030
Date Sun, 03 Apr 2011 08:25:59 GMT
Hi,

> : I don't see the reason why "exampledocs" should contain docs with narrow
> charsets not guaranteed to be supported.
> : In my opinion this file belongs in the test suite, also since it only
contains
> "test" content, unsuitable for demoing.
> 
> it's purpose for being there is to let *users* "test" if their servlet
container +
> solr conbination is working with alternate encodings -- much the same
reason
> utf8-example.xml and test_utf8.sh are included in exampledocs.
> 
> It's a perfectly valid exampledoc for Solr.  it may not work on all
platforms,
> but the *.sh files aren't garunteed to work on all platforms either.  if
we
> moved it to hte test directory, end users fetching binary releases
wouldn't
> get it. and may not be aware that their servlet container isn't supporting
thta
> charset.
> 
> personally i would like to see us add a lot more exampledocs in a lot more
> esoteric encodings, precisely to help end users sanity test this sort of
thing.
> we frequetnly get questions form people about character encoding
> wonkiness, and things like test_utf8.sh, utf8-example.xml, and now
> gb18030-example.xml can help us narrow down the problem: their client
> code, their servlet container, or solr?

Same here. In my opinion, an example set of files should also contain "more
complicated" ones to show what Solr can do. If some of them don't work, it's
not really a problem. Maybe we should simply add a "tag" to the filename to
mark them as not working in every configuration.

The servlet container can no longer break those files! Solr now *only*
communicates with the servlet container using Input/OutputStreams. All
charsets are handled by the XML parser or Readers/Writers created by Solr's
code (this was one improvement which even did a serious speed improvement,
because Jetty's servlet Writers are very ineffective...).

To come back to the original issue: I did extensive testing with different
*Sun/Oracle* JDKs and operating systems in VirtualBOX, none of them failed!
To get behind the issue, Jan should tell us hin complete configuration:
- Was the JDK freshly installed (not upgraded or whatever)
- Was it a clean binary Solr distribution (I tested only those). If it was a
SVN checkout, maybe the SVN client broke the file itself (strange was the
error message in the exception, it contained some trash behind the encoding
name, maybe the file itself was corrupted - maybe Jan did open and save it
with an incompatible text editor that cannot handle this extension!). We
should know what Jan changed. Maybe he used the already modified solr
installation of his project.
- Maybe the classpath on Jan's Solr installation contains some "older" XML
parser libs that cannot handle this GB encoding. From the exception, we
cannot see if the STAX parser that produced these exceptions is really the
one from Solr itself. Maybe there is some other Wstx in his classpath. A
good test would be to (as he uses JDK 1.6) to remove wstx from Solr's lib
folder. If the exception then still contains the same Exception (and not a
different one), there is another Wstx somewhere. JDK6 has an internal (but
slower) Stax parser, so removing the file is a good test case.
- Jan should test a short java program, to test if e.g. new String(new
byte[0], "GB18030") works for him!

Finally, JDK's charset support is in charsets.jar in the JDK lib folder. It
has nothing to do with any operating system support for charsets, so it also
works on any windows version that is e.g. US English only. The charset
support for windows xp is only to support displaying those characters on the
video card (contains only fonts and basic OS support). As Solr has no GUI
and the charset conversions are handled by charsets.jar internally,
installing any operating system patches has no effect.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message