camel-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fedd <feddkr...@hotmail.com>
Subject A possible bug in IOConverter with Win-1251 charset
Date Sat, 05 Mar 2016 20:39:30 GMT
Hi, I believe I have found a bug, they recommend to discuss it on forum
before posting to Jira.

I found it impossible to unmarshal a Win-1251 CSV file with Cyrillic
strings, on a machine where a vm charset is Win-1251 at least. I dug into
the code and saw that in an IOConverter you subclass an InputStream with
this read method:

                @Override
                public int read() throws IOException {
                    if (bufferBytes == null || bufferBytes.remaining() <= 0)
{
                        bufferedChars.clear();
                        int len = reader.read(bufferedChars);
                        bufferedChars.flip();
                        if (len == -1) {
                            return -1;
                        }
                        bufferBytes =
defaultStreamCharset.encode(bufferedChars);
                    }
                    return bufferBytes.get();
                }

I tried to find out why are you converting character buffer to byte buffer,
when you have chars and need to return integers. It may work for other
languages but doesn't work for Russian, where a character "ya" has a code of
FF in an encoding invented by Microsoft, Win-1251, the most widespread
encoding in Russia, Ukraine and some other countries that use Cyrillic
letters. (And "ya" is a very frequent character :)

This in turn makes it -1 as a byte, and later when calling this read()
method and expecting a -1 as an EOF signal, we stop reading it at a
legitimate Cyrillic letter.

Probably you have some reasoning,... but I would totaly omit this "encode"
part and provide the next integer right out of the "bufferedChars" buffer.

What do you think?

Regards,
Fyodor





--
View this message in context: http://camel.465427.n5.nabble.com/A-possible-bug-in-IOConverter-with-Win-1251-charset-tp5778665.html
Sent from the Camel Development mailing list archive at Nabble.com.

Mime
View raw message