commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Neidhart (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IO-356) CharSequenceInputStream#reset() behaves incorrectly in case when buffer size is not dividable by data size
Date Sun, 24 Feb 2013 15:34:13 GMT

    [ https://issues.apache.org/jira/browse/IO-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585387#comment-13585387
] 

Thomas Neidhart edited comment on IO-356 at 2/24/13 3:33 PM:
-------------------------------------------------------------

In the reset method we have to modify also the bbuf variable, as it actually contains the
data to be read.
If you take a look at the fillBuffer method, it actually fills bbuf with as many data from
cbuf as bbuf can hold.

A simple scenario, input string is (AAAAABBBBB), bbuf has a buffer size of 10:

{noformat}
  is.mark();
  byte[] data = new byte[5];
  is.read(data);
{noformat}

data will contain "AAAAA", but bbuf will contain the full input string (AAAAABBBBB).
When we now call:

{noformat}
  is.reset();
  is.read(data);
{noformat}

we just reposition cbuf, but we continue to read from bbuf, thus the result will be "BBBBB".
I think calling bbuf.limit(0) is correct and simple, although it might be possible to improve
it.

Regarding the other failing unit tests:

We do call the encode method with the endOfInput flag always set to true. This means we have
to reset the coder the next time we use it (calling also flush is advised according to javadoc
of CharsetEncoder):

{noformat}
    private void fillBuffer() throws CharacterCodingException {
        this.bbuf.compact();
        this.encoder.reset();
        final CoderResult result = this.encoder.encode(this.cbuf, this.bbuf, true);
        this.encoder.flush(bbuf);
        if (result.isError()) {
            result.throwException();
        }
        this.bbuf.flip();
    }
{noformat}

This was probably introduced as the CharsetEncoder always precedes the data with with the
byte-order mark when using UTF-16 charset.
In that way, the BOM is only output the first time the encoder is called, but it also means
that mark/reset will not work with such charsets, as subsequent calls will not generate the
BOM again.

I do not know how to fix this in a clean way atm, but I would consider the CharSequenceInputStream
for UTF-16 charset as broken and document it.
                
      was (Author: tn):
    In the reset method we have to modify also the bbuf variable, as it actually contains
the data to be read.
If you take a look at the fillBuffer method, it actually fills bbuf with as many data from
cbuf as bbuf can hold.

A simple scenario, input string is (AAAAABBBBB), bbuf has a buffer size of 10:

{noformat}
  is.mark();
  byte[] data = new byte[5];
  is.read(data);
{noformat}

data will contain "AAAAA", but bbuf will contain the full input string (AAAAABBBBB).
When we now call:

{noformat}
  is.reset();
  is.read(data);
{noformat}

we just reposition cbuf, but we continue to read from bbuf, thus the result will be "BBBBB".
I think calling bbuf.limit(0) is correct and simple, although it might be possible to improve
it.

Regarding the other failing unit tests:

We do call the encode method with the endOfInput flag always set to true. This means we have
to reset the coder the next time we use it (calling also flush is advised according to javadoc
of CharsetEncoder):

{noformat}
    private void fillBuffer() throws CharacterCodingException {
        this.bbuf.compact();
        this.encoder.reset();
        final CoderResult result = this.encoder.encode(this.cbuf, this.bbuf, true);
        this.encoder.flush(bbuf);
        if (result.isError()) {
            result.throwException();
        }
        this.bbuf.flip();
    }
{noformat}

this error was silently ignored as we created the encoder like this:

{noformat}
        this.encoder = charset.newEncoder()
            .onMalformedInput(CodingErrorAction.REPLACE)
            .onUnmappableCharacter(CodingErrorAction.REPLACE);
{noformat}

Another thing that I encountered, the CharsetEncoder always precedes the data with with the
byte-order mark when using UTF-16 charset, so we might get strange behavior in the output
bytes, as the byte-order mark may also appear multiple times (everytime the fillBuffer method
is called) in the middle of the data. The number of bytes actually read do not correspond
to the bytes from the input, as the BOM is inserted. So imho, any charset that produces a
BOM does not work for the CharSequenceInputStream atm, and I have not yet an idea how to fix
this.
                  
> CharSequenceInputStream#reset() behaves incorrectly in case when buffer size is not dividable
by data size
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: IO-356
>                 URL: https://issues.apache.org/jira/browse/IO-356
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Streams/Writers
>    Affects Versions: 2.4
>            Reporter: Dmitry Katsubo
>         Attachments: CharSequenceInputStreamTest.java
>
>
> The size effect happens when buffer size of input stream is not dividable by requested
data size. The bug is hidden in {{CharSequenceInputStream#reset()}} method which should also
call (I think) {{bbuf.limit(0)}} otherwise next call to {{CharSequenceInputStream#read()}}
will return the remaining tail which {{bbuf}} has accumulated.
> In the attached test case the test fails, if {{dataSize = 13}} (not dividable by 10)
and runs OK if {{dataSize = 20}} (dividable by 10).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message