commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IO-354) Commons IO Tailer does not respect UTF-8 Charset
Date Tue, 16 Apr 2013 22:49:15 GMT

    [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633510#comment-13633510
] 

Sebb commented on IO-354:
-------------------------

Thanks for the patch.

There's a minor issue with the patch, which is that the conversion from bytes to String relies
on the default encoding.
This was probably true of the original implementation.

Perhaps the class needs to support methods which take an encoding parameter?
                
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>         Attachments: Tailer-commonsio-354.patch
>
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java".
Basically, the current implementation does not work for multi-byte encoded files. See the
following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char type breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message