commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liyu Yi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IO-354) Commons IO Tailer does not respect UTF-8 Charset
Date Sat, 27 Oct 2012 00:11:12 GMT

     [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Liyu Yi updated IO-354:
-----------------------

    Description: 
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java".
Basically, the current implementation does not work for multi-byte encoded files. See the
following snippet,

448    private long readLines(RandomAccessFile reader) throws IOException {
449        StringBuilder sb = new StringBuilder();
450
451        long pos = reader.getFilePointer();
452        long rePos = pos; // position to re-read
453
454        int num;
455        boolean seenCR = false;
456        while (run && ((num = reader.read(inbuf)) != -1)) {
457            for (int i = 0; i < num; i++) {
458                byte ch = inbuf[i];
459                switch (ch) {
460                case '\n':
461                    seenCR = false; // swallow CR before LF
462                    listener.handle(sb.toString());
463                    sb.setLength(0);
464                    rePos = pos + i + 1;
465                    break;
466                case '\r':
467                    if (seenCR) {
468                        sb.append('\r');
469                    }
470                    seenCR = true;
471                    break;
472                default:
473                    if (seenCR) {
474                        seenCR = false; // swallow final CR
475                        listener.handle(sb.toString());
476                        sb.setLength(0);
477                        rePos = pos + i + 1;
478                    }
479                    sb.append((char) ch); // add character, not its ascii value
480                }
481            }
482
483            pos = reader.getFilePointer();
484        }
485
486        reader.seek(rePos); // Ensure we can re-read if necessary
487        return rePos;
488    }

At line 479, the conversion of byte to char type breaks the encoding.

  was:
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java".
Basically, the current implementation does not work for multi-byte encoded files. See the
following snippet,

448    private long readLines(RandomAccessFile reader) throws IOException {
449        StringBuilder sb = new StringBuilder();
450
451        long pos = reader.getFilePointer();
452        long rePos = pos; // position to re-read
453
454        int num;
455        boolean seenCR = false;
456        while (run && ((num = reader.read(inbuf)) != -1)) {
457            for (int i = 0; i < num; i++) {
458                byte ch = inbuf[i];
459                switch (ch) {
460                case '\n':
461                    seenCR = false; // swallow CR before LF
462                    listener.handle(sb.toString());
463                    sb.setLength(0);
464                    rePos = pos + i + 1;
465                    break;
466                case '\r':
467                    if (seenCR) {
468                        sb.append('\r');
469                    }
470                    seenCR = true;
471                    break;
472                default:
473                    if (seenCR) {
474                        seenCR = false; // swallow final CR
475                        listener.handle(sb.toString());
476                        sb.setLength(0);
477                        rePos = pos + i + 1;
478                    }
479                    sb.append((char) ch); // add character, not its ascii value
480                }
481            }
482
483            pos = reader.getFilePointer();
484        }
485
486        reader.seek(rePos); // Ensure we can re-read if necessary
487        return rePos;
488    }

At line 479, the conversion of byte to char types breaks the encoding.

    
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java".
Basically, the current implementation does not work for multi-byte encoded files. See the
following snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char type breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message