Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88DDFDC9D for ; Sat, 27 Oct 2012 00:11:12 +0000 (UTC) Received: (qmail 26136 invoked by uid 500); 27 Oct 2012 00:11:12 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 26073 invoked by uid 500); 27 Oct 2012 00:11:12 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 26064 invoked by uid 99); 27 Oct 2012 00:11:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Oct 2012 00:11:12 +0000 Date: Sat, 27 Oct 2012 00:11:11 +0000 (UTC) From: "Liyu Yi (JIRA)" To: issues@commons.apache.org Message-ID: <1548606425.34175.1351296672173.JavaMail.jiratomcat@arcas> In-Reply-To: <1749514562.34174.1351296554719.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (IO-354) Commons IO Tailer does not respect UTF-8 Charset MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ] Liyu Yi commented on IO-354: ---------------------------- I used a "hacky" fix to reconstruct the String with right encoding in the handler class. private String rebuildUTF8String(String line) { int len = line.length(); byte[] bytes = new byte[len]; for (int i=0; i Commons IO Tailer does not respect UTF-8 Charset > ------------------------------------------------ > > Key: IO-354 > URL: https://issues.apache.org/jira/browse/IO-354 > Project: Commons IO > Issue Type: Bug > Components: Utilities > Affects Versions: 2.3 > Environment: JDK 7 > RHEL Linux > Apache Commons IO version 2.4 > Reporter: Liyu Yi > Labels: Charset, Encoding, Tailer > > I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet, > 448 private long readLines(RandomAccessFile reader) throws IOException { > 449 StringBuilder sb = new StringBuilder(); > 450 > 451 long pos = reader.getFilePointer(); > 452 long rePos = pos; // position to re-read > 453 > 454 int num; > 455 boolean seenCR = false; > 456 while (run && ((num = reader.read(inbuf)) != -1)) { > 457 for (int i = 0; i < num; i++) { > 458 byte ch = inbuf[i]; > 459 switch (ch) { > 460 case '\n': > 461 seenCR = false; // swallow CR before LF > 462 listener.handle(sb.toString()); > 463 sb.setLength(0); > 464 rePos = pos + i + 1; > 465 break; > 466 case '\r': > 467 if (seenCR) { > 468 sb.append('\r'); > 469 } > 470 seenCR = true; > 471 break; > 472 default: > 473 if (seenCR) { > 474 seenCR = false; // swallow final CR > 475 listener.handle(sb.toString()); > 476 sb.setLength(0); > 477 rePos = pos + i + 1; > 478 } > 479 sb.append((char) ch); // add character, not its ascii value > 480 } > 481 } > 482 > 483 pos = reader.getFilePointer(); > 484 } > 485 > 486 reader.seek(rePos); // Ensure we can re-read if necessary > 487 return rePos; > 488 } > At line 479, the conversion of byte to char types breaks the encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira