Return-Path: X-Original-To: apmail-commons-user-archive@www.apache.org Delivered-To: apmail-commons-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8BDA2DBAA for ; Fri, 26 Oct 2012 23:26:34 +0000 (UTC) Received: (qmail 18076 invoked by uid 500); 26 Oct 2012 23:26:32 -0000 Delivered-To: apmail-commons-user-archive@commons.apache.org Received: (qmail 17677 invoked by uid 500); 26 Oct 2012 23:26:32 -0000 Mailing-List: contact user-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Users List" Delivered-To: mailing list user@commons.apache.org Received: (qmail 17560 invoked by uid 99); 26 Oct 2012 23:26:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2012 23:26:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sebbaz@gmail.com designates 209.85.212.43 as permitted sender) Received: from [209.85.212.43] (HELO mail-vb0-f43.google.com) (209.85.212.43) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2012 23:26:25 +0000 Received: by mail-vb0-f43.google.com with SMTP id fq11so3327195vbb.30 for ; Fri, 26 Oct 2012 16:26:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=3Gip5tAbCq6gaTS9WuN3yeP3BBshkFJMTU1lpCBDNLE=; b=vRu0H1gZczhBm+bBasInGQxyArbcUfQtjyehnxt8rjCREntFc7aP85yNpuRwFMyfAe 1dpMFIwtdVyZB40YkZ9YVza9bOfOVGftsoTjb1vzliNG7kX9+NcYPCZjs7ZRmYCURJAX 45C/+klAwqZj6CpkFkhGoSNST9VQyL/dd7WA8DQlAyRGKpnOTkTlLFlMMBlypBPxXMU0 55Kx+DkxyjquFcknqJ6vOtARQwAuGlWkKIlxE/gKhG8Q1VAO4sIVsWjWb2UT+ZzrXFzi SKTMgWOEEiXkaSako0s5RUmmaqh//foIxzMf75audsQ2jDyTWW75R3xJoC7J2/429cM5 BSoQ== MIME-Version: 1.0 Received: by 10.220.150.134 with SMTP id y6mr21099503vcv.20.1351293964178; Fri, 26 Oct 2012 16:26:04 -0700 (PDT) Received: by 10.58.172.71 with HTTP; Fri, 26 Oct 2012 16:26:04 -0700 (PDT) In-Reply-To: References: Date: Sat, 27 Oct 2012 00:26:04 +0100 Message-ID: Subject: Re: Commons IO Tailer does not respect UTF-8 Charset From: sebb To: Commons Developers List Cc: user@commons.apache.org, Liyu Yi Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On 26 October 2012 23:03, Liyu Yi wrote: > I just realized there is a defect in the source code of > "org.apache.commons.io.input.Tailer.java". Thanks for the feedback. Bug fixes are better provided as JIRA bugs. It's hard to keep track of bug reports and patches when they are mixed in with all the other mailing list traffic. Could you create a JIRA issue for this problem please? Also, in future, please do not cross-post to both the developer and user lists. The developers follow the user list. > Basically, the current > implementation does not work for multi-byte encoded files. See the > following snippet, > > 448 > > > > private long > readLines(RandomAccessFile > > reader) throws IOException > > { > > 449 > > > > StringBuilder > > sb = new StringBuilder > (); > > 450 > > > > 451 > > > > long pos = reader.getFilePointer > (); > > 452 > > > > long rePos = pos; // position to re-read > > 453 > > > > 454 > > > > int num; > > 455 > > > > boolean seenCR = false; > > 456 > > > > while (run > > && ((num = reader.read > (inbuf > )) > != -1)) { > > 457 > > > > for (int i = 0; i < num; i++) { > > 458 > > > > byte ch = inbuf > [i]; > > 459 > > > > switch (ch) { > > 460 > > > > case '\n': > > 461 > > > > seenCR = false; // swallow CR before LF > > 462 > > > > listener > .handle > (sb.toString > ()); > > 463 > > > > sb.setLength > (0); > > 464 > > > > rePos = pos + i + 1; > > 465 > > > > break; > > 466 > > > > case '\r': > > 467 > > > > if (seenCR) { > > 468 > > > > sb.append > ('\r'); > > 469 > > > > } > > 470 > > > > seenCR = true; > > 471 > > > > break; > > 472 > > > > default: > > 473 > > > > if (seenCR) { > > 474 > > > > seenCR = false; // swallow final CR > > 475 > > > > listener > .handle > (sb.toString > ()); > > 476 > > > > sb.setLength > (0); > > 477 > > > > rePos = pos + i + 1; > > 478 > > > > } > > 479 > > > > sb.append > ((char) > ch); // add character, not its ascii value > > 480 > > > > } > > 481 > > > > } > > 482 > > > > 483 > > > > pos = reader.getFilePointer > (); > > 484 > > > > } > > 485 > > > > 486 > > > > reader.seek > (rePos); > // Ensure we can re-read if necessary > > 487 > > > > return rePos; > > 488 > > > > } > > > At line 479, the conversion of byte to char types breaks the encoding. > > I used a "hacky" fix to reconstruct the String with right encoding in the > handler class. > > private String rebuildUTF8String(String line) { > > int len = line.length(); > > byte[] bytes = new byte[len]; > > for (int i=0; i > bytes[i] = (byte)line.charAt(i); > > } > > return new String(bytes, UTF8); > > } > > However, the right approach is to pass in the encoding to the "create" > method and handle it in the Tailer. > > Regards, > > Liyu Yi --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@commons.apache.org For additional commands, e-mail: user-help@commons.apache.org