Return-Path: Delivered-To: apmail-hadoop-hive-user-archive@minotaur.apache.org Received: (qmail 27446 invoked from network); 8 Jul 2009 07:03:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Jul 2009 07:03:25 -0000 Received: (qmail 12475 invoked by uid 500); 8 Jul 2009 07:03:35 -0000 Delivered-To: apmail-hadoop-hive-user-archive@hadoop.apache.org Received: (qmail 12438 invoked by uid 500); 8 Jul 2009 07:03:35 -0000 Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-user@hadoop.apache.org Delivered-To: mailing list hive-user@hadoop.apache.org Received: (qmail 12429 invoked by uid 99); 8 Jul 2009 07:03:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jul 2009 07:03:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zshao9@gmail.com designates 209.85.210.185 as permitted sender) Received: from [209.85.210.185] (HELO mail-yx0-f185.google.com) (209.85.210.185) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jul 2009 07:03:24 +0000 Received: by yxe15 with SMTP id 15so5460032yxe.5 for ; Wed, 08 Jul 2009 00:03:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=XQApO5mc7mJ1G5TfuIaOfpN4MU3gq5RkLX5/1CuFOBs=; b=IH17YU7/fUzJelpC6kb66FzK33Ztj6BX9Qk12gdkHFxW4Is/T6OfG+J8APWIHTwzMJ rS8zqkZ/IUQTk7ri476wB/tz7pI+mzlwLJdsnAMYHinHfn69/UAsLb2TpSkXiT+O2ruM 5VNYzZsXxxn/DQswaRVVyPdwgaoQpjefMzOIw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=x6BDWT6Ebyn9G7gLsDlEQxSi1SQdnfsBiLMfVTruMlLx81BqjQVQZvsuuLJXJr1mAg 9DYyNDdCG0yrBVZXrZE7J+DuFnCUlPOh8Y/SV+Sxx+yBWnF7g8lTsxxXGLJNNcobTRzB jtCauIpntnz5dhL1dCSWvSwUGiqgAig80qNjY= MIME-Version: 1.0 Received: by 10.100.96.4 with SMTP id t4mr11910987anb.170.1247036583524; Wed, 08 Jul 2009 00:03:03 -0700 (PDT) In-Reply-To: References: Date: Wed, 8 Jul 2009 00:03:03 -0700 Message-ID: <34fd060d0907080003n42ee211fhc11ef3b01d23c2bc@mail.gmail.com> Subject: Re: unicode supporting in hive From: Zheng Shao To: hive-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Min, The separators used in Hive are by default ^A, ^B, ^C ... (ascii code 1, 2, 3, etc). These won't appear in either UTF-8 or GBK: Please see these code maps for details: http://en.wikipedia.org/wiki/UTF-8 http://en.wikipedia.org/wiki/GBK Zheng On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou wrote: > Hi all, > It seems that hive would go wrong when storing unicode strings. Hive use > byte comparision for delimiting fields of a record( > see=A0 LazyStruct.java:92, a parse method). > If we use gbk or utf-8 encoding where characters would need more than 1 > byte, might 2-3 bytes,=A0 then it would by coincidence seperator for > delimiting fields equal one of byte in our gbk/utf-8 encoding character. > thus things go wrong. > Can hive solve the problem above? > > Thanks, > Min > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > My profile: > http://www.linkedin.com/in/coderplay > My blog: > http://coderplay.javaeye.com > --=20 Yours, Zheng