Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3762411BC9 for ; Wed, 17 Sep 2014 21:05:25 +0000 (UTC) Received: (qmail 89944 invoked by uid 500); 17 Sep 2014 21:05:23 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 89872 invoked by uid 500); 17 Sep 2014 21:05:23 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 89862 invoked by uid 99); 17 Sep 2014 21:05:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Sep 2014 21:05:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of charles.robertson@gmail.com designates 209.85.217.180 as permitted sender) Received: from [209.85.217.180] (HELO mail-lb0-f180.google.com) (209.85.217.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Sep 2014 21:05:18 +0000 Received: by mail-lb0-f180.google.com with SMTP id b12so2613589lbj.25 for ; Wed, 17 Sep 2014 14:04:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=XOauqwfUx7nNA2Z59mF6w6lFfCI1RXm50y4k6S24y9w=; b=LSnefASm61KEbQpOPnmxtTdrJCPbi08fI+9hpZi8ZHVsHsVSaIuORv86O5J/iChw0+ WwRiFGZ55CvrB6ev9KkXef3TSFIhPuMN+ww3Zg4gbj+YCjpfppJ+426HS1Pr3mxcC4Zo d9UgaxESdCCezKMqgZTxn0wEr4eRsEuuEWhQP22+Xz33jYFXc3+ZyzUES4CVRSOB403x 5oVWsbVRkxNrjrSoV5ZMayX/uNt+1m9te0deJvNsR3R+l/tb2vK9xzj2qtt6OjIzjDiY oZH1qdUN7sjzoa+RligxUL244QrOJnCPp40/CVf/1qbEEM5Zn4fqygKWpweHyuTav+3J BiCQ== MIME-Version: 1.0 X-Received: by 10.152.29.129 with SMTP id k1mr181138lah.81.1410987897095; Wed, 17 Sep 2014 14:04:57 -0700 (PDT) Received: by 10.112.77.196 with HTTP; Wed, 17 Sep 2014 14:04:57 -0700 (PDT) In-Reply-To: References: Date: Wed, 17 Sep 2014 22:04:57 +0100 Message-ID: Subject: Re: Remove non-ascii characters From: Charles Robertson To: user@hive.apache.org Content-Type: multipart/alternative; boundary=089e0160b7da638ac00503493822 X-Virus-Checked: Checked by ClamAV on apache.org --089e0160b7da638ac00503493822 Content-Type: text/plain; charset=UTF-8 Have answered my own question - dropped the serde properties from the tweets_raw table and used the get_json_object function in the select statement. Regards, Charles On 17 September 2014 21:12, Charles Robertson wrote: > I've written a script in python to strip out the ascii characters, and am > using the streaming mapreduce functionality to scrub the text of non-ascii > characters. > > However, this has now broken my hive queries because whereas I previously > had one JSON object per line, I now have a JSON object, a tab and blank, so > using LOAD DATA is no longer working. I've tried loading the data into a > two-column table and then doing an INSERT SELECT on the field containing > the JSON, but it doesn't like this because I'm selecting one column and the > receiving table has five - this is the table definition: > > CREATE EXTERNAL TABLE tweets_raw ( > id BIGINT, > created_at STRING, > text STRING, > screen_name STRING, > name STRING > ) > ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' > WITH SERDEPROPERTIES ( > 'paths'='id, created_at, text, user.screen_name, user.name' > ); > > Can anyone suggest how to get around this? > > Thanks, > Charles > > On 17 September 2014 09:33, Charles Robertson > wrote: > >> Hi all, >> >> What is the best way to remove non-ascii characters in Hive? I found this >> thread from over a year ago: >> http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E >> which sort of implies there isn't a pre-defined way of doing it, and the >> JIRA raised hasn't been implemented yet. >> >> Does anyone have any suggestions? >> >> Thanks, >> Charles >> > > --089e0160b7da638ac00503493822 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Have answered my own question - dropped the serde pro= perties from the tweets_raw table and used the get_json_object function in = the select statement.

Regards,
Charles

On 17 September 2014 21:12,= Charles Robertson <charles.robertson@gmail.com> w= rote:
I've= written a script in python to strip out the ascii characters, and am using= the streaming mapreduce functionality to scrub the text of non-ascii chara= cters.

However, this has now broken my hive queries because wh= ereas I previously had one JSON object per line, I now have a JSON object, = a tab and blank, so using LOAD DATA is no longer working. I've tried lo= ading the data into a two-column table and then doing an INSERT SELECT on t= he field containing the JSON, but it doesn't like this because I'm = selecting one column and the receiving table has five - this is the table d= efinition:

CREATE EXTERNAL TABLE tweets_raw (
=C2=A0=C2=A0 id BIG= INT,
=C2=A0=C2=A0 created_at STRING,
=C2=A0=C2=A0 text STRING,
=C2= =A0=C2=A0 screen_name STRING,
=C2=A0=C2=A0 name STRING
)
ROW FORMA= T SERDE 'com.amazon.elasticmapreduce.JsonSerde'
WITH SERDEPROPER= TIES (
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 'paths'=3D'id, create= d_at, text, user.screen_name, user.name'
=C2=A0=C2=A0=C2=A0 =C2=A0 );

Can anyo= ne suggest how to get around this?

Thanks,
Charles

On 17 September 2014 09:33, Charles Robertson <charles.robertson@gmail.com> wrote:
Hi all,

What is the best = way to remove non-ascii characters in Hive? I found this thread from over a= year ago:=C2=A0http://mail-archives.apache.org/mod_mbo= x/hive-user/201303.mbox/%3CCAKm=3DR7VE+VrQCCOhK69aPEEqNB=3D9zO8UAJ01ys8AuCA= Vvd9vag@mail.gmail.com%3E which sort of implies there isn't a pre-d= efined way of doing it, and the JIRA raised hasn't been implemented yet= .

Does anyone have any suggestions?

=
Thanks,
Charles


--089e0160b7da638ac00503493822--