hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Robertson <charles.robert...@gmail.com>
Subject Re: Remove non-ascii characters
Date Wed, 17 Sep 2014 20:12:40 GMT
I've written a script in python to strip out the ascii characters, and am
using the streaming mapreduce functionality to scrub the text of non-ascii
characters.

However, this has now broken my hive queries because whereas I previously
had one JSON object per line, I now have a JSON object, a tab and blank, so
using LOAD DATA is no longer working. I've tried loading the data into a
two-column table and then doing an INSERT SELECT on the field containing
the JSON, but it doesn't like this because I'm selecting one column and the
receiving table has five - this is the table definition:

CREATE EXTERNAL TABLE tweets_raw (
   id BIGINT,
   created_at STRING,
   text STRING,
   screen_name STRING,
   name STRING
)
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
WITH SERDEPROPERTIES (
      'paths'='id, created_at, text, user.screen_name, user.name'
      );

Can anyone suggest how to get around this?

Thanks,
Charles

On 17 September 2014 09:33, Charles Robertson <charles.robertson@gmail.com>
wrote:

> Hi all,
>
> What is the best way to remove non-ascii characters in Hive? I found this
> thread from over a year ago:
> http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
> which sort of implies there isn't a pre-defined way of doing it, and the
> JIRA raised hasn't been implemented yet.
>
> Does anyone have any suggestions?
>
> Thanks,
> Charles
>

Mime
View raw message