hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Ervits <artemerv...@gmail.com>
Subject Re: Need suggestions on processing JSON junk (e.g., invalid double quotes) data using HIVE
Date Fri, 23 Oct 2015 10:11:40 GMT
Hive 0.13 and up has json serde built in, no need to register another serde
. Flume has a Hive streaming sink so you could directly stream to Hive as
well with flume 1.6.the json serde is from hcatalog BTW. You sure the text
field doesn't have the quote? User-entered data may be malformed. Your
other option is to stream dta as string and then use json_object function
to query your data.
On Oct 22, 2015 6:02 PM, "Sam Joe" <games2013.sam@gmail.com> wrote:

> Hi,
>
> After streaming twitter data to HDFS using Flume, I'm trying to analyze it
> using some HIVE queries. The data is in JSON format and not clean having
> double quotes (") in wrong places causing the HIVE queries to fail. I am
> getting the following error:
>
> Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing '"'
for name
>
> The script used for creating the external table:
>
> ADD JAR /usr/local/hive/apache-hive-1.2.1-bin/lib/hive-serdes-1.0-SNAPSHOT.jar;set hive.support.sql11.reserved.keywords
= false;
> CREATE EXTERNAL TABLE tweets (
> id BIGINT,
> created_at STRING,
> source STRING,
> favorited BOOLEAN,
> retweet_count INT,
> retweeted_status STRUCT<
> text:STRING,
> user:STRUCT<screen_name:STRING,name:STRING>>,
> entities STRUCT<
> urls:ARRAY<STRUCT<expanded_url:STRING>>,
> user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
> hashtags:ARRAY<STRUCT<text:STRING>>>,
> text STRING,
> user STRUCT<
> screen_name:STRING,
> name:STRING,
> friends_count:INT,
> followers_count:INT,
> statuses_count:INT,
> verified:BOOLEAN,
> utc_offset:INT,
> time_zone:STRING>,
> in_reply_to_screen_name STRING)
> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
> LOCATION '/usr/local/hadoop/bin/tweets';
>
> Since I would not know for which row the extra double quotes is present, I
> can't put an escape character. How can I escape the junk characters and
> process the data successfully?
>
> Appreciate any help.
>
> Thanks,
>
> Joel
>

Mime
View raw message