hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Joe <games2013....@gmail.com>
Subject Re: Need suggestions on processing JSON junk (e.g., invalid double quotes) data using HIVE
Date Thu, 22 Oct 2015 23:00:13 GMT
Hi,

Please see the logs are given below:

hive> SELECT t.retweeted_screen_name,
    >        Sum(retweets) AS total_retweets,
    >        Count(*)      AS tweet_count
    > FROM   (SELECT retweeted_status.user.screen_name AS retweeted_screen_name,
    >                retweeted_status.text,
    >                Max(retweet_count)                AS retweets
    >         FROM   tweets
    >         GROUP  BY retweeted_status.user.screen_name,
    >                   retweeted_status.text) t
    > GROUP  BY t.retweeted_screen_name
    > ORDER  BY total_retweets DESC
    > LIMIT  1;
Query ID = joe_20151022143018_f680c6fd-5d6d-4d5e-8d20-df25396a84d5
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1445537142761_0002, Tracking URL =
http://localhost:8088/proxy/application_1445537142761_0002/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1445537142761_0002
Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 22015-10-22 14:30:51,478 Stage-1 map = 0%,  reduce =
0%2015-10-22 14:39:14,950 Stage-1 map = 69%,  reduce = 17%, Cumulative
CPU 62.09 sec2015-10-22 14:39:17,556 Stage-1 map = 70%,  reduce = 17%,
Cumulative CPU 63.9 sec2015-10-22 14:39:20,209 Stage-1 map = 71%,
reduce = 17%, Cumulative CPU 65.86 sec2015-10-22 14:39:25,098 Stage-1
map = 72%,  reduce = 17%, Cumulative CPU 67.68 sec2015-10-22
14:39:26,126 Stage-1 map = 74%,  reduce = 17%, Cumulative CPU 69.33
sec2015-10-22 14:39:29,943 Stage-1 map = 75%,  reduce = 17%,
Cumulative CPU 71.09 sec2015-10-22 14:39:34,993 Stage-1 map = 77%,
reduce = 17%, Cumulative CPU 74.86 sec2015-10-22 14:39:43,505 Stage-1
map = 100%,  reduce = 100%, Cumulative CPU 25.47 sec
MapReduce Total cumulative CPU time: 25 seconds 470 msec
Ended Job = job_1445537142761_0002 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1445537142761_0002_m_000001 (and more) from
job job_1445537142761_0002

Task with the most failures(1):
-----
Task ID:
  task_1445537142761_0002_m_000000
URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1445537142761_0002&tipid=task_1445537142761_0002_m_000000
-----
Diagnostic Messages for this Task:Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing writable
{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":true,"truncated":false,"lang":"it","in_reply_to_status_id_str":null,"id":654395624406675456,"in_reply_to_user_id_str":null,"timestamp_ms":"1444855049598","in_reply_to_status_id":null,"created_at":"Wed
Oct 14 20:37:29 +0000
2015","favorite_count":0,"place":null,"coordinates":null,"text":"Samaritani:
\"E-fattura strumento chiave per la business intelligence nella PA\"
https://t.co/CIckEdB9EG","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[{"expanded_url":"https://lnkd.in/eDRb_sv","indices":[79,102],"display_url":"lnkd.in/eDRb_sv","url":"https://t.co/CIckEdB9EG"}],"hashtags":[],"user_mentions":[]},"is_quote_status":false,"source":"<a
href=\"http://www.linkedin.com/\"
rel=\"nofollow\">LinkedIn<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"654395624406675456","user":{"location":"pisa","default_profile":true,"profile_background_tile":false,"statuses_count":2924,"lang":"it","profile_link_color":"0084B4","profile_banner_url":"https://pbs.twimg.com/profile_banners/145360070/1422279238","id":145360070,"following":null,"protected":false,"favourites_count":660,"profile_text_color":"333333","verified":false,"description":null,"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"marco
andreozzi","profile_background_color":"C0DEED","created_at":"Tue May
18 19:49:58 +0000
2010","default_profile_image":false,"followers_count":178,"profile_image_url_https":"https://pbs.twimg.com/profile_images/643826897231724544/odFpg1zd_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_off
       at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)Caused
by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
Error while processing writable
{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":true,"truncated":false,"lang":"it","in_reply_to_status_id_str":null,"id":654395624406675456,"in_reply_to_user_id_str":null,"timestamp_ms":"1444855049598","in_reply_to_status_id":null,"created_at":"Wed
Oct 14 20:37:29 +0000
2015","favorite_count":0,"place":null,"coordinates":null,"text":"Samaritani:
\"E-fattura strumento chiave per la business intelligence nella PA\"
https://t.co/CIckEdB9EG","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[{"expanded_url":"https://lnkd.in/eDRb_sv","indices":[79,102],"display_url":"lnkd.in/eDRb_sv","url":"https://t.co/CIckEdB9EG"}],"hashtags":[],"user_mentions":[]},"is_quote_status":false,"source":"<a
href=\"http://www.linkedin.com/\"
rel=\"nofollow\">LinkedIn<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"654395624406675456","user":{"location":"pisa","default_profile":true,"profile_background_tile":false,"statuses_count":2924,"lang":"it","profile_link_color":"0084B4","profile_banner_url":"https://pbs.twimg.com/profile_banners/145360070/1422279238","id":145360070,"following":null,"protected":false,"favourites_count":660,"profile_text_color":"333333","verified":false,"description":null,"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"marco
andreozzi","profile_background_color":"C0DEED","created_at":"Tue May
18 19:49:58 +0000
2010","default_profile_image":false,"followers_count":178,"profile_image_url_https":"https://pbs.twimg.com/profile_images/643826897231724544/odFpg1zd_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_off
       at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:516)
       at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
       ... 8 moreCaused by:
org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was
expecting closing '"' for name
 at [Source: java.io.StringReader@14305ac; line: 1, column: 3683]
        at com.cloudera.hive.serde.JSONSerDe.deserialize(JSONSerDe.java:128)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:141)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:105)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
        ... 9 more
Caused by: org.codehaus.jackson.JsonParseException: Unexpected
end-of-input: was expecting closing '"' for name
 at [Source: java.io.StringReader@14305ac; line: 1, column: 3683]
        at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
        at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
        at org.codehaus.jackson.impl.JsonParserMinimalBase._reportInvalidEOF(JsonParserMinimalBase.java:454)
        at org.codehaus.jackson.impl.ReaderBasedParser._parseFieldName2(ReaderBasedParser.java:1025)
        at org.codehaus.jackson.impl.ReaderBasedParser._parseFieldName(ReaderBasedParser.java:1008)
        at org.codehaus.jackson.impl.ReaderBasedParser.nextToken(ReaderBasedParser.java:418)
        at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:219)
        at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
        at org.codehaus.jackson.map.deser.std.MapDeserializer._readAndBind(MapDeserializer.java:319)
        at org.codehaus.jackson.map.deser.std.MapDeserializer.deserialize(MapDeserializer.java:249)
        at org.codehaus.jackson.map.deser.std.MapDeserializer.deserialize(MapDeserializer.java:33)
        at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2732)
        at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1863)
        at com.cloudera.hive.serde.JSONSerDe.deserialize(JSONSerDe.java:126)
        ... 12 more

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143


java.net.ConnectException: Call From joe-virtual-machine/127.0.0.1 to
0.0.0.0:10020 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefusedFAILED: Execution
Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 2   Cumulative CPU: 25.47 sec   HDFS
Read: 50988492 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 25 seconds 470 msec
hive>



Thanks,
Joel

On Thu, Oct 22, 2015 at 6:02 PM, Sam Joe <games2013.sam@gmail.com> wrote:

> Hi,
>
> After streaming twitter data to HDFS using Flume, I'm trying to analyze it
> using some HIVE queries. The data is in JSON format and not clean having
> double quotes (") in wrong places causing the HIVE queries to fail. I am
> getting the following error:
>
> Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing '"'
for name
>
> The script used for creating the external table:
>
> ADD JAR /usr/local/hive/apache-hive-1.2.1-bin/lib/hive-serdes-1.0-SNAPSHOT.jar;set hive.support.sql11.reserved.keywords
= false;
> CREATE EXTERNAL TABLE tweets (
> id BIGINT,
> created_at STRING,
> source STRING,
> favorited BOOLEAN,
> retweet_count INT,
> retweeted_status STRUCT<
> text:STRING,
> user:STRUCT<screen_name:STRING,name:STRING>>,
> entities STRUCT<
> urls:ARRAY<STRUCT<expanded_url:STRING>>,
> user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
> hashtags:ARRAY<STRUCT<text:STRING>>>,
> text STRING,
> user STRUCT<
> screen_name:STRING,
> name:STRING,
> friends_count:INT,
> followers_count:INT,
> statuses_count:INT,
> verified:BOOLEAN,
> utc_offset:INT,
> time_zone:STRING>,
> in_reply_to_screen_name STRING)
> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
> LOCATION '/usr/local/hadoop/bin/tweets';
>
> Since I would not know for which row the extra double quotes is present, I
> can't put an escape character. How can I escape the junk characters and
> process the data successfully?
>
> Appreciate any help.
>
> Thanks,
>
> Joel
>

Mime
View raw message