hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sadananda Hegde <saduhe...@gmail.com>
Subject Re: Bucketing external tables
Date Fri, 05 Apr 2013 22:02:33 GMT
Thanks, Mark.

I found the problem. For some reason, Hive is not able to write Avro output
file when the schema has a complex field with NULL option. It read without
any problem; but cannot write with that structure.  For example,  Insert
was failing on this array of structure field.

{ "name": "Passenger", "type":
                       [{"type":"array","items":
                           {"type":"record",
                             "name": "PAXStruct",
                             "fields": [
                                       { "name":"PAXCode",
"type":["string", "null"] },
                                       {
"name":"PAXQuantity","type":["int", "null"] }
                                       ]
                           }
                        }, "null"]
     }

I removed the last "null" clause and it's working okay now.

Regards,
Sadu


On Thu, Apr 4, 2013 at 12:36 AM, Mark Grover <grover.markgrover@gmail.com>wrote:

> Can you please check your Jobtracker logs? The is a generic error related
> to grabbing the Task Attempt Log URL, the real error is in JT logs.
>
>
> On Wed, Apr 3, 2013 at 7:17 PM, Sadananda Hegde <saduhegde@gmail.com>wrote:
>
>> Hi Dean,
>>
>> I tried inserting a bucketed hive table from a non-bucketed table using
>> insert overwrite .... select from clause; but I get the following error.
>>
>> ----------------------------------------------------------------------------------
>> Exception in thread "Thread-225" java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:44)
>>         at
>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
>>         at
>> org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
>>         at java.lang.Thread.run(Thread.java:662)
>> FAILED: Execution Error, return code 2 from
>> org.apache.hadoop.hive.ql.exec.MapRedTask
>>
>> --------------------------------------------------------------------------------------------------------------------------
>>
>> Both tables have same structure except that that one has CLUSTERED BY
>> CLAUSE and other not.
>>
>> Some columns are defined as Array of Structs. The Insert statement works
>> fine if I take out those complex columns. Are there any known issues
>> loading STRUCT or ARRAY OF STRUCT fields?
>>
>>
>> Thanks for your time and help.
>>
>> Sadu
>>
>>
>>
>>
>> On Sat, Mar 30, 2013 at 7:00 PM, Dean Wampler <
>> dean.wampler@thinkbiganalytics.com> wrote:
>>
>>> The table can be external. You should be able to use this data with
>>> other tools, because all bucketing does is ensure that all occurrences for
>>> records with a given key are written into the same block. This is why
>>> clustered/blocked data can be joined on those keys using map-side joins;
>>> Hive knows it can cache ab individual block in memory and the block will
>>> hold all records across the table for the keys in that block.
>>>
>>> So, Java MR apps and Pig can still read the records, but they won't
>>> necessarily understand how the data is organized. I.e., it might appear
>>> unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
>>> but I'm not sure.
>>>
>>> dean
>>>
>>>
>>> On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde <saduhegde@gmail.com>wrote:
>>>
>>>> Thanks, Dean.
>>>>
>>>> Does that mean, this bucketing is exclusively Hive feature and not
>>>> available to others like Java, Pig, etc?
>>>>
>>>> And also, my final tables have to be managed tables; not external
>>>> tables, right?
>>>>  .
>>>> Thank again for your time and help.
>>>>
>>>> Sadu
>>>>
>>>>
>>>>
>>>> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
>>>> dean.wampler@thinkbiganalytics.com> wrote:
>>>>
>>>>> I don't know of any way to avoid creating new tables and moving the
>>>>> data. In fact, that's the official way to do it, from a temp table to
the
>>>>> final table, so Hive can ensure the bucketing is done correctly:
>>>>>
>>>>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>>>>
>>>>> In other words, you might have a big move now, but going forward,
>>>>> you'll want to stage your data in a temp table, use this procedure to
put
>>>>> it in the final location, then delete the temp data.
>>>>>
>>>>> dean
>>>>>
>>>>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde <saduhegde@gmail.com>wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We run M/R jobs to parse and process large and highly complex xml
>>>>>> files into AVRO files. Then we build external Hive tables on top
the parsed
>>>>>> Avro files. The hive tables are partitioned by day; but they are
still huge
>>>>>> partitions and joins do not perform that well. So I would like to
try
>>>>>> out creating buckets on the join key. How do I create the buckets
on the
>>>>>> existing HDFS files? I would prefer to avoid creating another set
of tables
>>>>>> (bucketed) and load data from non-bucketed table to bucketed tables
if at
>>>>>> all possible. Is it possible to do the bucketing in Java as part
of the M/R
>>>>>> jobs while creating the Avro files?
>>>>>>
>>>>>> Any help / insight would greatly be appreciated.
>>>>>>
>>>>>> Thank you very much for your time and help.
>>>>>>
>>>>>> Sadu
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Dean Wampler, Ph.D.*
>>>>> thinkbiganalytics.com
>>>>> +1-312-339-1330
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>

Mime
View raw message