hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <nitinpawar...@gmail.com>
Subject Re: Handling hierarchical data in Hive
Date Tue, 25 Mar 2014 17:29:59 GMT
bucketing is certainly helpful when you have finite number of values on a
different column in a partitioned column.
though bucketing would mean that when you load data into the table, it
can't be a straight forward load data in path, you will need to run it via
hive queries (which does not seem to be a problem at least from the look of
it)

clustering used to be in the ranges of 2 like 2, 4, 8, 16 etc. Not sure if
it has changed now.
Also while loading data for bucketed table its advised you set the value
for set hive.enforce.bucketing = true;

 I have rarely used indexing in hive. but I do remember hive indexes used
to provide better data access to certain queries as well the storage layout
helps in improving search and lookup of the data.

It may be really helpful if you can note down the performance you get after
fine tuning the parameters

On Tue, Mar 25, 2014 at 10:37 PM, Saumitra Shahapure (Vizury) <
saumitra.shahapure@vizury.com> wrote:

> Hi Nitin/Prasan,
>
> Thanks for your replies, I appreciate your help :)
>
> Clustering looks to be quite close to what we want. However one main gap
> is that we need to fire hive query to populate clusters. In our case, the
> clustered data is already there. So computation in Hive query would be
> redundant. If
>
> CREATE TABLE analyze (generated_by INT, other_representative_field INT)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (generated_by) INTO 100 BUCKETS;
>
> Just accepts s3 directory hierarchy that we have (as explained in first
> mail), our problem would be solved.
>
> Another interesting solutions seem to be creating partition on dt field
> and creating Hive index/view on *generated_by *field.
>
> If anyone has insights around these, they would be really helpful.
> Meanwhile we will try to solve our problem by buckets/indices.
>
>
> --
> Regards,
> Saumitra Shahapure
>
>
> On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani <prasan.samtani@hulu.com>wrote:
>
>> Hi Saumitra,
>>
>> You might want to look into clustering within the partition. That is,
>> partition by "day", but cluster by "generated by" (within those
>> partitions), and see if that improves performance. Refer to the CLUSTER BY
>> command in the Hive language Manual.
>>
>> -Prasan
>>
>>
>> On Mar 25, 2014, at 4:26 AM, "Saumitra Shahapure (Vizury)" <
>> saumitra.shahapure@vizury.com> wrote:
>>
>> Hi Nitin,
>>
>> We are not facing small files problem since data is in S3. Also we do not
>> want to merge files. Merging files are creating large analyze table for say
>> one day would slow down queries fired on specific day and *generated_by.*
>>
>> Let me explain my problem in other words.
>> Right now we are over-partitioning our table. Over-partitioning is giving
>> us benefit that query on 1-2 partitions is too fast. It's side-effect is
>> that If we try to query large number of partitions, query is too slow. Is
>> there a way to get good performance in both of the scenarios?
>>
>> --
>> Regards,
>> Saumitra Shahapure
>>
>>
>> On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> see if this is what you are looking for
>>> https://github.com/sskaje/hive_merge
>>>
>>>
>>>
>>>
>>> On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <
>>> saumitra.shahapure@vizury.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are using Hive to query S3 data. For one of our tables named
>>>> analyze, we generate data hierarchically. First level of hierarchy is date
>>>> and second level is a field named *generated_by*. e.g. for 20 march we
>>>> may have S3 directories as
>>>> s3://analyze/20140320/111/
>>>> s3://analyze/20140320/222/
>>>> s3://analyze/20140320/333/
>>>> Size of files in each folders is typically small.
>>>>
>>>> Till now we have been using static partitioning so that queries on
>>>> specific date and *generated_by* would be faster.
>>>>
>>>> Now problem is that number of *generated_by* folders is increased to
>>>> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries
on
>>>> analyze on one month are slowed down.
>>>>
>>>> Is there any way to get rid of partitions, and at the same time
>>>> maintain good  performance of queries which are fired on specific day and
>>>> *generated_by*?
>>>> --
>>>> Regards,
>>>> Saumitra Shahapure
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>


-- 
Nitin Pawar

Mime
View raw message