hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan LeCompte <lecom...@gmail.com>
Subject Re: Performance of using map column in schema
Date Tue, 13 Oct 2009 07:29:38 GMT
Thanks Zheng! I was able to get this up and running, and it has been working
out great so far.

On Tue, Oct 13, 2009 at 12:06 AM, Zheng Shao <zshao9@gmail.com> wrote:

> Hi Ryan,
>
> Here are a list of commands to get you started along this route:
>
> CREATE TABLE apache_log (
>   a STRING,
>   b STRING,
>   c STRING,
>   extra MAP<STRING,STRING>
> ) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ' '
> MAP KEYS TERMINATED BY '=';
>
> LOAD DATA LOCAL INPATH 'myapache.log' OVERWRITE INTO TABLE apache_log;
>
> SELECT a, b, c, extra['key1'], extra['key2'] FROM apache_log;
>
>
> Zheng
>
>
> On Mon, Oct 12, 2009 at 1:48 PM, Ashish Thusoo <athusoo@facebook.com>wrote:
>
>>  One issue could be the fact that the key names will be stored for every
>> entry in the map and that would increase the data sizes. A good compromise
>> is to have the common fields in the log as top level columns in hive and
>> then have a catch all map for the rest.
>>
>> Ashish
>>
>>  ------------------------------
>> *From:* Ryan LeCompte [mailto:lecompte@gmail.com]
>> *Sent:* Sunday, October 11, 2009 4:19 AM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Performance of using map column in schema
>>
>> Hello all,
>>
>> I was wondering if there are any performance hits in using a
>> map<string,string> column in a Hive schema to represent a line of an apache
>> log. My issue is that frequently new parameters are added to apache log
>> lines, and it would be nice to not have to always explicitly define these
>> new typed columns in the Hive schema table. If we could specify a single
>> column of map<string,string> that represented all of the param key=value
>> pairs of the apache log line, then we could write ad-hoc queries that
>> referenced whichever log params we wanted. However, it seems that Hive wants
>> typed columns for each parameter to perform well. Any thoughts?
>>
>> Thanks,
>> Ryan
>>
>>
>
>
> --
> Yours,
> Zheng
>

Mime
View raw message