hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Thusoo <athu...@facebook.com>
Subject RE: Performance of using map column in schema
Date Mon, 12 Oct 2009 20:48:17 GMT
One issue could be the fact that the key names will be stored for every entry in the map and
that would increase the data sizes. A good compromise is to have the common fields in the
log as top level columns in hive and then have a catch all map for the rest.

Ashish

________________________________
From: Ryan LeCompte [mailto:lecompte@gmail.com]
Sent: Sunday, October 11, 2009 4:19 AM
To: hive-user@hadoop.apache.org
Subject: Performance of using map column in schema

Hello all,

I was wondering if there are any performance hits in using a  map<string,string> column
in a Hive schema to represent a line of an apache log. My issue is that frequently new parameters
are added to apache log lines, and it would be nice to not have to always explicitly define
these new typed columns in the Hive schema table. If we could specify a single column of map<string,string>
that represented all of the param key=value pairs of the apache log line, then we could write
ad-hoc queries that referenced whichever log params we wanted. However, it seems that Hive
wants typed columns for each parameter to perform well. Any thoughts?

Thanks,
Ryan


Mime
View raw message