hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject small files with hive and hadoop
Date Mon, 31 Jan 2011 16:08:06 GMT

I like to do a reporting with Hive on something like tracking data.
The raw data which is about 2 gigs or more a day I want to query with hive. This works already
for me, no problem.
Also I want to cascade down the reporting data to something like client, date, something in
Hive like partitioned by (client String, date String).
That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent
reporting source.
And here is the thing: Might it a problem if it comes to many small files?
The aggrgation level e.g. client/date might produce files about 1MB and in amount of 1000
a day.
Is this a problem? I read about the "to many open files problem" with hadoop. And might this
lead to a bad hive/map-reduce performance?
Maybe someone has some clues for that...

Thanks in advance
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit 
gratis Handy-Flat!

View raw message