hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KayVajj <vajjalak...@gmail.com>
Subject Lots of files in a directory vs files in sub directories
Date Mon, 03 Jun 2013 04:11:41 GMT

I am trying to figure s strategy around partitions in hive. I'm thinking
either a monthly or a daily partition. The usage directs me go towards the
daily partition scheme(querying etc), but I'm not sure what would be the
HDFS, Name Node limitations to this.

If for a daily partition I would have 3-4 GB of file in each partition and
for 2 years I might end up having

700 and odd directories with one file each. On the contrary in monthly I
would have 24 directories with each directory having 30 or 31 files of 4 GB

Most of my queries are in the date range and I was thinking daily
partitions would be more effective as it doesn't have to scan all the files
for the month in case of a monthly partition.

I would like to know what other considerations should I think about before
making a decision.

1) Name node/ HDFS limitations
2) Archiving files
3) compression

and may be more.

I would really appreciate any inputs on this


View raw message