hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <tdunn...@veoh.com>
Subject RE: Use HDFS as a long term storage solution?
Date Fri, 07 Sep 2007 02:26:03 GMT

That size and number of files is fine for storage in HDFS.  Your situation is comparable to
mine except that I collect 2-5 files per hour and these files are slightly larger.  My files
are in a compressed and encrypted format, but keeping them in a SequenceFile compressed format
would make map/reduce noticeably more efficient.  This is because the file splits can be arranged
by the JobTracker to coincide with the disk blocks in HDFS.  That can result significantly
higher percentage of tasks working against local inputs. 

Note that at these accumulation rates, you are really only talking about < 100K files over
a year.  That still counts as a small number of files.  A large number of files is 10M or

You might also be well served if you were to keep your data in a (block compressed) tab-delimited
form even at the cost of some grotesqueness.  That storage format would allow you to use Pig.
 Pig is, unfortunately, still quite limited in that input data must be fielded and in a simple

-----Original Message-----
From: C G [mailto:parallelguy@yahoo.com]
Sent: Thu 9/6/2007 6:30 PM
To: hadoop-user@lucene.apache.org
Subject: RE: Use HDFS as a long term storage solution?
Right, my preference would be to use HDFS exclusively...except that there are potential issues
with many small files in HDFS and a suggestion that perhaps MogileFS might be better for many
small files.  My strong preference is to store everything in HDFS, then do map/reduce with
the small files to produce results. Since there is a concern about storing a lot of small
files in HDFS, I now wonder if I should collect small files into MogileFS, the periodically
merge them together to create large files, and then store those in HDFS and then issue my
map/reduces. Ick, that sounds complex/time-consuming just writing about it :-(.
  The files I anticipate processing are all compressed (gzip), and are on the order of 80-200M
compressed.  I expect to collect 4-8 of these files per hour for most hours in the day.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message