hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Holloway <jonathan.hollo...@gmail.com>
Subject Hadoop and Small Files
Date Mon, 14 Sep 2009 17:33:36 GMT
Hi all,

I'm new to Hadoop and currently looking at it for a project where there is
around a few TB of data that needs to be stored
in a format suitable for MapReduce functions.  The problem is that I'm
dealing with small text files (including metadata)
of 10Kb in size (and upwards to a few MB) that need to be stored in some
format.  The files need to be accessed randomly
with very low latency.  I've been through the docs and previous posts on the
mailing list, and looked at the following options:

* HDFS - not suitable "as is" because of the 64MB block size
* HAR (Hadoop Archives) - not sure about random access to files within
format
* Sequence Files - slow to convert into this format, can't randomly access
the files
* CombineFileInputFormat - assuming you still can't access the files
randomly https://issues.apache.org/jira/browse/HADOOP-4565
* MapFile - looks good... but not sure about latency
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
* HBase (or similar distributed key-value store) - not sure about latency,
has this improved with the 0.20 release?

Please correct if I'm wrong re: the assumptions above.  Which is the most
appropriate option here?

Many thanks...
Jon.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message