hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Baskinger <sam.baskin...@networkedinsights.com>
Subject Re: Hadoop and Small Files
Date Mon, 14 Sep 2009 20:19:00 GMT
Hey Jon,

I don't know how many seconds would constitute low latency for your application, but I would
guess that Hadoop simply will not cut it. I would recommend something closer to Grid-SQL.

If you absolutely must process all the files using MapReduce, perhaps you can split them all
up into 1GB files and process them as a fraction of the larger problem? Just a thought.

Sam


On 9/14/09 12:33 PM, "Jonathan Holloway" <jonathan.holloway@gmail.com> wrote:

Hi all,

I'm new to Hadoop and currently looking at it for a project where there is
around a few TB of data that needs to be stored
in a format suitable for MapReduce functions.  The problem is that I'm
dealing with small text files (including metadata)
of 10Kb in size (and upwards to a few MB) that need to be stored in some
format.  The files need to be accessed randomly
with very low latency.  I've been through the docs and previous posts on the
mailing list, and looked at the following options:

* HDFS - not suitable "as is" because of the 64MB block size
* HAR (Hadoop Archives) - not sure about random access to files within
format
* Sequence Files - slow to convert into this format, can't randomly access
the files
* CombineFileInputFormat - assuming you still can't access the files
randomly https://issues.apache.org/jira/browse/HADOOP-4565
* MapFile - looks good... but not sure about latency
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
* HBase (or similar distributed key-value store) - not sure about latency,
has this improved with the 0.20 release?

Please correct if I'm wrong re: the assumptions above.  Which is the most
appropriate option here?

Many thanks...
Jon.


Sam Baskinger
Software Engineer
Networked Insights, Inc. <http://www.networkedinsights.com/>

This e-mail message and any attachments are for the sole use of the intended recipient(s)
and may contain confidential and privileged information.  Any unauthorized review, use, disclosure,
duplication or distribution is prohibited.  If you received this message in error, please
notify me by phone or return email, do not forward to any other person and permanently delete
the foregoing message, all attachments and all copies immediately.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message