hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: Hadoop and Small Files
Date Tue, 15 Sep 2009 05:13:42 GMT
 > The files need to be accessed randomly with very low latency

Then use:

* HBase (or similar distributed key-value store) - not sure about latency,
has this improved with the 0.20 release?

Yes, latency is significantly better with 0.20, see preso from hadoop 
summit on results:

http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/

-- amr

Jonathan Holloway wrote:
> Hi all,
>
> I'm new to Hadoop and currently looking at it for a project where there is
> around a few TB of data that needs to be stored
> in a format suitable for MapReduce functions.  The problem is that I'm
> dealing with small text files (including metadata)
> of 10Kb in size (and upwards to a few MB) that need to be stored in some
> format.  The files need to be accessed randomly
> with very low latency.  I've been through the docs and previous posts on the
> mailing list, and looked at the following options:
>
> * HDFS - not suitable "as is" because of the 64MB block size
> * HAR (Hadoop Archives) - not sure about random access to files within
> format
> * Sequence Files - slow to convert into this format, can't randomly access
> the files
> * CombineFileInputFormat - assuming you still can't access the files
> randomly https://issues.apache.org/jira/browse/HADOOP-4565
> * MapFile - looks good... but not sure about latency
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
> * HBase (or similar distributed key-value store) - not sure about latency,
> has this improved with the 0.20 release?
>
> Please correct if I'm wrong re: the assumptions above.  Which is the most
> appropriate option here?
>
> Many thanks...
> Jon.
>
>   

Mime
View raw message