hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Holloway <jonathan.hollo...@gmail.com>
Subject Re: Hadoop and Small Files
Date Tue, 15 Sep 2009 05:21:37 GMT
Hi all...

Many thanks for your help and the responses, currently investigating HBase
0.20 as a potential option...

2009/9/14 Amr Awadallah <aaa@cloudera.com>

> > The files need to be accessed randomly with very low latency
>
> Then use:
>
> * HBase (or similar distributed key-value store) - not sure about latency,
> has this improved with the 0.20 release?
>
> Yes, latency is significantly better with 0.20, see preso from hadoop
> summit on results:
>
>
> http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/
>
> -- amr
>
>
> Jonathan Holloway wrote:
>
>> Hi all,
>>
>> I'm new to Hadoop and currently looking at it for a project where there is
>> around a few TB of data that needs to be stored
>> in a format suitable for MapReduce functions.  The problem is that I'm
>> dealing with small text files (including metadata)
>> of 10Kb in size (and upwards to a few MB) that need to be stored in some
>> format.  The files need to be accessed randomly
>> with very low latency.  I've been through the docs and previous posts on
>> the
>> mailing list, and looked at the following options:
>>
>> * HDFS - not suitable "as is" because of the 64MB block size
>> * HAR (Hadoop Archives) - not sure about random access to files within
>> format
>> * Sequence Files - slow to convert into this format, can't randomly access
>> the files
>> * CombineFileInputFormat - assuming you still can't access the files
>> randomly https://issues.apache.org/jira/browse/HADOOP-4565
>> * MapFile - looks good... but not sure about latency
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
>> * HBase (or similar distributed key-value store) - not sure about latency,
>> has this improved with the 0.20 release?
>>
>> Please correct if I'm wrong re: the assumptions above.  Which is the most
>> appropriate option here?
>>
>> Many thanks...
>> Jon.
>>
>>
>>
>


-- 
Design and Tech-noogly
Web: http://www.oogly.co.uk
Mail:  jonathan.holloway@oogly.co.uk
IM:jonathan_philip_holloway@hotmail.com<IM%3Ajonathan_philip_holloway@hotmail.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message