hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
Date Thu, 12 Mar 2009 21:44:40 GMT
TCK wrote:
> How well does the read throughput from HDFS scale with the number of data nodes ?
> For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time
taken to read this whole file in parallel (ie, with multiple reader client processes requesting
different parts of the file in parallel) 

> be halved if I had the same file on a 20 data node cluster ? 
depends: yes, if whatever was bottleneck with 10 still continues to be 
bottleneck (i.e. you are able to saturate in both cases) and that 
resource is scaled (disk or network)

> Is this not possible because HDFS doesn't support random seeks? 
HDFS does support random seeks for reading... your case should work.

Raghu.


> What about if the file was split up into multiple smaller files before placing in the
HDFS ?
> Thanks for your input.
> -TCK
> 
> 
> 
> 
> --- On Wed, 2/4/09, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
> From: Brian Bockelman <bbockelm@cse.unl.edu>
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:50 PM
> 
> Sounds overly complicated.  Complicated usually leads to mistakes :)
> 
> What about just having a single cluster and only running the tasktrackers on
> the fast CPUs?  No messy cross-cluster transferring.
> 
> Brian
> 
> On Feb 4, 2009, at 12:46 PM, TCK wrote:
> 
>>
>> Thanks, Brian. This sounds encouraging for us.
>>
>> What are the advantages/disadvantages of keeping a persistent storage
> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>> The advantage I can think of is that a permanent storage cluster has
> different requirements from a map-reduce processing cluster -- the permanent
> storage cluster would need faster, bigger hard disks, and would need to grow as
> the total volume of all collected logs grows, whereas the processing cluster
> would need fast CPUs and would only need to grow with the rate of incoming data.
> So it seems to make sense to me to copy a piece of data from the permanent
> storage cluster to the processing cluster only when it needs to be processed. Is
> my line of thinking reasonable? How would this compare to running the map-reduce
> processing on same cluster as the data is stored in? Which approach is used by
> most people?
>> Best Regards,
>> TCK
>>
>>
>>
>> --- On Wed, 2/4/09, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
>> From: Brian Bockelman <bbockelm@cse.unl.edu>
>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
> reads?
>> To: core-user@hadoop.apache.org
>> Date: Wednesday, February 4, 2009, 1:06 PM
>>
>> Hey TCK,
>>
>> We use HDFS+FUSE solely as a storage solution for a application which
>> doesn't understand MapReduce.  We've scaled this solution to
> around
>> 80Gbps.  For 300 processes reading from the same file, we get about
> 20Gbps.
>> Do consider your data retention policies -- I would say that Hadoop as a
>> storage system is thus far about 99% reliable for storage and is not a
> backup
>> solution.  If you're scared of getting more than 1% of your logs lost,
> have
>> a good backup solution.  I would also add that when you are learning your
>> operational staff's abilities, expect even more data loss.  As you
> gain
>> experience, data loss goes down.
>>
>> I don't believe we've lost a single block in the last month, but
> it
>> took us 2-3 months of 1%-level losses to get here.
>>
>> Brian
>>
>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>
>>> Hey guys,
>>>
>>> We have been using Hadoop to do batch processing of logs. The logs get
>> written and stored on a NAS. Our Hadoop cluster periodically copies a
> batch of
>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>> copies the output back to the NAS. The HDFS is cleaned up at the end of
> each
>> batch (ie, everything in it is deleted).
>>> The problem is that reads off the NAS via NFS don't scale even if
> we
>> try to scale the copying process by adding more threads to read in
> parallel.
>>> If we instead stored the log files on an HDFS cluster (instead of
> NAS), it
>> seems like the reads would scale since the data can be read from multiple
> data
>> nodes at the same time without any contention (except network IO, which
>> shouldn't be a problem).
>>> I would appreciate if anyone could share any similar experience they
> have
>> had with doing parallel reads from a storage HDFS.
>>> Also is it a good idea to have a separate HDFS for storage vs for
> doing
>> the batch processing ?
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>>
>>
>>
>>
> 
> 
> 
> 
>       


Mime
View raw message