hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bardia Afshin <brandon...@gmail.com>
Subject Re: Why single thread for HDFS?
Date Mon, 05 Jul 2010 05:47:02 GMT
What's the unsubcribe link?

Sent from my iPhone

On Jul 2, 2010, at 8:24 AM, Jay Booth <jaybooth@gmail.com> wrote:

> Yeah, a good way to think of it is that parallelism is achieved at the
> application level.
> On the input side, you can process multiple files in parallel or one
> file in parallel by logically splitting and opening multiple readers
> of the same file at multiple points.  Each of these readers is single
> threaded, because, well, you're returning a stream of bytes in order.
> It's inherently serial.
> On the reduce side, multiple reduces run, writing to multiple files in
> the same directory.  Again, you can't really write to a single file in
> parallel effectively -- you can't write byte 26 before byte 25,
> because the file's not that long yet.
> Theoretically, maybe you could have all reduces write to the same file
> by allocating some amount of space ahead of time and writing to the
> blocks in parallel - in practice, you very rarely know how big your
> output is going to be before it's produced, so this doesn't really
> work.  Multiple files in the same directory achieves the same goal
> much more elegantly, without exposing a bunch of internal details of
> the filesystem to user space.
> Does that make sense?
> On Fri, Jul 2, 2010 at 9:26 AM, Segel, Mike <msegel@navteq.com> wrote:
>> Actually they also listen here and this is a basic question...
>> I'm not an expert, but how does having multiple threads really help  
>> this problem?
>> I'm assuming you're talking about a map/reduce job and not some  
>> specific client code which is being run on a client outside of the  
>> cloud/cluster....
>> I wasn't aware that you could easily synchronize threads running on  
>> different JVMs. ;-)
>> Your parallelism comes from multiple tasks running on different  
>> nodes within the cloud. By default you get one map/reduce job per  
>> block. You can write your own splitter to increase this and then  
>> get more parallelism.
>> HTH
>> -Mike
>> -----Original Message-----
>> From: Hemanth Yamijala [mailto:yhemanth@gmail.com]
>> Sent: Friday, July 02, 2010 2:56 AM
>> To: general@hadoop.apache.org
>> Subject: Re: Why single thread for HDFS?
>> Hi,
>> Can you please post this on hdfs-dev@hadoop.apache.org ? I suspect  
>> the
>> most qualified people to answer this question would all be on that
>> list.
>> Hemanth
>> On Fri, Jul 2, 2010 at 11:43 AM, elton sky <eltonsky9404@gmail.com>  
>> wrote:
>>> I guess this question was igored, so I just post it again.
>>> From my understanding, HDFS uses a single thread to do read and  
>>> write.
>>> Since a file is composed of many blocks and each block is stored  
>>> as a file
>>> in the underlying FS, we can do some parallelism on block base.
>>> When read across multi-blocks, threads can be used to read all  
>>> blocks. When
>>> write, we can calculate the offset of each block and write to all  
>>> of them
>>> simultaneously.
>>> Is this right?
>> The information contained in this communication may be CONFIDENTIAL  
>> and is intended only for the use of the recipient(s) named above.   
>> If you are not the intended recipient, you are hereby notified that  
>> any dissemination, distribution, or copying of this communication,  
>> or any of its contents, is strictly prohibited.  If you have  
>> received this communication in error, please notify the sender and  
>> delete/destroy the original message and any copy of it from your  
>> computer or paper files.

View raw message