Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 17250 invoked from network); 7 Jul 2010 02:11:41 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Jul 2010 02:11:41 -0000 Received: (qmail 65131 invoked by uid 500); 7 Jul 2010 02:11:40 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 65073 invoked by uid 500); 7 Jul 2010 02:11:40 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 65065 invoked by uid 99); 7 Jul 2010 02:11:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jul 2010 02:11:40 +0000 X-ASF-Spam-Status: No, hits=4.4 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eltonsky9404@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jul 2010 02:11:32 +0000 Received: by qyk12 with SMTP id 12so2816810qyk.14 for ; Tue, 06 Jul 2010 19:10:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=ZazK6nk6KOba8rmpkh7dVRJ6kplqof8mIw1ceI5Z7YY=; b=WzeCiTrDvTzO9zNIgrf67syFoYOvqM7muJTIhDHpkH+ZtB94X9Lka995OKlpjwFRcJ bHaoSl5SXu/Vf0EEmBdemmy0WEiS5vp0mRHVhlQal1vlfJ2FAJvWCwHBZoKHOE7CPxOM YdstIEW4mrt2es/22buZcBHmSV5vmhD12e1AU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=WMKNki9qZ/jqN6Y6jz4G3+nACLlNx6XNK+xisM4vOQYEWydXqGlGxr9VlmCiFpwGE0 PsXF07mmQWUg/ujy9cXWKcUsVlrmwSMTpIHO0keiOTKkkbWcx1yj5buNY+OxfbHftQ9M xko6IEPVHfEO1oIbJq5zYu86XIAw16Mf7QxUg= MIME-Version: 1.0 Received: by 10.224.52.32 with SMTP id f32mr3056206qag.352.1278468610981; Tue, 06 Jul 2010 19:10:10 -0700 (PDT) Received: by 10.224.89.18 with HTTP; Tue, 6 Jul 2010 19:10:10 -0700 (PDT) In-Reply-To: <4C334C20.9010105@apache.org> References: <5C9D1E1E-8299-4E97-B552-2DF69F5231BE@linkedin.com> <4C334C20.9010105@apache.org> Date: Wed, 7 Jul 2010 12:10:10 +1000 Message-ID: Subject: Re: Why single thread for HDFS? From: elton sky To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=00c09f88d2796a0155048ac2abad X-Virus-Checked: Checked by ClamAV on apache.org --00c09f88d2796a0155048ac2abad Content-Type: text/plain; charset=ISO-8859-1 Steve, Seems HP has done block based parallel reading from different datanodes. Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s each). I didn't see anywhere I can download their code to play around, pity~ BTW, can we specify which disk to read from with Java? On Wed, Jul 7, 2010 at 1:30 AM, Steve Loughran wrote: > Michael Segel wrote: > >> Uhm... >> >> That's not really true. It gets a bit more complicated than that. >> >> If you're talking about M/R jobs, you don't want to do threads in your >> map() routine, while this is possible, its going to be really hard to >> justify the extra parallelism along with the need to wait for all of the >> threads to complete before you can end the map() method. >> If you're talking about a way to copy files from one cluster to another... >> in hadoop... you can find out the block lists that make up the file. As long >> as the file is static, meaning no one is writing/spliting/compacting the >> file, you could copy it. Here being multi threaded could work. You'd have >> one thread per block that will read from one machine, and then write >> directly to the other. Of course you'll need to figure out where to write >> the block, or rather tie in to HDFS. >> > > There's a paper by Russ Perry using HDFS as a filestore for raster > processing, where he modified DfsClient to get all the locations of a file, > and let the caller decide where to read blocks from. > > http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html > > the advantage of this is that the caller can do the striping across > machines, keep every server busy by asking for files from each of them. Of > course, this ignores the trend to many-HDD servers; DfsClient can't > currently see which physical disk a file is on, which you'd need if the > client wanted to keep every disk on every server busy during a big read > --00c09f88d2796a0155048ac2abad--