Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 14880 invoked from network); 19 Aug 2009 18:00:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Aug 2009 18:00:33 -0000 Received: (qmail 62774 invoked by uid 500); 19 Aug 2009 18:00:50 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 62686 invoked by uid 500); 19 Aug 2009 18:00:50 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 62676 invoked by uid 99); 19 Aug 2009 18:00:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2009 18:00:50 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2009 18:00:37 +0000 Received: from [10.72.106.226] (heighthigh-lx.corp.yahoo.com [10.72.106.226]) by mrout3.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id n7JHwAXp096729 for ; Wed, 19 Aug 2009 10:58:10 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:subject: references:in-reply-to:content-type:content-transfer-encoding; b=USatc46gnRwV/01MYI0rQHtIznuDuIvKmgoZsJe8DJVvlPYQDz9gFMsX2xQjIAD0 Message-ID: <4A8C3D32.3010905@yahoo-inc.com> Date: Wed, 19 Aug 2009 10:58:10 -0700 From: Raghu Angadi User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: Faster alternative to FSDataInputStream References: <4A8B8D11.8020008@yahoo-inc.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Edward Capriolo wrote: >> On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo wrote: >> >>>>> It would be as fast as underlying filesystem goes. >>> I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. >>> In some testing I did writing a small file can >>> take 30-300 ms. So if you have 9000 small files (like I did) and you >>> are single threaded this takes a long time. >>> >>> If you orchestrate your task to use FSDataInput and FSDataOutput in >>> the map or reduce phase then each mapper or reducer is writing a file >>> at a time. Now that is fast. >>> >>> Ananth, are you doing your r/w inside a map/reduce job or are you just >>> using FS* in a top down program? >>> >>> >>> >>> On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi >>> wrote: >>>> Ananth T. Sarathy wrote: >>>>> I am trying to download binary files stored in Hadoop but there is like >>> a >>>>> 2 >>>>> minute wait on a 20mb file when I try to execute the in.read(buf). >>>> What does this mean : 2 min to pipe 20mb or one or your one of the >>> in.read() >>>> calls took 2 minutes? Your code actually measures team for read and >>> write. >>>> There is nothing in FSInputstream to cause this slow down. Do you think >>>> anyone would use Hadoop otherwise? It would be as fast as underlying >>>> filesystem goes. >>>> >>>> Raghu. >>>> >>>>> is there a better way to be doing this? >>>>> >>>>> private void pipe(InputStream in, OutputStream out) throws >>> IOException >>>>> { System.out.println(System.currentTimeMillis()+" Starting to Pipe >>>>> Data"); >>>>> byte[] buf = new byte[1024]; >>>>> int read = 0; >>>>> while ((read = in.read(buf)) >= 0) >>>>> { >>>>> out.write(buf, 0, read); >>>>> System.out.println(System.currentTimeMillis()+" Piping >>> Data"); >>>>> } >>>>> out.flush(); >>>>> System.out.println(System.currentTimeMillis()+" Finished Piping >>>>> Data"); >>>>> >>>>> } >>>>> >>>>> public void readFile(String fileToRead, OutputStream out) >>>>> throws IOException >>>>> { >>>>> System.out.println(System.currentTimeMillis()+" Start Read >>> File"); >>>>> Path inFile = new Path(fileToRead); >>>>> System.out.println(System.currentTimeMillis()+" Set Path"); >>>>> // Validate the input/output paths before reading/writing. >>>>> >>>>> if (!fs.exists(inFile)) >>>>> { >>>>> throw new HadoopFileException("Specified file " + fileToRead >>>>> + " not found."); >>>>> } >>>>> if (!fs.isFile(inFile)) >>>>> { >>>>> throw new HadoopFileException("Specified file " + fileToRead >>>>> + " not found."); >>>>> } >>>>> // Open inFile for reading. >>>>> System.out.println(System.currentTimeMillis()+" Opening Data >>>>> Stream"); >>>>> FSDataInputStream in = fs.open(inFile); >>>>> >>>>> System.out.println(System.currentTimeMillis()+" Opened Data >>>>> Stream"); >>>>> // Open outFile for writing. >>>>> >>>>> // Read from input stream and write to output stream until EOF. >>>>> pipe(in, out); >>>>> >>>>> // Close the streams when done. >>>>> out.close(); >>>>> in.close(); >>>>> } >>>>> Ananth T Sarathy >>>>> >>>>