Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=message-id:date:from:user-agent:mime-version:to:subject:
	references:in-reply-to:content-type:content-transfer-encoding;
	b=USatc46gnRwV/01MYI0rQHtIznuDuIvKmgoZsJe8DJVvlPYQDz9gFMsX2xQjIAD0
Message-ID: <4A8C3D32.3010905@yahoo-inc.com>
Date: Wed, 19 Aug 2009 10:58:10 -0700
From: Raghu Angadi <rangadi@yahoo-inc.com>
User-Agent: Thunderbird 2.0.0.22 (Windows/20090605)
MIME-Version: 1.0
To: common-user@hadoop.apache.org
Subject: Re: Faster alternative to FSDataInputStream
References: <ad681e7f0908180900s2eee6cd0y43c9f2d27a71934f@mail.gmail.com>
	 <4A8B8D11.8020008@yahoo-inc.com>
	 <cbbf4b570908190811m6ef7eaa2pd0fbb1f222071ca9@mail.gmail.com>
	 <ad681e7f0908190845j59dc22aet95e9af971efa8798@mail.gmail.com>
 <cbbf4b570908190914j206797ecrc862d06b6eccbd37@mail.gmail.com>
In-Reply-To: <cbbf4b570908190914j206797ecrc862d06b6eccbd37@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Edward Capriolo wrote:
>> On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>>
>>>>> It would be as fast as underlying filesystem goes.
>>> I would not agree with that statement. There is overhead. 

You might be misinterpreting my comment. There is of course some over 
head (at the least the procedure calls).. depending on you underlying 
filesystem, there could be extra buffer copies and CRC overhead. But 
none of that explains transfer as slow as 1 MBps (if my interpretation 
of of results is correct).

Raghu.

>>> In some testing I did writing a small file can
>>> take 30-300 ms. So if you have 9000 small files (like I did) and you
>>> are single threaded this takes a long time.
>>>
>>> If you orchestrate your task to use FSDataInput and FSDataOutput in
>>> the map or reduce phase then each mapper or reducer is writing a file
>>> at a time. Now that is fast.
>>>
>>> Ananth, are you doing your r/w inside a map/reduce job or are you just
>>> using FS* in a top down program?
>>>
>>>
>>>
>>> On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi<rangadi@yahoo-inc.com>
>>> wrote:
>>>> Ananth T. Sarathy wrote:
>>>>> I am trying to download binary files stored in Hadoop but there is like
>>> a
>>>>> 2
>>>>> minute wait on a 20mb file when I try to execute the in.read(buf).
>>>> What does this mean : 2 min to pipe 20mb or one or your one of the
>>> in.read()
>>>> calls took 2 minutes? Your code actually measures team for read and
>>> write.
>>>> There is nothing in FSInputstream to cause this slow down. Do you think
>>>> anyone would use Hadoop otherwise? It would be as fast as underlying
>>>> filesystem goes.
>>>>
>>>> Raghu.
>>>>
>>>>> is there a better way to be doing this?
>>>>>
>>>>>    private void pipe(InputStream in, OutputStream out) throws
>>> IOException
>>>>>    {    System.out.println(System.currentTimeMillis()+" Starting to Pipe
>>>>> Data");
>>>>>        byte[] buf = new byte[1024];
>>>>>        int read = 0;
>>>>>        while ((read = in.read(buf)) >= 0)
>>>>>        {
>>>>>            out.write(buf, 0, read);
>>>>>            System.out.println(System.currentTimeMillis()+" Piping
>>> Data");
>>>>>        }
>>>>>        out.flush();
>>>>>        System.out.println(System.currentTimeMillis()+" Finished Piping
>>>>> Data");
>>>>>
>>>>>    }
>>>>>
>>>>> public void readFile(String fileToRead, OutputStream out)
>>>>>            throws IOException
>>>>>    {
>>>>>        System.out.println(System.currentTimeMillis()+" Start Read
>>> File");
>>>>>        Path inFile = new Path(fileToRead);
>>>>>        System.out.println(System.currentTimeMillis()+" Set Path");
>>>>>        // Validate the input/output paths before reading/writing.
>>>>>
>>>>>        if (!fs.exists(inFile))
>>>>>        {
>>>>>            throw new HadoopFileException("Specified file  " + fileToRead
>>>>>                    + " not found.");
>>>>>        }
>>>>>        if (!fs.isFile(inFile))
>>>>>        {
>>>>>            throw new HadoopFileException("Specified file  " + fileToRead
>>>>>                    + " not found.");
>>>>>        }
>>>>>        // Open inFile for reading.
>>>>>        System.out.println(System.currentTimeMillis()+" Opening Data
>>>>> Stream");
>>>>>        FSDataInputStream in = fs.open(inFile);
>>>>>
>>>>>        System.out.println(System.currentTimeMillis()+" Opened Data
>>>>> Stream");
>>>>>        // Open outFile for writing.
>>>>>
>>>>>        // Read from input stream and write to output stream until EOF.
>>>>>        pipe(in, out);
>>>>>
>>>>>        // Close the streams when done.
>>>>>        out.close();
>>>>>        in.close();
>>>>>    }
>>>>> Ananth T Sarathy
>>>>>
>>>>