Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CA+qbEUOsFVfPY+sWjqr2hzk+TbDtULjmGUTbTKhXFzHzndb+DA@mail.gmail.com>
References: 
 <CAPCi2CkmY+aDDqd+SDdC8aT8Y8LsFw+ww8ZhhmtVo7NMw6W6_g@mail.gmail.com>
	<CA+qbEUPAXvagocjwuAjdKKj-vgFmkF1jZP=TJysQXRmc4RaPpQ@mail.gmail.com>
	<CAPCi2CkYoAdPxufPMO9cCzO1Ve885ZD857mKxn+MyD1tSyS1BA@mail.gmail.com>
	<CA+qbEUOsFVfPY+sWjqr2hzk+TbDtULjmGUTbTKhXFzHzndb+DA@mail.gmail.com>
Date: Tue, 22 Dec 2015 13:39:34 -0500
Message-ID: 
 <CAPCi2CmuTVZANTKxk=UMPSaJAWuHZO7JjXEL7HUga2KHR2WkEw@mail.gmail.com>
Subject: Re: Revive HADOOP-2705?
From: "dam6923 ." <dam6923@gmail.com>
To: hdfs-dev@hadoop.apache.org
Content-Type: text/plain; charset=UTF-8

Colin,

I will continue my investigation into the matter.  Thanks.
I will just point out that
org.apache.hadoop.hdfs.server.datanode.BlockSender overwrites this
value with a 64KB value, if necessary.  Line 116.

---------------------

On a side note, can you explain the purpose of:

org.apache.hadoop.hdfs.DFSUtilClient.getSmallBufferSize(Configuration)

This method seems to be an undocumented "feature" that overrides the
user's configuration but does not explain the reason.  It appears that
in most of the cases, this method is used when creating a buffer for
sending small messages between data-nodes.  If that is the case, I
would think that the message size should be the greatest consideration
in setting a buffer size, not the value specified in the user's
variable.  For maintainability and predictability, I would think a
hard-coded 512 would be most appropriate, or simply use the default
buffer size in BufferedOutputStream/BufferedInputStream

The one notable exception I see is in:

org.apache.hadoop.hdfs.server.datanode.DataNode.DataTransfer.run() - line 2261

It appears that the OutputStream used for sending blocks is using this
smaller buffer size to send entire data blocks, but no comment exists
to indicate why this smaller buffer is utilized instead of the size
configured by the user.

Thanks!

On Fri, Dec 18, 2015 at 9:59 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> Reading files from HDFS has different performance characteristics than
> reading local files.  For one thing, HDFS does a few megabyes of
> readahead internally by default.  If you are going to make a
> performance improvement suggestion, I would strongly encourage you to
> test it first.
>
> cheers,
> Colin
>
>
> On Tue, Dec 15, 2015 at 2:22 PM, dam6923 . <dam6923@gmail.com> wrote:
>> Here was the justification from 2004:
>>
>> https://bugs.openjdk.java.net/browse/JDK-4953311
>>
>>
>> Also, some research into the matter (not my own):
>>
>> http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
>>
>> One of the conclusions:
>>
>> "Minimize I/O operations by reading an array at a time, not a byte at
>> a time. An 8Kbyte array is a good size."
>>
>>
>> On Tue, Dec 15, 2015 at 3:41 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>>> Hi David,
>>>
>>> Do you have benchmarks to justify changing this configuration?
>>>
>>> best,
>>> Colin
>>>
>>> On Wed, Dec 9, 2015 at 8:05 AM, dam6923 . <dam6923@gmail.com> wrote:
>>>> Hello!
>>>>
>>>> A while back, Java 1.6, the size of the internal internal file-reading
>>>> buffers were bumped-up to 8192 bytes.
>>>>
>>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/BufferedInputStream.java
>>>>
>>>> Perhaps it's time to update Hadoop to at least this default level too. :)
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-2705
>>>>
>>>> Thanks,
>>>> David