hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Sarsfield <b...@bing.com>
Subject RE: io.file.buffer.size
Date Wed, 21 Nov 2012 17:00:14 GMT
If you are hitting an IO bottleneck you may want to consider not only adjusting io.file.buffer.size
you may also want to play with a few other settings. There is a relationship between io.file.buffer.size
for writes and dfs.write.packet.size as well.  Given that, if you squint at it a little bit,
your file io buffers are awaiting for network packets to fill up before it does a write. 
 Also, there is a relationship, at a much higher cluster wide level to dfs.block.size where
this can change your IO patters as well.



All of these settings are highly dependent on your hardware / network environment / workload.
 What works for someone may not work for you ☺  IO is a ‘finicky’ thing to say the least.



Here are some things to consider when you are trying to push your writes > 400MB/s …
 Though all of this will change when: https://issues.apache.org/jira/browse/HDFS-3529 and
https://issues.apache.org/jira/browse/HDFS-3528 are implemented.



This is what the datanode write path looks like (this may not be 100% accurate; but it looks
fairly correct?)   Huge credit for this goes to John Gordon for this diagram and writeup.

[cid:image002.png@01CDC7C4.C7921710]

Though if there’s a blocking IO call in here somewhere … well .. then all bets are off
..  ☺



The key here is that it appears there is a single thread default (ipc.server.read.threadpool.size)
that may affect the overall throughput.

// Network packet size

<property><name>dfs.write.packet.size</name><value>1048576</value></property>



// block size

<property> <name>dfs.block.size</name> <value>536870912</value></property>





//Task Tracker

<property><name>tasktracker.http.threads</name><value>128</value></property>



//Data node

  <property><name>ipc.server.listen.queue.size</name><value>128</value></property>

  <property><name>ipc.server.read.threadpool.size</name><value>100</value></property>

  <property><name>ipc.server.handler.queue.size</name><value>1600</value></property>

  <property><name>fs.datanode.handler.count</name><value>16</value></property>



Gory details:  (Proceed at your own risk!!)



Clean start:

------------------------------------

Client 1 posts an RPC request for writeBlock and starts sending a lot of bytes such that:

It gets interrupted by the OS after M milliseconds or when the OS decides to put it in a resource
wait queue because the device write buffer is full.

Server gets the read request, processes the header, and enters the blocking read section of
the code where it is marshaling the RPC arguments.

Client 2 is context switched in and posts an RPC request for writeBlock and starts sending
lots of bytes:

                It gets interrupted for reasons defined above.

…

Client N is context switched in and does the same thing.

Client 1 back, starts writing some more – but this time it has a higher probability of being
put in a resource wait queue – because the device write buffer for the NIC has the local
read requests in it buffered from all the other clients that haven’t been serviced.

Client 2 …

…

Now the write buffer is full of read requests and the clients are all in the OS resource wait
queue for the network.

Server read thread has been popping in and out of the resource wait queue to fulfill its sync
reads.

Client 1 is pulled off of the resource wait queue and writes until the buffer is full again.

Server is pulled off of the resource wait queue and reads those bytes.

Client 1 is pulled off of the resource wait queue…

…

Server reads the last bytes of the RPC request including the block data to be written and
puts the serialized “Call” object in the LinkedBlockingQueue to be satisfied by the code
that writes to disk.

Server reader thread exits synchronous point in its code.

Server Reader thread calls Select().

Clients 2-N sockets comes back.

Server asynchronously processes all their headers – very fast.

Clients start being pulled off of resource queues to send again.

Server reader thread Selects again – gets Client 2’s connection.

                If (block size < device buffer size allocated per socket) => Server
reader thread processes RPC synchronously but instantaneously (all data in buffer).

                Else => reader thread reads as much as it can, then go back to “clean
state”





What I think you will see for the aggregate local->local IO is one of these patterns:



Block size < 2* net buffer size (client write buffer + server read buffer):

Phase/operation


Net usage


Disk usage


When, for how long?


Reading headers asynch


High, short duration


N/A or draining prev queues


Short, mostly startup and after net buffer drains


Draining net buffers (simulated pure async)


High/med.


Sustained high or spikes just after (draining prod/consumer queue), usage spike rate linearly
dependent on duration of the drain


After one or more “Stutter” events, Duration is probability density curve P(buff drain
throughput) ~= # full buffs / # connections


Blocking marshal, other connection buffers filling


OS net usage decreases linearly with duration of stutter (buffers fill up).  Client perceived
usage is uniformly low (only one connection being read).


None or still clearing the queue from a previous net buffer drain.


After a net buffer drain.  Duration is probabilistic based on 1-P(drain).




Block size >= 2 * net buffer size:

Phase/operation


Net usage


Disk usage


When, for how long?


Reading headers asynch


High, short duration


N/A or draining prev queues


Short, mostly startup and after net buffer drains


Draining net buffers (simulated pure async)








Never.


Blocking marshal, other connection buffers filling


Med/low.  Instead of blocking for nearly empty buffers, blocks fairly reliably to read the
tail end of the RPC request while the other clients fill the buffers.


Erratic, very high, then 0, then very high.


Always.


(again credit: John Gordon)





Thanks,

Brad Sarsfield



-----Original Message-----
From: Kartashov, Andy [mailto:Andy.Kartashov@mpac.ca]
Sent: Wednesday, November 21, 2012 8:18 AM
To: user@hadoop.apache.org
Subject: io.file.buffer.size



Guys,



I've read that increasing above (default 4kb) number to, say 128kb, might speed things up.



My input is 40mln serialised records coming from RDMS and I noticed that with increased IO
my job actually runs a tiny bit slower. Is that possible?



p.s. got two questions:

1. During Sqoop import I see that two additional files are generated in the HDFS folder, namely
.../_log/history/...conf.xml .../_log/history/...sqoop_generated_class.jar

Is there a way to redirect these files to a different directory? I cannot find an answer.



2. I run multiple reducers and each generate each own output. If I was to merge all the output,
will running either of the below commands be recommended?



hadoop dfs -getmerge <output/*> <localdst> or hadoop dfs -cat output/* > output_All
hadoop dfs -get output_All <localdst>



Thanks,

AK





NOTICE: This e-mail message and any attachments are confidential, subject to copyright and
may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not
the intended recipient, please delete and contact the sender immediately. Please consider
the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe
qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts
par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite.
Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
Mime
View raw message