hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu, Raymond" <raymond....@intel.com>
Subject Trouble shooting process for a random lag region issue.
Date Thu, 17 Jan 2013 02:09:55 GMT
Hi

I have record my trouble shooting process for my random lag region in MR scan issue. share
it here, In case you meet similar problem need to diagnose.

Full text with image here: http://blog.csdn.net/colorant/article/details/8510254

Only text, as below:

----------------------------------------------------------

=== Problem observation ===

When scanning some specific table, there are always some lag behind slow map tasks, (usually
cost 150%~200% of the average task run time) And the top 10 slowest tasks usually locate on
the same Region Server, And if run the same scan job multiple times, the slowest tasks and
their location do not change.

If only judge by the above behavior, you can suspect that the lag behind Region Server  must
have some problem which slow down the whole system. But the truth is : If you run scan job
on different table, the lag behind Region Server is not the same one, say, e.g. with table
1, region server A have a lot of lags behind tasks, while for table 2, it might be region
server B which lags behind. 

Last but not least, All these tables works fine a few days ago, Seems the problem occurs (or
is observed) after a few times of cluster restart.

=== Environment ===

1 master node + 4 data/region node, each with 4 disk. 48G RAM, 16 CPU core
Hadoop 1.1.1, HBase 0.94.1 24/20 Map/Reduce slots on each node.

Each table is around 50GB, 64~96 Regions distributed evenly across 4 Region Servers. The data
is generated, and each region have exactly the same number of keyvalues and almost exactly
the same size. All table have Major Compact done.

Using Map Reduce job to do whole table scan. Each Region is assigned to a Local Map Task,
the map task just scan the local region, and count rows. Since map slot number is equal or
large than the region number, the tasks can be assigned within one batch.

=== Trouble shooting ===

My trouble shooting procedure is recorded as below ( with some path finding and misleading
works, but also a few knowledge gained as byproduct, well for me a newbie)

== Any bottleneck? ==

First of all, supervise the lag Region Server to check out is there any bottleneck when performing
the scan job. It appears to be nothing abnormal. The CPU/DISK IO is ok, not reached peak,
except that the overall disk IO throughput is a little bit lower than the other Region Servers.

== Data locality? == 

If the region's data is actually not reside on the local data node, then it will also lead
to hot spot region, since it will need to read data from other nodes.

To make sure that all data is actually read from local data node, I do a second Major compact
on the table to eliminate the possibility that the region get relocated and balanced since
last major compact. Then inspect on the network IO when doing MapReduce scan jobs.

Knowledge : A simple fast way to inspect network IO together with other system resource is
using "dstat" e.g. dstat -cdnm can supervise CPU / Disk IO / network IO / Memory, Cache, Buffer
all together.

The observation show to me that there are no data locality issue, all data is read from local
data node, and no notable network IO. The lag behind issue still exist after another Major
Compact. But there are some changes been observed. After each Major Compact, the top 10 slow
region seems to change randomly with weak relationship (say probably still on the same region
server before/after major compact)

Thus, this issue is not related to data locality.

== Cluster configuration == 

Since this problem is random across tables. So I also wondering that is there any configuration
I have made for the past days which impact the cluster's stability? e.g. All memory related
setting? Some parameters fine tune on map reduce framework?

? First of all I look into the GC behavior, since GC do bring a lot of randomness. And a lot
of settings might influence GC behavior. Say Hadoop/Hbase HeapSize, GC strategy, Eden area
size, HBase block Cache Enable/Disable etc.

After tuning and comparing different settings on these parameters ( including restore them
to the setting that I know is working before this problem occurs), the lag behind issue still
exist. Though some settings do behavior better in the sense of GC time, but don't solve the
lag region issue.

Knowledge:Disable Hbase block cache will reduce GC time a lot for whole table scan like
jobs , for my 50G data, it saves about 10s GC time - observed by jvisualvm GC plugin. And
by default, TableInputFormat do disable block cache (obviously, since all the data is accessed
only once, they don't need to be cached) , while if you are writing custom InputFormat, you
need to disable it by yourself.

? Then I try to tune some parameters which related HDFS/MapReduce/Hbase's concurrent capability,
e.g. Data Node Xceiver/Handler number, RegionServer Handler number, map slot number, client
scan cache size etc. Though these settings are sync across each node, so it should not bring
random issues. But after all, I did change these settings for the past days, so to make sure,
I double check and compare different settings of these parameters. And not surprisingly, they
don't help on this issue.

Knowledge: when there are no obvious bottleneck, Fine tune of these concurrent related parameters
don't have significant impact on overall cluster performance.

Thus, I believe cluster configuration is out of candidates.

== Profile on slow region server == 

Since for a specific table, the lag region server is fixed and can be reproduced across job.
So I use JVisualVM to sample Lag Region Server's lagging Map Task and also sample the region
server process. From the sample result,  the Map Task don't show noticeable different with
other tasks. ( That is also reasonable, since for scan, most jobs is done by Region Server,
client just count the result.) While at the same time, Region Server's Hotspot method's distribution
and percentage also don’t show noticeable different with other Region Servers ( well, a
few more idle time).

While, still I noticed one difference that: on Lag Region Server, around the end of the job
duration, each IPC Server Handler thread show a lot of Lock conflicts, as the following figure
shows:

http://img.my.csdn.net/uploads/201301/16/1358327469_2557.png

By dump thread detail around locking time and check for the stack, I can find out that this
happens at HFile block read level :

IPC Server handler 10 on 60020" daemon prio=10 tid=0x00007f39b02a1800 nid=0x68e4 waiting for
monitor entry [0x00007f39e4fec000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1348)
- waiting to lock <0x000000060fc01798> (a org.apache.hadoop.hdfs.DFSClient$DFSDataInputStream)

By reading code, it shows to me this is due to multiple scan read on the same store file,
even they actually read at different offset. though each scan run in it's own thread and have
it own scan instance and env, but the store file handler/reader itself is shared within region
server.  ( I don't quite understand this limit or purpose, to reduce handler number?  To serialize
file access?)

So I can now reasoning that this conflict is due to map reduce speculate feature, a map task
could be start with multiple instance in case that slow task is really slow. Then since this
second task do scan on the same region, and my table is major compacted, thus it read the
same store file. Since there are no block cache during scan, it must do real read from data
node, so wait for lock.

And this speculate feature actually don't help in my case. This second task is always waiting
for the lock be released from the original task, so it won't overpass the original task, but
finish at about the same time with the original task, and the original task always win out.
( I am wondering, actually the data is replicated across other node, can't we have a solution
to read directly from the other node to get benefit from the replica, instead of further burden
the lagging node? )

http://img.my.csdn.net/uploads/201301/16/1358327487_3595.png

So I disable the speculate feature of the map reduce framework.

But , this is not the root cause of my random lag issue, it is just a side effect. The slow
task is already lagging behind, the speculate feature just make it worse.

== Benchmark Hard disk performance ==

Since no significant data traffic will be need on network, So I by pass the checking of network.
Then Finally, I decided to check my hard disks. This don't come to me in the first time because
this issue happen randomly on every region server , And I can hardly believe my disks on different
region server go wrong at the same time. But now, it is the last thing I can go to.

? First, I do "hdparm -t" check on every hard disk on my cluster. Result show as below. (Marked
as OLD*)

http://img.my.csdn.net/uploads/201301/16/1358327500_6061.png

Note * : Old disk sets, most are ST31000528AS with 3 St31000340NS
Note ** :  New disk sets, all ST1000DM003-9YN162

Two disks were found with very low performance of 3 and 8MB/s while the other disks' average
 result is 110~120MB/s with a few disk at 90MB/s. (on sr171, Disk2 is actually not used since
this is name/master node, and only disk1 is used. )

While this disk slow pattern don't match my job's task slowness pattern. Actually most of
tasks on sr174 run faster than average e.g. on the previous task speculate example, sr174
is actually help sr175.

And then, I have a lot of lagging case happen on sr176, sr175, while it seems sr176 is ok.
 And as slow as 8MB/s ? Kidding me.

Since I don't know the internal mechanism of hdparm test. So I write a script to run dd command
on every disks to double check hard disk performance. (https://github.com/colorant/misc-scripts/blob/master/ddtest.sh)
Result show on the same table. This time I can see that there are a few disks performance
20-30% lower than the other disks, including one from sr176. (Though hdparm don't show me
the other bad performance disks, but which it says run bad did performance bad in dd case.)

In order to find out why these disks perform bad, I do both e4defrag to check file fraction
and fsck -c for full disk bad block scan. While no bad block is found… So why these disks
perform bad remain unknown to me. ( their type are almost the same as Notes mentioned, with
a few exception, but both have good disk and bad disk)

Anyway, since a lot of disks don't perform well, I decided to try to replace them all with
a new batch of disks. The performance of the new disks is also show on the same table. This
time, much better in sense of absolute and relative speed.

== Check blocks allocation == 

With the disks replaced and table rebuilt with the same data generator. I had hope that the
issue is fixed. But I am not a lucky guy as always. The random lag problem still exist, though
a lot better ( slowest task now cost around 130~150% average task time, typical in your cluster?).
Approaching, but still not reaching the performance I have got on previous slower disks before
this issue happened ( though the new disk seems to have 150% raw speed gain). 

Then I decided to check each disk's throughput on current slow region server during the job,
using dstat to show disk IO upon every second on sr175, screenshot as below :

http://img.my.csdn.net/uploads/201301/16/1358327516_4856.png

This screenshot show the disk IO for each disk when approaching the end of job. It is noticed
that, for disk4(sdd1),  it is kept busy ( And maybe reach the peak speed for hbase read? Though
not reach the peak raw speed), while the other disks all have idle times and throughput up
and down...

Thus the best guess is that disk4 have much more data to serve than other disks.

Since there are no ready cmd to find out which block on which disk belong to which table or
region. I had to gather the region list belong to a single node, then writing scripts to use
cmd like "hadoop fsck /path/to/file -files -blocks -locations" to find out which block the
region have, and "find" cmd to find though all the disks to locate each block's disk location.
( script for finding block: https://github.com/colorant/misc-scripts/blob/master/hadoop/findblock.sh
) Then aggregate the result.  There could be better quick solution with hdfs API, But these
scripts are simple easy to write, and works fine for me.

The following table shows the local blocks distribution for a 96 region table
http://img.my.csdn.net/uploads/201301/16/1358327526_5298.png

Just read the data with * mark for now. And yes, Disk4 have 45 blocks which is 50% more than
the other disks. No wonder that it will always be busy. I guess this is the reason that regions
on sr175 lags behind, it have a hot spot disk. (Though this disk is not the slowest). And
for sr174, blocks are well distributed, thus sr174 own most of the fastest tasks.

But you will wonder that HDFS should even out the blocks distribution across disks, how can
this happen. My idea is that, it do even out the blocks for the whole data,  but since it
don't know which block belong to which table and it don't know which data belong to local
region and which is a replica from region which served by other region server. So it do not
or could do balance work across disks well for single table's local region data. To double
confirm that the overall blocks is evenly distributed across disks. The following table show
sum of the blocks (local or replica of remote) on each disks 
http://img.my.csdn.net/uploads/201301/16/1358327537_2883.png

This also explained why a Major compact could change the slow pattern, but won't resolve the
slow issue. For major compact rewrite the blocks, so change the block distribution, but if
the block is still not evenly distributed, there still will be hot spot disks. And this uneven
blocks issue might always exist. Double confirmed by do another major compact and count the
block again. Result show on previous table with non * marked data.

While one thing I could not make sure is: you probably noticed on previous disk speed benchmark,
sr175 own the slowest disk ( though just 5-10% slower) among cluster, Then, it happened that
most of the lag issue also happen on this server. Though hot spot disk is not necessary the
slowest disk itself, but other disks. Is that because a slow disk make the blocks uneven distribution
issue worse on that node?

And to be noticed that, the uneven block distribution not necessary do strict 1:1 mapping
to slow region. It just make more chance for hot spot disk. The real read conflict case also
depends on every region's block scan sequence. Take this table for example. After a second
major compact, sr173's disk2 happen to have the most (45) blocks. Similar with previous case
on sr175. But the overall job run time is different.
http://img.my.csdn.net/uploads/201301/16/1358327548_7370.png

As the table above shows: the overall blocks distribution still have similar hotspot disks(these
disks have similar raw speed), but on different node. While the over all run time has a 10%
performance difference. The Result before/after a second major compact  is measured for several
times, with very small randomness.

And actually, now the overall job speed (95s) is a little bit faster than the best result
I got from previous old disks cluster, though not 50% faster as the raw disk speed is.

=== Why this happen suddenly? ===

Then comes the issue, why I don't observe this issue before? My best guess is that the original
table is major compacted by me for several times to get the best result out of it. I stop
compact them when they stay at a "good" block distribution status.  And I count those random
slowness issues as "real" random issues before they get stable. Then the recent several cluster
restart operation some how mess up the region locality, so I do major compact again to restore
data locality, But this time I pay special attention to the Lag issue. And also I am not lucky
enough to reach another "good" block distribution status after several try. Thus it come to
me that this issue suddenly happen. Hmm, this might not be the true case, but is the best
I can guess.

=== Solution? ===

So it seems that this issue is inevitable, is there some way to some how avoid it? The possible
solutions I can image are :

? Improve block(and replica) allocation strategy on disks, make it region aware?
	○ I doubt it will work. e.g. one region have perfect block distribution on disk1/2/3/4
etc. while another region also allocate blocks on disk1/2/3/4 , Though overall there are no
hot spot disks, But when you do scan for each region at the same time in a MR job…. They
conflict on each single disk if the region scan processed at similar speed…. Maybe there
are better ideal block allocation?
? More disks?
	○ my cluster has 16 cpu core and 4 disks, so anyway there will be multiple cpu read on
same disks. With more disks, say 12? It might help to reduce disk read conflict chance.
	○ While, what's the typical cpu/disk ration on industry cluster?
? Faster disks?
	○ To reduce the hotspot disks' impact. Actually, if run hdfs on ram disk. You hardly observe
Lag behind regions. While seems ram disk is an extreme example, In my case, a faster disks
helps , but not as much.
? Smaller block size?
	○ Thus more blocks with same data size, and hopefully it will lead to more even block distributions.
	○ But smaller block size might impact HDFS name node capability and batch process throughput?
? More Blocks
	○ Similar to smaller block size, aim to increase block number to have more chance to get
a better block distributions. But without change block size, instead, hope that table itself
is bigger and occupy more blocks.
? Tune data node read size?
	○ Not sure it helps or not. Might reduce conflict time, or help on busy disks with fewer
disk locate time with bigger read size?
? Make it CPU bound?
	○ If it is CPU bound, the hot disk won't impact as much. We can achieve this by e.g. encode
the data on disk. And previous more disks solution is some how also shift bottleneck from
disks to CPU.

Best Regard	
Raymond


 
> Hi
> 
> For feedback.
> 
> With a lot of profiling works, I guess I found the most promise cause of my
> problem.
> 
> It's not because one disk is slow or something ( though I do have slow disk on
> different region servers, but the lagging behind pattern seems not related to
> the disk slowness pattern)
> 
> It seems to me this issue is caused by HDFS blocks not distributed evenly across
> disks on the same region server.
> 
> For a specific table, I do have even region across region servers. But the HFILE
> data blocks that belong to the local region on the slow region server don't
> distributed evenly. Say disk1 have 45 blocks, while disk4 have 30 blocks etc.
> 
> While, on the large, Add up all the blocks from all tables including both local
> region's block and remote region's replica data block on this region server, they
> are evenly distributed.
> 
> Thus I guess the HDFS did try to even out the data block on the disks, But since
> it do not know which block is belong to which region, and there are incoming
> replica data blocks, so even with round robin strategy, it could not even out the
> "local region"'s data block across disks. Seems this could hardly been avoid?
> 
> Thus, there are hotspot disk, and with defined scan sequence which lead to lag
> behind region, which lead to lag behind map task and on this region server.
> 
> This is the best guess I have gain up to now. But not knowing why this issue
> come out suddenly on my cluster, or why I don't observe it before...
> 
> 
> >
> > Hi there,
> >
> > The HBase RefGuide has a comprehensive case study on such a case.
> > This might not be the exact problem, but the diagnostic approach should
> help.
> >
> > http://hbase.apache.org/book.html#casestudies.slownode
> >
> >
> >
> >
> >
> > On 1/4/13 10:37 PM, "Liu, Raymond" <raymond.liu@intel.com> wrote:
> >
> > >Hi
> > >
> > >I encounter a weird lag behind map task issue here :
> > >
> > >I have a small hadoop/hbase cluster with 1 master node and 4
> > >regionserver node all have 16 CPU with map and reduce slot set to 24.
> > >
> > >A few table is created with regions distributed on each region node
> > >evenly ( say 16 region for each region server). Also each region has
> > >almost the same number of kvs with very similar size. All table had
> > >major_compact done to ensure data locality
> > >
> > >I have a MR job which simply do local region scan in every map task (
> > >so
> > >16 map task for each regionserver node).
> > >
> > >By theory, every map task should finish within similar time.
> > >
> > >But the real case is that some regions on the same region server
> > >always lags behind a lot, say cost 150 ~250% of the other map tasks
> > >average times.
> > >
> > >If this is happen to a single region server for every table, I might
> > >doubt it is a disk issue or other reason that bring down the
> > >performance of this region server.
> > >
> > >But the weird thing is that, though with each single table, almost
> > >all the map task on the the same single regionserver is lag behind.
> > >But for different table, this lag behind regionserver is different!
> > >And the region and region size is distributed evenly which I double
> > >checked for a lot of times. ( I even try to set replica to 4 to
> > >ensure every node have a copy of local data)
> > >
> > >Say table 1, all map task on regionserver node 2 is slow. While for
> > >table 2, maybe all map task on regionserver node 3 is slow, and with
> > >table 1, it will always be regionserver node 2 which is slow
> > >regardless of cluster restart, and the slowest map task will always
> > >be the very same one. And it won't go away even I do major compact
> again.....
> > >
> > >So, anyone could give me some clue on what reason might possible lead
> > >to this weird behavior? Any wild guess is welcome!
> > >
> > >(BTW. I don't encounter this issue a few days ago with the same table.
> > >While I do restart cluster and do a few changes upon config file
> > >during that period, But restore the config file don't help)
> > >
> > >
> > >Best Regards,
> > >Raymond Liu
> > >
> > >
> >

Mime
View raw message