hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliott Clark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9338) Test Big Linked List fails on Hadoop 2.1.0
Date Fri, 06 Sep 2013 20:35:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760594#comment-13760594

Elliott Clark commented on HBASE-9338:

h1.What's happened

When the new Hadoop release was in RC and we were getting close to a 0.96 release of our own
I changed all of the IT tests to run for longer. When I did this IntegrationTestBigLinkedList
and IntegrationTestLoadAndVerify started failing.  I at first thought this was due to 2.1.0
but now I'm pretty sure it was just a red herring as the tests pass with a shorter run time.
So I started creating new chaos monkeys with different actions turned on and off.  Doing that
I found that Online Schema Change was what was causing Test Load and Verify to fail.  So that
was disabled in 0.96.

Then I started working on Big Linked List.  That one has been much harder.


The calm monkey always passes.  The unbalance one almost always passes.  When it doesn't,
the failure is usually due to the assignment manager getting stuck.  There appears to be no
data loss then.

Lots of the failures are like this one: 
{code}	Map-Reduce Framework
		Map input records=599999660
		Map output records=1199999320
		Map output bytes=38999977900
		Map output materialized bytes=41399980716
		Input split bytes=7360
		Combine input records=0
		Combine output records=0
		Reduce input groups=600000000
		Reduce shuffle bytes=41399980716
		Reduce input records=1199999320
		Reduce output records=680
		Spilled Records=3576554634
		Shuffled Maps =696
		Failed Shuffles=0
		Merged Map outputs=696
		GC time elapsed (ms)=96902
		CPU time spent (ms)=14269000
		Physical memory (bytes) snapshot=86934548480
		Virtual memory (bytes) snapshot=157839667200
		Total committed heap usage (bytes)=86516498432
	HBase Counters
	Shuffle Errors

The number of map input records is too low.  It should always be a multiple of 30m. The number
of un-referenced rows is equal to the number of undefined rows. If you add that number to
the number of map input records then all rows are accounted for. That suggests that for this
job there are lots of places that we are missing exactly on link at each location. 

However that pattern isn't 100%.  There are other tests where the number of unreferenced nodes
is not equal to the number of undefined rows.  So those tests runs have some places in the
chain were more then one link is missing.  To me this seems to suggest that it's not an off
by one error and that it seems like it's on the client side.

h1.Things I've tried

* Tried to see if it's just a badly written test
** The test is hard to follow and could be cleaned up
** But it's logic seems solid.
* I thought maybe yarn is the one messing up our data.  So I Changed the output keys to match
what Load Test and Verify was doing
** This didn't seem to have any real impact one Online Schema Change was turned off.
* Tried to see if one map task was failing.  So I changed the job to log more and to use a
better client column.
** It's not just one map task that has failures
** It is always the last generation job that was the writer of data that's missing.
* Then I tried to see if it's just the beginning or the end of the linked list that is missing
** It's not.  There is missing data everywhere in the chain.
* Running Verify step with no monkey
** This failed suggesting that we are really losing data
* Then I though it might be the same bug that was causing replication to lose edits. So I
re-ran the tests before and after JD's HLog reader change.
** Tests fail with JD's patch
* Then I tried Running the test with a Monkey that kills RS and Master
** This passed
** I even ran this with more loop iterations and it still passed
* I Thought maybe it was merge regions. So I turned off merging regions
** The tests still failed.
* So I ran a job where inside of the generation job I run a get right after the puts.
** This was SLOW.  
** Like really really really slow.  10x slower.  What has happened to our get speed ?
** It failed.  To me this suggested that this was related to the client.  My thought was that
the client was silently failing a put.
* Then I thought that it was possible that the gets were just arriving at the region servers
after the data loss had happened.
* So I moved the get *directly* after the puts.
** This required flushing a lot more. So it was slow.
** The first two iterations passed but it would be 3 days for me to finish the entire five
iterations so I stopped the run.
** Everything was passing though
* Next I thought it could be a JVM issue.  1.7.15 had been a little flakey on me.
* So I upgraded to 1.7.25.
** This had no effect.
* Next run I started looking at where the data was lost.
** {color:red}All the data I lost in my last run was in the same region.{color}
** All of it was around the same time.
** Different mappers.

h1.Sample Dangling Pointers
=> #<#<Class:0x2123dde6>:0x5f1840c3>
hbase(main):002:0> get 'IntegrationTestBigLinkedList', "Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465754975, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000007_0                                              
 meta:count                                         timestamp=1378465754975, value=\x00\x00\x00\x00\x01:{\xE9
 meta:prev                                          timestamp=1378465754975, value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H
3 row(s) in 0.0920 seconds

hbase(main):003:0> get 'IntegrationTestBigLinkedList', "\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465679500, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000001_0                                              
 meta:count                                         timestamp=1378465679500, value=\x00\x00\x00\x00\x01_\x97z
 meta:prev                                          timestamp=1378465679500, value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19
3 row(s) in 0.0940 seconds

hbase(main):004:0> get 'IntegrationTestBigLinkedList', "s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465746735, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000002_0                                              
 meta:count                                         timestamp=1378465746735, value=\x00\x00\x00\x00\x01Z:\xEB
 meta:prev                                          timestamp=1378465746735, value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A
3 row(s) in 0.1850 seconds

hbase(main):005:0> get 'IntegrationTestBigLinkedList', "T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465679216, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000000_0                                              
 meta:count                                         timestamp=1378465679216, value=\x00\x00\x00\x00\x01P\xC1!
 meta:prev                                          timestamp=1378465679216, value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD
3 row(s) in 0.1400 seconds

hbase(main):006:0> get 'IntegrationTestBigLinkedList', "\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465684324, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000003_0                                              
 meta:count                                         timestamp=1378465684324, value=\x00\x00\x00\x00\x01b\x12\xCA
 meta:prev                                          timestamp=1378465684324, value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C
3 row(s) in 0.0730 seconds

hbase(main):007:0> get 'IntegrationTestBigLinkedList', "|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465730726, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000000_0                                              
 meta:count                                         timestamp=1378465730726, value=\x00\x00\x00\x00\x01R\x99\x0D
 meta:prev                                          timestamp=1378465730726, value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6
3 row(s) in 0.1530 seconds

hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p ^V\xA4\xD0@\x0A\x0E\xC1"
COLUMN                                              CELL                                 
 meta:client                                        timestamp=1378465752879, value=Job: job_1378337631389_0032
Task: attempt_1378337631389_0032_m_000002_0                                              
 meta:count                                         timestamp=1378465752879, value=\x00\x00\x00\x00\x01Z\xC8%
 meta:prev                                          timestamp=1378465752879, value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6
3 row(s) in 0.0780 seconds

All of the missing rows start with "\xAE" or "\xAD"

hbase(main):002:0> java.util.Date.new(1378465752879).toString
=> "Fri Sep 06 04:09:12 PDT 2013"
hbase(main):003:0> java.util.Date.new(1378465730726).toString
=> "Fri Sep 06 04:08:50 PDT 2013"
hbase(main):004:0> java.util.Date.new(1378465684324).toString
=> "Fri Sep 06 04:08:04 PDT 2013"
hbase(main):005:0> java.util.Date.new(1378465679216).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):006:0> java.util.Date.new(1378465746735).toString
=> "Fri Sep 06 04:09:06 PDT 2013"
hbase(main):007:0> java.util.Date.new(1378465679500).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):008:0> java.util.Date.new(1378465754975).toString
=> "Fri Sep 06 04:09:14 PDT 2013"

The dangling pointers are all written within 3-4 seconds of each other.

h2.Side Note

JDK7 seems to core dump a lot when starting up region servers.  Once things are all set up
it's stable.  I need to turn on core dumps into the a known folder so the Jenkins can archive
> Test Big Linked List fails on Hadoop 2.1.0
> ------------------------------------------
>                 Key: HBASE-9338
>                 URL: https://issues.apache.org/jira/browse/HBASE-9338
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>            Priority: Blocker
>             Fix For: 0.96.0
>         Attachments: HBASE-HBASE-9338-TESTING.patch

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message