hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba Borthakur" <dhr...@yahoo-inc.com>
Subject RE: 答复: HBase PerformanceEvaluation failing
Date Mon, 26 Nov 2007 18:00:49 GMT
Hi there,

The "failing to allocate a datanode for a block" already shows up as an exception in the namenode
logs:

> java.io.IOException: File
> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
> could only be replicated to 0 nodes, instead of 1

Do you agree that this message is sufficient?

Thanks,
dhruba

-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Monday, November 26, 2007 9:21 AM
To: hadoop-user@lucene.apache.org
Subject: Re: 答复: HBase PerformanceEvaluation failing

Thanks from jumpingin Dhruba.

Should the 'failing to allocate a datanode for a block' show in the logs
as an exception? (I don't see one). Would something show in logs if we
were at DEBUG level?

Thanks,
St.Ack


dhruba Borthakur wrote:
> Hi Kareem,
>
> I think you have run into 
>   http://issues.apache.org/jira/browse/HADOOP-2275
>
> I have a fix for this problem. You can find that fix as an attachment to the above JIRA.
If you typically build your Hadoop distribution from source, can you pl apply the above patch
to your source and then try out your experiment? If not, that's fine.. this fix will probably
go into the next release of hadoop.
>
> Thanks for your help,
> dhruba
>
> -----Original Message-----
> From: Kareem Dana [mailto:kareem.dana@gmail.com] 
> Sent: Saturday, November 24, 2007 9:13 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: 答复: HBase PerformanceEvaluation failing
>
> I cleaned the logs, cleaned and reformatted the DFS. Started up DFS,
> mapred, and hbase.
> Ran fsck just to make sure:
> kcd@hadoop07:~/hadoop$ bin/hadoop fsck /
> .....Status: HEALTHY
>  Total size:    450 B
>  Total blocks:  3 (avg. block size 150 B)
>  Total dirs:    16
>  Total files:   5
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Target replication factor:     1
>  Real replication factor:       1.0
>
>
> The filesystem under path '/' is HEALTHY
>
> Started the PE Test:
> kcd@hadoop07:~/hadoop$ bin/hadoop jar
> /usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar
> sequentialWrite 2
> 07/11/24 23:14:03 INFO hbase.PerformanceEvaluation: Table name:
> TestTable, families: {info:={name: info, max versions: 3, compression:
> NONE, in memory: false, max length: 2147483647, bloom filter: none}}
> created
> 07/11/24 23:14:04 INFO mapred.FileInputFormat: Total input paths to process : 1
> 07/11/24 23:14:06 INFO mapred.JobClient: Running job: job_200711242311_0001
> 07/11/24 23:14:07 INFO mapred.JobClient:  map 0% reduce 0%
> 07/11/24 23:20:53 INFO mapred.JobClient:  map 5% reduce 0%
> 07/11/24 23:21:03 INFO mapred.JobClient:  map 5% reduce 1%
> 07/11/24 23:22:35 INFO mapred.JobClient:  map 10% reduce 1%
> 07/11/24 23:22:46 INFO mapred.JobClient:  map 10% reduce 3%
> 07/11/24 23:24:10 INFO mapred.JobClient:  map 14% reduce 3%
> 07/11/24 23:24:22 INFO mapred.JobClient:  map 19% reduce 3%
> 07/11/24 23:24:34 INFO mapred.JobClient:  map 19% reduce 5%
> 07/11/24 23:29:31 INFO mapred.JobClient:  map 24% reduce 5%
> 07/11/24 23:29:42 INFO mapred.JobClient:  map 24% reduce 6%
> 07/11/24 23:31:37 INFO mapred.JobClient:  map 29% reduce 6%
> 07/11/24 23:31:45 INFO mapred.JobClient:  map 30% reduce 6%
> 07/11/24 23:31:56 INFO mapred.JobClient:  map 30% reduce 10%
> 07/11/24 23:32:12 INFO mapred.JobClient:  map 35% reduce 10%
> 07/11/24 23:32:18 INFO mapred.JobClient:  map 35% reduce 11%
> 07/11/24 23:37:33 INFO mapred.JobClient:  map 40% reduce 11%
> 07/11/24 23:37:39 INFO mapred.JobClient:  map 40% reduce 13%
> 07/11/24 23:37:45 INFO mapred.JobClient:  map 45% reduce 13%
> 07/11/24 23:37:54 INFO mapred.JobClient:  map 45% reduce 15%
> 07/11/24 23:42:02 INFO mapred.JobClient:  map 50% reduce 15%
> 07/11/24 23:52:14 INFO mapred.JobClient:  map 100% reduce 100%
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
>         at org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThanOne(PerformanceEvaluation.java:248)
>         at org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:588)
>         at org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(PerformanceEvaluation.java:680)
>         at org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:699)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>
>
> Checked DFS status again:
> kcd@hadoop07:~/hadoop$ bin/hadoop fsck /
> .....
> /tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data:
> MISSING 1 blocks of total size 0 B.
> .......
> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460773960343228/data:
> MISSING 1 blocks of total size 0 B.
> ....................................
> /tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,8007445145595019431/info/mapfiles/5636229081894866081/data:
> MISSING 1 blocks of total size 0 B.
> ....................................................
> ....................................................
> /tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,7501417187825603851/info/mapfiles/6826621922173844792/data:
> MISSING 1 blocks of total size 0 B.
> ...........
> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_60020/hlog.dat.025:
> MISSING 1 blocks of total size 0 B.
> .Status: CORRUPT
>  Total size:    1823865158 B
>  Total blocks:  174 (avg. block size 10481983 B)
>  Total dirs:    180
>  Total files:   164
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Target replication factor:     1
>  Real replication factor:       1.0
>
>
> The filesystem under path '/' is CORRUPT
>
> All my logs are uploaded at http://www.cs.duke.edu/~kcd/hadoop-logs2/
>
> The namenode log looks fine until the very end when it has a few of
> these errors:
> 2007-11-24 23:43:00,798 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data.
> blk_-6879410450276751752
> 2007-11-24 23:43:00,798 WARN org.apache.hadoop.fs.FSNamesystem: Not
> able to place enough replicas, still in need of 1
> 2007-11-24 23:43:00,808 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54310, call
> addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data,
> DFSClient_-1049549396) from 172.16.6.56:56970: error:
> java.io.IOException: File
> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
> could only be replicated to 0 nodes, instead of 1
> java.io.IOException: File
> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
> could only be replicated to 0 nodes, instead of 1
> 	at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003)
> 	at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
> 	at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:585)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>
> The hbase-region servers seem to be having a lot of problems though
> according to their logs. Not quite sure what could be the cause.
>
> - Kareem
>
> On Nov 24, 2007 9:56 PM, Ted Dunning <tdunning@veoh.com> wrote:
>   
>> Can you clear the logs, clean out the file system and run the test again?
>>
>> The namenode logs should tell an interesting story.
>>
>>
>>
>> On 11/24/07 6:12 PM, "Kareem Dana" <kareem.dana@gmail.com> wrote:
>>
>>     
>>> I ran hadoop fsck and sure enough the DFS was corrupted. It seems that
>>> the PerformanceEvaluation test is corrupting it. Before I run the
>>> test, I ran fsck and the DFS was reported as HEALTHY. Once the PE
>>> fails, the DFS is reported as corrupt. I tried to simplify my setup
>>> and run the PE again. My new config is as follows:
>>>
>>> hadoop07 - DFS Master, Mapred master, hbase master
>>> hadoop09-10 - 2 hbase region servers
>>> hadoop11-12 - 2 datanodes, task trackers
>>>
>>> mapred.map.tasks = 2
>>> mapred.reduce.tasks = 1
>>> dfs.replication = 1
>>>
>>> I ran the distributed PE in that configuration and it still failed
>>> with similar errors. The output of the hadoop fsck for this run was:
>>>
>>> ..........
>>>
>>>       
>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data>
>> :
>>     
>>> MISSING 1 blocks of total size 0 B.
>>> ......................................
>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf
>>> iles/5263238643231358600/data:
>>> MISSING 1 blocks of total size 0 B.
>>> ....
>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
>>> iles/2024298319068625138/data:
>>> MISSING 1 blocks of total size 0 B.
>>> ....
>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
>>> iles/5071453667327337040/data:
>>> MISSING 1 blocks of total size 0 B.
>>> .........
>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile
>>> s/4400784113695734765/data:
>>> MISSING 1 blocks of total size 0 B.
>>> ...................................
>>> ........................................................................
>>> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027:
>>> MISSING 1 blocks of total size 0 B.
>>> .Status: CORRUPT
>>>  Total size:    1890454330 B
>>>  Total blocks:  180 (avg. block size 10502524 B)
>>>  Total dirs:    190
>>>  Total files:   173
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       0 (0.0 %)
>>>  Target replication factor:     1
>>>  Real replication factor:       1.0
>>>
>>>
>>> The filesystem under path '/' is CORRUPT
>>>
>>>
>>> On Nov 24, 2007 6:21 PM, Ted Dunning <tdunning@veoh.com> wrote:
>>>       
>>>> I think that stack was suggesting an HDFS fsck, not a disk level fsck.
>>>>
>>>> Try [hadoop fsck /]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 11/24/07 4:09 PM, "Kareem Dana" <kareem.dana@gmail.com> wrote:
>>>>
>>>>         
>>>>> I do not have root access on the xen cluster I'm using. I will ask the
>>>>> admin to make sure the disk is working properly. Regarding the
>>>>> mismatch versions though, are you suggesting that different region
>>>>> servers might be running different versions of hbase/hadoop? They are
>>>>> all running the same code from the same shared storage. There isn't
>>>>> even another version of hadoop anywhere for the other nodes to run. I
>>>>> think I'll try dropping my cluster down to 2 nodes and working back
>>>>> up... maybe I can pin point a specific problem node. Thanks for taking
>>>>> a look at my logs.
>>>>>
>>>>> On Nov 24, 2007 5:49 PM, stack <stack@duboce.net> wrote:
>>>>>           
>>>>>> I took a quick look Kareem.   As with the last time, hbase keeps
having
>>>>>> trouble w/ the hdfs.  Things start out fine around 16:00 then go
bad
>>>>>> because can't write reliably to the hdfs -- a variety of reasons.
 You
>>>>>> then seem to restart the cluster around 17:37 or so and things seem
to
>>>>>> go along fine for a while until 19:05 when again, all regionservers
>>>>>> report trouble writing the hdfs.  Have you run an fsck?
>>>>>>             
>>>>         
>>     


Mime
View raw message