kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tuo Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KYLIN-4000) Merge task hang at
Date Fri, 10 May 2019 17:39:00 GMT
Tuo Zhu created KYLIN-4000:
------------------------------

             Summary: Merge task hang at 
                 Key: KYLIN-4000
                 URL: https://issues.apache.org/jira/browse/KYLIN-4000
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
    Affects Versions: v2.6.1
            Reporter: Tuo Zhu


Auto triggered merge tasks all hanged at step: Convert Cuboid Data to HFile. The last reduce
step in "Kylin_HFile_Generator_mycube_Step".

Hadoop UI shows job status:
 attempt_1555603539483_1251_r_000000_0 78.56 RUNNING reduce > reduce

>From the syslog, following outputs are printed at first few hours:

{code}
 2019-05-10 22:27:48,699 INFO [IPC Server handler 27 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8629564
 2019-05-10 22:28:24,743 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:28:24,746 INFO [IPC Server handler 23 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.86303306
 2019-05-10 22:29:00,794 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:29:00,796 INFO [IPC Server handler 20 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.86305153
 2019-05-10 22:29:36,833 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:29:36,836 INFO [IPC Server handler 3 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8634469
 2019-05-10 22:30:12,883 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:30:12,887 INFO [IPC Server handler 18 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8643631
{code}


 After that, only ZK logs keeps printed following lines constantly:

{code}
2019-05-11 00:02:23,445 WARN [Thread-1388778] org.apache.hadoop.hdfs.DataStreamer: These favored
nodes were specified but not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
 2019-05-11 00:02:23,452 INFO [main] org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Connect
0x29758ab9 to boron:2181,carbon:2181,neon:2181 with session timeout=90000ms, retries 6, retry
interval 1000ms, keepAlive=60000ms
 2019-05-11 00:02:23,452 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9]
org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=bo
 ron:2181,carbon:2181,neon:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$24/171414890@19fbce98
 2019-05-11 00:02:23,452 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
org.apache.zookeeper.ClientCnxn: Opening socket connection to server boron/172.16.211.20:2181.
Will not attempt to authenticate using SASL (unknown error)
 2019-05-11 00:02:23,453 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client:
/172.16.211.8:52214, server: boron/172.16.211.20:2181
 2019-05-11 00:02:23,453 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
org.apache.zookeeper.ClientCnxn: Session establishment complete on server boron/172.16.211.20:2181,
sessionid = 0x16a061f0770a354, negotiated timeout = 60000
 2019-05-11 00:02:23,455 INFO [main] org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Close
zookeeper connection 0x29758ab9 to boron:2181,carbon:2181,neon:2181
 2019-05-11 00:02:23,455 INFO [main] org.apache.hadoop.hbase.io.hfile.CacheConfig: Created
cacheConfig: CacheConfig:disabled
 2019-05-11 00:02:23,456 INFO [main] org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2:
Writer=hdfs://berylium:8020/kylin/kylin_metadata/kylin-772502e8-c126-fec2-5808-08aa2df1fdea/pbs_uv_daily/hfile/_temporary/1/_temporary/attempt_1555603539483_1149_r_000000_0/F1/fdbf1bd8933242f2bd1da40a03c7a635,
wrote=43
 2019-05-11 00:02:23,457 WARN [Thread-1388781] org.apache.hadoop.hdfs.DataStreamer: These
favored nodes were specified but not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
{code}

It lasts for tens of hours. No error, no exceptions. In the end, it created too many blocks
in HDFS. It only happens when merge task is triggered. Daily build jobs went very good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message