hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming LI (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAWQ-1094) Select on INTERNAL table returns wrong results when hdfs blocks have checksum errors
Date Mon, 10 Oct 2016 14:39:20 GMT
Ming LI created HAWQ-1094:
-----------------------------

             Summary: Select on INTERNAL table returns wrong results when hdfs blocks have
checksum errors
                 Key: HAWQ-1094
                 URL: https://issues.apache.org/jira/browse/HAWQ-1094
             Project: Apache HAWQ
          Issue Type: Bug
          Components: Fault Tolerance
            Reporter: Ming LI
            Assignee: Lei Chang


I created a parquet table and inserted the following values into the table:

{code}
sr37228_repro=# select * from number;
 id
----
  1
  1
  1
  1
  1
(5 rows)
{code}

I then modified the data in two of the three blocks and tried reading the data again.

{code}
Modifying contents of internal table blocks...

Found hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 in hdfs

Modifying block /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008
on 172.28.21.155
block_script.sh                                                                          
                             100%  228     0.2KB/s   00:00
Modifying block /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008
on 172.28.21.156
block_script.sh                                                                          
                             100%  228     0.2KB/s   00:00

Running count query again, this time with bad data in two of the three blocks
 count |    id
-------+----------
     1 |        0
     2 |        1
     1 | 16777216
     1 | 16777217
(4 rows)


Checking Showing file health:

Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
Connecting to namenode via http://hdm1.hdp.local:50070/fsck?ugi=gpadmin&blocks=1&locations=1&files=1&path=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path /hawq_default/16385/16543/17000/10
at Mon Sep 26 12:07:53 PDT 2016
/hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK],
DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK],
DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]

Status: HEALTHY
 Total size:    206 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      1 (avg. block size 206 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
{code}

When setupBlockReader reads a bad block using the LocalBlockReader, the reader correctly detects
a bad checksum.

{code}
2016-09-26 13:02:09.267021 PDT,,,p380682,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","Resource
manager discovered local host IPv4 address 127.0.0.1",,,,,,,0,,"network_utils.c",210,
2016-09-26 13:02:09.267171 PDT,,,p380682,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","Resource
manager discovered local host IPv4 address 172.28.21.155",,,,,,,0,,"network_utils.c",210,
2016-09-26 13:02:16.239048 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","00000","Dropping in memory mapping
OidInMemHeapMapping",,,,,,"SET log_min_messages TO 'debug5'",0,,"cdbinmemheapam.c",293,
2016-09-26 13:02:16.239289 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","CommitTransactionCommand",,,,,,"SET
log_min_messages TO 'debug5'",0,,"postgres.c",3131,
2016-09-26 13:02:16.239435 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","CommitTransaction",,,,,,"SET
log_min_messages TO 'debug5'",0,,"xact.c",5103,
2016-09-26 13:02:16.239819 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","name: unnamed; blockState:
      STARTED; state: INPROGR, xid/subid/cid: 6227/1/0, nestlvl: 1, children: <>",,,,,,"SET
log_min_messages TO 'debug5'",0,,"xact.c",5128,
2016-09-26 13:02:16.239978 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","00000","Dropping in memory mapping
OidInMemOnlyMapping",,,,,,"SET log_min_messages TO 'debug5'",0,,"cdbinmemheapam.c",293,
2016-09-26 13:02:25.600367 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,,seg1,,,,,"DEBUG5","00000","First char: 'M'; gp_role = 'execute'.",,,,,,,0,,"postgres.c",4737,
2016-09-26 13:02:25.600639 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","Message type M received by from libpq,
len = 1412",,,,,,,0,,"postgres.c",4813,
2016-09-26 13:02:25.600742 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG5","00000","MPP dispatched stmt from QD: explain
analyze select * from number;.",,,,,,,0,,"postgres.c",4893,
2016-09-26 13:02:25.600847 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","SetupProcessIdentity: receive msg:
ProcessIdentity_Begin_slice_1_idx_0_gang_1_cmd_74_writer_t_End_ProcessIdentity",,,,,,,0,,"identity.c",365,
2016-09-26 13:02:25.600997 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","ProcessIdentity is not init",,,,,,,0,,"identity.c",599,
2016-09-26 13:02:25.601129 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","ProcessIdentity: slice 1 id 0 gang
num 1 writer t",,,,,,,0,,"identity.c",602,
2016-09-26 13:02:25.601250 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG5","00000","Get a temporary directory:/tmp/hawq/segment",,,,,,,0,,"cdbtmpdir.c",48,
2016-09-26 13:02:25.601351 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG1","00000","getLocalTmpDirFromSegmentConfig
session_id:143 command_id:74 qeidx:0 tmpdir:/tmp/hawq/segment",,,,,,,0,,"identity.c",418,
2016-09-26 13:02:25.601784 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG3","00000","StartTransactionCommand",,,,,,"explain
analyze select * from number;",0,,"postgres.c",3107,
2016-09-26 13:02:25.602075 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","StartTransaction",,,,,,"explain
analyze select * from number;",0,,"xact.c",5103,
2016-09-26 13:02:25.602195 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","name: unnamed; blockState:
      DEFAULT; state: INPROGR, xid/subid/cid: 6228/1/0, nestlvl: 1, children: <>",,,,,,"explain
analyze select * from number;",0,,"xact.c",5128,
2016-09-26 13:02:25.602578 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 0 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.602703 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 1 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.602836 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 2 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.602994 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 3 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.603104 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 4 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.603211 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 5 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.603317 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 6 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.603572 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 7 key 17000
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.603751 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 8 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.603881 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 9 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604003 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 10 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604110 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 11 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604216 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 12 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604323 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 13 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604555 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 14 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604697 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 15 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604848 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 16 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.604959 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 17 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.605064 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 18 key 17002
relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
2016-09-26 13:02:25.605591 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","Resource enforcer
finds cpu sub-system is disabled",,,,,,"explain analyze select * from number;",0,,"resourceenforcer.c",908,
2016-09-26 13:02:25.605716 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG2","00000","Current nice level
of the process: 19",,,,,,"explain analyze select * from number;",0,,"postgres.c",283,
2016-09-26 13:02:25.605856 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG2","00000","Reniced process to
level 19",,,,,,"explain analyze select * from number;",0,,"postgres.c",302,
2016-09-26 13:02:25.606073 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG5","00000","GetSnapshotData setting
globalxmin and xmin to 6228",,,,,,"explain analyze select * from number;",0,,"procarray.c",552,
2016-09-26 13:02:25.606306 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","Inserted entry for
query (sessionid=143, commandcnt=74)",,,,,,"explain analyze select * from number;",0,,"workfile_queryspace.c",283,
2016-09-26 13:02:25.606748 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","Have both IPv6 and
IPv4 choices",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1291,
2016-09-26 13:02:25.606978 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","receive socket ai_family
10 ai_socktype 2 ai_protocol 17",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1303,
2016-09-26 13:02:25.607098 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","receive socket 6 ai_family
10 ai_socktype 2 ai_protocol 17",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1307,
2016-09-26 13:02:25.607207 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","bind addrlen 28 fam
10",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1318,
2016-09-26 13:02:25.607320 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit default
buffer size 124928 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2200,
2016-09-26 13:02:25.607555 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit use buffer
size 2097152 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2215,
2016-09-26 13:02:25.607678 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit default
buffer size 124928 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2200,
2016-09-26 13:02:25.607787 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit use buffer
size 2097152 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2215,
2016-09-26 13:02:25.607939 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","GetSockAddr socket
ai_family 2 ai_socktype 2 ai_protocol 17 for 172.28.21.157",,,,,,"explain analyze select *
from number;",0,,"ic_udp.c",3058,
2016-09-26 13:02:25.608052 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","We are inet6, remote
is inet.  Converting to v4 mapped address.",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",3137,
2016-09-26 13:02:25.608249 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 0 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.608706 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 1 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.608836 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 2 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.608966 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 3 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.609083 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 4 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.609200 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 5 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.609316 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 6 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.609657 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 7 key 17000
for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
2016-09-26 13:02:25.613152 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG5","00000","Parquet metadata file
footer length index: 198",,,,,,"explain analyze select * from number;",0,,"cdbparquetfooterprocessor.c",141,
2016-09-26 13:02:25.676719 PDT,,,p380675,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","3rd
party error log:
2016-09-26 13:02:25.676477, p384452, th140708219193472, ERROR cannot setup block reader for
Block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186]
file /hawq_default/16385/16543/17000/10 on Datanode: hdw2.hdp.local(172.28.21.155).
LocalBlockReader.cpp: 127: HdfsIOException: Failed to construct LocalBlockReader for block:
[block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186].
        @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo>
const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&,
std::vector<char, std::allocator<char> >&)
        @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
        @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
        @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
        @       Hdfs::Internal::InputStreamImpl::read(char*, int)
        @       hdfsRead
        @       gpfs_hdfs_read
        @       HdfsRead
        @       FileRead
        @       readParquetFooter
        @       ParquetStorageRead_OpenFile
        @       parquet_getnext
        @       ParquetScanNext
        @       ExecTableScan
        @       ExecProcNode
        @       ExecMotion
        @       ExecProcNode
        @       ExecutePlan
        @       ExecutorRun
        @       PortalRunSelect
        @       PortalRun
        @       PostgresMain
        @       BackendStartup
        @       ServerLoop
        @       PostmasterMain
        @       main
        @       __libc_start_main
        @       Unknown
Caused by
LocalBlockReader.cpp: 283: HdfsIOException: LocalBlockReader failed to skip from position:
0, length: 0, block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186].
        @       Hdfs::Internal::LocalBlockReader::skip(long)
        @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo>
const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&,
std::vector<char, std::allocator<char> >&)
        @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
        @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
        @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
        @       Hdfs::Internal::InputStreamImpl::read(char*, int)
        @       hdfsRead
        @       gpfs_hdfs_read
        @       HdfsRead
        @       FileRead
        @       readParquetFooter
        @       ParquetStorageRead_OpenFile
        @       parquet_getnext
        @       ParquetScanNext
        @       ExecTableScan
        @       ExecProcNode
        @       ExecMotion
        @       ExecProcNode
        @       ExecutePlan
        @       ExecutorRun
        @       PortalRunSelect
        @       PortalRun
        @       PostgresMain
        @       BackendStartup
        @       ServerLoop
        @       PostmasterMain
        @       main
        @       __libc_start_main
        @       Unknown
Caused by
LocalBlockReader.cpp: 156: ChecksumException: LocalBlockReader checksum not match for block:
[block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186]
        @       Hdfs::Internal::LocalBlockReader::readAndVerify(int)
        @       Hdfs::Internal::LocalBlockReader::skip(long)
        @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo>
const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&,
std::vector<char, std::allocator<char> >&)
        @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
        @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
        @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
        @       Hdfs::Internal::InputStreamImpl::read(char*, int)
        @       hdfsRead
        @       gpfs_hdfs_read
        @       HdfsRead
        @       FileRead
        @       readParquetFooter
        @       ParquetStorageRead_OpenFile
        @       parquet_getnext
        @       ParquetScanNext
        @       ExecTableScan
        @       ExecProcNode
        @       ExecMotion
        @       ExecProcNode
        @       ExecutePlan
        @       ExecutorRun
        @       PortalRunSelect
        @       PortalRun
        @       PostgresMain
        @       BackendStartup
        @       ServerLoop
        @       PostmasterMain
        @       main
        @       __libc_start_main
        @       Unknown

retry the same node but disable read shortcircuit feature",,,,,,,,"SysLoggerMain","syslogger.c",518,
2016-09-26 13:02:25.680638 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26

{code}

Even though it correctly detected the bad checksum using the LocalBlockReader, when it calls
the RemoteBlockReader it does not appear to detect the bad checksum, and the read is allowed
to go through.

{code}
sr37228_repro=# select * from number;
    id
----------
 16777217
 16777216
        0
        1
        1
(5 rows)

Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health

Connecting to namenode via http://hdm1.hdp.local:50070/fsck?ugi=gpadmin&blocks=1&locations=1&files=1&path=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path /hawq_default/16385/16543/17000/10
at Mon Sep 26 12:07:53 PDT 2016
/hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK],
DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK],
DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]

Status: HEALTHY
 Total size:    206 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      1 (avg. block size 206 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds


The filesystem under path '/hawq_default/16385/16543/17000/10' is HEALTHY
{code}

The behavior of InputStreamImpl::setupBlockReader appears to be to: 

1. Attempt to read the block locally using LocalBlockReader
2. If the local block read fails, attempt to read the block from the next available node using
RemoteBlockReader
3. Continue to read all the available blocks using RemoteBlockReader until we have no more
blocks to read.

In this case, the RemoteBlockReader appears to ignore the bad checksum in the block, and returns
wrong results.

Questions:

1. When we detect a bad checksum on the local block, why do we not mark the block as corrupt
with the NameNode?
2. When we read the block using RemoteBlockReader, why doesn't it detect the bad block?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message