Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B5BE31060A for ; Thu, 29 Aug 2013 20:44:52 +0000 (UTC) Received: (qmail 28530 invoked by uid 500); 29 Aug 2013 20:44:52 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 28493 invoked by uid 500); 29 Aug 2013 20:44:52 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 28439 invoked by uid 99); 29 Aug 2013 20:44:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Aug 2013 20:44:52 +0000 Date: Thu, 29 Aug 2013 20:44:52 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-9373) Fix more log spam in replication for 0.96.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754030#comment-13754030 ] stack commented on HBASE-9373: ------------------------------ Looking, it looks like corruption given the pb code -- see below. Odd is that we recover on subsequent read (JD says that replication will reopen the file if it does not get a record because of failed parse). Let me add a patch w/ more detail around what is going on in here. {code} 30 /** 29 * Parse a single field from {@code input} and merge it into this set. 28 * @param tag The field's tag number, which was already parsed. 27 * @return {@code false} if the tag is an end group tag. 26 */ 25 public boolean mergeFieldFrom(final int tag, final CodedInputStream input) 24 throws IOException { 23 final int number = WireFormat.getTagFieldNumber(tag); 22 switch (WireFormat.getTagWireType(tag)) { 21 case WireFormat.WIRETYPE_VARINT: 20 getFieldBuilder(number).addVarint(input.readInt64()); 19 return true; 18 case WireFormat.WIRETYPE_FIXED64: 17 getFieldBuilder(number).addFixed64(input.readFixed64()); 16 return true; 15 case WireFormat.WIRETYPE_LENGTH_DELIMITED: 14 getFieldBuilder(number).addLengthDelimited(input.readBytes()); 13 return true; 12 case WireFormat.WIRETYPE_START_GROUP: 11 final Builder subBuilder = newBuilder(); 10 input.readGroup(number, subBuilder, 9 ExtensionRegistry.getEmptyRegistry()); 8 getFieldBuilder(number).addGroup(subBuilder.build()); 7 return true; 6 case WireFormat.WIRETYPE_END_GROUP: 5 return false; 4 case WireFormat.WIRETYPE_FIXED32: 3 getFieldBuilder(number).addFixed32(input.readFixed32()); 2 return true; 1 default: 0 throw InvalidProtocolBufferException.invalidWireType(); 1 } 2 } {code} > Fix more log spam in replication for 0.96.0 > ------------------------------------------- > > Key: HBASE-9373 > URL: https://issues.apache.org/jira/browse/HBASE-9373 > Project: HBase > Issue Type: Improvement > Affects Versions: 0.95.2 > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Fix For: 0.98.0, 0.96.0 > > > Two things that are bugging me. > First this one where we try to be more responsive now and only sleep 1 second if we didn't get data. Let's set it down to TRACE. > bq. 2013-08-28 23:17:47,421 DEBUG [regionserver60020.replicationSource,1] org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing to replicate, sleeping 1000 times 1 > Then I've seen cases where we can hit an EOF and instead of just being silent we hit this: > {noformat} > 2013-08-28 23:16:07,182 ERROR [ReplicationExecutor-0.replicationSource,1-jdec2hbase0403-5,60020,1377730319617] org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader: Invalid PB while reading WAL, probably an unexpected EOF, ignoring > com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type. > at com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:99) > at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:498) > at com.google.protobuf.GeneratedMessage.parseUnknownField(GeneratedMessage.java:193) > at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey.(WALProtos.java:686) > at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey.(WALProtos.java:644) > at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$1.parsePartialFrom(WALProtos.java:771) > at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$1.parsePartialFrom(WALProtos.java:766) > at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:1444) > at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:1218) > at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:220) > at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:912) > at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267) > at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:290) > at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:926) > at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:296) > at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:918) > at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:197) > at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:98) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:89) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:390) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:298) > {noformat} > The problem here is it shows up as an ERROR, so the intention is that there really could be a problem? Or would it manifest itself in some other way anyway if we silence this exception? [~stack]? FWIW I verified that I had all my data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira