From jira-return-11558-archive-asf-public=cust-asf.ponee.io@kafka.apache.org  Thu Apr  5 21:54:05 2018
Return-Path: <jira-return-11558-archive-asf-public=cust-asf.ponee.io@kafka.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id C2BCF180677
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  5 Apr 2018 21:54:04 +0200 (CEST)
Received: (qmail 79672 invoked by uid 500); 5 Apr 2018 19:54:03 -0000
Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:jira-help@kafka.apache.org>
List-Unsubscribe: <mailto:jira-unsubscribe@kafka.apache.org>
List-Post: <mailto:jira@kafka.apache.org>
List-Id: <jira.kafka.apache.org>
Reply-To: jira@kafka.apache.org
Delivered-To: mailing list jira@kafka.apache.org
Received: (qmail 79647 invoked by uid 99); 5 Apr 2018 19:54:03 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2018 19:54:03 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 486DEC00E7
	for <jira@kafka.apache.org>; Thu,  5 Apr 2018 19:54:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.311
X-Spam-Level:
X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id dICj_ZPDvbKG for <jira@kafka.apache.org>;
	Thu,  5 Apr 2018 19:54:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 301315F189
	for <jira@kafka.apache.org>; Thu,  5 Apr 2018 19:54:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6FB25E0179
	for <jira@kafka.apache.org>; Thu,  5 Apr 2018 19:54:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 31BFC2561F
	for <jira@kafka.apache.org>; Thu,  5 Apr 2018 19:54:00 +0000 (UTC)
Date: Thu, 5 Apr 2018 19:54:00 +0000 (UTC)
From: "Ari Uka (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.13146066.1521399875000.195453.1522958040170@Atlassian.JIRA>
In-Reply-To: <JIRA.13146066.1521399875000@Atlassian.JIRA>
References: <JIRA.13146066.1521399875000@Atlassian.JIRA> <JIRA.13146066.1521399875746@jira-lw-us.apache.org>
Subject: [jira] [Commented] (KAFKA-6679) Random corruption (CRC validation
 issues)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/KAFKA-6679?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1642=
7504#comment-16427504 ]=20

Ari Uka commented on KAFKA-6679:
--------------------------------

Similar issue:=C2=A0https://issues.apache.org/jira/browse/KAFKA-3240

> Random corruption (CRC validation issues)=20
> ------------------------------------------
>
>                 Key: KAFKA-6679
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6679
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, replication
>    Affects Versions: 0.10.2.0, 1.0.1
>         Environment: FreeBSD 11.0-RELEASE-p8
>            Reporter: Ari Uka
>            Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers a=
nd randomly consumers will start to fail with an error message saying the C=
RC does not match. The brokers are all on 1.0.1, but the issue started on 0=
.10.2 with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3D3, leaderId=
=3D1, fetcherId=3D0] Error for partition topic-a-0 to broker 1:org.apache.k=
afka.common.errors.CorruptRecordException: This message has failed its CRC =
checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.Re=
plicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3D3] Error process=
ing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.R=
eplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is sma=
ller than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3D3] Error process=
ing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.R=
eplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is sma=
ller than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3D3, leaderId=
=3D2, fetcherId=3D0] Error for partition=C2=A0topic-c-2 to broker 2:org.apa=
che.kafka.common.errors.CorruptRecordException: This message has failed its=
 CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.serv=
er.ReplicaFetcherThread)
> ```
> =C2=A0
> To fix this, I have to use the=C2=A0kafka-consumer-groups.sh command line=
 tool and do a binary search until I can find a non corrupt message and pus=
h the offsets forward. It's annoying because I can't actually push to a spe=
cific date because=C2=A0kafka-consumer-groups.sh starts to emit the same er=
ror, ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying=
 to find the root cause.=C2=A0
> I'm using the Go consumer [https://github.com/Shopify/sarama]=C2=A0and [h=
ttps://github.com/bsm/sarama-cluster].=C2=A0
> At first, I thought it could be the consumer libraries, but the error hap=
pens with kafka-console-consumer.sh as well when a specific message is corr=
upted in Kafka. I don't think it's possible for Kafka producers to actually=
 push corrupt messages to Kafka and then cause all consumers to break right=
? I assume Kafka would reject corrupt messages, so I'm not sure what's goin=
g on here.
> Should I just re-create the cluster, I don't think it's hardware failure =
across the 3 machines tho.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)