Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5C943200C12 for ; Sun, 5 Feb 2017 20:32:37 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5B0B5160B59; Sun, 5 Feb 2017 19:32:37 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A52B0160B48 for ; Sun, 5 Feb 2017 20:32:36 +0100 (CET) Received: (qmail 31886 invoked by uid 500); 5 Feb 2017 19:32:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 31874 invoked by uid 99); 5 Feb 2017 19:32:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Feb 2017 19:32:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 49AF1180252 for ; Sun, 5 Feb 2017 19:32:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 8T0nGj1CJUVk for ; Sun, 5 Feb 2017 19:32:31 +0000 (UTC) Received: from mail-lf0-f53.google.com (mail-lf0-f53.google.com [209.85.215.53]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 524115F1E9 for ; Sun, 5 Feb 2017 19:32:31 +0000 (UTC) Received: by mail-lf0-f53.google.com with SMTP id z134so34485673lff.3 for ; Sun, 05 Feb 2017 11:32:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=wr7EXPuSR+bRdIwIz2fHbJPbXhFlZyXRzk9FwA7sJPc=; b=iMMq7JjKFww2G2v/91HtXvYo1cbjVkxEAL5bQUZbRlIH5TKemsFje56hd13FraxxU3 oFFSIg+iBSbsdIqjXC4gkNTo9QxS0IBeKtNoT1uRUbXVRi3RNyMETdtuOyTcINwew7Vb XqoBukNP+4eBrnQULp9z1G5TyxSv5Zajx1+IxPUBURwqKqJEtvj2srbh6p0jR3Qfz/RB 8+36E0FbtY3J/r1Si2a0qr/8IBZqExozMBoJCA3Hh2rNWtKANMdrqKi9Fu88+b7OUsR9 TYafHw52C/BBtJ2tLlJ7W4IAYPEnAbV5WNXtWL3pET5BEajmLWcweVLwj43JM0AOQDp8 jLdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=wr7EXPuSR+bRdIwIz2fHbJPbXhFlZyXRzk9FwA7sJPc=; b=OCbuu83mqb2uZpRwV5rx2ipvkaXm6eJN1j+wlM+DegK7hUKYZh/cJtUkR6jFvR9wkw AwWzt+rvbPvdjWfRu6id/XAPhJMdBdxdG6d2rh2ZMuRXvmL5E2g85vIEMXEuyC6pj42u ACODZTPeu0Pzf2BvzdASeN15dTDrdajijhb8HESwwmjU9gEgOq12tlGjiE+xNedcRYxI TnMaryYh7LgS3ZL0k4lHi3czmzYrbTZ37MKr9YoNFn7VIiTOvhrmqUFIQ3YG8En/K5x7 0SbRsj/T/ojpMWw2RoXq3AjdUd0uPjY0oRrBb3e3QQftw/mJoMBSBbmxMO5086kj43LO frmA== X-Gm-Message-State: AIkVDXJrLhlF7LUfu10v7qxE4ktzSTVxa+ooxq/nBJeUi5oHWZ4yZxJHsluLMCJUCIttNY7ljHoNpDQx6b1bOA== X-Received: by 10.46.22.85 with SMTP id 21mr2600732ljw.18.1486323148489; Sun, 05 Feb 2017 11:32:28 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.22.232 with HTTP; Sun, 5 Feb 2017 11:32:28 -0800 (PST) From: Alexandre Normand Date: Sun, 5 Feb 2017 11:32:28 -0800 Message-ID: Subject: Seeking advice on skipped/lost data during data migration from and to a hbase table To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=f403045fc1a0498eef0547cd933e archived-at: Sun, 05 Feb 2017 19:32:37 -0000 --f403045fc1a0498eef0547cd933e Content-Type: text/plain; charset=UTF-8 We're migrating data from a previous iteration of a table to a new one and this process involved a MR job that scans data from the source table and writes the equivalent data in the new table. The source table has 6000+ regions and it frequently splits because we're still ingesting time series data into it. We used buffered writing on the other end when writing to the new table and we have a yarn resource pool to limit the concurrent writing. First, I should say that this job took a long time but still mostly worked. However, we've built a mechanism to compare requested data fetched from each one of the tables and found that some rows (0.02%) are missing from the destination. We've ruled out a few things already: * Functional bug in the job that would have resulted in skipping that 0.02% of the rows. * Potential for that data not having existed when the migration job initially ran. At a high-level, the suspects could be: * The source table splitting could have resulted in some input keys not being read. However, since a hbase split is comprised of a startKey/endKey, this seems like this would not be expected unless there was a bug in there somehow. * The writing/flushing losing a batch. Since we're buffering writes and flush everything on the clean up of map tasks, we would expect write failures to cause task failures/retries and therefore to not be a problem in the end. Given that this flush is synchronous and, according to our understanding, completes when the data is in the WAL and memstore, this also seems unlikely unless there's a bug. I should add that we've extracted a sample of 1% of the source rows (doing all of them is really time consuming because of the size of data) and found that missing data often appears in clusters of the source hbase row keys. This doesn't really help pointing at a problem with the scan side of things or the write side of things (since a failure in either would result in a similar output) but we thought it was interesting. That said, we do have a few keys that are missing that aren't clustered. This could be because we've only ran the comparison for 1% of the data or it could be that whatever is causing this can affect very isolated cases. We're now trying to understand how this could have happened in order to understand how it could impact other jobs/applications and also to increase our confidence that we write a modified version of the migration job to re-migrate the skipped/missing data. Any ideas or advice would be much appreciated. Thanks! -- Alex --f403045fc1a0498eef0547cd933e--