Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EBF221798F for ; Wed, 25 Feb 2015 18:59:13 +0000 (UTC) Received: (qmail 15478 invoked by uid 500); 25 Feb 2015 18:58:56 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 15407 invoked by uid 500); 25 Feb 2015 18:58:56 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 15395 invoked by uid 99); 25 Feb 2015 18:58:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Feb 2015 18:58:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sjdurfey@gmail.com designates 209.85.213.173 as permitted sender) Received: from [209.85.213.173] (HELO mail-ig0-f173.google.com) (209.85.213.173) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Feb 2015 18:58:29 +0000 Received: by mail-ig0-f173.google.com with SMTP id a13so38140204igq.0 for ; Wed, 25 Feb 2015 10:57:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=w5ae07LeOu45bNhX/kB/vrljrVT9Gd2vXwArovctQOQ=; b=p013yh1HaPkfrui/6Not+aoKwHnPrcjXCxhKOKIZZhTivpp718MF/34f7dniFDCXtk sxmj8nWQQe78SfTsHbMDN3CMNSzYcOiEqIZ29oJZh940p+RD+lXGhwor+Kk0eifL9o2T rGETUYDm3gJyVceZOA78KIvJwqrQpl/xWFHk2WgLf3ry5TyLpptpCe3BtFeq7OBZ6hqb AKhg5DOBRrYudxjSiIF8EG1QCWttRFML3cSCiAUP5WBK3JICXOI+0m5aQZQ2+j2FGjF9 8W9uxOg7wYAhmNWlZZXZUZMgB6pip8lQBuqCq/Jkzlt1bltjFIeFWEx+ceggZufOOjHI xnrg== MIME-Version: 1.0 X-Received: by 10.51.16.1 with SMTP id fs1mr28981223igd.8.1424890662435; Wed, 25 Feb 2015 10:57:42 -0800 (PST) Received: by 10.107.12.140 with HTTP; Wed, 25 Feb 2015 10:57:42 -0800 (PST) In-Reply-To: References: <3D39061A-F4A8-4FC5-A136-5C2120C654C8@gmail.com> <158A1F96-C1FD-4D78-8468-F63DBF308603@gmail.com> Date: Wed, 25 Feb 2015 12:57:42 -0600 Message-ID: Subject: Re: HBase scan time range, inconsistency From: Stephen Durfey To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a1135e68cc7641a050fee35d0 X-Virus-Checked: Checked by ClamAV on apache.org --001a1135e68cc7641a050fee35d0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable > > What's the TTL setting for your table ? > > Which hbase release are you using ? > > Was there compaction in between the scans ? > > Thanks > The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don=E2= =80=99t want to say compactions aren=E2=80=99t a factor, but the jobs are short liv= ed (4-5 minutes), and I have ran them frequently over the last couple of days trying to gather stats around what was being extracted, and trying to find the difference and intersection in row keys before job runs. These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. > When you say there is a variation in the number of rows retrieved - the 4= 0 > rows that got increased - are those rows in the expected time range? Or i= s > the system retrieving some rows which are not in the specified time range= ? > > And when the rows drop by 70, are you using any row which was needed to b= e > retrieved got missed out? > The best I can tell, if there is an increase in counts, those rows are not coming from outside of the time range. In the job, I am maintaining a list of rows that have a timestamp outside of my provided time range, and then writing those out to hdfs at the end of the map task. So far, nothing has been written out. Any filters in your scan? > > Regards > Ram > There are some column filters. There is an API abstraction on top of hbase that I am using to allow users to easily extract data from columns that start with a provided column prefix. So, the column filters are in place to ensure I am only getting back data from columns that start with the provided prefix. To add a little more detail, my row keys are separated out by partition. At periodic times (through oozie), data is loaded from a source into the appropriate partition. I ran some scans against a partition that hadn't been updated in almost a year (with a scan range around the times of the 2nd to last load into the table), and the row key counts were consistent across multiple scans. I chose another partition that is actively being updated once a day. I chose a scan time around the 4th most recent load, and the results were inconsistent from scan to scan (fluctuating up and down). Setting the begin time to 4 days in the past end time on the scan range to 'right now', using System.currentTimeMillis() (with the time being after the daily load), the results also fluctuated up and down. So, it kind of seems like there is some sort of temporal recency that is causing the counts to fluctuate. On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan < ramkrishna.s.vasudevan@gmail.com> wrote: These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. When you say there is a variation in the number of rows retrieved - the 40 rows that got increased - are those rows in the expected time range? Or is the system retrieving some rows which are not in the specified time range? And when the rows drop by 70, are you using any row which was needed to be retrieved got missed out? Any filters in your scan? Regards Ram On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu wrote: What's the TTL setting for your table ? Which hbase release are you using ? Was there compaction in between the scans ? Thanks On Feb 24, 2015, at 2:32 PM, Stephen Durfey wrote: I have some code that accepts a time range and looks for data written to an HBase table during that range. If anything has been written for that row during that range, the row key is saved off, and sometime later in the pipeline those row keys are used to extract the entire row. I=E2=80=99m tes= ting against a fixed time range, at some point in the past. This is being done as part of a Map/Reduce job (using Apache Crunch). I have some job counters setup to keep track of the number of rows extracted. Since the time range is fixed, I would expect the scan to return the same number of rows with data in the provided time range. However, I am seeing this number vary from scan to scan (bouncing between increasing and decreasing). I=E2=80=99ve eliminated the possibility that data is being pulled in from outside the time range. I did this by scanning for one column qualifier (and only using this as the qualifier for if a row had data in the time range), getting the timestamp on the cell for each returned row and compared it against the begin and end times for the scan, and I didn=E2=80= =99t find any that satisfied that criteria. I=E2=80=99ve observed some row keys show = up in the 1st scan, then drop out in the 2nd scan, only to show back up again in the 3rd scan (all with the exact same Scan object). These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. I=E2=80=99m kind of looking for ideas to try to track down what could be ca= using this to happen. The code itself is pretty simple, it creates a Scan object, scans the table, and then in the map phase, extract out the row key, and at the end, it dumps them to a directory in hdfs. --001a1135e68cc7641a050fee35d0--