hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Durfey <sjdur...@gmail.com>
Subject Re: HBase scan time range, inconsistency
Date Wed, 25 Feb 2015 22:46:50 GMT
>
> Are you writing any Deletes? Are you writing any duplicates?


No physical deletes are occurring in my data, and there is a very real
possibility of duplicates.

How is the partitioning done?
>

The key structure would be /partition_id/person_id .... I'm dealing with
clinical data, with a data source identified by the partition, and the
person data is associated with that particular partition at load time.

Are you doing the column filtering with a custom filter or one of the
> prepackaged ones?
>

They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
QualifierFilter, and ColumnPrefixFilter are used under various conditions,
depending upon what is requested on the Scan object.


On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <busbey@cloudera.com> wrote:

> Are you writing any Deletes? Are you writing any duplicates?
>
> How is the partitioning done?
>
> What does the entire key structure look like?
>
> Are you doing the column filtering with a custom filter or one of the
> prepackaged ones?
>
> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sjdurfey@gmail.com>
> wrote:
>
> > >
> > > What's the TTL setting for your table ?
> > >
> > > Which hbase release are you using ?
> > >
> > > Was there compaction in between the scans ?
> > >
> > > Thanks
> > >
> >
> > The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
> > want to say compactions aren’t a factor, but the jobs are short lived
> (4-5
> > minutes), and I have ran them frequently over the last couple of days
> > trying to gather stats around what was being extracted, and trying to
> find
> > the difference and intersection in row keys before job runs.
> >
> > These numbers have varied wildly, from being off by 2-3 between
> >
> > subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > > When you say there is a variation in the number of rows retrieved - the
> > 40
> > > rows that got increased - are those rows in the expected time range? Or
> > is
> > > the system retrieving some rows which are not in the specified time
> > range?
> > >
> > > And when the rows drop by 70, are you using any row which was needed to
> > be
> > > retrieved got missed out?
> > >
> >
> > The best I can tell, if there is an increase in counts, those rows are
> not
> > coming from outside of the time range. In the job, I am maintaining a
> list
> > of rows that have a timestamp outside of my provided time range, and then
> > writing those out to hdfs at the end of the map task. So far, nothing has
> > been written out.
> >
> > Any filters in your scan?
> > >
> >
> > >
> > Regards
> > > Ram
> > >
> >
> > There are some column filters. There is an API abstraction on top of
> hbase
> > that I am using to allow users to easily extract data from columns that
> > start with a provided column prefix. So, the column filters are in place
> to
> > ensure I am only getting back data from columns that start with the
> > provided prefix.
> >
> > To add a little more detail, my row keys are separated out by partition.
> At
> > periodic times (through oozie), data is loaded from a source into the
> > appropriate partition. I ran some scans against a partition that hadn't
> > been updated in almost a year (with a scan range around the times of the
> > 2nd to last load into the table), and the row key counts were consistent
> > across multiple scans. I chose another partition that is actively being
> > updated once a day. I chose a scan time around the 4th most recent load,
> > and the results were inconsistent from scan to scan (fluctuating up and
> > down). Setting the begin time to 4 days in the past end time on the scan
> > range to 'right now', using System.currentTimeMillis() (with the time
> being
> > after the daily load), the results also fluctuated up and down. So, it
> kind
> > of seems like there is some sort of temporal recency that is causing the
> > counts to fluctuate.
> >
> >
> >
> > On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> > These numbers have varied wildly, from being off by 2-3 between
> >
> > subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > When you say there is a variation in the number of rows retrieved - the
> 40
> > rows that got increased - are those rows in the expected time range? Or
> is
> > the system retrieving some rows which are not in the specified time
> range?
> >
> > And when the rows drop by 70, are you using any row which was needed to
> be
> > retrieved got missed out?
> >
> > Any filters in your scan?
> >
> > Regards
> > Ram
> >
> > On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > What's the TTL setting for your table ?
> >
> > Which hbase release are you using ?
> >
> > Was there compaction in between the scans ?
> >
> > Thanks
> >
> >
> > On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sjdurfey@gmail.com> wrote:
> >
> > I have some code that accepts a time range and looks for data written to
> >
> > an HBase table during that range. If anything has been written for that
> row
> > during that range, the row key is saved off, and sometime later in the
> > pipeline those row keys are used to extract the entire row. I’m testing
> > against a fixed time range, at some point in the past. This is being done
> > as part of a Map/Reduce job (using Apache Crunch). I have some job
> counters
> > setup to keep track of the number of rows extracted. Since the time range
> > is fixed, I would expect the scan to return the same number of rows with
> > data in the provided time range. However, I am seeing this number vary
> from
> > scan to scan (bouncing between increasing and decreasing).
> >
> >
> > I’ve eliminated the possibility that data is being pulled in from
> >
> > outside the time range. I did this by scanning for one column qualifier
> > (and only using this as the qualifier for if a row had data in the time
> > range), getting the timestamp on the cell for each returned row and
> > compared it against the begin and end times for the scan, and I didn’t
> find
> > any that satisfied that criteria. I’ve observed some row keys show up in
> > the 1st scan, then drop out in the 2nd scan, only to show back up again
> in
> > the 3rd scan (all with the exact same Scan object). These numbers have
> > varied wildly, from being off by 2-3 between subsequent scans to 40 row
> > increases, followed by a drop of 70 rows.
> >
> >
> > I’m kind of looking for ideas to try to track down what could be causing
> >
> > this to happen. The code itself is pretty simple, it creates a Scan
> object,
> > scans the table, and then in the map phase, extract out the row key, and
> at
> > the end, it dumps them to a directory in hdfs.
> >
>
>
>
> --
> Sean
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message