hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase scan time range, inconsistency
Date Thu, 26 Feb 2015 20:37:11 GMT
The maxVersions field of Scan object is 1 by default:

  private int maxVersions = 1;

Cheers

On Thu, Feb 26, 2015 at 12:31 PM, Stephen Durfey <sjdurfey@gmail.com> wrote:

> >
> > 1) What do you mean by saying your have a partitioned HBase table?
> > (Regions and partitions are not the same)
>
>
> By partitions, I just mean logical partitions, using the row key to keep
> data from separate data sources apart from each other.
>
> I think the issue may be resolved now, but it isn't obvious to me why the
> change works. The table is set to the save the max number of versions, but
> the number of versions is not specified in the Scan object. Once I changed
> the Scan to request the max number of versions the counts remained the same
> across all subsequent job runs. Can anyone provide some insight as to why
> this is the case?
>
> On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel <michael_segel@hotmail.com>
> wrote:
>
> > Ok…
> >
> > Silly question time… so just humor me for a second.
> >
> > 1) What do you mean by saying your have a partitioned HBase table?
> > (Regions and partitions are not the same)
> >
> > 2) There’s a question of the isolation level during the scan. What
> happens
> > when there is a compaction running or there’s RLL taking place?
> >
> > Does your scan get locked/blocked? Does it skip the row?
> > (This should be documented.)
> > Do you count the number of rows scanned when building the list of rows
> > that need to be processed further?
> >
> >
> >
> >
> >
> > > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sjdurfey@gmail.com>
> wrote:
> >
> > >
> > >>
> > >> Are you writing any Deletes? Are you writing any duplicates?
> > >
> > >
> > > No physical deletes are occurring in my data, and there is a very real
> > > possibility of duplicates.
> > >
> > > How is the partitioning done?
> > >>
> > >
> > > The key structure would be /partition_id/person_id .... I'm dealing
> with
> > > clinical data, with a data source identified by the partition, and the
> > > person data is associated with that particular partition at load time.
> > >
> > > Are you doing the column filtering with a custom filter or one of the
> > >> prepackaged ones?
> > >>
> > >
> > > They appear to be all prepackaged filters:  FamilyFilter,
> KeyOnlyFilter,
> > > QualifierFilter, and ColumnPrefixFilter are used under various
> > conditions,
> > > depending upon what is requested on the Scan object.
> > >
> > >
> > > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <busbey@cloudera.com>
> > wrote:
> > >
> > >> Are you writing any Deletes? Are you writing any duplicates?
> > >>
> > >> How is the partitioning done?
> > >>
> > >> What does the entire key structure look like?
> > >>
> > >> Are you doing the column filtering with a custom filter or one of the
> > >> prepackaged ones?
> > >>
> > >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sjdurfey@gmail.com>
> > >> wrote:
> > >>
> > >>>>
> > >>>> What's the TTL setting for your table ?
> > >>>>
> > >>>> Which hbase release are you using ?
> > >>>>
> > >>>> Was there compaction in between the scans ?
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>
> > >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I
> > don’t
> > >>> want to say compactions aren’t a factor, but the jobs are short lived
> > >> (4-5
> > >>> minutes), and I have ran them frequently over the last couple of days
> > >>> trying to gather stats around what was being extracted, and trying
to
> > >> find
> > >>> the difference and intersection in row keys before job runs.
> > >>>
> > >>> These numbers have varied wildly, from being off by 2-3 between
> > >>>
> > >>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > >>>> When you say there is a variation in the number of rows retrieved
-
> > the
> > >>> 40
> > >>>> rows that got increased - are those rows in the expected time range?
> > Or
> > >>> is
> > >>>> the system retrieving some rows which are not in the specified
time
> > >>> range?
> > >>>>
> > >>>> And when the rows drop by 70, are you using any row which was needed
> > to
> > >>> be
> > >>>> retrieved got missed out?
> > >>>>
> > >>>
> > >>> The best I can tell, if there is an increase in counts, those rows
> are
> > >> not
> > >>> coming from outside of the time range. In the job, I am maintaining
a
> > >> list
> > >>> of rows that have a timestamp outside of my provided time range, and
> > then
> > >>> writing those out to hdfs at the end of the map task. So far, nothing
> > has
> > >>> been written out.
> > >>>
> > >>> Any filters in your scan?
> > >>>>
> > >>>
> > >>>>
> > >>> Regards
> > >>>> Ram
> > >>>>
> > >>>
> > >>> There are some column filters. There is an API abstraction on top of
> > >> hbase
> > >>> that I am using to allow users to easily extract data from columns
> that
> > >>> start with a provided column prefix. So, the column filters are in
> > place
> > >> to
> > >>> ensure I am only getting back data from columns that start with the
> > >>> provided prefix.
> > >>>
> > >>> To add a little more detail, my row keys are separated out by
> > partition.
> > >> At
> > >>> periodic times (through oozie), data is loaded from a source into the
> > >>> appropriate partition. I ran some scans against a partition that
> hadn't
> > >>> been updated in almost a year (with a scan range around the times of
> > the
> > >>> 2nd to last load into the table), and the row key counts were
> > consistent
> > >>> across multiple scans. I chose another partition that is actively
> being
> > >>> updated once a day. I chose a scan time around the 4th most recent
> > load,
> > >>> and the results were inconsistent from scan to scan (fluctuating up
> and
> > >>> down). Setting the begin time to 4 days in the past end time on the
> > scan
> > >>> range to 'right now', using System.currentTimeMillis() (with the time
> > >> being
> > >>> after the daily load), the results also fluctuated up and down. So,
> it
> > >> kind
> > >>> of seems like there is some sort of temporal recency that is causing
> > the
> > >>> counts to fluctuate.
> > >>>
> > >>>
> > >>>
> > >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> > >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> > >>>
> > >>> These numbers have varied wildly, from being off by 2-3 between
> > >>>
> > >>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > >>> When you say there is a variation in the number of rows retrieved -
> the
> > >> 40
> > >>> rows that got increased - are those rows in the expected time range?
> Or
> > >> is
> > >>> the system retrieving some rows which are not in the specified time
> > >> range?
> > >>>
> > >>> And when the rows drop by 70, are you using any row which was needed
> to
> > >> be
> > >>> retrieved got missed out?
> > >>>
> > >>> Any filters in your scan?
> > >>>
> > >>> Regards
> > >>> Ram
> > >>>
> > >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yuzhihong@gmail.com>
wrote:
> > >>>
> > >>> What's the TTL setting for your table ?
> > >>>
> > >>> Which hbase release are you using ?
> > >>>
> > >>> Was there compaction in between the scans ?
> > >>>
> > >>> Thanks
> > >>>
> > >>>
> > >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sjdurfey@gmail.com>
> > wrote:
> > >>>
> > >>> I have some code that accepts a time range and looks for data written
> > to
> > >>>
> > >>> an HBase table during that range. If anything has been written for
> that
> > >> row
> > >>> during that range, the row key is saved off, and sometime later in
> the
> > >>> pipeline those row keys are used to extract the entire row. I’m
> testing
> > >>> against a fixed time range, at some point in the past. This is being
> > done
> > >>> as part of a Map/Reduce job (using Apache Crunch). I have some job
> > >> counters
> > >>> setup to keep track of the number of rows extracted. Since the time
> > range
> > >>> is fixed, I would expect the scan to return the same number of rows
> > with
> > >>> data in the provided time range. However, I am seeing this number
> vary
> > >> from
> > >>> scan to scan (bouncing between increasing and decreasing).
> > >>>
> > >>>
> > >>> I’ve eliminated the possibility that data is being pulled in from
> > >>>
> > >>> outside the time range. I did this by scanning for one column
> qualifier
> > >>> (and only using this as the qualifier for if a row had data in the
> time
> > >>> range), getting the timestamp on the cell for each returned row and
> > >>> compared it against the begin and end times for the scan, and I
> didn’t
> > >> find
> > >>> any that satisfied that criteria. I’ve observed some row keys show
up
> > in
> > >>> the 1st scan, then drop out in the 2nd scan, only to show back up
> again
> > >> in
> > >>> the 3rd scan (all with the exact same Scan object). These numbers
> have
> > >>> varied wildly, from being off by 2-3 between subsequent scans to 40
> row
> > >>> increases, followed by a drop of 70 rows.
> > >>>
> > >>>
> > >>> I’m kind of looking for ideas to try to track down what could be
> > causing
> > >>>
> > >>> this to happen. The code itself is pretty simple, it creates a Scan
> > >> object,
> > >>> scans the table, and then in the map phase, extract out the row key,
> > and
> > >> at
> > >>> the end, it dumps them to a directory in hdfs.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Sean
> > >>
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message