hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Kettmann <andrew.kettm...@evolve24.com>
Subject RE: Inconsistent rows exported/counted when looking at a set, unchanged past time frame.
Date Fri, 09 Feb 2018 22:30:53 GMT
A simpler question would be this:

Given:


  *   a set timeframe in the past (2-3 days roughly a year ago)
  *   we are NOT removing records from the table at all
  *   We ARE inserting into this table actively

Should I expect two consecutive runs of the rowcounter mapreduce job to return an identical
number?


Andrew Kettmann
Consultant, Platform Services Group

From: Andrew Kettmann
Sent: Thursday, February 08, 2018 11:35 AM
To: user@hbase.apache.org
Subject: Inconsistent rows exported/counted when looking at a set, unchanged past time frame.

First the version details:

Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1.
Hbase: Version 1.2.0-cdh5.8.0
HDFS/YARN: Hadoop 2.6.0-cdh5.8.0
Hbck and hdfs fsck return healthy

15 nodes, sized down recently from 30 (other service requirements reduced. Solr, etc)


The simplest example of the inconsistency is using rowcounter. If I run the same mapreduce
job twice in a row, I get different counts:

hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter -Dmapreduce.map.speculative=false
TABLENAME --starttime=1485907200000 --endtime=1486058400000

Looking at org.​apache.​hadoop.​hbase.​mapreduce.​RowCounter​$RowCounterMapper​$Counters:
Run 1: 4876683
Run 2: 4866351

Similarly with exports of the same date/time. Consecutive runs of the export get different
results:
hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.map.tasks.speculative.execution=false \
-Dmapred.reduce.tasks.speculative.execution=false \
TABLENAME \
HDFSPATH 1 1485907200000 1486058400000

From Map Input/output records:
Run 1: 4296778
Run 2: 4297307

None of the results show anything for spilled records, no failed maps. Sometimes the row count
increases, sometimes it decreases. We aren’t using any row filter queries, we just want
to export chunks of the data for a specific time range. This table is actively being read/written
to, but I am asking about a date range in early 2017 in this case, so that should have no
impact I would have thought. Another point is that the rowcount job and the export return
ridiculously different numbers. There should be no older versions of rows involved as we are
set to only keep the newest, and I can confirm that there are rows that are consistently missing
from the exports. Table definition is below.

hbase(main):001:0> describe 'TABLENAME'
Table TABLENAME is ENABLED
TABLENAME
COLUMN FAMILIES DESCRIPTION
{NAME => 'text', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE
=> '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', MIN_VERSIONS => '0', TTL =>
'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLO
CKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.2800 seconds

Any advice/suggestions would be greatly appreciated, are some of my assumptions wrong regarding
import/export and that it should be consistent given consistent date/times?


Andrew Kettmann
Platform Services Group

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message