Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates
 209.85.216.169 as permitted sender)
MIME-Version: 1.0
Sender: jdcryans@gmail.com
In-Reply-To: 
 <CALnqbSz4vMkvwKTOL1AF_kueeKq-8m9vBrZqU+eqm-xPGuvQEw@mail.gmail.com>
References: 
 <CALnqbSyW++widyE2tTgdNmPm9BUk1jF981uj+k5MBMBB2h25Vw@mail.gmail.com>
	<CAGpTDNfTj8xpG6hNovLtg93p-FyC8c00-K_ZGjtMLzn+OS5OUg@mail.gmail.com>
	<CALnqbSz4vMkvwKTOL1AF_kueeKq-8m9vBrZqU+eqm-xPGuvQEw@mail.gmail.com>
Date: Thu, 26 Jul 2012 10:44:25 -0700
Message-ID: 
 <CAGpTDNe3BA-K3qE90k=fSB4+v_Dk6ESBU-g7h3W=ODZ8Q8dJtA@mail.gmail.com>
Subject: Re: silently aborted scans when using
 hbase.client.scanner.max.result.size
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Damn! Well that's a big bug then but it seems that HBASE-2214 would
fix it since the client would pass it's own maxsize? Although, reading
the patch, it doesn't seem so since if it wasn't configured on the
client and it wasn't passed on the Scan then the region server will
pickup

In the patch:

-      this.maxScannerResultSize = conf.getLong(
+      if (scan.getMaxResultSize() > 0) {
+        this.maxScannerResultSize = scan.getMaxResultSize();
+      } else {
+        this.maxScannerResultSize = conf.getLong(
           HConstants.HBASE_CLIENT_SCANNER_MAX_RESULT_SIZE_KEY,
           HConstants.DEFAULT_HBASE_CLIENT_SCANNER_MAX_RESULT_SIZE);
+      }

If in the else clause you set the new value on the scan then the
region server would always receive the right amount of data. Then you
have to wonder why the region server would even set its own since it's
just likely to cause trouble. Or maybe it's the client that shouldn't
care.

I'll add a comment to that jira too.

J-D

On Thu, Jul 26, 2012 at 1:05 AM, Ferdy Galema <ferdy.galema@kalooga.com> wrote:
> Thanks man!! It is really that simple! That is crazy. I've been running
> this property serverside-only for such a long time but never really
> experienced the effects until using a higher caching value. (Which is
> perfectly explainable). Wherever this property is mentioned, is surely must
> be documented that it is critical to use it both server and client. (Unless
> you enjoy missing rows at random.)
>
> Thanks again.
> Ferdy
>
> On Wed, Jul 25, 2012 at 9:07 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> That looks nasty.
>>
>> Could it be that your client doesn't know about the max result size?
>> Looking at ClientScanner.next() we iterate while this is true:
>>
>> } while (remainingResultSize > 0 && countdown > 0 &&
>> nextScanner(countdown, values == null));
>>
>> Let's say the region server returns less rows than needed, like 1240,
>> but the caching is set to 1241. The remaining size would still be
>> higher than zero and so would the countdown (its value would be 1). So
>> it's gonna try to get the nextScanner. If you have just one region it
>> would stop there.
>>
>> But that would be the case if you have 1 region and did not set the
>> config on the client-side.
>>
>> J-D
>>
>> On Wed, Jul 25, 2012 at 5:04 AM, Ferdy Galema <ferdy.galema@kalooga.com>
>> wrote:
>> > I was experiencing aborted scans on certain conditions. In these cases I
>> > was simply missing so many rows that only a fraction was inputted,
>> without
>> > warning. After lots of testing I was able to pinpoint and reproduce the
>> > error when scanning over a single region, single column family, single
>> > store file. So really just a single (major_compacted) storefile. I scan
>> > over this region using a single Scan in a local jobtracker context. (So
>> not
>> > mapreduce, although this has exactly the same behaviour). Finally, I
>> > noticed the number of input rows is dependent on the
>> > hbase.client.scanner.caching property. See following example runs that
>> > scans over this region with a specific start and stop key:
>> >
>> > -Dhbase.client.scanner.caching=1
>> > inputrows=1506
>> >
>> > -Dhbase.client.scanner.caching=10000
>> > inputrows=1240
>> >
>> > -Dhbase.client.scanner.caching=1240
>> > inputrows=1506
>> >
>> > -Dhbase.client.scanner.caching=1241
>> > inputrows=1240
>> >
>> > This is weird huh? So setting the cache to 1241 in this case aborts the
>> > scan silently. Removing the stoprow yields the same amout. Setting the
>> > caching to 1 with no stoprow yields all rows. (Several hundreds of
>> > thousands).
>> >
>> > Neither the client nor the regionserver log any warning whatsoever. I had
>> > the hbase.client.scanner.max.result.size set to 90100100. After removing
>> > this property it all works like a charm!!! All rows are properly
>> inputted,
>> > regardless of hbase.client.scanner.caching. As an extra verification I
>> > checked the regionserver for warnings that I would expect without this
>> > property and this seems to be the case:
>> > 2012-07-25 11:46:52,889 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>> > handler 8 on 60
>> > 020, responseTooLarge for: next(-1937592840574159040, 10000) from
>> > x.x.x.x:39398: Size: 3
>> > 38.1m
>> > 2012-07-25 11:47:14,359 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>> > handler 9 on 60
>> > 020, responseTooLarge for: next(-1937592840574159040, 10000) from
>> > x.x.x.x:39407: Size: 1
>> > 86.6m
>> >
>> > So, anyone know what this could be? I am willing to debug this behaviour
>> at
>> > the regionserver level, but before I do I want to make sure I am not
>> > running into something that has already been solved. This is
>> > on hbase-0.90.6-cdh3u4, using snappy.
>>