hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Latency related configs for 0.90
Date Thu, 21 Apr 2011 16:28:08 GMT
Sorry, I mostly focused on performance related issues I saw in your
code given this thread is called "Latency related configs for 0.90".
So it's not the case anymore right?

What you are describing indeed sounds exactly like a timestamp issue,
such as https://issues.apache.org/jira/browse/HBASE-2256

In the code you just pasted I don't see anything regarding the setting
of timestamps BTW.

Finally, it seems odd to me that the issue happens only when the
machines are distant... HBASE-2256 happens when the client is as close
as possible to the server. You might want to take a look at the
system's dates and see if there's a skew.

J-D

On Thu, Apr 21, 2011 at 5:48 AM, George P. Stathis <gstathis@traackr.com> wrote:
> Thanks J-D. Here is an updated one: http://pastebin.com/MZDgVBam
>
> I posted this test case as a sample of the type of operations we are doing;
> it's not the actual code itself though. In our actual code, the htable pool
> and config are all spring managed singleton instances available across the
> entire app, so we don't keep creating them and dropping them. I fixed the
> unit test to take your pointers into consideration. It allows to
> drop hbase.zookeeper.property.maxClientCnxns back to the default 30, so
> thanks for that.
>
> But this example was simply meant to illustrate what we are trying to do
> with hbase; basically, create a secondary index row for a given record.
>
> The actual symptoms that we are experiencing are not maxClientCnxns issues.
> We are seeing data not being persisted when we think they are or not being
> entirely deleted when we think they are; this mostly happens when we
> introduce a network in between the client and the hbase server (although
> it's been seen to happen much less frequently when the client and server are
> on the same box).
>
> As an example, we see things like this (pseudo-code):
>
> // Insert data
> Put p = new Put("some_row_id");
> p.add("familiyA","qualifierA","valueA");
> p.add("familiyA","qualifierB","valueB");
> p.add("familiyA","qualifierC","valueC");
> table.put(p);
>
> // Validate row presence
> Result row = table.get(new Get("some_row_id"));
> System.out.println(row.toString());
> => keyvalues={some_row_id/
> familiyA:qualifierA/1303389288609/Put/vlen=13,
> some_row_id/familiyA:qualifierB/1303389288610/Put/vlen=13,
> some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13}
>
> // Delete row
> table.delete(new Delete("some_row_id"));
>
> // Validate row deletion
> Result deletedRow = table.get(new Get("some_row_id"));
> System.out.println(row.toString());
> => keyvalues={some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13} ///
> orphaned cell !!!
>
> I was seeing this case happen last night for hours on end with the same test
> data. I began suspecting timestamp issues as possible culprits. I went to
> bed and left the test environment alone overnight (no processes running on
> it). This morning, I re-ran the same test case: the orphaned cell phenomenon
> is no longer happening. So it's very hit or miss, but the example I gave
> above was definitely reproducible at will for a few hours.
>
> Are there any known cases where a deliberate delete on an entire row will
> still leave data behind? Could we be messing our timestamps in such a way
> that we could be causing this?
>
> -GS
>
>
>
> On Wed, Apr 20, 2011 at 6:58 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> Regarding the test:
>>
>>  - Try to only keep one HBaseAdmin, one HTablePool and always reuse
>> the same conf between tests, creating a new HBA or HTP creates a new
>> HBaseConfiguration thus a new connection. Use methods like
>> setUpBeforeClass. Another option is to close the connection once you
>> used those classes and the close the first one in tearDown that you
>> created in setUp. Right now I can count 25 connections being created
>> in this test (I know it stucks, it's a regression in 0.90)
>>  - The fact that you are creating new HTablePools in do* means you are
>> re-creating new HTables for almost every request you are doing and
>> that's a pretty expensive operation. Again, keeping only a single
>> instance will help a lot.
>>
>> That's the most obvious stuff I saw.
>>
>> J-D
>>
>> On Wed, Apr 20, 2011 at 12:46 PM, George P. Stathis
>> <gstathis@traackr.com> wrote:
>> > On Wed, Apr 20, 2011 at 12:48 PM, Stack <stack@duboce.net> wrote:
>> >
>> >> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
>> >> <gstathis@traackr.com> wrote:
>> >> > We have several unit tests that have started mysteriously failing in
>> >> random
>> >> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3.
>> Those
>> >> > tests used to run against 0.89 and never failed before. They also run
>> OK
>> >> on
>> >> > our local macbooks. On EC2, we are seeing lots of issues where the
>> setup
>> >> > data is not being persisted in time for the tests to assert against
>> them.
>> >> > They are also not always being torn down properly.
>> >> >
>> >>
>> >> These are your tests George dependent on HBase.  What are they asking
>> >> of HBase?  You are spinning up a cluster and then the takedown is not
>> >> working?   Want to pastebin some log?  We might see something.
>> >>
>> >
>> >
>> > It's not practical to paste all the secondary-indexing code we have in
>> > place. It's very likely that there is an issue in our code though, so I
>> > don't want to send folks down a rabbit hole. I just wanted to validate
>> that
>> > there are no new configs in 0.90 (from 0.89) that could affect read/write
>> > consistency.
>> >
>> > I created a test that simulates what most of our secondary-indexing code
>> > does:
>> >
>> > http://pastebin.com/M9qKv87u
>> >
>> > It's a simplified version and of course, this one does not fail, or
>> rather,
>> > I have not been able to make it fail in the same way. The only thing that
>> > I've hit with this test in pseudo-distributed mode
>> > is hbase.zookeeper.property.maxClientCnxns which I bumped up and was able
>> to
>> > force it past it. The issue we are seeing does not throw any errors in
>> any
>> > of the master/regionserver/zookeeper logs, so, right now, all indications
>> > are that the problem is on our side. I just need to diff deeper.
>> >
>> > BTW, we are not spinning up a temporary mini-cluster to test; instead, we
>> > have a dedicated dev pseudo-distributed machine against which our CI
>> tests
>> > run. That's the environment that is presenting issues at the moment.
>> Again,
>> > the odd part is that we have setup our local instances the same way as
>> our
>> > dev pseudo-distributed machine and the tests pass. The differences are
>> that
>> > we run on macs and the dev instance is on EC2.
>> >
>> >
>> >>
>> >> > We first started seeing issues running our hudson build on the same
>> >> machine
>> >> > as the hbase pseudo-cluster. We figured that was putting too much load
>> on
>> >> > the box, so we created a separate large instance on EC2 to host just
>> the
>> >> > 0.90 stack. This migration nearly quadrupled the number of unit tests
>> >> > failing at times. The only difference between for first and second
CI
>> >> setup
>> >> > is the network in between.
>> >> >
>> >>
>> >> Yeah.  EC2.  But we should be able to manage with a flakey network
>> anyways.
>> >>
>> >
>> > Just wanted to make sure that this was indeed the case.
>> >
>> >
>> >>
>> >>
>> >> > Before we start tearing down our code line by line, I'd like to see
if
>> >> there
>> >> > are latency related configuration tweaks we could try to make the
>> setup
>> >> > more resilient to network lag. Are there any hbase/zookepper settings
>> >> that
>> >> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP
>> >> > in hbase-env.sh . Can that help?
>> >> >
>> >>
>> >> You've seen that hbase uses a different config. when it runs tests;
>> >> its in src/tests/resources/hbase-site.xml.
>> >>
>> >> But if stuff used to work on 0.89 w/ old config. this is probably not
>> it.
>> >>
>> >
>> > I reverted all our configs back to default but the issue remains. I'll
>> take
>> > a look at the test config and see if any of those settings may help out.
>> > From what I can gather at first glance, the test settings are more
>> > aggressive actually, so they seem even less tolerant of delays.
>> >
>> > Will keep digging and I'll post and update when we get somewhere.
>> >
>> >
>> >>
>> >> > Any suggestions are more than welcome. Also, the overview above may
>> not
>> >> be
>> >> > enough to go on, so please let me know if I could provide more
>> details.
>> >> >
>> >>
>> >> I think pastebin of a failing test, one that used pass, with
>> >> description (or code) of what is being done would be place to start;
>> >> we might recognize the diff in 0.89 to 0.90.
>> >>
>> >> St.Ack
>> >>
>> >
>>
>

Mime
View raw message