hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M.Deniz OKTAR" <deniz.ok...@gmail.com>
Subject Re: region servers dying - flush request - YCSB
Date Mon, 07 Mar 2011 17:22:26 GMT
I run every kind of benchmark I could find on those machines and they seemed
to work fine. Did memory/disk tests too.

The master node or other nodes provide some information and exceptions about
that they can't reach to the dead node.

Btw sometimes the process does not die but looses the connection.

--

deniz

On Mon, Mar 7, 2011 at 7:19 PM, Stack <stack@duboce.net> wrote:

> I'm stumped.  I have nothing to go on when no death throes or
> complaints.  This hardware for sure is healthy?  Other stuff runs w/o
> issue?
> St.Ack
>
> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <deniz.oktar@gmail.com>
> wrote:
> > I don't know if its normal but I see alot of '0's in the test results
> when
> > it tends to fail, such as:
> >
> >  1196 sec: 7394901 operations; 0 current ops/sec;
> >
> > --
> > deniz
> >
> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <deniz.oktar@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> Thanks for the effort, answers below:
> >>
> >>
> >>
> >>
> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <stack@duboce.net> wrote:
> >>
> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <deniz.oktar@gmail.com>
> >>> wrote:
> >>> > We have a 5 node cluster, 4 of them being region servers. I am
> running a
> >>> > custom workload with YCSB and when the data is loading (heavy insert)
> at
> >>> > least one of the region servers are dying after about 600000
> operations.
> >>>
> >>>
> >>> Tell us the character of your 'custom workload' please.
> >>>
> >>>
> >> The workload is below, the part that fails is the loading part (-load)
> >> which inserts all the records first)
> >>
> >> recordcount=10000000
> >> operationcount=3000000
> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
> >>
> >> readallfields=true
> >>
> >> readproportion=0.5
> >> updateproportion=0.1
> >> scanproportion=0
> >> insertproportion=0.35
> >> readmodifywriteproportion=0.05
> >>
> >> requestdistribution=zipfian
> >>
> >>
> >>
> >>
> >>>
> >>> > There are no abnormalities in the logs as far as I can see, the only
> >>> common
> >>> > point is that all of them(in different trials, different region
> servers
> >>> > fail) request for a flush as the last logs, given below. .out files
> are
> >>> > empty. I am looking at the /var/log/hbase folder for logs. Running
> sun
> >>> java
> >>> > 6 latest version. I couldn't find any logs that indicates a problem
> with
> >>> > java. Tried the tests with openjdk and had the same results.
> >>> >
> >>>
> >>> Its strange that flush is the last thing in your log.  The process is
> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We usually
> >>> scream loudly when dying.
> >>>
> >>
> >> Yes, thats the strange part. The last line is a flush as if the process
> >> never failed. Yes, the process is dead and hbase cannot see the node.
> >>
> >>
> >>>
> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users and
> >>> certain
> >>> > that they are correct.
> >>>
> >>> The first line in an hbase log prints out the ulimit it sees.  You
> >>> might check that the hbase process for sure is picking up your ulimit
> >>> setting.
> >>>
> >>> That was a mistake I did a couple of days ago, checked it with cat
> >> /proc/<pid of reginserver>/limits  and all related users like 'hbase'
> has
> >> those limits. Checked the logs:
> >>
> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
> >> ulimit -n 52768
> >>
> >>>
> >>> > Also in the kernel logs, there are no apparent problems.
> >>> >
> >>>
> >>> (The mystery compounds)
> >>>
> >>> > 2011-03-07 15:07:58,301 DEBUG
> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
> >>> > requested for
> >>> >
> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
> >>> > because regionserver60020.cacheFlusher; priority=3, compaction queue
> >>> size=18
> >>> > 2011-03-07 15:07:58,301 DEBUG
> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>> > NOT flushing memstore for region
> >>> >
> >>>
> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
> >>> > flushing=false, writesEnabled=false
> >>> > 2011-03-07 15:07:58,301 DEBUG
> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>> > Started memstore flush for
> >>> >
> >>>
> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
> >>> > current region memstore size 68.6m
> >>> > 2011-03-07 15:07:58,310 DEBUG
> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>> > Flush requested on
> >>> >
> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
> >>> > -end of log file-
> >>> > ---
> >>> >
> >>>
> >>> Nothing more?
> >>>
> >>>
> >> No, nothing after that. But quite a lot of logs before that, I can send
> >> them if you'd like.
> >>
> >>
> >>
> >>> Thanks,
> >>> St.Ack
> >>>
> >>
> >> Thanks alot!
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message