hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: HBase on same boxes as HDFS Data nodes
Date Thu, 08 Jul 2010 22:40:58 GMT
Major compaction does not do any "checking" for current file locations.  Rather, major compactions
take all files of a region and compact them into a single file per family.  By rewriting the
region to HDFS, all blocks will be written to the local node.  That's how it gives locality.

By default, major compactions are run every 24 hours.

JG

> -----Original Message-----
> From: Jamie Cockrill [mailto:jamie.cockrill@gmail.com]
> Sent: Thursday, July 08, 2010 3:24 PM
> To: user@hbase.apache.org
> Subject: Re: HBase on same boxes as HDFS Data nodes
>
> Hi Venkatesh,
>
> I've had a read of the article that JD suggested and I think the
> following is useful for your situation (it's the paragraph before the
> last):
>
> "So this means for HBase that as the region server stays up for long
> enough (which is the default) that after a major compaction on all
> tables - which can be invoked manually or is triggered by a
> configuration setting - it has the files local on the same host. The
> data node that shares the same physical host has a copy of all data
> the region server requires. If you are running a scan or get or any
> other use-case you can be sure to get the best performance."
>
> If I understand this correctly, if you do a major compaction on the
> table, it'll check that the data blocks and regions are co-located
> properly and move any that are not. You can do a major compact in the
> hbase shell manually with:
>
> major_compact '<your tablename here>'
>
> Without the angle brackets, but with the parentheses. I think it does
> do major compacts on a periodic basis, but I couldn't say for sure.
>
> Thanks,
>
> Jamie
>
> PS, JD, thanks for the tip about locality. I'll get the hang of HBase
> some point!
>
> On 8 July 2010 18:51,  <vramanathan00@aol.com> wrote:
> >
> >  Thankyou..
> > I've some more questions
> > I'm spending quite a bit over last few weeks to develop one of our
> applications using HBase/Hadoop
> > & using 0.20.4
> >
> > Hbase - Table X
> > rows - 1- 100 -> Region A -> RegionServer A     --> DataNode A
> > ....
> > rows  1500 - 1600 -> Region M -> RegionServer B -> DataNode B
> >
> > So based on what I have read so far..I'm thinking of Region Server A
> & Data Node A pairs on the same host to
> > make use of locality..
> >
> > As per your answer ..If we restart the cluster, because of radom
> assigment, locality is gone
> > so..Region Server B -..> Region A ---> data blocks will be in Data
> Node A
> > ...if I understand correctly..
> > will the data move over time though...for example if i have lots of
> access to data in DataNode A ? without the current work that is in
> progress..
> >
> > thanks again for your reply
> >
> > venkatesh
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jean-Daniel Cryans <jdcryans@apache.org>
> > To: user@hbase.apache.org
> > Sent: Thu, Jul 8, 2010 1:35 pm
> > Subject: Re: HBase on same boxes as HDFS Data nodes
> >
> >
> > Former, " Now imagine you stop HBase after saving a lot of data and
> >
> > restarting it subsequently. The region servers are restarted and
> >
> > assign a seemingly random number of regions"
> >
> >
> >
> > It's not really because we enjoy it that way, but because the work
> >
> > required just isn't done. If this is of interest to you, Jonathan and
> >
> > Karthik at Facebook started rewriting our load balancer. See
> >
> > https://issues.apache.org/jira/browse/HBASE-2699 and
> >
> > https://issues.apache.org/jira/browse/HBASE-2480
> >
> >
> >
> > J-D
> >
> >
> >
> > On Thu, Jul 8, 2010 at 10:30 AM,  <vramanathan00@aol.com> wrote:
> >
> >>
> >
> >>  Hi
> >
> >> Fairly new to hbase..& the list serve..Following up on this thread &
> the
> >
> > article..
> >
> >> Could some one elaborate why locality is lost upon restart? Is it
> because
> >
> >> of random assignment by HMaster and/or HRegionServer is stateless or
> other
> >
> > reasons?
> >
> >>
> >
> >> thanks
> >
> >> venkatesh
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >> -----Original Message-----
> >
> >> From: Jean-Daniel Cryans <jdcryans@apache.org>
> >
> >> To: user@hbase.apache.org
> >
> >> Sent: Thu, Jul 8, 2010 1:11 pm
> >
> >> Subject: Re: HBase on same boxes as HDFS Data nodes
> >
> >>
> >
> >>
> >
> >> More info on this blog post:
> >
> >>
> >
> >> http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
> >
> >>
> >
> >>
> >
> >>
> >
> >> J-D
> >
> >>
> >
> >>
> >
> >>
> >
> >> On Thu, Jul 8, 2010 at 10:11 AM, Jean-Daniel Cryans
> <jdcryans@apache.org>
> >
> > wrote:
> >
> >>
> >
> >>> This would be done at the expense of network IO, since you will
> lose
> >
> >>
> >
> >>> locality for jobs that read/write to HBase. Also I guess the
> datanodes
> >
> >>
> >
> >>> are also there, so HBase will lose locality with HDFS.
> >
> >>
> >
> >>>
> >
> >>
> >
> >>> J-D
> >
> >>
> >
> >>>
> >
> >>
> >
> >>> On Thu, Jul 8, 2010 at 10:07 AM, Jamie Cockrill
> >
> >>
> >
> >>> <jamie.cockrill@gmail.com> wrote:
> >
> >>
> >
> >>>> Thanks all for your help with this, everything seems much more
> stable
> >
> >>
> >
> >>>> for the meantime. I have a backlog loading job to run over a great
> >
> >>
> >
> >>>> deal of data, so I might separate out my region servers from my
> task
> >
> >>
> >
> >>>> trackers for the meantime.
> >
> >>
> >
> >>>>
> >
> >>
> >
> >>>> Thanks again,
> >
> >>
> >
> >>>>
> >
> >>
> >
> >>>> Jamie
> >
> >>
> >
> >>>>
> >
> >>
> >
> >>>>
> >
> >>
> >
> >>>>
> >
> >>
> >
> >>>> On 8 July 2010 17:46, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
> >
> >>
> >
> >>>>> OS cache is good, glad you figured out your memory problem.
> >
> >>
> >
> >>>>>
> >
> >>
> >
> >>>>> J-D
> >
> >>
> >
> >>>>>
> >
> >>
> >
> >>>>> On Thu, Jul 8, 2010 at 2:03 AM, Jamie Cockrill
> <jamie.cockrill@gmail.com>
> >
> >>
> >
> >> wrote:
> >
> >>
> >
> >>>>>> Morning all. Day 2 begins...
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>> I discussed this with someone else earlier and they pointed
out
> that
> >
> >>
> >
> >>>>>> we also have task trackers running on all of those nodes, which
> will
> >
> >>
> >
> >>>>>> affect the amount of memory being used when jobs are being run.
> Each
> >
> >>
> >
> >>>>>> tasktracker had a maximum of 8 maps and 8 reduces configured
per
> node,
> >
> >>
> >
> >>>>>> with a JVM Xmx of 512mb each.  Clearly this implies a fully
> utilised
> >
> >>
> >
> >>>>>> node will use 8*512mb + 8*512mb = 8GB of memory on tasks alone.
> That's
> >
> >>
> >
> >>>>>> before the datanode does anything, or HBase for that matter.
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>> As such, I've dropped it to 4 maps, 4 reduces per node and
> reduced the
> >
> >>
> >
> >>>>>> Xmx to 256mb, giving a potential maximum task overhead of 2GB
> per
> >
> >>
> >
> >>>>>> node. Running 'vmstat 20' now, under load from mapreduce jobs,
> >
> >>
> >
> >>>>>> suggests that the actual free memory is about the same, but
the
> memory
> >
> >>
> >
> >>>>>> cache is much much bigger, which presumably is healthlier as,
in
> >
> >>
> >
> >>>>>> theory, that ought to relinquish memory to processes that
> request it.
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>> Lets see if that does the trick!
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>> ta
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>> Jamie
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>> On 7 July 2010 19:30, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
> >
> >>
> >
> >>>>>>> YouAreDead means that the region server's session was expired,
> GC
> >
> >>
> >
> >>>>>>> seems like your major problem. (file problems can happen
after
> a GC
> >
> >>
> >
> >>>>>>> sleep because they were moved around while the process was
> sleeping,
> >
> >>
> >
> >>>>>>> you also get the same kind of messages with xcievers issue...
> sorry
> >
> >>
> >
> >>>>>>> for the confusion)
> >
> >>
> >
> >>>>>>>
> >
> >>
> >
> >>>>>>> By over committing the memory I meant trying to fit too
much
> stuff in
> >
> >>
> >
> >>>>>>> the amount of RAM that you have. I guess it's the map and
> reduce tasks
> >
> >>
> >
> >>>>>>> that eat all the free space? Why not lower their number?
> >
> >>
> >
> >>>>>>>
> >
> >>
> >
> >>>>>>> J-D
> >
> >>
> >
> >>>>>>>
> >
> >>
> >
> >>>>>>> On Wed, Jul 7, 2010 at 11:22 AM, Jamie Cockrill
> >
> >>
> >
> >>>>>>> <jamie.cockrill@gmail.com> wrote:
> >
> >>
> >
> >>>>>>>> PS, I've now reset my MAX_FILESIZE back to the default.
 (from
> the 1GB
> >
> >>
> >
> >>>>>>>> i raised it to). It caused me to run into a delightful
> >
> >>
> >
> >>>>>>>> 'YouAreDeadException' which looks very related to the
Garbage
> >
> >>
> >
> >>>>>>>> collection issues on the Troubleshooting page, as my
Zookeeper
> session
> >
> >>
> >
> >>>>>>>> expired.
> >
> >>
> >
> >>>>>>>>
> >
> >>
> >
> >>>>>>>> Thanks
> >
> >>
> >
> >>>>>>>>
> >
> >>
> >
> >>>>>>>> Jamie
> >
> >>
> >
> >>>>>>>>
> >
> >>
> >
> >>>>>>>>
> >
> >>
> >
> >>>>>>>>
> >
> >>
> >
> >>>>>>>> On 7 July 2010 19:19, Jamie Cockrill
> <jamie.cockrill@gmail.com> wrote:
> >
> >>
> >
> >>>>>>>>> By overcommit, do you mean make my overcommit_ratio
higher on
> each box
> >
> >>
> >
> >>>>>>>>> (its at the default 50 at the moment)? What I'm
noticing at
> the moment
> >
> >>
> >
> >>>>>>>>> is that hadoop is taking up the vast majority of
the memory
> on the
> >
> >>
> >
> >>>>>>>>> boxes.
> >
> >>
> >
> >>>>>>>>>
> >
> >>
> >
> >>>>>>>>> I found this article:
> >
> >>
> >
> >>>>>>>>> http://blog.rapleaf.com/dev/2010/01/05/the-wrath-of-drwho-or-
> unpredictable-hadoop-memory-usage/
> >
> >>
> >
> >>>>>>>>> which Todd, it looks like you replied to. Does this
sound
> like a
> >
> >>
> >
> >>>>>>>>> similar problem? No worries if you can't remember,
it was
> back in
> >
> >>
> >
> >>>>>>>>> january! This article suggests reducing the amount
of memory
> allocated
> >
> >>
> >
> >>>>>>>>> to Hadoop at startup, how would I go about doing
this?
> >
> >>
> >
> >>>>>>>>>
> >
> >>
> >
> >>>>>>>>> Thank you everyone for your patience so far. Sorry
if this is
> taking
> >
> >>
> >
> >>>>>>>>> up a lot of your time.
> >
> >>
> >
> >>>>>>>>>
> >
> >>
> >
> >>>>>>>>> Thanks,
> >
> >>
> >
> >>>>>>>>>
> >
> >>
> >
> >>>>>>>>> Jamie
> >
> >>
> >
> >>>>>>>>>
> >
> >>
> >
> >>>>>>>>> On 7 July 2010 19:03, Jean-Daniel Cryans
> <jdcryans@apache.org> wrote:
> >
> >>
> >
> >>>>>>>>>> swappinness at 0 is good, but also don't overcommit
your
> memory!
> >
> >>
> >
> >>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>> J-D
> >
> >>
> >
> >>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>> On Wed, Jul 7, 2010 at 10:53 AM, Jamie Cockrill
> >
> >>
> >
> >>>>>>>>>> <jamie.cockrill@gmail.com> wrote:
> >
> >>
> >
> >>>>>>>>>>> I think you're right.
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> Unfortunately the machines are on a separate
network to
> this laptop,
> >
> >>
> >
> >>>>>>>>>>> so I'm having to type everything across,
apologies if it
> doesn't
> >
> >>
> >
> >>>>>>>>>>> translate well...
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> free -m gave:
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> Mem    Total    Used     Free
> >
> >>
> >
> >>>>>>>>>>>            7992     7939      53
> >
> >>
> >
> >>>>>>>>>>> b/c                    7877    114
> >
> >>
> >
> >>>>>>>>>>> Swap: 23415       895  22519
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> I did this on another node that isn't being
smashed at the
> moment and
> >
> >>
> >
> >>>>>>>>>>> the numbers came out similar, but the buffers/cache
free
> was higher
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> vmstat -20 is giving non-zero si and so's
ranging between 3
> and just
> >
> >>
> >
> >>>>>>>>>>> short of 5000.
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> That seems to be it I guess. Hadoop troubleshooting
> suggests setting
> >
> >>
> >
> >>>>>>>>>>> swappiness to 0, is that just a case of
changing the value
> in
> >
> >>
> >
> >>>>>>>>>>> /proc/sys/vm/swappiness?
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> thanks
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> Jamie
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>> On 7 July 2010 18:40, Todd Lipcon <todd@cloudera.com>
> wrote:
> >
> >>
> >
> >>>>>>>>>>>> On Wed, Jul 7, 2010 at 10:32 AM, Jamie
Cockrill
> <jamie.cockrill@gmail.com>wrote:
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>> On the subject of GC and heap, I've
left those as
> defaults. I could
> >
> >>
> >
> >>>>>>>>>>>>> look at those if that's the next
logical step? Would
> there be
> >
> >>
> >
> >> anything
> >
> >>
> >
> >>>>>>>>>>>>> in any of the logs that I should
look at?
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>> One thing I have noticed is that
it does take an absolute
> age to
> >
> > log
> >
> >>
> >
> >>>>>>>>>>>>> in to the DN/RS to restart the RS
once it's fallen over,
> in one
> >
> >>
> >
> >>>>>>>>>>>>> instance it took about 10 minutes.
These are 8GB, 4 core
> amd64
> >
> > boxes
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>> That indicates swapping. Can you run
"free -m" on the
> node?
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>> Also let "vmstat 20" run while running
your job and
> observe the "si"
> >
> >>
> >
> >> and
> >
> >>
> >
> >>>>>>>>>>>> "so" columns. If those are nonzero,
it indicates you're
> swapping,
> >
> > and
> >
> >>
> >
> >> you've
> >
> >>
> >
> >>>>>>>>>>>> oversubscribed your RAM (very easy on
8G machines)
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>> -Todd
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>> ta
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>> Jamie
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>> On 7 July 2010 18:30, Jamie Cockrill
> <jamie.cockrill@gmail.com>
> >
> >>
> >
> >> wrote:
> >
> >>
> >
> >>>>>>>>>>>>> > Bad news, it looks like my
xcievers is set as it should
> be, it's
> >
> >>
> >
> >> in
> >
> >>
> >
> >>>>>>>>>>>>> > the hdfs-site.xml and looking
at the job.xml of one of
> my jobs in
> >
> >>
> >
> >> the
> >
> >>
> >
> >>>>>>>>>>>>> > job-tracker, it's showing that
property as set to 2047.
> I've cat
> >
> > |
> >
> >>
> >
> >>>>>>>>>>>>> > grepped one of the datanode
logs and although there
> were a few in
> >
> >>
> >
> >>>>>>>>>>>>> > there, they were from a few
months ago. I've upped my
> >
> > MAX_FILESIZE
> >
> >>
> >
> >> on
> >
> >>
> >
> >>>>>>>>>>>>> > my table to 1GB to see if that
helps (not sure if it
> will!).
> >
> >>
> >
> >>>>>>>>>>>>> >
> >
> >>
> >
> >>>>>>>>>>>>> > Thanks,
> >
> >>
> >
> >>>>>>>>>>>>> >
> >
> >>
> >
> >>>>>>>>>>>>> > Jamie
> >
> >>
> >
> >>>>>>>>>>>>> >
> >
> >>
> >
> >>>>>>>>>>>>> > On 7 July 2010 18:12, Jean-Daniel
Cryans
> <jdcryans@apache.org>
> >
> >>
> >
> >> wrote:
> >
> >>
> >
> >>>>>>>>>>>>> >> xcievers exceptions will
be in the datanodes' logs,
> and your
> >
> >>
> >
> >> problem
> >
> >>
> >
> >>>>>>>>>>>>> >> totally looks like it.
0.20.5 will have the same issue
> (since
> >
> >>
> >
> >> it's on
> >
> >>
> >
> >>>>>>>>>>>>> >> the HDFS side)
> >
> >>
> >
> >>>>>>>>>>>>> >>
> >
> >>
> >
> >>>>>>>>>>>>> >> J-D
> >
> >>
> >
> >>>>>>>>>>>>> >>
> >
> >>
> >
> >>>>>>>>>>>>> >> On Wed, Jul 7, 2010 at
10:08 AM, Jamie Cockrill
> >
> >>
> >
> >>>>>>>>>>>>> >> <jamie.cockrill@gmail.com>
wrote:
> >
> >>
> >
> >>>>>>>>>>>>> >>> Hi Todd & JD,
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> Environment:
> >
> >>
> >
> >>>>>>>>>>>>> >>> All (hadoop and HBase)
installed as of karmic-cdh3,
> which
> >
> > means:
> >
> >>
> >
> >>>>>>>>>>>>> >>> Hadoop 0.20.2+228
> >
> >>
> >
> >>>>>>>>>>>>> >>> HBase 0.89.20100621+17
> >
> >>
> >
> >>>>>>>>>>>>> >>> Zookeeper 3.3.1+7
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> Unfortunately my whole
cluster of regionservers have
> now
> >
> >>
> >
> >> crashed, so I
> >
> >>
> >
> >>>>>>>>>>>>> >>> can't really say if
it was swapping too much. There
> is a DEBUG
> >
> >>
> >
> >>>>>>>>>>>>> >>> statement just before
it crashes saying:
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> org.apache.hadoop.hbase.regionserver.wal.HLog:
> closing hlog
> >
> >>
> >
> >> writer in
> >
> >>
> >
> >>>>>>>>>>>>> >>> hdfs://<somewhere
on my HDFS, in /hbase>
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> What follows is:
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> WARN org.apache.hadoop.hdfs.DFSClient:
DataStreamer
> Exception:
> >
> >>
> >
> >>>>>>>>>>>>> >>> org.apache.hadoop.ipc.RemoteException:
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
> >
> > No
> >
> >>
> >
> >> lease
> >
> >>
> >
> >>>>>>>>>>>>> >>> on <file location
as above> File does not exist.
> Holder
> >
> >>
> >
> >>>>>>>>>>>>> >>> DFSClient_-11113603
does not have any open files
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> It then seems to try
and do some error recovery
> (Error Recovery
> >
> >>
> >
> >> for
> >
> >>
> >
> >>>>>>>>>>>>> >>> block null bad datanode[0]
nodes == null), fails
> (Could not get
> >
> >>
> >
> >> block
> >
> >>
> >
> >>>>>>>>>>>>> >>> locations. Source file
"<hbase file as before>" -
> Aborting).
> >
> >>
> >
> >> There is
> >
> >>
> >
> >>>>>>>>>>>>> >>> then an ERROR org.apache...HRegionServer:
Close and
> delete
> >
> >>
> >
> >> failed.
> >
> >>
> >
> >>>>>>>>>>>>> >>> There is then a similar
LeaseExpiredException as
> above.
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> There are then a couple
of messages from
> HRegionServer saying
> >
> >>
> >
> >> that
> >
> >>
> >
> >>>>>>>>>>>>> >>> it's notifying master
of its shutdown and stopping
> itself. The
> >
> >>
> >
> >>>>>>>>>>>>> >>> shutdown hook then
fires and the RemoteException and
> >
> >>
> >
> >>>>>>>>>>>>> >>> LeaseExpiredExceptions
are printed again.
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> ulimit is set to 65000
(it's in the regionserver log,
> printed
> >
> > as
> >
> >>
> >
> >> I
> >
> >>
> >
> >>>>>>>>>>>>> >>> restarted the regionserver),
however I haven't got
> the xceivers
> >
> >>
> >
> >> set
> >
> >>
> >
> >>>>>>>>>>>>> >>> anywhere. I'll give
that a go. It does seem very odd
> as I did
> >
> >>
> >
> >> have a
> >
> >>
> >
> >>>>>>>>>>>>> >>> few of them fall over
one at a time with a few early
> loads, but
> >
> >>
> >
> >> that
> >
> >>
> >
> >>>>>>>>>>>>> >>> seemed to be because
the regions weren't splitting
> properly, so
> >
> >>
> >
> >> all
> >
> >>
> >
> >>>>>>>>>>>>> >>> the traffic was going
to one node and it was being
> overwhelmed.
> >
> >>
> >
> >> Once I
> >
> >>
> >
> >>>>>>>>>>>>> >>> throttled it, after
one load it a region split seemed
> to get
> >
> >>
> >
> >>>>>>>>>>>>> >>> triggered, which flung
regions all over, which made
> subsequent
> >
> >>
> >
> >> loads
> >
> >>
> >
> >>>>>>>>>>>>> >>> much more distributed.
However, perhaps the time-bomb
> was
> >
> >>
> >
> >> ticking...
> >
> >>
> >
> >>>>>>>>>>>>> >>> I'll  have a go at
specifying the xcievers property.
> I'm pretty
> >
> >>
> >
> >>>>>>>>>>>>> >>> certain i've got everything
else covered, except the
> patches as
> >
> >>
> >
> >>>>>>>>>>>>> >>> referenced in the JIRA.
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> I just grepped some
of the log files and didn't get
> an explicit
> >
> >>
> >
> >>>>>>>>>>>>> >>> exception with 'xciever'
in it.
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> I am considering downgrading(?)
to 0.20.5, however
> because
> >
> >>
> >
> >> everything
> >
> >>
> >
> >>>>>>>>>>>>> >>> is installed as per
karmic-cdh3, I'm a bit reluctant
> to do so
> >
> > as
> >
> >>
> >
> >>>>>>>>>>>>> >>> presumably Cloudera
has tested each of these versions
> against
> >
> >>
> >
> >> each
> >
> >>
> >
> >>>>>>>>>>>>> >>> other? And I don't
really want to introduce further
> versioning
> >
> >>
> >
> >> issues.
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> Thanks,
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> Jamie
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>> On 7 July 2010 17:30,
Jean-Daniel Cryans
> <jdcryans@apache.org>
> >
> >>
> >
> >> wrote:
> >
> >>
> >
> >>>>>>>>>>>>> >>>> Jamie,
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>> Does your configuration
meets the requirements?
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> http://hbase.apache.org/docs/r0.20.5/api/overview-
> summary.html#requirements
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>> ulimit and xcievers,
if not set, are usually time
> bombs that
> >
> >>
> >
> >> blow off
> >
> >>
> >
> >>>>>>>>>>>>> when
> >
> >>
> >
> >>>>>>>>>>>>> >>>> the cluster is
under load.
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>> J-D
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>> On Wed, Jul 7,
2010 at 9:11 AM, Jamie Cockrill <
> >
> >>
> >
> >>>>>>>>>>>>> jamie.cockrill@gmail.com>wrote:
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> Dear all,
> >
> >>
> >
> >>>>>>>>>>>>> >>>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> My current
HBase/Hadoop architecture has HBase
> region servers
> >
> >>
> >
> >> on the
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> same physical
boxes as the HDFS data-nodes. I'm
> getting an
> >
> >>
> >
> >> awful lot
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> of region server
crashes. The last thing that
> happens appears
> >
> >>
> >
> >> to be a
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> DroppedSnapshot
Exception, caused by an
> IOException: could
> >
> > not
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> complete write
to file <file on HDFS>. I am running
> it under
> >
> >>
> >
> >> load,
> >
> >>
> >
> >>>>>>>>>>>>> how
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> heavy that
is I'm not sure how that is quantified,
> but I'm
> >
> >>
> >
> >> guessing
> >
> >>
> >
> >>>>>>>>>>>>> it
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> is a load issue.
> >
> >>
> >
> >>>>>>>>>>>>> >>>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> Is it common
practice to put region servers on
> data-nodes? Is
> >
> >>
> >
> >> it
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> common to see
region server crashes when either the
> HDFS or
> >
> >>
> >
> >> region
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> server (or
both) is under heavy load? I'm guessing
> that is
> >
> > the
> >
> >>
> >
> >> case
> >
> >>
> >
> >>>>>>>>>>>>> as
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> I've seen a
few similar posts. I've not got a great
> deal of
> >
> >>
> >
> >> capacity
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> to be separating
region servers from HDFS data
> nodes, but it
> >
> >>
> >
> >> might be
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> an argument
I could make.
> >
> >>
> >
> >>>>>>>>>>>>> >>>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> Thanks
> >
> >>
> >
> >>>>>>>>>>>>> >>>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>>> Jamie
> >
> >>
> >
> >>>>>>>>>>>>> >>>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>>
> >
> >>
> >
> >>>>>>>>>>>>> >>>
> >
> >>
> >
> >>>>>>>>>>>>> >>
> >
> >>
> >
> >>>>>>>>>>>>> >
> >
> >>
> >
> >>>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>> --
> >
> >>
> >
> >>>>>>>>>>>> Todd Lipcon
> >
> >>
> >
> >>>>>>>>>>>> Software Engineer, Cloudera
> >
> >>
> >
> >>>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>>
> >
> >>
> >
> >>>>>>>>>
> >
> >>
> >
> >>>>>>>>
> >
> >>
> >
> >>>>>>>
> >
> >>
> >
> >>>>>>
> >
> >>
> >
> >>>>>
> >
> >>
> >
> >>>>
> >
> >>
> >
> >>>
> >
> >>
> >
> >>
> >
> >>
> >
> >>
> >
> >
> >
> >

Mime
View raw message