hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucian Iordache <lucian.george.iorda...@gmail.com>
Subject Re: Lease does not exist exceptions
Date Thu, 27 Oct 2011 07:35:48 GMT
Yep. did not work entirely.

I had a job to run on 1000 regions. And the caching was 200. The job crashed
with a lot of ClosedChannelExceptions + LeaseExceptions.

Set the caching to 10 ==> the same.
Set the caching to 1 ==> ~600 successfully completed tasks, but still a lot
of them crashed ==> job crashed
Set the hbase.rpc.timeout to 240000 (which is the lease timeout on the
region server) ==> the job completed successfully, without any failed
attempts.

The problem was that we have some very large regions (2GB) and there are
some of them with very few data, that's why it takes more than 60 seconds to
get even the first row. As Daniel said, in the documentation of the lease
timeout for regionserver and the hbase.rpc.timeout should be mentioned to be
careful when modifying them, because you can get to problems, like in our
case.

Regards,
Lucian

On Wed, Oct 26, 2011 at 7:53 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Did you try setting the scanner caching down like I mentioned?
>
> J-D
>
> On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache
> <lucian.george.iordache@gmail.com> wrote:
> > Problem solved. It was like I said, the server took more than the
> > hbase.rpc.timeout to run the call and the client closed the connection.
> >
> > Best Regards,
> > Lucian
> >
> > On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
> > lucian.george.iordache@gmail.com> wrote:
> >
> >> Yes, I will try to see the SocketTimeoutException after putting log on
> >> debug, because, like it says here
> >> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on
> debug
> >> on the client side.
> >>
> >> Regards,
> >> Lucian
> >>
> >>
> >> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >>
> >>> So you should see the SocketTimeoutException in your *client* logs (in
> >>> your case, mappers), not LeaseException. At this point yes you're
> >>> going to timeout, but if you spend so much time cycling on the server
> >>> side then you shouldn't set a high caching configuration on your
> >>> scanner as IO isn't your bottle neck.
> >>>
> >>> J-D
> >>>
> >>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> >>> <lucian.george.iordache@gmail.com> wrote:
> >>> > Hi,
> >>> >
> >>> > The servers have been restarted (I have this configuration for more
> than
> >>> a
> >>> > month, so this is not the problem).
> >>> > About the stack traces, they show exactly the same, a lot of
> >>> > ClosedChannelConnections and LeaseExceptions.
> >>> >
> >>> > But I found something that could be the problem: hbase.rpc.timeout
.
> >>> This
> >>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml.
So
> it
> >>> > could happen the next way:
> >>> > - the mapper makes a scanner.next call to the region server
> >>> > - the region servers needs more than 60 seconds to execute it (I use
> >>> > multiple filters, and it could take a lot of time)
> >>> > - the scan client gets the timeout and cuts the connection
> >>> > - the region server tries to send the results to the client ==>
> >>> > ClosedChannelConnection
> >>> >
> >>> > I will get a deeper look into it tomorrow. If you have other
> >>> suggestions,
> >>> > please let me know!
> >>> >
> >>> > Thanks,
> >>> > Lucian
> >>> >
> >>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
> >>> jdcryans@apache.org>wrote:
> >>> >
> >>> >> Did you restart the region servers after changing the config?
> >>> >>
> >>> >> Are you sure it's the same exception/stack trace?
> >>> >>
> >>> >> J-D
> >>> >>
> >>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> >>> >> <lucian.george.iordache@gmail.com> wrote:
> >>> >> > Hi all,
> >>> >> >
> >>> >> > I have exactly the same problem that Eran had.
> >>> >> > But there is something I don't understand: in my case, I have
set
> the
> >>> >> lease
> >>> >> > time to 240000 (4 minutes). But most of the map tasks that
are
> >>> failing
> >>> >> run
> >>> >> > about 2 minutes. How is it possible to get a LeaseException
if the
> >>> task
> >>> >> runs
> >>> >> > less than the configured time for a lease?
> >>> >> >
> >>> >> > Regards,
> >>> >> > Lucian Iordache
> >>> >> >
> >>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <eran@gigya.com>
> >>> wrote:
> >>> >> >
> >>> >> >> Perfect! Thanks.
> >>> >> >>
> >>> >> >> -eran
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
> >>> jdcryans@apache.org
> >>> >> >> >wrote:
> >>> >> >>
> >>> >> >> > hbase.regionserver.lease.period
> >>> >> >> >
> >>> >> >> > Set it bigger than 60000.
> >>> >> >> >
> >>> >> >> > J-D
> >>> >> >> >
> >>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <eran@gigya.com>
> >>> wrote:
> >>> >> >> > >
> >>> >> >> > > Thanks J-D!
> >>> >> >> > > Since my main table is expected to continue
growing I guess
> at
> >>> some
> >>> >> >> point
> >>> >> >> > > even setting the cache size to 1 will not be
enough. Is there
> a
> >>> way
> >>> >> to
> >>> >> >> > > configure the lease timeout?
> >>> >> >> > >
> >>> >> >> > > -eran
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans
<
> >>> >> jdcryans@apache.org
> >>> >> >> > >wrote:
> >>> >> >> > >
> >>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran
Kutner <
> eran@gigya.com
> >>> >
> >>> >> >> wrote:
> >>> >> >> > > >
> >>> >> >> > > > > Hi J-D,
> >>> >> >> > > > > Thanks for the detailed explanation.
> >>> >> >> > > > > So if I understand correctly the lease
we're talking
> about
> >>> is a
> >>> >> >> > scanner
> >>> >> >> > > > > lease and the timeout is between two
scanner calls,
> correct?
> >>> I
> >>> >> >> think
> >>> >> >> > that
> >>> >> >> > > > > make sense because I now realize that
jobs that fail
> (some
> >>> jobs
> >>> >> >> > continued
> >>> >> >> > > > > to
> >>> >> >> > > > > fail even after reducing the number
of map tasks as Stack
> >>> >> >> suggested)
> >>> >> >> > use
> >>> >> >> > > > > filters to fetch relatively few rows
out of a very large
> >>> table,
> >>> >> so
> >>> >> >> > they
> >>> >> >> > > > > could be spending a lot of time on
the region server
> >>> scanning
> >>> >> rows
> >>> >> >> > until
> >>> >> >> > > > it
> >>> >> >> > > > > reached my setCaching value which
was 1000. Setting the
> >>> caching
> >>> >> >> value
> >>> >> >> > to
> >>> >> >> > > > 1
> >>> >> >> > > > > seem to allow these job to complete.
> >>> >> >> > > > > I think it has to be the above, since
my rows are small,
> >>> with
> >>> >> just
> >>> >> >> a
> >>> >> >> > few
> >>> >> >> > > > > columns and processing them is very
quick.
> >>> >> >> > > > >
> >>> >> >> > > >
> >>> >> >> > > > Excellent!
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > >
> >>> >> >> > > > > However, there are still a couple
ofw thing I don't
> >>> understand:
> >>> >> >> > > > > 1. What is the difference between
setCaching and
> setBatch?
> >>> >> >> > > > >
> >>> >> >> > > >
> >>> >> >> > > > * Set the maximum number of values to return
for each call
> to
> >>> >> next()
> >>> >> >> > > >
> >>> >> >> > > > VS
> >>> >> >> > > >
> >>> >> >> > > > * Set the number of rows for caching that
will be passed to
> >>> >> scanners.
> >>> >> >> > > >
> >>> >> >> > > > The former is useful if you have rows with
millions of
> columns
> >>> and
> >>> >> >> you
> >>> >> >> > > > could
> >>> >> >> > > > setBatch to get only 1000 of them at a
time. You could call
> >>> that
> >>> >> >> > intra-row
> >>> >> >> > > > scanning.
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > > 2. Examining the region server logs
more closely than I
> did
> >>> >> >> yesterday
> >>> >> >> > I
> >>> >> >> > > > see
> >>> >> >> > > > > a log of ClosedChannelExceptions in
addition to the
> expired
> >>> >> leases
> >>> >> >> > (but
> >>> >> >> > > > no
> >>> >> >> > > > > UnknownScannerException), is that
expected? You can see
> an
> >>> >> excerpt
> >>> >> >> of
> >>> >> >> > the
> >>> >> >> > > > > log from one of the region servers
here:
> >>> >> >> > http://pastebin.com/NLcZTzsY
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > It means that when the server got to process
that client
> >>> request
> >>> >> and
> >>> >> >> > > > started
> >>> >> >> > > > reading from the socket, the client was
already gone.
> Killing
> >>> a
> >>> >> >> client
> >>> >> >> > does
> >>> >> >> > > > that (or killing a MR that scans), so does
> >>> SocketTimeoutException.
> >>> >> >> This
> >>> >> >> > > > should probably go in the book. We should
also print
> something
> >>> >> nicer
> >>> >> >> :)
> >>> >> >> > > >
> >>> >> >> > > > J-D
> >>> >> >> > > >
> >>> >> >> >
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Numai bine,
> >> Lucian
> >>
> >
> >
> >
> > --
> > Numai bine,
> > Lucian
> >
>



-- 
Numai bine,
Lucian

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message