hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Scanner timeouts
Date Fri, 28 Oct 2016 18:29:14 GMT
bq. with 400 threads hitting HBase at the same time

How many regions are serving the 400 threads ?
How many region servers do you have ?

If the regions are spread relatively evenly across the cluster, the above
may not be big issue.

On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Ok, will do.
>
> So the scanner does not indicate of itself that I’ve missed something in
> handling the data. If not index, then made a fast lookup “key”? I ask
> because the timeout change may work but not be the optimal solution. The
> stage that fails is very long compared to other stages. And with 400
> threads hitting HBase at the same time, this seems like something I may
> need to restructure and any advice about that would be welcome.
>
> HBase is 1.2.3
>
>
> On Oct 28, 2016, at 10:36 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> For your first question, you need to pass hbase-site.xml which has config
> parameters affecting client operations to Spark  executors.
>
> bq. missed indexing some column
>
> hbase doesn't have indexing (as in the sense of traditional RDBMS).
>
> Let's see what happens after hbase-site.xml is passed to executors.
>
> BTW Can you tell us the release of hbase you're using ?
>
>
>
> On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <pat@occamsmachete.com>
> wrote:
>
> > So to clarify there are some values in hbase/conf/hbase-site.xml that are
> > needed by the calling code in the Spark driver and executors and so must
> be
> > passed using --files to spark-submit? If so I can do this.
> >
> > But do I have a deeper issue? Is it typical to need a scan like this?
> Have
> > I missed indexing some column maybe?
> >
> >
> > On Oct 28, 2016, at 9:59 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > Mich:
> > bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> >
> > What you observed was different issue.
> > The above looks like trouble with locating region(s) during scan.
> >
> > On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> > mich.talebzadeh@gmail.com>
> > wrote:
> >
> >> This is an example I got
> >>
> >> warning: there were two deprecation warnings; re-run with -deprecation
> > for
> >> details
> >> rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[77]
> > at
> >> map at <console>:151
> >> defined class columns
> >> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
> >> string]
> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> >> attempts=36, exceptions:
> >> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> >> callTimeout=60000, callDuration=68411: row
> >> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> >> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> >> seqNum=0
> >> at
> >> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> >> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
> >> at
> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >> ScannerCallableWithReplicas.java:210)
> >> at
> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >> ScannerCallableWithReplicas.java:60)
> >> at
> >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> >> RpcRetryingCaller.java:210)
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn * https://www.linkedin.com/profile/view?id=
> >> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >> OABUrV8Pw>*
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly disclaimed.
> >> The author will in no case be liable for any monetary damages arising
> > from
> >> such loss, damage or destruction.
> >>
> >>
> >>
> >> On 28 October 2016 at 17:52, Pat Ferrel <pat@occamsmachete.com> wrote:
> >>
> >>> I will check that, but if that is a server startup thing I was not
> aware
> >> I
> >>> had to send it to the executors. So it’s like a connection timeout from
> >>> executor code?
> >>>
> >>>
> >>> On Oct 28, 2016, at 9:48 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>>
> >>> How did you change the timeout(s) ?
> >>>
> >>> bq. timeout is currently set to 60000
> >>>
> >>> Did you pass hbase-site.xml using --files to Spark job ?
> >>>
> >>> Cheers
> >>>
> >>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <pat@occamsmachete.com>
> >> wrote:
> >>>
> >>>> Using standalone Spark. I don’t recall seeing connection lost errors,
> >> but
> >>>> there are lots of logs. I’ve set the scanner and RPC timeouts to large
> >>>> numbers on the servers.
> >>>>
> >>>> But I also saw in the logs:
> >>>>
> >>>>  org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> >>>> passed since the last invocation, timeout is currently set to 60000
> >>>>
> >>>> Not sure where that is coming from. Does the driver machine making
> >>> queries
> >>>> need to have the timeout config also?
> >>>>
> >>>> And why so large, am I doing something wrong?
> >>>>
> >>>>
> >>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>>>
> >>>> Mich:
> >>>> The OutOfOrderScannerNextException indicated problem with read from
> >>> hbase.
> >>>>
> >>>> How did you know connection to Spark cluster was lost ?
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> >>>> mich.talebzadeh@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Looks like it lost the connection to Spark cluster.
> >>>>>
> >>>>> What mode you are using with Spark, Standalone, Yarn or others.
The
> >>> issue
> >>>>> looks like a resource manager issue.
> >>>>>
> >>>>> I have seen this when running Zeppelin with Spark on Hbase.
> >>>>>
> >>>>> HTH
> >>>>>
> >>>>> Dr Mich Talebzadeh
> >>>>>
> >>>>>
> >>>>>
> >>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>> <https://www.linkedin.com/profile/view?id=
> >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>>>> OABUrV8Pw>*
> >>>>>
> >>>>>
> >>>>>
> >>>>> http://talebzadehmich.wordpress.com
> >>>>>
> >>>>>
> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
for
> >>> any
> >>>>> loss, damage or destruction of data or any other property which
may
> >>> arise
> >>>>> from relying on this email's technical content is explicitly
> >> disclaimed.
> >>>>> The author will in no case be liable for any monetary damages arising
> >>>> from
> >>>>> such loss, damage or destruction.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 28 October 2016 at 16:38, Pat Ferrel <pat@occamsmachete.com>
> >> wrote:
> >>>>>
> >>>>>> I’m getting data from HBase using a large Spark cluster with
> >>> parallelism
> >>>>>> of near 400. The query fails quire often with the message below.
> >>>>> Sometimes
> >>>>>> a retry will work and sometimes the ultimate failure results
> (below).
> >>>>>>
> >>>>>> If I reduce parallelism in Spark it slows other parts of the
> >> algorithm
> >>>>>> unacceptably. I have also experimented with very large RPC/Scanner
> >>>>> timeouts
> >>>>>> of many minutes—to no avail.
> >>>>>>
> >>>>>> Any clues about what to look for or what may be setup wrong
in my
> >>>> tables?
> >>>>>>
> >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed
4
> >>> times,
> >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >> org.apache.hadoop.hbase.
> >>>>> DoNotRetryIOException:
> >>>>>> Failed after retry of OutOfOrderScannerNextException: was there
a
> >> rpc
> >>>>>> timeout?+details
> >>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed
4
> >>> times,
> >>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >> org.apache.hadoop.hbase.
> >>>>> DoNotRetryIOException:
> >>>>>> Failed after retry of OutOfOrderScannerNextException: was there
a
> >> rpc
> >>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> >>>>> ClientScanner.java:403)
> >>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> >>>> nextKeyValue(
> >>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >>>>>> mapreduce.TableRecordReader.nextKeyValue(
> TableRecordReader.java:138)
> >>> at
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message