impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piyush Narang <p.nar...@criteo.com>
Subject Re: Debugging Impala query that consistently hangs
Date Tue, 13 Feb 2018 19:50:29 GMT
Yeah unfortunately all I see in our Hive server logs (our setup isn’t ideal, we pipe the
logs to kibana and they’re split up across lines ☹ ) is:
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,078 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(401)) - ugi=foo@bar#011ip=/1.2.3.4#011cmd=alter_partitions
: db=bossdb_perf tbl=bi_arbitrage_full#011
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, AS,
0, 0]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 10489)#012#011at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)#012#011at
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)#012#011at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:735)#012#011at
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:730)#012#011at
java.security.AccessController.doPrivileged(Native Method)#012#011at javax.security.auth...
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, AS,
0, 1]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, AS,
1, 0]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, EU,
1, 0]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 422)#012#011at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)#012#011at
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:730)#012#011at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)#012#011at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)#012#011at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)#012#011at
java.lang.Thread.run(Thread.java:748)#012
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, EU,
1, 1]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,092 INFO log.PerfLogger (PerfLogger.java:PerfLogEnd(168))
- </PERFLOG method=alter_partitions start=1518539013077 end=1518539013092 duration=15 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler
threadId=25300 retryCount=-1 error=true>
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,077 INFO log.PerfLogger (PerfLogger.java:PerfLogBegin(127))
- <PERFLOG method=alter_partitions from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,078 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(832)) - 25300: alter_partitions : db=bossdb_perf tbl=bi_arbitrage_full
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, EU,
0, 1]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, US,
0, 1]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, US,
1, 1]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, AS,
1, 1]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, EU,
0, 0]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, US,
1, 0]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,079 INFO metastore.HiveMetaStore
(HiveMetaStore.java:alter_partitions(3648)) - New partition values:[1, 2017-10-04, 0, US,
0, 0]
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 16:23:33,092 ERROR metastore.RetryingHMSHandler
(RetryingHMSHandler.java:invokeInternal(195)) - InvalidOperationException(message:Alter partition
operation failed: NoSuchObjectException(message:partition values=[1, 2017-10-04, 0, AS, 0,
0]))#012#011at org.apache.hadoop.hive.metastore.HiveAlterHandler.alterPartitions(HiveAlterHandler.java:631)#012#011at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_partitions(HiveMetaStore.java:3659)#012#011at
sun.reflect.GeneratedMethodAccessor368.invoke(Unknown Source)#012#011at sun.reflect.DelegatingMethod...
Feb 13 11:23:33 1c-98-ec-1d-dd-84 hivemetastore: [ ] 43)#012#011at java.lang.reflect.Method.invoke(Method.java:498)#012#011at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)#012#011at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)#012#011at
com.sun.proxy.$Proxy11.alter_partitions(Unknown Source)#012#011at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$alter_partitions.getResult(ThriftHiveMetastore.java:10505)#012#011at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetast...


-- Piyush


From: Alexander Behm <alex.behm@cloudera.com>
Reply-To: "user@impala.apache.org" <user@impala.apache.org>
Date: Tuesday, February 13, 2018 at 1:03 PM
To: "user@impala.apache.org" <user@impala.apache.org>
Subject: Re: Debugging Impala query that consistently hangs

If you are willing to share, it would be great get some details on why compute stats failed.

On Tue, Feb 13, 2018 at 7:57 AM, Piyush Narang <p.narang@criteo.com<mailto:p.narang@criteo.com>>
wrote:
Thanks Mostafa, that did help. I was able to compute stats on bi_dim_campaign and advertiser_event_rich.
I updated the row counts manually for the arbitrage table. I do see the query completing successfully
now. Will check in on the stats issue over the next couple of days for the bigger arbitrage
table. The last time we ran into this we didn’t see any useful logs on the Hive server side.
Our Hive team has updated Hive since then so I’ll try and try this out again / investigate
it a bit more.

-- Piyush


From: Mostafa Mokhtar <mmokhtar@cloudera.com<mailto:mmokhtar@cloudera.com>>
Reply-To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Date: Monday, February 12, 2018 at 5:11 PM
To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Subject: Re: Debugging Impala query that consistently hangs

Hi Piyush,

Please run the "alter table" commands below to set the number of rows per table, doing that
should yield better plans that what you currently have and avoid building a hash table that
consumes lots of memory.

You can also set the number of distinct values for the columns involved in join, aggregations
and predicates.

None of this manual work is needed if "compute stats foo" is run successfully.

alter table bi_dim_campaign set tblproperties ('numRows'='213');
alter table advertiser_event_rich set tblproperties ('numRows'='177000');
alter table bi_arbitrage_full set tblproperties ('bi_arbitrage_full'='2170000000');

https://www.cloudera.com/documentation/enterprise/5-10-x/topics/impala_perf_stats.html#perf_table_stats_manual<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cloudera.com_documentation_enterprise_5-2D10-2Dx_topics_impala-5Fperf-5Fstats.html-23perf-5Ftable-5Fstats-5Fmanual&d=DwMGaQ&c=nxfEpP1JWHVKAq835DW4mA&r=3Ka-O_qIfLiCDaGELmIN3BcChZatNdPOwe36odQXFYo&m=PXbtfShc2kbIinJiXsC_-oGofEjSbuisWx3h4RcmMhc&s=fimpxYlR-Ji85GRCRBZj7v8QcsKquSvOOfBP_-svygA&e=>
https://www.cloudera.com/documentation/enterprise/5-10-x/topics/impala_perf_stats.html#perf_column_stats_manual<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cloudera.com_documentation_enterprise_5-2D10-2Dx_topics_impala-5Fperf-5Fstats.html-23perf-5Fcolumn-5Fstats-5Fmanual&d=DwMGaQ&c=nxfEpP1JWHVKAq835DW4mA&r=3Ka-O_qIfLiCDaGELmIN3BcChZatNdPOwe36odQXFYo&m=PXbtfShc2kbIinJiXsC_-oGofEjSbuisWx3h4RcmMhc&s=3HNloA8f-indt0L3ZJ1D6ZaOlDxvXojgAKhabrBh6fQ&e=>

P.S In tables with a large number of partitions I have seen HMS to hit various scalability
limitations in JVM and the backing store.



On Mon, Feb 12, 2018 at 6:55 PM, Tim Armstrong <tarmstrong@cloudera.com<mailto:tarmstrong@cloudera.com>>
wrote:
Let us know if we can help figuring out what went wrong with compute stats.

- Tim

On Mon, Feb 12, 2018 at 6:07 AM, Piyush Narang <p.narang@criteo.com<mailto:p.narang@criteo.com>>
wrote:
Got it, thanks for the explanation Tim. I’ll chase into the issue with compute stats for
that table.

-- Piyush


From: Tim Armstrong <tarmstrong@cloudera.com<mailto:tarmstrong@cloudera.com>>
Reply-To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Date: Sunday, February 11, 2018 at 2:31 PM

To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Subject: Re: Debugging Impala query that consistently hangs

Piyush,

  I can't recommend in strong enough terms that you figure out how to get compute stats working.
You will not have a good experience with Impala without statistics - there's no way you will
get good plans for all your queries.

- Tim

On Fri, Feb 9, 2018 at 11:25 AM, Piyush Narang <p.narang@criteo.com<mailto:p.narang@criteo.com>>
wrote:
Thanks Tim. I had issues running compute stats on some of our tables (calling alter table
on Hive was failing and I wasn’t able to resolve it) and I think this was one of them. I’ll
try switching over to a shuffle join and see if that helps.

-- Piyush


From: Tim Armstrong <tarmstrong@cloudera.com<mailto:tarmstrong@cloudera.com>>
Reply-To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Date: Friday, February 9, 2018 at 12:24 PM

To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Subject: Re: Debugging Impala query that consistently hangs

I suspect it's busy building the hash tables in the join with id=7. If you drill down into
the profile I suspect you'll see a bunch of time spent there. The top-level time counter isn't
necessarily updated live for the time spent building the hash tables, but the fact it's using
179GB of memory is a big hint that it's building some big hash tables.

The plan you're getting is really terrible btw. That join has > 2B rows on the right side
and 0 rows on the left side, which is the exact opposite of what you what.

I'd suggest running compute stats on the input tables to get a better plan. I suspect that
will solve your problem.

On Thu, Feb 8, 2018 at 12:06 PM, Piyush Narang <p.narang@criteo.com<mailto:p.narang@criteo.com>>
wrote:
Yeah like I mentioned, the summary tab isn’t getting updated either:
Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak
Mem  Detail
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
22:AGGREGATE              1  142.755us  142.755us        0           1    4.00 KB       10.00
MB  FINALIZE
21:EXCHANGE               1    0.000ns    0.000ns        0           1          0        
     0  UNPARTITIONED
10:AGGREGATE              1  126.704us  126.704us        0           1    4.00 KB       10.00
MB
20:AGGREGATE              1  312.493us  312.493us        0          -1  119.12 KB      128.00
MB  FINALIZE
19:EXCHANGE               1    0.000ns    0.000ns        0          -1          0        
     0  HASH(day,country,…)
09:AGGREGATE              1  216.614us  216.614us        0          -1  119.12 KB      128.00
MB  STREAMING
18:AGGREGATE              1  357.918us  357.918us        0          -1  170.12 KB      128.00
MB  FINALIZE
17:EXCHANGE               1    0.000ns    0.000ns        0          -1          0        
     0  HASH(…)
08:AGGREGATE              1   27.985us   27.985us        0          -1  170.12 KB      128.00
MB  STREAMING
07:HASH JOIN              1    0.000ns    0.000ns        0          -1  179.72 GB        2.00
GB  LEFT OUTER JOIN, PARTITIONED
|--16:EXCHANGE            1      6m47s      6m47s    2.17B          -1          0        
     0  HASH(user_id)
|  05:HASH JOIN          22    8s927ms   14s258ms    2.17B          -1   68.07 MB        2.00
GB  LEFT OUTER JOIN, BROADCAST
|  |--14:EXCHANGE        22   11s626ms   11s844ms    1.08M          -1          0        
     0  BROADCAST
|  |  04:SCAN HDFS        2  103.838ms  138.573ms    1.08M          -1   10.48 MB       96.00
MB  dim_publisher pub
|  03:SCAN HDFS          22      8m40s      10m9s    2.17B          -1    1.03 GB      616.00
MB  bi_arbitrage_full a
15:EXCHANGE               1   22s489ms   22s489ms        0          -1          0        
     0  HASH(uid)
06:HASH JOIN              1   51.613ms   51.613ms   88.70K          -1   46.04 MB        2.00
GB  INNER JOIN, BROADCAST
|--13:EXCHANGE            1   22s928ms   22s928ms  177.30K          -1          0        
     0  BROADCAST
|  02:SCAN HDFS          22   14s311ms   21s235ms  177.30K          -1  798.47 MB      440.00
MB  advertiser_event_rich
12:AGGREGATE              1    7.971ms    7.971ms       36          -1   36.18 MB      128.00
MB  FINALIZE
11:EXCHANGE               1    1s892ms    1s892ms       56          -1          0        
     0  HASH(..)
01:AGGREGATE              1    0.000ns    0.000ns       56          -1   35.73 MB      128.00
MB  STREAMING
00:SCAN HDFS              1    2s012ms    2s012ms      213          -1    3.34 MB      128.00
MB  bi_dim_campaign


-- Piyush


From: Jeszy <jeszyb@gmail.com<mailto:jeszyb@gmail.com>>
Reply-To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Date: Thursday, February 8, 2018 at 2:59 PM
To: "user@impala.apache.org<mailto:user@impala.apache.org>" <user@impala.apache.org<mailto:user@impala.apache.org>>
Subject: Re: Debugging Impala query that consistently hangs

Not sure that's what you're referring to, but scan progress isn't necessarily indicative of
overall query progress. Can you attach the text profile of the cancelled query?
If you cannot upload attachments, the Summary section is the best starting point, so please
include that.

On 8 February 2018 at 20:53, Piyush Narang <p.narang@criteo.com<mailto:p.narang@criteo.com>>
wrote:
Hi folks,

I have a query that I’m running on Impala that seems to consistently stop making progress
after reaching 45-50%. It stays at that split number for a couple of hours (before I cancel
it).  I don’t see any progress on the summary page either. I’m running 2.11.0-cdh5.14.0
RELEASE (build d68206561bce6b26762d62c01a78e6cd27aa7690). It seems to not be making progress
from an exchange hash step.
Has anyone run into this in the past? Any suggestions on what’s the best way to debug this?
(I could take stack dumps on the coordinator / workers, but not sure if there’s any other
way).

Thanks,

-- Piyush







Mime
View raw message