ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoc Phan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-22701) hive CLI process leak on metastore alert
Date Thu, 28 Dec 2017 00:06:01 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-22701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoc Phan updated AMBARI-22701:
------------------------------
    Description: 
alert_hive_metastore.py will cause orphan processes running over time. Below is one example:


{code:none}
1001     593317 593316  0 Dec24 ?        00:00:00 -bash -c export  PATH='/usr/sbin:/sbin:/usr/
   lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/s
   bin/:/usr/hdp/current/hive-metastore/bin' ; export HIVE_CONF_DIR="/usr/hdp/current/hive-metastore/conf/conf.server"
; hive --hiveconf hive.metastore.uris=thrift://demo.local:9083                 --hiveconf
hive.metastore.client.connect.retry.delay=1                     --hiveconf hive.metastore.failure.retries=1
                --hiveconf hive.metastore.connect.retries=1                 --hiveconf hive.metastore.client.socket.timeout=14
                    --hiveconf hive.execution.engine=mr -e "show databases;"
{code}


There could be thousands of those over many months in the host with Hive Metastore. To check,
run below two commands:


{code:none}
ps -ef | grep "[s]how databases" | wc -l
ps h -Led -o user | sort | uniq -c | sort -n
{code}


This will hit nproc limit and crash other services in the same host.

The fixes are:
1. Swap to "hive" user instead of "ambari-qa" user: 
https://issues.apache.org/jira/browse/AMBARI-22142

2. Change hive CLI to beeline:
https://issues.apache.org/jira/browse/AMBARI-17006

For some reasons, the hive CLI processes don't get killed and kept "lingering" around.

Proposed fix in /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts

Instructions:

1. Add below lines below "HIVE_METASTORE_URIS_KEY = '{{hive-site/hive.metastore.uris}}'"

HIVE_SERVER_THRIFT_PORT_KEY = '{{hive-site/hive.server2.thrift.port}}'
HIVE_SERVER_THRIFT_HTTP_PORT_KEY = '{{hive-site/hive.server2.thrift.http.port}}'
HIVE_SERVER_TRANSPORT_MODE_KEY = '{{hive-site/hive.server2.transport.mode}}'
THRIFT_PORT_DEFAULT = 10000
HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary'

2. Change SMOKEUSER_DEFAULT = 'ambari-qa' to:

SMOKEUSER_DEFAULT = 'hive'

3. Replace   
return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY,
SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT)

with this:

  return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY,
SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT, HIVE_SERVER_THRIFT_PORT_KEY,
HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY)

4. Replace this

return (HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY)

with this:

  return (HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY,
HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY)


5. Comment out these lines because it will kept injecting ambari-qa user back

  #if SMOKEUSER_KEY in configurations:
  #  smokeuser = configurations[SMOKEUSER_KEY]

6. Replace this code block:


    cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; "
                 "hive --hiveconf hive.metastore.uris={metastore_uri}\

                 --hiveconf hive.metastore.client.connect.retry.delay=1\
                 --hiveconf hive.metastore.failure.retries=1\
                 --hiveconf hive.metastore.connect.retries=1\
                 --hiveconf hive.metastore.client.socket.timeout=14\
                 --hiveconf hive.execution.engine=mr -e 'show databases;'")

with this block:

    transport_mode = HIVE_SERVER_TRANSPORT_MODE_DEFAULT
    if HIVE_SERVER_TRANSPORT_MODE_KEY in configurations:
      transport_mode = configurations[HIVE_SERVER_TRANSPORT_MODE_KEY]

    port = THRIFT_PORT_DEFAULT
    if transport_mode.lower() == 'binary' and HIVE_SERVER_THRIFT_PORT_KEY in configurations:
      port = int(configurations[HIVE_SERVER_THRIFT_PORT_KEY])
    elif transport_mode.lower() == 'http' and HIVE_SERVER_THRIFT_HTTP_PORT_KEY in configurations:
      port = int(configurations[HIVE_SERVER_THRIFT_HTTP_PORT_KEY])

    cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; "

                 "beeline -u jdbc:hive2://{host_name}:{port}/\
                 --hiveconf hive.metastore.client.connect.retry.delay=1\
                 --hiveconf hive.metastore.failure.retries=1\
                 --hiveconf hive.metastore.connect.retries=1\
                 --hiveconf hive.metastore.client.socket.timeout=14\
                 --hiveconf hive.execution.engine=mr -e 'show databases;'")

  was:
alert_hive_metastore.py will cause orphan processes running over time. Below is one example:


{code:java}
1001     593317 593316  0 Dec24 ?        00:00:00 -bash -c export  PATH='/usr/sbin:/sbin:/usr/
   lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/s
   bin/:/usr/hdp/current/hive-metastore/bin' ; export HIVE_CONF_DIR="/usr/hdp/current/hive-metastore/conf/conf.server"
; hive --hiveconf hive.metastore.uris=thrift://demo.local:9083                 --hiveconf
hive.metastore.client.connect.retry.delay=1                     --hiveconf hive.metastore.failure.retries=1
                --hiveconf hive.metastore.connect.retries=1                 --hiveconf hive.metastore.client.socket.timeout=14
                    --hiveconf hive.execution.engine=mr -e "show databases;"
{code}


There could be thousands of those over many months in the host with Hive Metastore. To check,
run below two commands:


{code:bash}
ps -ef | grep "[s]how databases" | wc -l
ps h -Led -o user | sort | uniq -c | sort -n
{code}


This will hit nproc limit and crash other services in the same host.

The fixes are:
1. Swap to "hive" user instead of "ambari-qa" user: 
https://issues.apache.org/jira/browse/AMBARI-22142

2. Change hive CLI to beeline:
https://issues.apache.org/jira/browse/AMBARI-17006

For some reasons, the hive CLI processes don't get killed and kept "lingering" around.

Proposed fix in /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts

Instructions:

1. Add below lines below "HIVE_METASTORE_URIS_KEY = '{{hive-site/hive.metastore.uris}}'"

HIVE_SERVER_THRIFT_PORT_KEY = '{{hive-site/hive.server2.thrift.port}}'
HIVE_SERVER_THRIFT_HTTP_PORT_KEY = '{{hive-site/hive.server2.thrift.http.port}}'
HIVE_SERVER_TRANSPORT_MODE_KEY = '{{hive-site/hive.server2.transport.mode}}'
THRIFT_PORT_DEFAULT = 10000
HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary'

2. Change SMOKEUSER_DEFAULT = 'ambari-qa' to:

SMOKEUSER_DEFAULT = 'hive'

3. Replace   
return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY,
SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT)

with this:

  return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY,
SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT, HIVE_SERVER_THRIFT_PORT_KEY,
HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY)

4. Replace this

return (HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY)

with this:

  return (HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY,
HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY)


5. Comment out these lines because it will kept injecting ambari-qa user back

  #if SMOKEUSER_KEY in configurations:
  #  smokeuser = configurations[SMOKEUSER_KEY]

6. Replace this code block:


    cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; "
                 "hive --hiveconf hive.metastore.uris={metastore_uri}\

                 --hiveconf hive.metastore.client.connect.retry.delay=1\
                 --hiveconf hive.metastore.failure.retries=1\
                 --hiveconf hive.metastore.connect.retries=1\
                 --hiveconf hive.metastore.client.socket.timeout=14\
                 --hiveconf hive.execution.engine=mr -e 'show databases;'")

with this block:

    transport_mode = HIVE_SERVER_TRANSPORT_MODE_DEFAULT
    if HIVE_SERVER_TRANSPORT_MODE_KEY in configurations:
      transport_mode = configurations[HIVE_SERVER_TRANSPORT_MODE_KEY]

    port = THRIFT_PORT_DEFAULT
    if transport_mode.lower() == 'binary' and HIVE_SERVER_THRIFT_PORT_KEY in configurations:
      port = int(configurations[HIVE_SERVER_THRIFT_PORT_KEY])
    elif transport_mode.lower() == 'http' and HIVE_SERVER_THRIFT_HTTP_PORT_KEY in configurations:
      port = int(configurations[HIVE_SERVER_THRIFT_HTTP_PORT_KEY])

    cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; "

                 "beeline -u jdbc:hive2://{host_name}:{port}/\
                 --hiveconf hive.metastore.client.connect.retry.delay=1\
                 --hiveconf hive.metastore.failure.retries=1\
                 --hiveconf hive.metastore.connect.retries=1\
                 --hiveconf hive.metastore.client.socket.timeout=14\
                 --hiveconf hive.execution.engine=mr -e 'show databases;'")


> hive CLI process leak on metastore alert
> ----------------------------------------
>
>                 Key: AMBARI-22701
>                 URL: https://issues.apache.org/jira/browse/AMBARI-22701
>             Project: Ambari
>          Issue Type: Bug
>          Components: alerts
>    Affects Versions: 2.4.0
>         Environment: CentOS 6.9
> Ambari 2.4.0.1
> Hortonworks Hadoop 2.5.0.0-1245
> Hive installed
> Tez installed
>            Reporter: Hoc Phan
>
> alert_hive_metastore.py will cause orphan processes running over time. Below is one example:
> {code:none}
> 1001     593317 593316  0 Dec24 ?        00:00:00 -bash -c export  PATH='/usr/sbin:/sbin:/usr/
   lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/s
   bin/:/usr/hdp/current/hive-metastore/bin' ; export HIVE_CONF_DIR="/usr/hdp/current/hive-metastore/conf/conf.server"
; hive --hiveconf hive.metastore.uris=thrift://demo.local:9083                 --hiveconf
hive.metastore.client.connect.retry.delay=1                     --hiveconf hive.metastore.failure.retries=1
                --hiveconf hive.metastore.connect.retries=1                 --hiveconf hive.metastore.client.socket.timeout=14
                    --hiveconf hive.execution.engine=mr -e "show databases;"
> {code}
> There could be thousands of those over many months in the host with Hive Metastore. To
check, run below two commands:
> {code:none}
> ps -ef | grep "[s]how databases" | wc -l
> ps h -Led -o user | sort | uniq -c | sort -n
> {code}
> This will hit nproc limit and crash other services in the same host.
> The fixes are:
> 1. Swap to "hive" user instead of "ambari-qa" user: 
> https://issues.apache.org/jira/browse/AMBARI-22142
> 2. Change hive CLI to beeline:
> https://issues.apache.org/jira/browse/AMBARI-17006
> For some reasons, the hive CLI processes don't get killed and kept "lingering" around.
> Proposed fix in /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts
> Instructions:
> 1. Add below lines below "HIVE_METASTORE_URIS_KEY = '{{hive-site/hive.metastore.uris}}'"
> HIVE_SERVER_THRIFT_PORT_KEY = '{{hive-site/hive.server2.thrift.port}}'
> HIVE_SERVER_THRIFT_HTTP_PORT_KEY = '{{hive-site/hive.server2.thrift.http.port}}'
> HIVE_SERVER_TRANSPORT_MODE_KEY = '{{hive-site/hive.server2.transport.mode}}'
> THRIFT_PORT_DEFAULT = 10000
> HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary'
> 2. Change SMOKEUSER_DEFAULT = 'ambari-qa' to:
> SMOKEUSER_DEFAULT = 'hive'
> 3. Replace   
> return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY,
SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT)
> with this:
>   return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY,
SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT, HIVE_SERVER_THRIFT_PORT_KEY,
HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY)
> 4. Replace this
> return (HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY)
> with this:
>   return (HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY,
HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY)
> 5. Comment out these lines because it will kept injecting ambari-qa user back
>   #if SMOKEUSER_KEY in configurations:
>   #  smokeuser = configurations[SMOKEUSER_KEY]
> 6. Replace this code block:
>     cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; "
>                  "hive --hiveconf hive.metastore.uris={metastore_uri}\
>                  --hiveconf hive.metastore.client.connect.retry.delay=1\
>                  --hiveconf hive.metastore.failure.retries=1\
>                  --hiveconf hive.metastore.connect.retries=1\
>                  --hiveconf hive.metastore.client.socket.timeout=14\
>                  --hiveconf hive.execution.engine=mr -e 'show databases;'")
> with this block:
>     transport_mode = HIVE_SERVER_TRANSPORT_MODE_DEFAULT
>     if HIVE_SERVER_TRANSPORT_MODE_KEY in configurations:
>       transport_mode = configurations[HIVE_SERVER_TRANSPORT_MODE_KEY]
>     port = THRIFT_PORT_DEFAULT
>     if transport_mode.lower() == 'binary' and HIVE_SERVER_THRIFT_PORT_KEY in configurations:
>       port = int(configurations[HIVE_SERVER_THRIFT_PORT_KEY])
>     elif transport_mode.lower() == 'http' and HIVE_SERVER_THRIFT_HTTP_PORT_KEY in configurations:
>       port = int(configurations[HIVE_SERVER_THRIFT_HTTP_PORT_KEY])
>     cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; "
>                  "beeline -u jdbc:hive2://{host_name}:{port}/\
>                  --hiveconf hive.metastore.client.connect.retry.delay=1\
>                  --hiveconf hive.metastore.failure.retries=1\
>                  --hiveconf hive.metastore.connect.retries=1\
>                  --hiveconf hive.metastore.client.socket.timeout=14\
>                  --hiveconf hive.execution.engine=mr -e 'show databases;'")



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message