hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subramanyam Pattipaka (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability
Date Fri, 22 Sep 2017 17:50:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Subramanyam Pattipaka updated HIVE-15947:
-----------------------------------------
    Description: 
Currently Templeton service doesn't restrict number of job operation requests. It simply accepts
and tries to run all operations. If more number of concurrent job submit requests comes then
the time to submit job operations can increase significantly. Templetonused hdfs to store
staging file for job. If HDFS storage can't respond to large number of requests and throttles
then the job submission can take very large times in order of minutes.

This behavior may not be suitable for all applications and client applications  may be looking
for predictable and low response for successful request or send throttle response to client
to wait for some time before re-requesting job operation.

In this JIRA, I am trying to address following job operations 
1) Submit new Job
2) Get Job Status
3) List jobs

These three operations has different complexity due to variance in use of cluster resources
like YARN/HDFS.

The idea is to introduce a new config templeton.parallellism.job.submit which controls maximum
number of concurrent active job submissions within Templeton and use this config to control
better response times. If a new job submission request sees that there are already templeton.parallellism.job.submit
jobs getting submitted concurrently then the request will fail with Http error 503 with reason


   “Too many concurrent job submission requests received. Please wait for some time before
retrying.”
 
The client is expected to catch this response and retry after waiting for some time. The default
value for the config templeton.parallellism.job.submit is set to ‘0’. This means by default
job submission requests are always accepted. The behavior needs to be enabled based on requirements.

We can have similar behavior for Status and List operations with configs templeton.parallellism.job.status
and templeton.parallellism.job.list respectively.

Once the job operation is started, the operation can take longer time. The client which has
requested for job operation may not be waiting for indefinite amount of time. This work introduces
configurations

templeton.job.submit.timeout
templeton.job.status.timeout
templeton.job.list.timeout

to specify maximum amount of time job operation can execute. If time out happens then list
and status job requests returns to client with message

"List job request got timed out. Please retry the operation after waiting for some time."

If submit job request gets timed out then 
      i) The job submit request thread which receives time out will check if valid job id
is generated in job request.
      ii) If it is generated then issue kill job request on cancel thread pool. Don't wait
for operation to complete and returns to client with time out message. 

Side effects of enabling time out for submit operations
1) This has a possibility for having active job for some time by the client gets response
and a list operation from client could potential show the newly created job before it gets
killed.
2) We do best effort to kill the job and no guarantees. This means there is a possibility
of duplicate job created. One possible reason for this could be a case where job is created
and then operation timed out but kill request failed due to resource manager unavailability.
When resource manager restarts, it will restarts the job which got created.

Fixing this scenario is not part of the scope of this JIRA. The job operation functionality
can be enabled only if above side effects are acceptable.


  was:
Currently Templeton service doesn't restrict number of job operation requests. It simply accepts
and tries to run all operations. If more number of concurrent job submit requests comes then
the time to submit job operations can increase significantly. Templetonused hdfs to store
staging file for job. If HDFS storage can't respond to large number of requests and throttles
then the job submission can take very large times in order of minutes.

This behavior may not be suitable for all applications and client applications  may be looking
for predictable and low response for successful request or send throttle response to client
to wait for some time before re-requesting job operation.

In this JIRA, I am trying to address following job operations 
1) Submit new Job
2) Get Job Status
3) List jobs

These three operations has different complexity due to variance in use of cluster resources
like YARN/HDFS.

The idea is to introduce a new config templeton.job.submit.exec.max-procs which controls maximum
number of concurrent active job submissions within Templeton and use this config to control
better response times. If a new job submission request sees that there are already templeton.job.submit.exec.max-procs
jobs getting submitted concurrently then the request will fail with Http error 503 with reason


   “Too many concurrent job submission requests received. Please wait for some time before
retrying.”
 
The client is expected to catch this response and retry after waiting for some time. The default
value for the config templeton.job.submit.exec.max-procs is set to ‘0’. This means by
default job submission requests are always accepted. The behavior needs to be enabled based
on requirements.

We can have similar behavior for Status and List operations with configs templeton.job.status.exec.max-procs
and templeton.list.job.exec.max-procs respectively.

Once the job operation is started, the operation can take longer time. The client which has
requested for job operation may not be waiting for indefinite amount of time. This work introduces
configurations

templeton.exec.job.submit.timeout
templeton.exec.job.status.timeout
templeton.exec.job.list.timeout

to specify maximum amount of time job operation can execute. If time out happens then list
and status job requests returns to client with message

"List job request got timed out. Please retry the operation after waiting for some time."

If submit job request gets timed out then 
      i) The job submit request thread which receives time out will check if valid job id
is generated in job request.
      ii) If it is generated then issue kill job request on cancel thread pool. Don't wait
for operation to complete and returns to client with time out message. 

Side effects of enabling time out for submit operations
1) This has a possibility for having active job for some time by the client gets response
and a list operation from client could potential show the newly created job before it gets
killed.
2) We do best effort to kill the job and no guarantees. This means there is a possibility
of duplicate job created. One possible reason for this could be a case where job is created
and then operation timed out but kill request failed due to resource manager unavailability.
When resource manager restarts, it will restarts the job which got created.

Fixing this scenario is not part of the scope of this JIRA. The job operation functionality
can be enabled only if above side effects are acceptable.



> Enhance Templeton service job operations reliability
> ----------------------------------------------------
>
>                 Key: HIVE-15947
>                 URL: https://issues.apache.org/jira/browse/HIVE-15947
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Subramanyam Pattipaka
>            Assignee: Subramanyam Pattipaka
>              Labels: TODOC2.2
>             Fix For: 2.3.0
>
>         Attachments: HIVE-15947.10.patch, HIVE-15947.2.patch, HIVE-15947.3.patch, HIVE-15947.4.patch,
HIVE-15947.6.patch, HIVE-15947.7.patch, HIVE-15947.8.patch, HIVE-15947.9.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation requests. It simply
accepts and tries to run all operations. If more number of concurrent job submit requests
comes then the time to submit job operations can increase significantly. Templetonused hdfs
to store staging file for job. If HDFS storage can't respond to large number of requests and
throttles then the job submission can take very large times in order of minutes.
> This behavior may not be suitable for all applications and client applications  may be
looking for predictable and low response for successful request or send throttle response
to client to wait for some time before re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of cluster resources
like YARN/HDFS.
> The idea is to introduce a new config templeton.parallellism.job.submit which controls
maximum number of concurrent active job submissions within Templeton and use this config to
control better response times. If a new job submission request sees that there are already
templeton.parallellism.job.submit jobs getting submitted concurrently then the request will
fail with Http error 503 with reason 
>    “Too many concurrent job submission requests received. Please wait for some time
before retrying.”
>  
> The client is expected to catch this response and retry after waiting for some time.
The default value for the config templeton.parallellism.job.submit is set to ‘0’. This
means by default job submission requests are always accepted. The behavior needs to be enabled
based on requirements.
> We can have similar behavior for Status and List operations with configs templeton.parallellism.job.status
and templeton.parallellism.job.list respectively.
> Once the job operation is started, the operation can take longer time. The client which
has requested for job operation may not be waiting for indefinite amount of time. This work
introduces configurations
> templeton.job.submit.timeout
> templeton.job.status.timeout
> templeton.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out happens then
list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for some time."
> If submit job request gets timed out then 
>       i) The job submit request thread which receives time out will check if valid job
id is generated in job request.
>       ii) If it is generated then issue kill job request on cancel thread pool. Don't
wait for operation to complete and returns to client with time out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client gets response
and a list operation from client could potential show the newly created job before it gets
killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a possibility
of duplicate job created. One possible reason for this could be a case where job is created
and then operation timed out but kill request failed due to resource manager unavailability.
When resource manager restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation functionality
can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message