hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-1408) add option to let hive automatically run in local mode based on tunable heuristics
Date Thu, 08 Jul 2010 09:27:50 GMT

     [ https://issues.apache.org/jira/browse/HIVE-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joydeep Sen Sarma updated HIVE-1408:
------------------------------------

    Attachment: 1408.1.patch

v1 - i will update with tests.

couple of main objectives:
1. decide whether each mr job can be run locally
2. decide whether local disk can be used for intermediate data (if all jobs are going to run
locally)

right now - both #1 and #2 are code complete - but only #1 has been enabled in the code (#2
needs more testing)

the general strategy is:
- after compilation/optimization - look at input size of each mr job.
- if all the jobs are small - then we can use local disk for intermediate data (#2)
- else - we use hdfs for intermediate input and before launching each job - we (re)test whether
the input data set is such that we can execute locally.

had to do substantial restructuring to make this happen:
a. MapRedTask is now a wrapper around ExecDriver. This allows us to have a single task implementation
for running mr jobs. mapredtask decides at execute time whether it should run locally or not.
b. Context.java is pretty much rewritten - the path management code was somewhat buggy (in
particular isMRTmpFileURI was incorrect). the code was rewritten to allow make it easy to
swizzle tmp paths to be directed to local disk after plan generation
c. added a small cache for caching DFS file metadata (sizes). this is because we lookup file
metadata many times over now (for determining local mode as well as for estimating reducer
count) and this cuts the overhead of repeated DFS rpcs
d. most test output changes are because of altered temporary path naming convention due to
(b)
e. bug fixes: CTAS and RCFileOutputFormat were broken for local mode execution. some cleanup
(debug log statements should be wrapped in ifDebugEnabled()).


> add option to let hive automatically run in local mode based on tunable heuristics
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-1408
>                 URL: https://issues.apache.org/jira/browse/HIVE-1408
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>         Attachments: 1408.1.patch
>
>
> as a followup to HIVE-543 - we should have a simple option (enabled by default) to let
hive run in local mode if possible.
> two levels of options are desirable:
> 1. hive.exec.mode.local.auto=true/false // control whether local mode is automatically
chosen
> 2. Options to control different heuristics, some naiive examples:
>      hive.exec.mode.local.auto.input.size.max=1G // don't choose local mode if data >
1G
>      hive.exec.mode.local.auto.script.enable=true/false // choose if local mode is enabled
for queries with user scripts
> this can be implemented as a pre/post execution hook. It makes sense to provide this
as a standard hook in the hive codebase since it's likely to improve response time for many
users (especially for test queries).
> the initial proposal is to choose this at a query level and not at per hive-task (ie.
hadoop job) level. per job-level requires more changes to compilation (to not pre-commit to
hdfs or local scratch directories at compile time).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message