hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Zhang <>
Subject Re: Why two map stages for a simple select query?
Date Sat, 14 Aug 2010 03:45:40 GMT
The second map-reduce job is probably the merge job which takes the output of the first map-only
job (the real query) and merge the resulting files. The merge job is not always triggered.
If you look at the plan you may find it is a child of a conditional task, which means it is
conditionally triggered based on the results of the first map-only job. 

You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles
is used to control whether to merge the result of a map-reduce job. 

On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:

> Hi all,
> I'm mystified by Hive's behavior for two types of queries.
> 1: consider the following simple select query:
> insert overwrite table alogs_test_extracted1
> select raw.client_ip, raw.cookie, raw.referrer_flag
> from alogs_test_rc6 raw;
> Both tables are stored as rcfiles, and LZO compression is turned on.
> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
> can someone explain to me _what_ hive is doing in the two map jobs?..
> I stared at the output of EXPLAIN, but can't figure out what is going
> on.  When I do similar extractions by hand, I have a mapper that pulls
> out fields from records, and (optionally) a reducer that combines the
> results -- that is, one map stage.  Why are there two here?..  (about
> 30% of the time is spent on the first map stage, 45% on the second map
> stage, and 25% on the reduce step).
> 2: consider the "transform..using" query below:
> insert overwrite table alogs_test_rc6
> select
>  transform (d.ll)
>    using 'java myProcessingClass'
>    as (field1, field2, field3)
> from (select logline as ll from raw_log_test1day) d;
> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
> map, and a map-reduce.  However, when the job actually runs, Hive says
> "Launching job 1 out of 2", runs the transform script in mappers,
> writes the table, and never launches job 2 (the map-reduce stage in
> the plan)!  Why is this happening, and can I control this behavior?..
> Sometimes it would be preferable for me to run a map-only job (perhaps
> combining input data for mappers with CombineFileInputFormat to avoid
> generating thousands of 20MB files).
> Thanks in advance to anyone who can clarify Hive's behavior here...
> --Leo

View raw message