hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LLBian <linanmengxia...@126.com>
Subject Re:Re: what is the difference between ³hive.compute.splits.in.am=true²and "hive.compute.splits.in.am=false"
Date Tue, 19 Jan 2016 07:04:34 GMT

Thank you very very much,Gopal. I got it. And I will study this carefully on the PPTS you
shared.
Best Regards.

--LLBian


At 2016-01-19 14:16:27, "Gopal Vijayaraghavan" <gopalv@apache.org> wrote:
>
>
>>Thank-you so much for your quick response. Yea, the option is use only
>>for hive-on-tez. I want to know its source, its principle.
>
>in.am=true is the better option as it computes the splits after a job has
>been submitted.
>
>Imagine you have 3 tables in your query - with in.am=false, all the splits
>have to be generated before the 1st task is spun up.
>
>with in.am=true, the 1st task can spin up when at least one of the tables
>has already generated splits. GetSplits() is not blocking across all
>tables - only within 1 table.
>
>In some cases, you can wait for the 1st task to even finish executing
>before starting the split-gen for the 2nd task, producing ~1000x speedups.
>
>For example,
>
>insert into bigtable partition(dt)
>select ... from small left outer join bigtable where
>date(small.ts) = bigtable.dt and small.txnid = bigtable.txnid
>where bigtable.txnid is null
>;
>
>With in.am = true + tez DPP, the split-gen is dynamic and will not
>generate splits for 100% of big-table (assuming small table is just today).
>
>>Mybe this resource
>>“http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/29” is very
>>useful,
>
>It has diagrams, but here's an original .pptx
>
>http://people.apache.org/~gopalv/W-235p-Pandey.pptx
>
>MD5 (W-235p-Pandey.pptx) = fd3d5c7eb6360f9654bdfbfb20031ba4
>
>
>Cheers,
>Gopal
>
>
Mime
View raw message