hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaobin She <xiaobin...@gmail.com>
Subject Re: how to avoid scan the same table multi times?
Date Fri, 13 Jan 2012 13:09:29 GMT
hi,

I use the multiple inserts method, and I write an sql like this:

from td
INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct
v1)
INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1

INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1),
count(distinct v1) group by v4
INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1),
count(distinct v1) group by v3, v4

INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1),
count(distinct v1)  where v4=2 or v4=6
INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1),
count(distinct v1) where v4=2 or v4=6 group by v3

INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1),
count(distinct v1)  where v4=1 or v4=5
INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1),
count(distinct v1) where v4=1 or v4=5 group by v3


it works, and the output result is what I want.

but there is one problem, hive generate 9 mapreduce jobs and run these jobs
one by one.

I run explain on this query, and I got the following message:


STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-0 depends on stages: Stage-9
  Stage-10 depends on stages: Stage-9
  Stage-1 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-9
  Stage-2 depends on stages: Stage-11
  Stage-12 depends on stages: Stage-9
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-13
  Stage-14 depends on stages: Stage-9
  Stage-5 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-9
  Stage-6 depends on stages: Stage-15
  Stage-16 depends on stages: Stage-9
  Stage-7 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-9
  Stage-8 depends on stages: Stage-17


it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?

Or how can I make job 1-8 run concurrently?

thank you very much for your help again!

xiaobin



在 2012年1月13日 下午12:01,Xiaobin She <xiaobinshe@gmail.com>写道:

> to Martin, Mark and Edward,
>
> thank you for your advices, I will try it out.
>
> And to Martin, by "appropriate data format", do you mean something like "
> 2012011202" ?
>
> thanks!
>
> xiaobin
>
> 在 2012年1月12日 下午10:20,Martin Kuhn <martin.kuhn@affinitas.de>写道:
>
> Hi there,
>>
>> > Select count(*), count(distinct u), type from t group by type where
>> plat=1 and dt=”2012-1-12-02”
>> > Select count(*), count(distinct u), type from t where (type =2 or type
>> =6) and dt=”2012-1-12-02” group by type;
>>
>> > Is there a better way to do these queries?
>>
>> You could try something like this:
>>
>> SELECT
>>    type
>>  , count(*)
>>  , count(DISTINCT u)
>>  , count(CASE WHEN plat=1 THEN u ELSE NULL)
>>  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
>>  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
>>  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
>> FROM
>>    t
>> WHERE
>>    dt in ("2012-1-12-02", "2012-1-12-03")
>> GROUP BY
>>    type
>> ORDER BY
>>    type
>> ;
>>
>> Good luck :)
>> Martin Kuhn
>>
>>
>> P.S.  You'ge got a strange date format there. For sorting purposes it
>> would be more appropriate to use something like "2012-01-12-02".
>>
>>
>

Mime
View raw message