hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaobin She <xiaobin...@gmail.com>
Subject Re: how to avoid scan the same table multi times?
Date Sun, 15 Jan 2012 07:37:57 GMT
oh, I change the name of the column in the query,  so the corresponding
table should be:

create table td(id int, time string, ip string, v1 bigint, v2 int, v3 int,
v4 int, v5 bigint, v6 int)  PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','  lines TERMINATED BY '\n' ;


Any advices on how to make the jobs run concurrently?

thank you!




在 2012年1月13日 下午9:09,Xiaobin She <xiaobinshe@gmail.com>写道:

> hi,
>
> I use the multiple inserts method, and I write an sql like this:
>
> from td
> INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
> INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct
> v1)
> INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1
>
> INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1),
> count(distinct v1) group by v4
> INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1),
> count(distinct v1) group by v3, v4
>
> INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1),
> count(distinct v1)  where v4=2 or v4=6
> INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1),
> count(distinct v1) where v4=2 or v4=6 group by v3
>
> INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1),
> count(distinct v1)  where v4=1 or v4=5
> INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1),
> count(distinct v1) where v4=1 or v4=5 group by v3
>
>
> it works, and the output result is what I want.
>
> but there is one problem, hive generate 9 mapreduce jobs and run these
> jobs one by one.
>
> I run explain on this query, and I got the following message:
>
>
> STAGE DEPENDENCIES:
>   Stage-9 is a root stage
>   Stage-0 depends on stages: Stage-9
>   Stage-10 depends on stages: Stage-9
>   Stage-1 depends on stages: Stage-10
>   Stage-11 depends on stages: Stage-9
>   Stage-2 depends on stages: Stage-11
>   Stage-12 depends on stages: Stage-9
>   Stage-3 depends on stages: Stage-12
>   Stage-13 depends on stages: Stage-9
>   Stage-4 depends on stages: Stage-13
>   Stage-14 depends on stages: Stage-9
>   Stage-5 depends on stages: Stage-14
>   Stage-15 depends on stages: Stage-9
>   Stage-6 depends on stages: Stage-15
>   Stage-16 depends on stages: Stage-9
>   Stage-7 depends on stages: Stage-16
>   Stage-17 depends on stages: Stage-9
>   Stage-8 depends on stages: Stage-17
>
>
> it seems that stage 9-17 is corresponding to mapreduce job 0-8
> but from the explain message above, stage 10-17 only depends on stage 9,
> so I have an question, why job 1-8 can't run concurrently?
>
> Or how can I make job 1-8 run concurrently?
>
> thank you very much for your help again!
>
> xiaobin
>
>
>
> 在 2012年1月13日 下午12:01,Xiaobin She <xiaobinshe@gmail.com>写道:
>
> to Martin, Mark and Edward,
>>
>> thank you for your advices, I will try it out.
>>
>> And to Martin, by "appropriate data format", do you mean something like "
>> 2012011202" ?
>>
>> thanks!
>>
>> xiaobin
>>
>> 在 2012年1月12日 下午10:20,Martin Kuhn <martin.kuhn@affinitas.de>写道:
>>
>> Hi there,
>>>
>>> > Select count(*), count(distinct u), type from t group by type where
>>> plat=1 and dt=”2012-1-12-02”
>>> > Select count(*), count(distinct u), type from t where (type =2 or type
>>> =6) and dt=”2012-1-12-02” group by type;
>>>
>>> > Is there a better way to do these queries?
>>>
>>> You could try something like this:
>>>
>>> SELECT
>>>    type
>>>  , count(*)
>>>  , count(DISTINCT u)
>>>  , count(CASE WHEN plat=1 THEN u ELSE NULL)
>>>  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
>>>  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
>>>  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
>>> FROM
>>>    t
>>> WHERE
>>>    dt in ("2012-1-12-02", "2012-1-12-03")
>>> GROUP BY
>>>    type
>>> ORDER BY
>>>    type
>>> ;
>>>
>>> Good luck :)
>>> Martin Kuhn
>>>
>>>
>>> P.S.  You'ge got a strange date format there. For sorting purposes it
>>> would be more appropriate to use something like "2012-01-12-02".
>>>
>>>
>>
>

Mime
View raw message