impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: Any plans for approximate topN query?
Date Wed, 29 Nov 2017 04:48:14 GMT
Agree that the techniques (approximation and sampling) are different and
complementary.

Our current user base tends to require exact query responses, so this is a
direction we have not seriously explored.

You are certainly welcome to flesh out your ideas in more detail and
propose/make a contribution! Perhaps other members of the community agree
with you and are willing to hep.


On Tue, Nov 28, 2017 at 6:24 PM, Jason Heo <jason.heo.sde@gmail.com> wrote:

> Hi, Jeszy
>
> Thank you for your reply.
>
> My understanding is that you're mentioning sampling.
>
> Although both topN and sampling are an approximate technique for making
> queries run faster, I think they are difference concept.
>
> Using topN, by returning only N aggregated item on each node, we can
> eliminate expensive shuffle operation whereas sampling can reduce amount of
> input data.
>
> topN can be used without sampling, and sampling can be used without topN,
> and they can be used at the same time.
>
> My experiment on Druid 0.10.0 over my Dataset shows that "topN without
> sampling" is 100 times faster than GroupBy & OrderBy, and "topN with
> sampling" is 200 times after than GroupBy & OrderBy.
>
> Currently not many of Distributed SQL Engine support topN, by implementing
> topN Impala could be adopted by many types of analytic systems.
>
> Thanks.
>
> Regards,
>
> Jason
>
>
> 2017-11-28 23:19 GMT+09:00 Jeszy <jeszyb@gmail.com>:
>
>> Hello Jason,
>>
>> IMPALA-5300 (https://issues.apache.org/jira/browse/IMPALA-5300) is in
>> the works, and I think it fits your use case. Can you take a look?
>>
>> Thanks!
>>
>> On 28 November 2017 at 15:11, Jason Heo <jason.heo.sde@gmail.com> wrote:
>> > Hi,
>> >
>> > I'm wondering impala team has any plans for approximate topN for single
>> > dimension.
>> >
>> > My Web analytic system mostly serves top n urls. Such a "GROUP BY url
>> ORDER
>> > BY pageview LIMIT n" is slow especially for high-cardinality field.
>> > Approximate topN can be used instead of GroupBy for single dimension
>> with
>> > extremely lower latency.
>> >
>> > Elastisearch, Druid, and Clickhouse already provide this feature.
>> >
>> > It would be great if I can use it on Druid.
>> >
>> > Thanks.
>> >
>> > Regards,
>> >
>> > Jason
>>
>
>

Mime
View raw message