hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "j.barrett Strausser" <>
Subject Re: percentile_approx slowness
Date Thu, 25 Sep 2014 20:35:48 GMT
Not an answer to your question, but you can compute approximate percentiles
with only the memory overhead of a single integer ( two integers if you
want better results)

So you could pretty easily implement that above algorithm as a python UDF
and then have a reduce step that averages the results.

On Thu, Sep 25, 2014 at 3:06 PM, Kevin Weiler <>

>  Hi All,
>  I have a query that attempts to computer percentiles on some datasets
> that are well in excess of 100,000,000 rows and have thus opted to use
> percentile_approx as we are routinely overrunning the memory. I’m having
> trouble finding a threshold that I want to begin sampling. Before this
> dataset got so large, the maximum number of rows I would need to include in
> the percentile was about 1,000,000. I’ve tried using 1,000,000 as a
> sampling threshold, 100,000, and even the default 10,000. For some reason
> this query, that previously took about 20 minutes to run is now taking
> around 13 hours to complete (in the case of 100,000 as my sampling rate).
> Are there some hive settings I should be investigating to see if I can have
> this query complete in a reasonable time?
> --
>   *Kevin Weiler*
> IT
>  IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
> 60606 |
>  Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *
> <>*
> ------------------------------
> The information in this e-mail is intended only for the person or entity
> to which it is addressed.
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
> If you receive this e-mail unintentionally, please inform us immediately
> by "reply" and then delete it from your system. Although this information
> has been compiled with great care, neither IMC Financial Markets & Asset
> Management nor any of its related entities shall accept any responsibility
> for any errors, omissions or other inaccuracies in this information or for
> the consequences thereof, nor shall it be bound in any way by the contents
> of this e-mail or its attachments. In the event of incomplete or incorrect
> transmission, please return the e-mail to the sender and permanently delete
> this message and any attachments.
> Messages and attachments are scanned for all known viruses. Always scan
> attachments before opening them.


View raw message