hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "j.barrett Strausser" <j.barrett.straus...@gmail.com>
Subject Re: percentile_approx slowness
Date Thu, 25 Sep 2014 20:35:48 GMT
Not an answer to your question, but you can compute approximate percentiles
with only the memory overhead of a single integer ( two integers if you
want better results)

http://link.springer.com/chapter/10.1007/978-3-642-40273-9_7

So you could pretty easily implement that above algorithm as a python UDF
and then have a reduce step that averages the results.





On Thu, Sep 25, 2014 at 3:06 PM, Kevin Weiler <Kevin.Weiler@imc-chicago.com>
wrote:

>  Hi All,
>
>  I have a query that attempts to computer percentiles on some datasets
> that are well in excess of 100,000,000 rows and have thus opted to use
> percentile_approx as we are routinely overrunning the memory. I’m having
> trouble finding a threshold that I want to begin sampling. Before this
> dataset got so large, the maximum number of rows I would need to include in
> the percentile was about 1,000,000. I’ve tried using 1,000,000 as a
> sampling threshold, 100,000, and even the default 10,000. For some reason
> this query, that previously took about 20 minutes to run is now taking
> around 13 hours to complete (in the case of 100,000 as my sampling rate).
> Are there some hive settings I should be investigating to see if I can have
> this query complete in a reasonable time?
>
> --
>   *Kevin Weiler*
>
> IT
>  IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
> 60606 | http://imc-chicago.com/
>  Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *kevin.weiler@imc-chicago.com
> <Kevin.Weiler@imc-chicago.com>*
>
>
> ------------------------------
>
> The information in this e-mail is intended only for the person or entity
> to which it is addressed.
>
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
>
> If you receive this e-mail unintentionally, please inform us immediately
> by "reply" and then delete it from your system. Although this information
> has been compiled with great care, neither IMC Financial Markets & Asset
> Management nor any of its related entities shall accept any responsibility
> for any errors, omissions or other inaccuracies in this information or for
> the consequences thereof, nor shall it be bound in any way by the contents
> of this e-mail or its attachments. In the event of incomplete or incorrect
> transmission, please return the e-mail to the sender and permanently delete
> this message and any attachments.
>
> Messages and attachments are scanned for all known viruses. Always scan
> attachments before opening them.
>



-- 


https://github.com/bearrito
@deepbearrito

Mime
View raw message