hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Weiler <Kevin.Wei...@imc-chicago.com>
Subject Re: percentile_approx slowness
Date Thu, 02 Oct 2014 14:15:18 GMT
Hi all,

I wanted to note that I figured out a better solution to my problem. I was selecting each
percentile I wanted to compute (0.1, 0.5, 0.9 etc) as an individual percentile calculation
which was blowing up my query. It turns out that if you do it like this:

SELECT
  PERCENTILE(col, array(0.1, 0.5, 0.9))

the aggregation doesn’t need to run multiple times and my query runs just fine.

I now have one additional question.

I would like to store each percentile as a field in another hive table. This calculation returns
an array. How can I break out the array into individual fields to be put into a new table?

--
Kevin Weiler
IT
IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com<mailto:Kevin.Weiler@imc-chicago.com>

On Sep 25, 2014, at 3:35 PM, j.barrett Strausser <j.barrett.strausser@gmail.com<mailto:j.barrett.strausser@gmail.com>>
wrote:

Not an answer to your question, but you can compute approximate percentiles with only the
memory overhead of a single integer ( two integers if you want better results)

http://link.springer.com/chapter/10.1007/978-3-642-40273-9_7

So you could pretty easily implement that above algorithm as a python UDF and then have a
reduce step that averages the results.





On Thu, Sep 25, 2014 at 3:06 PM, Kevin Weiler <Kevin.Weiler@imc-chicago.com<mailto:Kevin.Weiler@imc-chicago.com>>
wrote:
Hi All,

I have a query that attempts to computer percentiles on some datasets that are well in excess
of 100,000,000 rows and have thus opted to use percentile_approx as we are routinely overrunning
the memory. I’m having trouble finding a threshold that I want to begin sampling. Before
this dataset got so large, the maximum number of rows I would need to include in the percentile
was about 1,000,000. I’ve tried using 1,000,000 as a sampling threshold, 100,000, and even
the default 10,000. For some reason this query, that previously took about 20 minutes to run
is now taking around 13 hours to complete (in the case of 100,000 as my sampling rate). Are
there some hive settings I should be investigating to see if I can have this query complete
in a reasonable time?

--
Kevin Weiler
IT
IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
Phone: +1 312-204-7439<tel:%2B1%20312-204-7439> | Fax: +1 312-244-3301<tel:%2B1%20312-244-3301>
| E-Mail: kevin.weiler@imc-chicago.com<mailto:Kevin.Weiler@imc-chicago.com>


________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended
recipient should receive this e-mail, he / she shall not be entitled to read, disseminate,
disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then
delete it from your system. Although this information has been compiled with great care, neither
IMC Financial Markets & Asset Management nor any of its related entities shall accept
any responsibility for any errors, omissions or other inaccuracies in this information or
for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail
or its attachments. In the event of incomplete or incorrect transmission, please return the
e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before
opening them.



--


https://github.com/bearrito
@deepbearrito


________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended
recipient should receive this e-mail, he / she shall not be entitled to read, disseminate,
disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then
delete it from your system. Although this information has been compiled with great care, neither
IMC Financial Markets & Asset Management nor any of its related entities shall accept
any responsibility for any errors, omissions or other inaccuracies in this information or
for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail
or its attachments. In the event of incomplete or incorrect transmission, please return the
e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before
opening them.

Mime
View raw message