hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <pjayachand...@hortonworks.com>
Subject Re: percentile_approx slowness
Date Thu, 02 Oct 2014 18:30:00 GMT
You can look for explode(), posexplode() UDF’s in hive.  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

Thanks
Prasanth Jayachandran

On Oct 2, 2014, at 7:15 AM, Kevin Weiler <Kevin.Weiler@imc-chicago.com> wrote:

> Hi all,
> 
> I wanted to note that I figured out a better solution to my problem. I was selecting
each percentile I wanted to compute (0.1, 0.5, 0.9 etc) as an individual percentile calculation
which was blowing up my query. It turns out that if you do it like this:
> 
> SELECT
>   PERCENTILE(col, array(0.1, 0.5, 0.9))
> 
> the aggregation doesn’t need to run multiple times and my query runs just fine.
> 
> I now have one additional question.
> 
> I would like to store each percentile as a field in another hive table. This calculation
returns an array. How can I break out the array into individual fields to be put into a new
table?
> 
> --
> Kevin Weiler
> IT
> 
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com
> 
> On Sep 25, 2014, at 3:35 PM, j.barrett Strausser <j.barrett.strausser@gmail.com>
wrote:
> 
>> Not an answer to your question, but you can compute approximate percentiles with
only the memory overhead of a single integer ( two integers if you want better results)
>> 
>> http://link.springer.com/chapter/10.1007/978-3-642-40273-9_7
>> 
>> So you could pretty easily implement that above algorithm as a python UDF and then
have a reduce step that averages the results.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Sep 25, 2014 at 3:06 PM, Kevin Weiler <Kevin.Weiler@imc-chicago.com>
wrote:
>> Hi All,
>> 
>> I have a query that attempts to computer percentiles on some datasets that are well
in excess of 100,000,000 rows and have thus opted to use percentile_approx as we are routinely
overrunning the memory. I’m having trouble finding a threshold that I want to begin sampling.
Before this dataset got so large, the maximum number of rows I would need to include in the
percentile was about 1,000,000. I’ve tried using 1,000,000 as a sampling threshold, 100,000,
and even the default 10,000. For some reason this query, that previously took about 20 minutes
to run is now taking around 13 hours to complete (in the case of 100,000 as my sampling rate).
Are there some hive settings I should be investigating to see if I can have this query complete
in a reasonable time?
>> 
>> --
>> Kevin Weiler
>> IT
>> 
>> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
>> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com
>> 
>> 
>> 
>> The information in this e-mail is intended only for the person or entity to which
it is addressed.
>> 
>> It may contain confidential and /or privileged material. If someone other than the
intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate,
disclose or duplicate it.
>> 
>> If you receive this e-mail unintentionally, please inform us immediately by "reply"
and then delete it from your system. Although this information has been compiled with great
care, neither IMC Financial Markets & Asset Management nor any of its related entities
shall accept any responsibility for any errors, omissions or other inaccuracies in this information
or for the consequences thereof, nor shall it be bound in any way by the contents of this
e-mail or its attachments. In the event of incomplete or incorrect transmission, please return
the e-mail to the sender and permanently delete this message and any attachments.
>> 
>> Messages and attachments are scanned for all known viruses. Always scan attachments
before opening them.
>> 
>> 
>> 
>> -- 
>> 
>> 
>> https://github.com/bearrito
>> @deepbearrito
> 
> 
> 
> The information in this e-mail is intended only for the person or entity to which it
is addressed.
> 
> It may contain confidential and /or privileged material. If someone other than the intended
recipient should receive this e-mail, he / she shall not be entitled to read, disseminate,
disclose or duplicate it.
> 
> If you receive this e-mail unintentionally, please inform us immediately by "reply" and
then delete it from your system. Although this information has been compiled with great care,
neither IMC Financial Markets & Asset Management nor any of its related entities shall
accept any responsibility for any errors, omissions or other inaccuracies in this information
or for the consequences thereof, nor shall it be bound in any way by the contents of this
e-mail or its attachments. In the event of incomplete or incorrect transmission, please return
the e-mail to the sender and permanently delete this message and any attachments.
> 
> Messages and attachments are scanned for all known viruses. Always scan attachments before
opening them.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message