hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column
Date Tue, 15 Jun 2010 18:21:27 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879054#action_12879054
] 

HBase Review Board commented on HIVE-1397:
------------------------------------------

Message from: "Mayank Lahiri" <mayank.lahiri@facebook.com>


bq.  On None, John Sichi wrote:
bq.  > http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java,
line 346
bq.  > <http://review.hbase.org/r/181/diff/1/?file=1351#file1351line346>
bq.  >
bq.  >     Since eventually we would like to support histograms on non-numeric fields such
as STRING, I think we should rename this one numeric_histogram (likewise for the Java class)
to avoid confusion later when we have other algorithms.
bq.  >

It might seem a little odd since histograms are generally used as approximations of numerical
distributions. I would suggest either (a) overloading histogram() to behave differently on
STRING arguments (perhaps STRING arguments that cause a NumberFormatException), or (b) creating
a factor_histogram() function for general strings. 

I could add in the code for computing frequencies of STRINGs quite easily, although there's
no way to prevent it from choking if there are too many unique strings.


bq.  On None, John Sichi wrote:
bq.  > http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
line 239
bq.  > <http://review.hbase.org/r/181/diff/1/?file=1352#file1352line239>
bq.  >
bq.  >     Shouldn't this method be private?

yes! thanks!


bq.  On None, John Sichi wrote:
bq.  > http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
line 186
bq.  > <http://review.hbase.org/r/181/diff/1/?file=1352#file1352line186>
bq.  >
bq.  >     Formatting nitpick:  Here and elsewhere, use braces even for single-line blocks;
we use the NeedBraces checkstyle rule to flag this.
bq.  >     
bq.  >     http://stackoverflow.com/questions/382633/can-the-checkstyle-module-needbraces-work-with-nested-if-else-blocks
bq.  >

Will re-submit patch, thanks.


bq.  On None, John Sichi wrote:
bq.  > http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
line 362
bq.  > <http://review.hbase.org/r/181/diff/1/?file=1352#file1352line362>
bq.  >
bq.  >     Under what conditions can this exception be encountered?  Shouldn't it be impossible
since we already checked the type up front?

This could happen when a STRING row contains a non-numeric value. As a follow-up to my earlier
comment, we could either drop the value or somehow "intelligently" switch to computing the
histogram over strings instead of doubles.


bq.  On None, John Sichi wrote:
bq.  > http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java,
line 153
bq.  > <http://review.hbase.org/r/181/diff/1/?file=1352#file1352line153>
bq.  >
bq.  >     Is it possible to return an ARRAY<STRUCT{x DOUBLE; y DOUBLE}> instead?
 That seems more natural (and compact) than a MAP<DOUBLE,DOUBLE>.
bq.  >     
bq.  >     But if you already got feedback that MAP is prefereable, ignore this comment.
bq.  >

Not a problem. I was using the map to avoid an extra level of indirection, and possibly to
make it compatible with an explode() extension that explodes maps as well as arrays. 


- Mayank


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/181/#review224
-----------------------------------------------------------





> histogram() UDAF for a numerical column
> ---------------------------------------
>
>                 Key: HIVE-1397
>                 URL: https://issues.apache.org/jira/browse/HIVE-1397
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.6.0
>
>         Attachments: Histogram_quality.png.jpg, HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, short,
double, long, etc.) column. The result is returned as a map of (x,y) histogram pairs, and
can be plotted in Gnuplot using impulses (for example). The algorithm is currently adapted
from "A streaming parallel decision tree algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010),
and uses space proportional to the number of histogram bins specified. It has no approximation
guarantees, but seems to work well when there is a lot of data and a large number (e.g. 50-100)
of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message