hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/StatisticsAndDataMining" by MayankLahiri
Date Thu, 19 Aug 2010 22:36:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/StatisticsAndDataMining" page has been changed by MayankLahiri.
The comment on this change is: finished datamining and statistics wiki page.
http://wiki.apache.org/hadoop/Hive/StatisticsAndDataMining?action=diff&rev1=1&rev2=2

--------------------------------------------------

  = Statistics and Data Mining in Hive =
  
- This page is a central repository for the slightly more advanced statistical and data mining
functions that are being integrated into Hive, and especially the functions that warrant more
than one-line descriptions. 
+ This page is the secondary documentation for the slightly more advanced statistical and
data mining functions that are being integrated into Hive, and especially the functions that
warrant more than one-line descriptions. 
  
  <<TableOfContents(3)>>
  
@@ -74, +74 @@

  
  == histogram_numeric(): Estimating frequency distributions ==
  
+ Histograms represent frequency distributions from empirical data. The kind that is referred
to here are histograms with variable-sized bins. Specifically, this UDAF will return a list
of (x,y) pairs that represent histogram bin centers and heights. It's up to you to then plot
them in Excel / Gnuplot / Matlab / Mathematica to get a nice graphical display.
+ 
+ === Use Cases ===
+ 
+  1. Estimating the frequency distribution of a column, possibly grouped by other attributes.
+  2. Choosing discretization points in a continuous valued column.
+ 
+ === Usage ===
+ 
+ {{{
+ SELECT histogram_numeric(age) FROM users GROUP BY gender;
+ }}}
+ 
+ The command above is self-explanatory. Converting the output into a graphical display is
a bit more involved. The following [[http://www.gnuplot.info/|Gnuplot]] command should do
it, assuming that you've parsed the output from `histogram()` into a text file of (x,y) pairs
called `data.txt`.
+ 
+ {{{
+ plot 'data.txt' u 1:2 w impulses lw 5
+ }}}
+ 
+ === Example ===
+ 
+ {{{
+ SELECT explode(histogram_numeric(val, 10)) AS x FROM normal;
+ {"x":-3.6505464999999995,"y":20.0}
+ {"x":-2.7514727901960785,"y":510.0}
+ {"x":-1.7956678951954481,"y":8263.0}
+ {"x":-0.9878507685761995,"y":19167.0}
+ {"x":-0.2625338380837097,"y":31737.0}
+ {"x":0.5057392319427763,"y":31502.0}
+ {"x":1.2774146480311135,"y":14526.0}
+ {"x":2.083955560712489,"y":3986.0}
+ {"x":2.9209550254545484,"y":275.0}
+ {"x":3.674835214285715,"y":14.0}
+ }}}
+ 

Mime
View raw message