Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The "Hive/StatisticsAndDataMining" page has been changed by MayankLahiri. The comment on this change is: finished datamining and statistics wiki page. http://wiki.apache.org/hadoop/Hive/StatisticsAndDataMining?action=diff&rev1=1&rev2=2 -------------------------------------------------- = Statistics and Data Mining in Hive = - This page is a central repository for the slightly more advanced statistical and data mining functions that are being integrated into Hive, and especially the functions that warrant more than one-line descriptions. + This page is the secondary documentation for the slightly more advanced statistical and data mining functions that are being integrated into Hive, and especially the functions that warrant more than one-line descriptions. <> @@ -74, +74 @@ == histogram_numeric(): Estimating frequency distributions == + Histograms represent frequency distributions from empirical data. The kind that is referred to here are histograms with variable-sized bins. Specifically, this UDAF will return a list of (x,y) pairs that represent histogram bin centers and heights. It's up to you to then plot them in Excel / Gnuplot / Matlab / Mathematica to get a nice graphical display. + + === Use Cases === + + 1. Estimating the frequency distribution of a column, possibly grouped by other attributes. + 2. Choosing discretization points in a continuous valued column. + + === Usage === + + {{{ + SELECT histogram_numeric(age) FROM users GROUP BY gender; + }}} + + The command above is self-explanatory. Converting the output into a graphical display is a bit more involved. The following [[http://www.gnuplot.info/|Gnuplot]] command should do it, assuming that you've parsed the output from `histogram()` into a text file of (x,y) pairs called `data.txt`. + + {{{ + plot 'data.txt' u 1:2 w impulses lw 5 + }}} + + === Example === + + {{{ + SELECT explode(histogram_numeric(val, 10)) AS x FROM normal; + {"x":-3.6505464999999995,"y":20.0} + {"x":-2.7514727901960785,"y":510.0} + {"x":-1.7956678951954481,"y":8263.0} + {"x":-0.9878507685761995,"y":19167.0} + {"x":-0.2625338380837097,"y":31737.0} + {"x":0.5057392319427763,"y":31502.0} + {"x":1.2774146480311135,"y":14526.0} + {"x":2.083955560712489,"y":3986.0} + {"x":2.9209550254545484,"y":275.0} + {"x":3.674835214285715,"y":14.0} + }}} +