hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unmesha sreeveni <unmeshab...@gmail.com>
Subject Binning for numerical dataset
Date Tue, 04 Feb 2014 09:40:55 GMT
I am able to normalize a given data say
100,1:2:3
101,2:3:4

into
100 1
100 2
100 3
101 2
101 3
101 4

How to do binning for a numerical data say iris.csv.

I worked out the maths behind it
Iris DataSet:  http://archive.ics.uci.edu/ml/datasets/Iris
1. find out the minimum and maximum values of each attribute
in the data set.

             Sepal Length Sepal Width Petal Length Petal Width
Min            4.3                2.0             1.0                0.1
Max            7.9               4.4             6.9                2.5

Then, we should divide the data values of each attributes into 'n' buckets .
Say, n=5.
Bucket Width= (Max - Min) /n


Eg: Sepal Length
= (7.9-4.3)/5
= 0.72
So, the intervals will be as follows :
4.3 -   5.02
5.02 - 5.74
Likewise,
5.74 -6.46
6.46 - 7.18
7.18- 7.9
continue for all attributes
How to do the same in Mapreduce .



-- 
*Thanks & Regards*

Unmesha Sreeveni U.B

Mime
View raw message