hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unmesha sreeveni <unmeshab...@gmail.com>
Subject Re: WholeFileInputFormat in hadoop
Date Mon, 30 Jun 2014 04:01:08 GMT
I am trying to do DBScan Algo.I refered the algo in "Data Mining - Concepts
and Techniques (3rd Ed)" chapter 10 Page no: 474.
Here in this algorithmwe need to find the disance between each point.
say my sample input is

So in DBScan we have to pic 1 elemnt and then find the distance between all.

While implementing so I will not be able to get the whole file in map
inorder to find the distance.
I tried some approach
1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
think this is a better one.(And it end up with heap space error)
2. and this one is not implementes as I thought it is not feasible
  - Reading 1 line of input data set in driver and write to a new file.(say
 - this centriod can be read in setup and calculate the distance in Map and
emit the data which satifies the condition with dbscan
map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
epsilon neighbours of (5,6) which come from different map and in Reducer
find the neighbors of epsilon neighbour.
 - Next iteration should also be done agian read the input file find a node
which is not visited....
If the input is a 1GB file the MR job executes as many times of the total

Can anyone suggest me a better way to do this.

Hope the usecase is understandable else please tell me.I will explain

*Thanks & Regards *

*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*

View raw message