hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mawata <chris.maw...@gmail.com>
Subject Re: Split files into 80% and 20% for building model and prediction
Date Fri, 12 Dec 2014 14:12:06 GMT
How about doing something on the lines of bucketing: Pick a field that is
unique for each record and if hash of the field mod 10 is 8 or less it goes
in one bin, otherwise into the other one.
Cheers
Chris
On Dec 12, 2014 1:32 AM, "unmesha sreeveni" <unmeshabiju@gmail.com> wrote:

> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number
> of record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of
> records and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with
> 80% data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Mime
View raw message