hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unmesha sreeveni <unmeshab...@gmail.com>
Subject Re: Split files into 80% and 20% for building model and prediction
Date Fri, 12 Dec 2014 11:00:20 GMT
Hi Mikael
 So you wont write an MR job for counting the number of records in that
file to find 80% and 20%

On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <mikael.sitruk@gmail.com>
wrote:
>
> I would use a different approach. For each row in the mapper I would have
> invoked random.Next() then if the number generated by random is below 0.8
> then the row would go to key for training otherwise go to key for the test.
> Mikael.s
> ------------------------------
> From: Susheel Kumar Gadalay <skgadalay@gmail.com>
> Sent: ‎12/‎12/‎2014 12:00
> To: user@hadoop.apache.org
> Subject: Re: Split files into 80% and 20% for building model and
> prediction
>
> Simple solution..
>
> Copy the HDFS file to local and use OS commands to count no of lines
>
> cat file1 | wc -l
>
> and cut it based on line number.
>
>
> On 12/12/14, unmesha sreeveni <unmeshabiju@gmail.com> wrote:
> > I am trying to divide my HDFS file into 2 parts/files
> > 80% and 20% for classification algorithm(80% for modelling and 20% for
> > prediction)
> > Please provide suggestion for the same.
> > To take 80% and 20% to 2 seperate files we need to know the exact number
> of
> > record in the data set
> > And it is only known if we go through the data set once.
> > so we need to write 1 MapReduce Job for just counting the number of
> records
> > and
> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> > Inputs.
> >
> >
> > Am I in the right track or there is any alternative for the same.
> > But again a small confusion how to check if the reducer get filled with
> 80%
> > data.
> >
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Mime
View raw message