Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F760F325 for ; Fri, 12 Dec 2014 14:13:27 +0000 (UTC) Received: (qmail 24144 invoked by uid 500); 12 Dec 2014 14:13:21 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 24018 invoked by uid 500); 12 Dec 2014 14:13:21 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 24002 invoked by uid 99); 12 Dec 2014 14:13:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2014 14:13:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chris.mawata@gmail.com designates 209.85.213.171 as permitted sender) Received: from [209.85.213.171] (HELO mail-ig0-f171.google.com) (209.85.213.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2014 14:12:53 +0000 Received: by mail-ig0-f171.google.com with SMTP id z20so1483123igj.10 for ; Fri, 12 Dec 2014 06:12:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=arDOP9VSGKw0ha5WsElP9dzt1tWwmkiIRGfOkNpxKU8=; b=HH0D9vNVcptrPQ1NDN9Z3Go1PtU8ns5AAGRctsRGWdL3kYMYuKN8XNRJsnT7MGm/RG ZJmHOghxzaQzJ4jXP7U0flQeFCWX9bgvSbU4ZKurDx/upcBpuI1uecFmO70fynU4uVEw KrtzeE1yHxebkR1m3g+3SclUi4UH0JNfOZkNEe47gf6MfWPvvmIm9B7TZCHd6JEmCkEB OhZ0TP12P9Jpnuy4bkKgirL9E7uFXAF63iNMp0CJoALEKbLEBudUhMDbJbO1rn6qQYPz r/2WxjX50ArExXeBexRkm1RryGlA5vmRDhO63Yeag4P9j0QAmaAZn7lpr+Zk3Ia7Ix0X 2x7g== MIME-Version: 1.0 X-Received: by 10.42.204.139 with SMTP id fm11mr16841269icb.87.1418393526799; Fri, 12 Dec 2014 06:12:06 -0800 (PST) Received: by 10.50.225.172 with HTTP; Fri, 12 Dec 2014 06:12:06 -0800 (PST) Received: by 10.50.225.172 with HTTP; Fri, 12 Dec 2014 06:12:06 -0800 (PST) In-Reply-To: References: Date: Fri, 12 Dec 2014 09:12:06 -0500 Message-ID: Subject: Re: Split files into 80% and 20% for building model and prediction From: Chris Mawata To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf303dda18511c24050a057ac9 X-Virus-Checked: Checked by ClamAV on apache.org --20cf303dda18511c24050a057ac9 Content-Type: text/plain; charset=UTF-8 How about doing something on the lines of bucketing: Pick a field that is unique for each record and if hash of the field mod 10 is 8 or less it goes in one bin, otherwise into the other one. Cheers Chris On Dec 12, 2014 1:32 AM, "unmesha sreeveni" wrote: > I am trying to divide my HDFS file into 2 parts/files > 80% and 20% for classification algorithm(80% for modelling and 20% for > prediction) > Please provide suggestion for the same. > To take 80% and 20% to 2 seperate files we need to know the exact number > of record in the data set > And it is only known if we go through the data set once. > so we need to write 1 MapReduce Job for just counting the number of > records and > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple > Inputs. > > > Am I in the right track or there is any alternative for the same. > But again a small confusion how to check if the reducer get filled with > 80% data. > > > -- > *Thanks & Regards * > > > *Unmesha Sreeveni U.B* > *Hadoop, Bigdata Developer* > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* > http://www.unmeshasreeveni.blogspot.in/ > > > --20cf303dda18511c24050a057ac9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

How about doing something on the lines of bucketing: Pick a = field that is unique for each record and if hash of the field mod 10 is 8 o= r less it goes in one bin, otherwise into the other one.
Cheers
Chris

--20cf303dda18511c24050a057ac9--