Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C24CC9B9A for ; Thu, 15 Mar 2012 14:59:02 +0000 (UTC) Received: (qmail 53786 invoked by uid 500); 15 Mar 2012 14:58:59 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 53702 invoked by uid 500); 15 Mar 2012 14:58:59 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 53694 invoked by uid 99); 15 Mar 2012 14:58:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2012 14:58:59 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bejoy.hadoop@gmail.com designates 209.85.210.51 as permitted sender) Received: from [209.85.210.51] (HELO mail-pz0-f51.google.com) (209.85.210.51) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2012 14:58:52 +0000 Received: by dady9 with SMTP id y9so5322808dad.38 for ; Thu, 15 Mar 2012 07:58:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=htjK0MHk6idPJulMQXJqn9nfN3Tj/uA/z02jljluL/o=; b=RAisHP3EilWWZRgVU203wOW3mCAIBwujNXwPj0oYccONOQlz5IXuA1pSMvxM/JUOz3 J68mFps/WkjZVa0jwTJa8/J4PwT5UDehGKHs1WolIOZjTmdD8BoTpgwmdnG7VJ32no0B g1HuIdlDWkj+ybSpSthZZ4UusgF47yobsrPpqiPd50S+V+kvo7L/ZWic+yhmCXdN9D3p qd0pwmyHm73cn+g1fWmaQFwKeMDmfspooUz3sdfQ1XKun891/AYKX253sCllM5NnahJQ edoyCJ3820U5hUGIROBAoWjZTl0ZYpGHy0L7DlBPD2fs+5UnWhCidJsMdtawKs2XvXgV BG+Q== MIME-Version: 1.0 Received: by 10.68.200.9 with SMTP id jo9mr2874271pbc.19.1331823512357; Thu, 15 Mar 2012 07:58:32 -0700 (PDT) Received: by 10.143.16.14 with HTTP; Thu, 15 Mar 2012 07:58:32 -0700 (PDT) In-Reply-To: References: Date: Thu, 15 Mar 2012 20:28:32 +0530 Message-ID: Subject: Re: SequenceFile split question From: Bejoy Ks To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b15fbcf5b8c8404bb4953a7 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b15fbcf5b8c8404bb4953a7 Content-Type: text/plain; charset=ISO-8859-1 Hi Mohit You are right. If your smaller XML files are in hdfs then MR would be the best approach to combine it to a sequence file. It'd do the job in parallel. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia wrote: > Thanks! that helps. I am reading small xml files from external file system > and then writing to the SequenceFile. I made it stand alone client thinking > that mapreduce may not be the best way to do this type of writing. My > understanding was that map reduce is best suited for processing data within > HDFS. Is map reduce also one of the options I should consider? > > On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks wrote: > > > Hi Mohit > > If you are using a stand alone client application to do the same > > definitely there is just one instance of the same running and you'd be > > writing the sequence file to one hdfs block at a time. Once it reaches > hdfs > > block size the writing continues to next block, in the mean time the > first > > block is replicated. If you are doing the same job distributed as map > > reduce you'd be writing to to n files at a time when n is the number of > > tasks in your map reduce job. > > AFAIK the data node where the blocks have to be placed is determined > > by hadoop it is not controlled by end user application. But if you are > > triggering the stand alone job on a particular data node and if it has > > space one replica would be stored in the same. Same applies in case of MR > > tasks as well. > > > > Regards > > Bejoy.K.S > > > > On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia > >wrote: > > > > > I have a client program that creates sequencefile, which essentially > > merges > > > small files into a big file. I was wondering how is sequence file > > splitting > > > the data accross nodes. When I start the sequence file is empty. Does > it > > > get split when it reaches the dfs.block size? If so then does it mean > > that > > > I am always writing to just one node at a given point in time? > > > > > > If I start a new client writing a new sequence file then is there a way > > to > > > select a different data node? > > > > > > --047d7b15fbcf5b8c8404bb4953a7--