Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of bejoy.hadoop@gmail.com
 designates 209.85.210.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOT3TWpoMNfg1x=WBS=DNyboWYO1kCucT8HxQavPg23EOSiCkA@mail.gmail.com>
References: 
 <CAOT3TWrY8vC2QOWSKFndE1MHPU6qDeQ6LeJSdjETv3Ufwqqorw@mail.gmail.com>
	<CACD21ENzDDU6KUNdqwT2iCq8DM=PiY_5r=mD6A=OiThJ3L8dKg@mail.gmail.com>
	<CAOT3TWpoMNfg1x=WBS=DNyboWYO1kCucT8HxQavPg23EOSiCkA@mail.gmail.com>
Date: Thu, 15 Mar 2012 20:28:32 +0530
Message-ID: 
 <CACD21EOwC4XSO79LBjvScQo=7WmNs-gJoEtkPPKtH5L6ROutow@mail.gmail.com>
Subject: Re: SequenceFile split question
From: Bejoy Ks <bejoy.hadoop@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b15fbcf5b8c8404bb4953a7

--047d7b15fbcf5b8c8404bb4953a7
Content-Type: text/plain; charset=ISO-8859-1

Hi Mohit
     You are right. If your smaller XML files are in hdfs then MR would be
the best approach to combine it to a sequence file. It'd do the job
in parallel.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> Thanks! that helps. I am reading small xml files from external file system
> and then writing to the SequenceFile. I made it stand alone client thinking
> that mapreduce may not be the best way to do this type of writing. My
> understanding was that map reduce is best suited for processing data within
> HDFS. Is map reduce also one of the options I should consider?
>
> On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
>
> > Hi Mohit
> >      If you are using a stand alone client application to do the same
> > definitely there is just one instance of the same running and you'd be
> > writing the sequence file to one hdfs block at a time. Once it reaches
> hdfs
> > block size the writing continues to next block, in the mean time the
> first
> > block is replicated. If you are doing the same job distributed as map
> > reduce you'd be writing to to n files at a time when n is the number of
> > tasks in your map reduce job.
> >     AFAIK the data node where the blocks have to be placed is determined
> > by hadoop it is not controlled by end user application. But if you are
> > triggering the stand alone job on a particular data node and if it has
> > space one replica would be stored in the same. Same applies in case of MR
> > tasks as well.
> >
> > Regards
> > Bejoy.K.S
> >
> > On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia <mohitanchlia@gmail.com
> > >wrote:
> >
> > > I have a client program that creates sequencefile, which essentially
> > merges
> > > small files into a big file. I was wondering how is sequence file
> > splitting
> > > the data accross nodes. When I start the sequence file is empty. Does
> it
> > > get split when it reaches the dfs.block size? If so then does it mean
> > that
> > > I am always writing to just one node at a given point in time?
> > >
> > > If I start a new client writing a new sequence file then is there a way
> > to
> > > select a different data node?
> > >
> >
>

--047d7b15fbcf5b8c8404bb4953a7--