Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <2011437596-1364559261-cardhu_decombobulator_blackberry.rim.net-1604892280-@b16.c6.bise7.blackberry>
References: 
 <2011437596-1364559261-cardhu_decombobulator_blackberry.rim.net-1604892280-@b16.c6.bise7.blackberry>
From: Ted Dunning <tdunning@maprtech.com>
Date: Fri, 29 Mar 2013 14:05:43 +0100
Message-ID: 
 <CAND0qzsK82ptr5vb1FdTrYLSajfN9C5wPsDCET2-NqrUjNdKcA@mail.gmail.com>
Subject: Re: Million docs and word count scenario
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>,
 pathurun@yahoo.com
Content-Type: multipart/alternative; boundary=bcaec554d754f6a82d04d90feef6

--bcaec554d754f6a82d04d90feef6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Putting each document into a separate file is not likely to be a great
thing to do.

On the other hand, putting them all into one file may not be what you want
either.

It is probably best to find a middle ground and create files each with many
documents and each a few gigabytes in size.


On Fri, Mar 29, 2013 at 1:15 PM, <pathurun@yahoo.com> wrote:

> If there r 1 million docs in an enterprse and we need to perform word
> count computation on all the docs what is the first step to be done.  Is =
it
> to extract all the text of all the docs  into a single file and then put
> into hdfs or put each one separately in hdfs.
> Thanks
>
> Sent from BlackBerry=AE on Airtel

--bcaec554d754f6a82d04d90feef6
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Putting each document into a separate file is not likely t=
o be a great thing to do.<div><br></div><div style>On the other hand, putti=
ng them all into one file may not be what you want either.</div><div style>

<br></div><div style>It is probably best to find a middle ground and create=
 files each with many documents and each a few gigabytes in size.</div></di=
v><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Fri, Mar=
 29, 2013 at 1:15 PM,  <span dir=3D"ltr">&lt;<a href=3D"mailto:pathurun@yah=
oo.com" target=3D"_blank">pathurun@yahoo.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">If there r 1 million docs in an enterprse an=
d we need to perform word count computation on all the docs what is the fir=
st step to be done. =A0Is it to extract all the text of all the docs =A0int=
o a single file and then put into hdfs or put each one separately in hdfs.<=
br>


Thanks<br>
<br>
Sent from BlackBerry=AE on Airtel</blockquote></div><br></div>

--bcaec554d754f6a82d04d90feef6--