Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 12CB4F5B1 for ; Fri, 29 Mar 2013 13:11:06 +0000 (UTC) Received: (qmail 20111 invoked by uid 500); 29 Mar 2013 13:10:28 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 18085 invoked by uid 500); 29 Mar 2013 13:10:13 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 13279 invoked by uid 99); 29 Mar 2013 13:06:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 13:06:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 13:06:25 +0000 Received: by mail-lb0-f182.google.com with SMTP id z13so306681lbh.13 for ; Fri, 29 Mar 2013 06:06:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=WfyyVREc3H1PD48y5dLY3DuVNGU7hr0RPp/w4LkeeRI=; b=H5TPXjFBJsPyyL/fPxAwcdQsnd4fDf+9/pa9SGdLVdA4waqyXg6TmnfNWMLcOVZGK7 +BgzNwCR7n+ydG1A9Fu3TinKnifl9kt+T+OJLq+UiUp5Jr7pZmK0OHGwUplMlO7gMSn1 zzT1bBBiT1gjfsiezysdMruVSkUn7ZiV6aPkVXf4M3BzIa61PyxXlpH/R27VgR6Qhu40 WD4zQ8LvCcf7wwaLs0O7xNkr2Gc06TkiSKGV1Z7l8HBkgtzG9yAHi/NK0e3ZhwhwsCFN fHABMugPoGf7yBPFMaRV/XXqW73OkBJwXPWWX0hCteZXxQpZX69Nnfa2zeIMmYQzN6wg Y9aw== X-Received: by 10.112.47.41 with SMTP id a9mr1321255lbn.134.1364562363710; Fri, 29 Mar 2013 06:06:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.37.5 with HTTP; Fri, 29 Mar 2013 06:05:43 -0700 (PDT) In-Reply-To: <2011437596-1364559261-cardhu_decombobulator_blackberry.rim.net-1604892280-@b16.c6.bise7.blackberry> References: <2011437596-1364559261-cardhu_decombobulator_blackberry.rim.net-1604892280-@b16.c6.bise7.blackberry> From: Ted Dunning Date: Fri, 29 Mar 2013 14:05:43 +0100 Message-ID: Subject: Re: Million docs and word count scenario To: "common-user@hadoop.apache.org" , pathurun@yahoo.com Content-Type: multipart/alternative; boundary=bcaec554d754f6a82d04d90feef6 X-Gm-Message-State: ALoCoQmPBFWYBjnC9Yk6SyoZa7QWzDCUbJx/B4x/7HKTmmGx48GttfV/QIWP6Mk5l+DTt7JG9yAf X-Virus-Checked: Checked by ClamAV on apache.org --bcaec554d754f6a82d04d90feef6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Putting each document into a separate file is not likely to be a great thing to do. On the other hand, putting them all into one file may not be what you want either. It is probably best to find a middle ground and create files each with many documents and each a few gigabytes in size. On Fri, Mar 29, 2013 at 1:15 PM, wrote: > If there r 1 million docs in an enterprse and we need to perform word > count computation on all the docs what is the first step to be done. Is = it > to extract all the text of all the docs into a single file and then put > into hdfs or put each one separately in hdfs. > Thanks > > Sent from BlackBerry=AE on Airtel --bcaec554d754f6a82d04d90feef6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Putting each document into a separate file is not likely t= o be a great thing to do.

On the other hand, putti= ng them all into one file may not be what you want either.

It is probably best to find a middle ground and create= files each with many documents and each a few gigabytes in size.


On Fri, Mar= 29, 2013 at 1:15 PM, <pathurun@yahoo.com> wrote:
If there r 1 million docs in an enterprse an= d we need to perform word count computation on all the docs what is the fir= st step to be done. =A0Is it to extract all the text of all the docs =A0int= o a single file and then put into hdfs or put each one separately in hdfs.<= br> Thanks

Sent from BlackBerry=AE on Airtel

--bcaec554d754f6a82d04d90feef6--