Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 86567 invoked from network); 10 Dec 2010 18:16:08 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Dec 2010 18:16:08 -0000 Received: (qmail 85523 invoked by uid 500); 10 Dec 2010 18:16:05 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 85466 invoked by uid 500); 10 Dec 2010 18:16:05 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 85458 invoked by uid 99); 10 Dec 2010 18:16:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Dec 2010 18:16:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mp2893@gmail.com designates 209.85.161.179 as permitted sender) Received: from [209.85.161.179] (HELO mail-gx0-f179.google.com) (209.85.161.179) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Dec 2010 18:15:57 +0000 Received: by gxk21 with SMTP id 21so2783082gxk.38 for ; Fri, 10 Dec 2010 10:15:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:subject:references:from :content-type:x-mailer:in-reply-to:message-id:date:to :content-transfer-encoding:mime-version; bh=GOYSDm/LJ1yFcHnlft8NgqyyPRZh7iw2jIvjOtBRmRA=; b=W6wsInYQMsVuaiSLN6aK3/Wl/fPV/4WIsu1zP0gHI2cmmQDC0nHg9QV3+4Y1oMXYZX jQQnvARfFKwu8eUlxI5c+TVauJjEoWLFgCx+g3ZH0zMIigLgoH5z4r6LWynUw+muxfD2 kZkDHRdsOdFYOkTHZ4qJZ9zzzlho1LflI6GZE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:references:from:content-type:x-mailer:in-reply-to :message-id:date:to:content-transfer-encoding:mime-version; b=eEv6vEaYJQfmJu2FxTxsRtMsdAeB7XEuxsuWuYEY5EAvY2U71Udg/7dWmlegNu0b9t wnrHyNdPqpFrHFKqOrS19g8M9U8kXLNwr+8PbCgJZVUzEcVnrz2RrE4zd96CpYkd21Xf qlGucUJWryC9mSAXZmGwzdI0OCYWD4296tAUE= Received: by 10.236.105.180 with SMTP id k40mr2400639yhg.93.1292004936313; Fri, 10 Dec 2010 10:15:36 -0800 (PST) Received: from [192.168.100.11] ([114.206.249.147]) by mx.google.com with ESMTPS id j64sm2057203yha.35.2010.12.10.10.15.34 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 10 Dec 2010 10:15:35 -0800 (PST) Subject: Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase? References: From: Edward Choi Content-Type: text/plain; charset=euc-kr X-Mailer: iPhone Mail (8C148a) In-Reply-To: Message-Id: <1AACB857-EFA9-45BA-9C34-A250B97DB942@gmail.com> Date: Sat, 11 Dec 2010 03:15:28 +0900 To: "common-user@hadoop.apache.org" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (iPhone Mail 8C148a) X-Virus-Checked: Checked by ClamAV on apache.org God I never knew that they had a project like this.=20 I should definitely check it out. I may even be able to use it at my work pl= ace. Thanks for the tip!! =46rom mp2893's iPhone On 2010. 12. 10., at =BF=C0=C8=C4 10:36, "Jones, Nick" w= rote: > It might be worth looking into Nutch; it can probably be configured to do t= he type of crawling you need. >=20 > Nick Jones >=20 > -----Original Message----- > From: Edward Choi [mailto:mp2893@gmail.com]=20 > Sent: Friday, December 10, 2010 6:24 AM > To: common-user@hadoop.apache.org > Subject: Re: Is it possible to write file output in Map phase once and wri= te another file output in Reduce phase? >=20 > Wow thanks for the info. I'll definitely try that.=20 > One question though... > Is that "tagged name"and "free indicator" some kind of special class varia= ble provided by MultipleOutputs class? >=20 > Ed >=20 > =46rom mp2893's iPhone >=20 > On 2010. 12. 10., at =BF=C0=C8=C4 5:30, Harsh J w= rote: >=20 >> Hi, >>=20 >> You can use MultipleOutputs class to achieve this, with tagged names >> and free indicators of whether the output was from a map or reduce >> also. >>=20 >> On Fri, Dec 10, 2010 at 12:57 PM, edward choi wrote: >>> Hi, >>>=20 >>> I'm trying to crawl numerous news sites. >>> My plan is to make a file containing a list of all the news rss feed url= s, >>> and the path to save the crawled news article. >>> So it would be like this: >>>=20 >>> nytimes_nation, /user/hadoop/nytimes >>> nytimes_sports, /user/hadoop/nytimes >>> latimes_world, /user/hadoop/latimes >>> latimes_nation, /user/hadoop/latimes >>> ... >>> ... >>> ... >>>=20 >>> Each mapper would get a single line and crawl the assigned url, process >>> text, and save the result. >>> So this job does not need any Reducing process. >>>=20 >>> But what I'd also like to do is to create a dictionary at the same time.= >>> This could definitely take advantage of Reduce phase. Each mapper can >>> generate output as "KEY:term, VALUE:term_frequency" >>> Then Reducer can merge them all together and create a dictionary. (Of co= urse >>> I would be using many Reducers so the dictionary would be partitioned) >>>=20 >>> I know that I can do this by creating two separate jobs (one for crawlin= g, >>> the other for making dictionary), but I'd like to do this in one-pass. >>>=20 >>> So my design is: >>> Map phase =3D=3D> crawl news articles, process text, write the result to= a file. >>> II >>> II pass (term, term_frequency) pair to the Reducer >>> II >>> V >>> Reduce phase =3D=3D> Merge the (term, term_frequency) pair and create a >>> dictionary >>>=20 >>> Is this at all possible? Or is it inherently impossible due to the struc= ture >>> of Hadoop? >>> If it's possible, could anyone tell me how to do it? >>>=20 >>> Ed. >>>=20 >>=20 >>=20 >>=20 >> --=20 >> Harsh J >> www.harshj.com >=20