Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of mp2893@gmail.com designates
 209.85.161.179 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=subject:references:from:content-type:x-mailer:in-reply-to
         :message-id:date:to:content-transfer-encoding:mime-version;
        b=eEv6vEaYJQfmJu2FxTxsRtMsdAeB7XEuxsuWuYEY5EAvY2U71Udg/7dWmlegNu0b9t
         wnrHyNdPqpFrHFKqOrS19g8M9U8kXLNwr+8PbCgJZVUzEcVnrz2RrE4zd96CpYkd21Xf
         qlGucUJWryC9mSAXZmGwzdI0OCYWD4296tAUE=
Subject: Re: Is it possible to write file output in Map phase once and write
 another file output in Reduce phase?
References: <AANLkTim3k-F_7GFeaSx1KXEt7Uk8NWp5pNQqhgcqhegy@mail.gmail.com>
 <AANLkTikzT9ihMh_W6EFSOae4eDAgbeYpzeJKsiCtFJwo@mail.gmail.com>
 <A5C323BA-4869-47E6-B744-C4D42031E029@gmail.com>
 <C5285AB6118CC04ABB0B589DF1DE7492AA270ECC@SAUSEXMBP01.amd.com>
From: Edward Choi <mp2893@gmail.com>
Content-Type: text/plain;
	charset=euc-kr
In-Reply-To: <C5285AB6118CC04ABB0B589DF1DE7492AA270ECC@SAUSEXMBP01.amd.com>
Message-Id: <1AACB857-EFA9-45BA-9C34-A250B97DB942@gmail.com>
Date: Sat, 11 Dec 2010 03:15:28 +0900
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (iPhone Mail 8C148a)

God I never knew that they had a project like this.=20
I should definitely check it out. I may even be able to use it at my work pl=
ace. Thanks for the tip!!

=46rom mp2893's iPhone

On 2010. 12. 10., at =BF=C0=C8=C4 10:36, "Jones, Nick" <nick.jones@amd.com> w=
rote:

> It might be worth looking into Nutch; it can probably be configured to do t=
he type of crawling you need.
>=20
> Nick Jones
>=20
> -----Original Message-----
> From: Edward Choi [mailto:mp2893@gmail.com]=20
> Sent: Friday, December 10, 2010 6:24 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Is it possible to write file output in Map phase once and wri=
te another file output in Reduce phase?
>=20
> Wow thanks for the info. I'll definitely try that.=20
> One question though...
> Is that "tagged name"and "free indicator" some kind of special class varia=
ble provided by MultipleOutputs class?
>=20
> Ed
>=20
> =46rom mp2893's iPhone
>=20
> On 2010. 12. 10., at =BF=C0=C8=C4 5:30, Harsh J <qwertymaniac@gmail.com> w=
rote:
>=20
>> Hi,
>>=20
>> You can use MultipleOutputs class to achieve this, with tagged names
>> and free indicators of whether the output was from a map or reduce
>> also.
>>=20
>> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <mp2893@gmail.com> wrote:
>>> Hi,
>>>=20
>>> I'm trying to crawl numerous news sites.
>>> My plan is to make a file containing a list of all the news rss feed url=
s,
>>> and the path to save the crawled news article.
>>> So it would be like this:
>>>=20
>>> nytimes_nation,    /user/hadoop/nytimes
>>> nytimes_sports,    /user/hadoop/nytimes
>>> latimes_world,      /user/hadoop/latimes
>>> latimes_nation,     /user/hadoop/latimes
>>> ...
>>> ...
>>> ...
>>>=20
>>> Each mapper would get a single line and crawl the assigned url, process
>>> text, and save the result.
>>> So this job does not need any Reducing process.
>>>=20
>>> But what I'd also like to do is to create a dictionary at the same time.=

>>> This could definitely take advantage of Reduce phase. Each mapper can
>>> generate output as "KEY:term, VALUE:term_frequency"
>>> Then Reducer can merge them all together and create a dictionary. (Of co=
urse
>>> I would be using many Reducers so the dictionary would be partitioned)
>>>=20
>>> I know that I can do this by creating two separate jobs (one for crawlin=
g,
>>> the other for making dictionary), but I'd like to do this in one-pass.
>>>=20
>>> So my design is:
>>> Map phase =3D=3D> crawl news articles, process text, write the result to=
 a file.
>>>       II
>>>       II     pass (term, term_frequency) pair to the Reducer
>>>       II
>>>       V
>>> Reduce phase =3D=3D> Merge the (term, term_frequency) pair and create a
>>> dictionary
>>>=20
>>> Is this at all possible? Or is it inherently impossible due to the struc=
ture
>>> of Hadoop?
>>> If it's possible, could anyone tell me how to do it?
>>>=20
>>> Ed.
>>>=20
>>=20
>>=20
>>=20
>> --=20
>> Harsh J
>> www.harshj.com
>=20