hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Hadoop project - help needed
Date Tue, 31 May 2011 15:54:03 GMT
Parismav,

So you are more or less trying to scrape some data in a distributed way.  Well there are several
things that you could do, just be careful I am not sure the terms of service for the flickr
APIs so make sure that you are not violating them by downloading too much data.  You probably
want to use the map input data to be command/control for what the mappers do.  I would probably
put in a format like

ACCOUT INFO\tGROUP INFO\n

Then you could use the N-line input format so that each mapper will process one line out of
the file.  Something like (This is just psudo code)

Mapper<Long, String, ?, ?> {
  map(Long offset, String line,...) {
    String parts = line.split("\t");
    openConnection(parts[0]);
    GroupData gd = getDataAboutGroup(parts[1]);
    ...
  }
}

I would probably not bother with a reducer if all you are doing is pulling down data.  Also
the output format you choose really depends on the type of data you are downloading, and how
you want to use that data later.  For example if you want to download the actual picture then
you probably want to use a sequence file format or some other binary format, because converting
a picture to text can be very costly.

--Bobby Evans

On 5/31/11 10:35 AM, "parismav" <paok_gate_4_@hotmail.com> wrote:



Hello dear forum,
i am working on a project on apache Hadoop, i am totally new to this
software and i need some help understanding the basic features!

To sum up, for my project i have configured hadoop so that it runs 3
datanodes on one machine.
The project's main goal is, to use both Flickr API (flickr.com) libraries
and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a
Flickr group and returns photos' info from that group.

In order to do that, i have 3 flickr accounts, each one with a different api
key.

I dont need any help on the flickr side of the code, ofcourse. But what i
dont understand, is how to use the Mapper and Reducer part of the code.
What input do i have to give the Map() function?
do i have to contain this whole "info downloading" process in the map()
function?

In a few words, how do i convert my code so that it runs distributedly on
hadoop?
thank u!
--
View this message in context: http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message