Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTimC8iW-YwvJ6E5ZOOa6gc9GJcMOvUye0RX8L9IX@mail.gmail.com>
References: <AANLkTimC8iW-YwvJ6E5ZOOa6gc9GJcMOvUye0RX8L9IX@mail.gmail.com>
Date: Fri, 28 May 2010 14:07:59 -0400
Message-ID: <AANLkTilWbzq_xUZwSG1CgowCk7GT22g_KH4D5QGI4y97@mail.gmail.com>
Subject: Re: Is Hadoop applicable to this problem.
From: Eric Sammer <esammer@cloudera.com>
To: common-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Kevin:

This is certainly something Hadoop can do well. The easiest way to do
this is in multiple map reduce jobs.

Job 1: Group all files by folder.

You can simply use Hadoop's grouping by key. In pseudo code, you would do:

// key is byte offset in the text file, value is the line of text
def map(key, value, collector):
  list parts = value.split("\s")

  collector.collect(parts[0], parts[1])

def reduce(key, values):
  // key is list1, values is all files for list1
  buffer = ""

  int counter = 0

  for (value in values):
    if (counter > 0):
      buffer = buffer + ", "

    buffer = buffer + value

  collector.collect(key, buffer)

This gets you your example data1.txt.

For data2.txt, you would do another MR job over the original input
file and simply make part[1] the key and part[0] the value in the
map() method. You can then do what you need with the files from there.
Selecting the top N is really just another MR job. Tools like Pig and
Hive can do all of these operations for you giving you higher level
languages saving you some coding, but it's probably a nice project to
learn Hadoop by writing the code yourself. You should be able to start
with the word count code and modify it to do this. Take a look at the
Cloudera training videos to learn more.
http://www.cloudera.com/resources/?type=Training

Hope this helps.

On Fri, May 28, 2010 at 6:34 AM, Kevin Tse <kevintse.onjee@gmail.com> wrote:
> Hi, all.
> I have encountered a problem that cannot be solved with simple computation,
> I don't know whether hadoop is applicable to it, I am completely new to
> hadoop and MapReduce.
> I have the raw data stored in a txt file weighing 700MB in size(100 million
> lines). the file is in the format as the following:
> list1 111
> list1 222
> list1 333
> list2 111
> list2 222
> list2 333
> list2 444
> list3 111
> list3 888
>
> The first field of each line is something like a folder, the second one is
> like a file. A file(the same file) can be saved under arbitrary amount of
> different folders.
> From this raw data, for the file "111", I want to collect the files that are
> saved under the folders that contain the file "111", the file "111" is
> excluded, and extract top 20 from these files which are sorting by their
> appearance frequency in descending order.
>
> I was trying to solve this problem using AWK, but the script consumed too
> much memory, and it was not fast enough.
> And later I heard about hadoop. and from some tutorials on the web I learned
> a little about how it works. I want to use its "Distributed Computing"
> ability.
> I have already read the word counting tutorial, but I still don't have an
> idea about how to write my own Map function and Reduce function for the
> problem.
>
> By the way, I can generate the intermediate data and save them to 2 files in
> acceptable speed using AWK in the format as following:
> data1.txt:
> list1 111,222,333
> list2 111,222,333,444
> list3 111,888
>
> data2.txt:
> 111 list1,list2,list3
> 222 list1,list2
> 333 list1,list3
> 444 list2
> 888 list4
>
> My question is:
> Is hadoop applicable to this problem, if so, would you please give me a clue
> on how to implement the Map function and the Reduce function.
> Thank you in advance.
>
> - Kevin Tse
>


-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com