hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandar Stupar <stupar.aleksan...@yahoo.com>
Subject Re: Is Hadoop applicable to this problem.
Date Mon, 31 May 2010 06:54:23 GMT
Hi guys,

this looks to me as a set self join problem. As I see it the easiest way
to implement it using MR would be to use data1.txt as input format 
(if you already have it, or generate it by MR):

data1.txt:
list1 111,222,333
list2 
111,222,333,444
list3 111,888


In map function output all pairs of files per folder:
map(String folder, String[] files){
    for(String file1:files){
        for(String file2:files){
            if(file1!=file2)
                output(file1,file2);
        }    
    }
}


Now in the reduce function you have everything you need:
reduce (String file, String [] otherFiles){
    HashMap<String, long> mergedResults = new ...
    for(String otherFile:otherFiles){
        mergedResults.get(otherFile).increaseCount();
    }
    //emit top K results by count value
}


Hope this helps,
Aleksandar. 




________________________________
From: Kevin Tse <kevintse.onjee@gmail.com>
To: common-user@hadoop.apache.org
Sent: Sat, May 29, 2010 11:47:10 AM
Subject: Re: Is Hadoop applicable to this problem.

Hi, Eric
Thank you for your reply.
With your tip, I was able to write the Map function and Reduce function to
generate the intermediate data and saved them to two files as data1.txt and
data2.txt, which I could easily achieved using AWK as I mentioned in the
previous mail. Well, generating the intermediate data in those formats is
not really a tough task for the moment, the MR jobs to generate the
intermediate data are so simple like the word counting one, but when I think
of making another MR functions to generate the final output(I will show the
format of the final output in the following), I don't know how to start
coding. Now the problem is how to write the Map function and Reduce function
to generate the final result with either the original data or the
intermediate data as the input(I am not sure whether or not the intermediate
data in those formats help here.).

The MapReduce paradigm seems quite simple, the Map function just collects
the input as key-value mapping, while the Reduce function merges the values
with the same key, it seems to me that it can do nothing more than just
that. But My requirement is not that simple, it involves grouping and
sorting. I hope I am understanding something wrong and you would point me
out.

Now I don't want to get too many things/tools involved to accomplish the
task, as you suggested I may code myself using Hadoop common. If I am not
reading wrong your words, Hadoop(without Pig and Hive) alone will just meet
my requirement.

the format of the final result I wish to get from the original data is like
the following:
111 222,333,444,888 ....(if there are more than 20 items here, I just want
the top 20)
222 111,333,444,888
333 111,222,444
444 111,222,333
888 111

Your help will be greatly appreciated.
- Kevin Tse

On Sat, May 29, 2010 at 2:07 AM, Eric Sammer <esammer@cloudera.com> wrote:

> Kevin:
>
> This is certainly something Hadoop can do well. The easiest way to do
> this is in multiple map reduce jobs.
>
> Job 1: Group all files by folder.
>
> You can simply use Hadoop's grouping by key. In pseudo code, you would do:
>
> // key is byte offset in the text file, value is the line of text
> def map(key, value, collector):
>  list parts = value.split("\s")
>
>  collector.collect(parts[0], parts[1])
>
> def reduce(key, values):
>  // key is list1, values is all files for list1
>  buffer = ""
>
>  int counter = 0
>
>  for (value in values):
>    if (counter > 0):
>      buffer = buffer + ", "
>
>    buffer = buffer + value
>
>  collector.collect(key, buffer)
>
> This gets you your example data1.txt.
>
> For data2.txt, you would do another MR job over the original input
> file and simply make part[1] the key and part[0] the value in the
> map() method. You can then do what you need with the files from there.
> Selecting the top N is really just another MR job. Tools like Pig and
> Hive can do all of these operations for you giving you higher level
> languages saving you some coding, but it's probably a nice project to
> learn Hadoop by writing the code yourself. You should be able to start
> with the word count code and modify it to do this. Take a look at the
> Cloudera training videos to learn more.
> http://www.cloudera.com/resources/?type=Training
>
> Hope this helps.
>
> On Fri, May 28, 2010 at 6:34 AM, Kevin Tse <kevintse.onjee@gmail.com>
> wrote:
> > Hi, all.
> > I have encountered a problem that cannot be solved with simple
> computation,
> > I don't know whether hadoop is applicable to it, I am completely new to
> > hadoop and MapReduce.
> > I have the raw data stored in a txt file weighing 700MB in size(100
> million
> > lines). the file is in the format as the following:
> > list1 111
> > list1 222
> > list1 333
> > list2 111
> > list2 222
> > list2 333
> > list2 444
> > list3 111
> > list3 888
> >
> > The first field of each line is something like a folder, the second one
> is
> > like a file. A file(the same file) can be saved under arbitrary amount of
> > different folders.
> > From this raw data, for the file "111", I want to collect the files that
> are
> > saved under the folders that contain the file "111", the file "111" is
> > excluded, and extract top 20 from these files which are sorting by their
> > appearance frequency in descending order.
> >
> > I was trying to solve this problem using AWK, but the script consumed too
> > much memory, and it was not fast enough.
> > And later I heard about hadoop. and from some tutorials on the web I
> learned
> > a little about how it works. I want to use its "Distributed Computing"
> > ability.
> > I have already read the word counting tutorial, but I still don't have an
> > idea about how to write my own Map function and Reduce function for the
> > problem.
> >
> > By the way, I can generate the intermediate data and save them to 2 files
> in
> > acceptable speed using AWK in the format as following:
> > data1.txt:
> > list1 111,222,333
> > list2 111,222,333,444
> > list3 111,888
> >
> > data2.txt:
> > 111 list1,list2,list3
> > 222 list1,list2
> > 333 list1,list3
> > 444 list2
> > 888 list4
> >
> > My question is:
> > Is hadoop applicable to this problem, if so, would you please give me a
> clue
> > on how to implement the Map function and the Reduce function.
> > Thank you in advance.
> >
> > - Kevin Tse
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message