hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Tse <kevintse.on...@gmail.com>
Subject Re: Is Hadoop applicable to this problem.
Date Mon, 31 May 2010 09:13:03 GMT
Aleksandar,
Thank you so much, now I think I have all I need to give Hadoop a run in a
cluster environment.

- Kevin Tse

On Mon, May 31, 2010 at 2:54 PM, Aleksandar Stupar <
stupar.aleksandar@yahoo.com> wrote:

> Hi guys,
>
> this looks to me as a set self join problem. As I see it the easiest way
> to implement it using MR would be to use data1.txt as input format
> (if you already have it, or generate it by MR):
>
> data1.txt:
> list1 111,222,333
> list2
> 111,222,333,444
> list3 111,888
>
>
> In map function output all pairs of files per folder:
> map(String folder, String[] files){
>    for(String file1:files){
>        for(String file2:files){
>            if(file1!=file2)
>                output(file1,file2);
>        }
>    }
> }
>
>
> Now in the reduce function you have everything you need:
> reduce (String file, String [] otherFiles){
>    HashMap<String, long> mergedResults = new ...
>    for(String otherFile:otherFiles){
>        mergedResults.get(otherFile).increaseCount();
>    }
>    //emit top K results by count value
> }
>
>
> Hope this helps,
> Aleksandar.
>
>
>
>
> ________________________________
> From: Kevin Tse <kevintse.onjee@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Sat, May 29, 2010 11:47:10 AM
> Subject: Re: Is Hadoop applicable to this problem.
>
> Hi, Eric
> Thank you for your reply.
> With your tip, I was able to write the Map function and Reduce function to
> generate the intermediate data and saved them to two files as data1.txt and
> data2.txt, which I could easily achieved using AWK as I mentioned in the
> previous mail. Well, generating the intermediate data in those formats is
> not really a tough task for the moment, the MR jobs to generate the
> intermediate data are so simple like the word counting one, but when I
> think
> of making another MR functions to generate the final output(I will show the
> format of the final output in the following), I don't know how to start
> coding. Now the problem is how to write the Map function and Reduce
> function
> to generate the final result with either the original data or the
> intermediate data as the input(I am not sure whether or not the
> intermediate
> data in those formats help here.).
>
> The MapReduce paradigm seems quite simple, the Map function just collects
> the input as key-value mapping, while the Reduce function merges the values
> with the same key, it seems to me that it can do nothing more than just
> that. But My requirement is not that simple, it involves grouping and
> sorting. I hope I am understanding something wrong and you would point me
> out.
>
> Now I don't want to get too many things/tools involved to accomplish the
> task, as you suggested I may code myself using Hadoop common. If I am not
> reading wrong your words, Hadoop(without Pig and Hive) alone will just meet
> my requirement.
>
> the format of the final result I wish to get from the original data is like
> the following:
> 111 222,333,444,888 ....(if there are more than 20 items here, I just want
> the top 20)
> 222 111,333,444,888
> 333 111,222,444
> 444 111,222,333
> 888 111
>
> Your help will be greatly appreciated.
> - Kevin Tse
>
> On Sat, May 29, 2010 at 2:07 AM, Eric Sammer <esammer@cloudera.com> wrote:
>
> > Kevin:
> >
> > This is certainly something Hadoop can do well. The easiest way to do
> > this is in multiple map reduce jobs.
> >
> > Job 1: Group all files by folder.
> >
> > You can simply use Hadoop's grouping by key. In pseudo code, you would
> do:
> >
> > // key is byte offset in the text file, value is the line of text
> > def map(key, value, collector):
> >  list parts = value.split("\s")
> >
> >  collector.collect(parts[0], parts[1])
> >
> > def reduce(key, values):
> >  // key is list1, values is all files for list1
> >  buffer = ""
> >
> >  int counter = 0
> >
> >  for (value in values):
> >    if (counter > 0):
> >      buffer = buffer + ", "
> >
> >    buffer = buffer + value
> >
> >  collector.collect(key, buffer)
> >
> > This gets you your example data1.txt.
> >
> > For data2.txt, you would do another MR job over the original input
> > file and simply make part[1] the key and part[0] the value in the
> > map() method. You can then do what you need with the files from there.
> > Selecting the top N is really just another MR job. Tools like Pig and
> > Hive can do all of these operations for you giving you higher level
> > languages saving you some coding, but it's probably a nice project to
> > learn Hadoop by writing the code yourself. You should be able to start
> > with the word count code and modify it to do this. Take a look at the
> > Cloudera training videos to learn more.
> > http://www.cloudera.com/resources/?type=Training
> >
> > Hope this helps.
> >
> > On Fri, May 28, 2010 at 6:34 AM, Kevin Tse <kevintse.onjee@gmail.com>
> > wrote:
> > > Hi, all.
> > > I have encountered a problem that cannot be solved with simple
> > computation,
> > > I don't know whether hadoop is applicable to it, I am completely new to
> > > hadoop and MapReduce.
> > > I have the raw data stored in a txt file weighing 700MB in size(100
> > million
> > > lines). the file is in the format as the following:
> > > list1 111
> > > list1 222
> > > list1 333
> > > list2 111
> > > list2 222
> > > list2 333
> > > list2 444
> > > list3 111
> > > list3 888
> > >
> > > The first field of each line is something like a folder, the second one
> > is
> > > like a file. A file(the same file) can be saved under arbitrary amount
> of
> > > different folders.
> > > From this raw data, for the file "111", I want to collect the files
> that
> > are
> > > saved under the folders that contain the file "111", the file "111" is
> > > excluded, and extract top 20 from these files which are sorting by
> their
> > > appearance frequency in descending order.
> > >
> > > I was trying to solve this problem using AWK, but the script consumed
> too
> > > much memory, and it was not fast enough.
> > > And later I heard about hadoop. and from some tutorials on the web I
> > learned
> > > a little about how it works. I want to use its "Distributed Computing"
> > > ability.
> > > I have already read the word counting tutorial, but I still don't have
> an
> > > idea about how to write my own Map function and Reduce function for the
> > > problem.
> > >
> > > By the way, I can generate the intermediate data and save them to 2
> files
> > in
> > > acceptable speed using AWK in the format as following:
> > > data1.txt:
> > > list1 111,222,333
> > > list2 111,222,333,444
> > > list3 111,888
> > >
> > > data2.txt:
> > > 111 list1,list2,list3
> > > 222 list1,list2
> > > 333 list1,list3
> > > 444 list2
> > > 888 list4
> > >
> > > My question is:
> > > Is hadoop applicable to this problem, if so, would you please give me a
> > clue
> > > on how to implement the Map function and the Reduce function.
> > > Thank you in advance.
> > >
> > > - Kevin Tse
> > >
> >
> >
> >
> > --
> > Eric Sammer
> > phone: +1-917-287-2675
> > twitter: esammer
> > data: www.cloudera.com
> >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message