Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 90282 invoked from network); 28 May 2010 18:08:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 May 2010 18:08:30 -0000 Received: (qmail 52217 invoked by uid 500); 28 May 2010 18:08:27 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 52164 invoked by uid 500); 28 May 2010 18:08:27 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 52156 invoked by uid 99); 28 May 2010 18:08:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 18:08:27 +0000 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=AWL,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 18:08:21 +0000 Received: by wwi18 with SMTP id 18so473397wwi.35 for ; Fri, 28 May 2010 11:08:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.179.68 with SMTP id g46mr25536wem.17.1275070079980; Fri, 28 May 2010 11:07:59 -0700 (PDT) Received: by 10.216.80.14 with HTTP; Fri, 28 May 2010 11:07:59 -0700 (PDT) In-Reply-To: References: Date: Fri, 28 May 2010 14:07:59 -0400 Message-ID: Subject: Re: Is Hadoop applicable to this problem. From: Eric Sammer To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Kevin: This is certainly something Hadoop can do well. The easiest way to do this is in multiple map reduce jobs. Job 1: Group all files by folder. You can simply use Hadoop's grouping by key. In pseudo code, you would do: // key is byte offset in the text file, value is the line of text def map(key, value, collector): list parts = value.split("\s") collector.collect(parts[0], parts[1]) def reduce(key, values): // key is list1, values is all files for list1 buffer = "" int counter = 0 for (value in values): if (counter > 0): buffer = buffer + ", " buffer = buffer + value collector.collect(key, buffer) This gets you your example data1.txt. For data2.txt, you would do another MR job over the original input file and simply make part[1] the key and part[0] the value in the map() method. You can then do what you need with the files from there. Selecting the top N is really just another MR job. Tools like Pig and Hive can do all of these operations for you giving you higher level languages saving you some coding, but it's probably a nice project to learn Hadoop by writing the code yourself. You should be able to start with the word count code and modify it to do this. Take a look at the Cloudera training videos to learn more. http://www.cloudera.com/resources/?type=Training Hope this helps. On Fri, May 28, 2010 at 6:34 AM, Kevin Tse wrote: > Hi, all. > I have encountered a problem that cannot be solved with simple computation, > I don't know whether hadoop is applicable to it, I am completely new to > hadoop and MapReduce. > I have the raw data stored in a txt file weighing 700MB in size(100 million > lines). the file is in the format as the following: > list1 111 > list1 222 > list1 333 > list2 111 > list2 222 > list2 333 > list2 444 > list3 111 > list3 888 > > The first field of each line is something like a folder, the second one is > like a file. A file(the same file) can be saved under arbitrary amount of > different folders. > From this raw data, for the file "111", I want to collect the files that are > saved under the folders that contain the file "111", the file "111" is > excluded, and extract top 20 from these files which are sorting by their > appearance frequency in descending order. > > I was trying to solve this problem using AWK, but the script consumed too > much memory, and it was not fast enough. > And later I heard about hadoop. and from some tutorials on the web I learned > a little about how it works. I want to use its "Distributed Computing" > ability. > I have already read the word counting tutorial, but I still don't have an > idea about how to write my own Map function and Reduce function for the > problem. > > By the way, I can generate the intermediate data and save them to 2 files in > acceptable speed using AWK in the format as following: > data1.txt: > list1 111,222,333 > list2 111,222,333,444 > list3 111,888 > > data2.txt: > 111 list1,list2,list3 > 222 list1,list2 > 333 list1,list3 > 444 list2 > 888 list4 > > My question is: > Is hadoop applicable to this problem, if so, would you please give me a clue > on how to implement the Map function and the Reduce function. > Thank you in advance. > > - Kevin Tse > -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com