Return-Path: Delivered-To: apmail-hadoop-pig-user-archive@www.apache.org Received: (qmail 7471 invoked from network); 19 Jan 2010 23:23:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Jan 2010 23:23:38 -0000 Received: (qmail 82956 invoked by uid 500); 19 Jan 2010 23:23:38 -0000 Delivered-To: apmail-hadoop-pig-user-archive@hadoop.apache.org Received: (qmail 82899 invoked by uid 500); 19 Jan 2010 23:23:38 -0000 Mailing-List: contact pig-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-user@hadoop.apache.org Delivered-To: mailing list pig-user@hadoop.apache.org Received: (qmail 82889 invoked by uid 99); 19 Jan 2010 23:23:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jan 2010 23:23:38 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jan 2010 23:23:26 +0000 Received: from [192.168.0.196] (snvvpn2-10-73-153-c55.hq.corp.yahoo.com [10.73.153.55]) by mrout2.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id o0JNKhfC012657 for ; Tue, 19 Jan 2010 15:20:43 -0800 (PST) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:from:to:in-reply-to:content-type: content-transfer-encoding:mime-version:subject:date:references:x-mailer; b=k8PWJ5MqXtuZQZEV3bgU+k8TYkdRXni1oaSJRtK2lBgUR5ltH90tpp42KQEuZwO6 Message-Id: From: Alan Gates To: pig-user@hadoop.apache.org In-Reply-To: <357a70951001191517v535d2cc9s2b728062ed49489e@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: Initial Benchmark Results Date: Tue, 19 Jan 2010 15:20:43 -0800 References: <357a70951001181647w33d234a9heafa229b17a31a5@mail.gmail.com> <357a70951001181648o545e5f44sf9fed89982697e79@mail.gmail.com> <357a70951001181716y49791d33la218103791076eb5@mail.gmail.com> <357a70951001181748i77645a08g668ec03fee9ef8f0@mail.gmail.com> <518b17a1001190819nd021b93xcb18d1248e6941f9@mail.gmail.com> <357a70951001191517v535d2cc9s2b728062ed49489e@mail.gmail.com> X-Mailer: Apple Mail (2.936) X-Virus-Checked: Checked by ClamAV on apache.org On Jan 19, 2010, at 3:17 PM, Dmitriy Ryaboy wrote: > Zaki, > It's PIG-960. But Alan's right, of course, this doesn't help with > textloader (though it could, with some elbow grease). Elbow grease provided. See PIG-1197. Alan. > > -D > > On Tue, Jan 19, 2010 at 8:19 AM, zaki rahaman > wrote: >> @ Rob, >> >> Great work so far and really useful for all those out there looking >> to >> test-drive and compare different MapReduce abstractions. Looking >> forward to >> seeing what other kinds of benchmarks you're thinking of running (a >> list >> would be nice maybe?) >> >> @Dmitriy, >> >> Can you point me to the JIRA for the PigStorage improvements... and >> maybe >> this would be better answered on pig-dev but whats the ETA on 0.6? >> Are there >> still commits/changes being made to trunk as we'd like to start >> moving over >> to a 0.5/0.6 and Hadoop 0.20 setup >> >> On Tue, Jan 19, 2010 at 5:33 AM, Rob Stewart > >wrote: >> >>> Dmitry, apologies, the full Hive script is: >>> --------------- >>> CREATE EXTERNAL TABLE Text(words STRING) STORED AS TEXTFILE >>> LOCATION '/user/rs46/Inputs/WordCount/wordsx1_skewed.dat'; >>> FROM Text >>> INSERT OVERWRITE DIRECTORY >>> '/user/rs46/Outputs/WordCount/outputHiveWordCount' >>> SELECT words, COUNT(words) as totals >>> GROUP BY words; >>> --------------- >>> >>> And, you have pointed a valid issue with regards to JAQL - Perhaps >>> it is >>> fairer to include a tokenizer to produce the Text output. So I >>> have amended >>> all the JAQL scripts to output to plain text, instead of sequence >>> files. >>> >>> >>> I will find Pig 0.6, and rerun the Pig scripts to incorporate the >>> IO speed >>> improvements, and the JAQL scripts to write to text files. >>> >>> The complete script takes about 24 hours to run (I run each >>> multiple times >>> for a fair average). I imagine I'll be able to post the updated >>> stats by >>> the >>> weekend. Thanks for the guidance Dmitry, I'm keen get some >>> useful/interesting data. >>> >>> >>> Regards, >>> >>> Rob Stewart >>> >>> >>> >>> >>> 2010/1/19 Dmitriy Ryaboy >>> >>>> Thanks Rob. >>>> Can you point me to where the tokenization is happening in the Hive >>>> and Jaql scripts? ie, how is "Text" constructed? >>>> >>>> -D >>>> >>>> On Mon, Jan 18, 2010 at 5:26 PM, Rob Stewart >>>> wrote: >>>>> Hi Dmitry, >>>>> >>>>> thanks for the info on Pig 0.6 . I will definitely give that a >>>>> go, and >>> it >>>>> would perhaps be interesting to annotate Pig 0.5 against 0.6 on >>>>> the >>>> original >>>>> charts. I did have my suspicions that disk IO had a part to play >>>>> in the >>>> slow >>>>> performance. >>>>> >>>>> With regards to the output results. I can confirm that the >>>>> numbers are >>>>> correct.. i.e. Each application is correctly counting the >>>>> frequency of >>>> each >>>>> word. I am less concerned with the output format (e.g. >>>>> outputting to a >>>> Text >>>>> file might give an advantage to Pig over Hive, or outputting to a >>>> sequence >>>>> file may give advantage to JAQL etc etc....). It seemed to make >>>>> sense >>> to >>>>> allow Hive to output to a Hive structure "table", Pig to a plain >>>>> text >>>> file, >>>>> and JAQL to a JAQL sequence file. Once again though - the >>>>> wordcount >>>> numbers >>>>> are certainly correct. >>>>> >>>>> I will test Pig 0.6 over the next few days, and would be more than >>> happy >>>> to >>>>> provide you with the updated execution time statistics. >>>>> >>>>> Rob Stewart >>>>> >>>>> >>>>> >>>>> 2010/1/19 Dmitriy Ryaboy >>>>> >>>>>> Rob, >>>>>> Pig 0.6 (on the verge of being released) replaces the way Pig >>>>>> reads >>>>>> data off the disk for PigStorage, which gives it a significant >>>>>> speed >>>>>> up. Prior to that, I was seeing Pig lag behind Hive when simply >>>>>> reading data and doing nothing else -- now it's equal. So >>>>>> definitely >>>>>> try that. >>>>>> >>>>>> Did you check that you get the same output from all three >>>>>> frameworks? >>>>>> I wonder if they tokenize in the same manner. It might be >>>>>> better to >>>>>> write a UDF you know to be equivalent for each of the >>>>>> frameworks, and >>>>>> use that instead of something built-in. >>>>>> >>>>>> -D >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jan 18, 2010 at 5:01 PM, Rob Stewart >>>>>> wrote: >>>>>>> And For your information, the JAQL script looks like: >>>>>>> >>>>>>> $input = read(lines("Inputs/WordCount/wordsx1_skewed.dat")); >>>>>>> $input -> group by $word = $ >>>>>>> into { $word, num: count($) } >>>>>>> -> write(hdfs('Outputs/WordCount/ >>>>>>> wordCountOutputx1_skewed.jaql')); >>>>>>> >>>>>>> (Note, this is from JAQL svn snapshot). >>>>>>> >>>>>>> Rob >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2010/1/19 Rob Stewart >>>>>>> >>>>>>>> Sure, the Hive script is as follows: >>>>>>>> >>>>>>>> FROM Text >>>>>>>> INSERT OVERWRITE DIRECTORY >>>>>>>> '/user/rs46/Outputs/WordCount/outputHiveWordCount' >>>>>>>> SELECT words, COUNT(words) as totals >>>>>>>> GROUP BY words; >>>>>>>> >>>>>>>> Notice I have been careful not to include the *loading times* >>>>>>>> of >>> the >>>>>> data, >>>>>>>> as this is handled by Hadoop FileInput, and is the same for >>>>>>>> each. >>> For >>>>>> each >>>>>>>> of the languages, I have only timed: >>>>>>>> 1. Reading the Input file >>>>>>>> 2. Doing the word count >>>>>>>> 3. Writing results out to the HDFS (For Hive, to a Hive >>>>>>>> "database >>>>>> table", >>>>>>>> and for JAQL, to a JAQL Sequence file), though this only take a >>>> matter >>>>>> of >>>>>>>> seconds. >>>>>>>> >>>>>>>> I've been careful in each of the word counts *not* to order >>>>>>>> by the >>>>>>>> frequency of each word, as this would cloud the issue with an >>> "ORDER >>>> BY" >>>>>>>> query. So all the results give me a list of words and their >>>> frequency, >>>>>> in >>>>>>>> alphabetical order of the words (done automatically by the >>> MapReduce >>>>>> model). >>>>>>>> >>>>>>>> I am using Pig 0.5.0, with Hadoop 0.20.0 >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Rob Stewart >>>>>>>> >>>>>>>> >>>>>>>> 2010/1/19 Dmitriy Ryaboy >>>>>>>> >>>>>>>> Oh and which version of pig are you using? >>>>>>>>> >>>>>>>>> On Mon, Jan 18, 2010 at 4:47 PM, Dmitriy Ryaboy < >>> dvryaboy@gmail.com >>>>> >>>>>>>>> wrote: >>>>>>>>>> Rob, >>>>>>>>>> Can you show the Hive script you used, as well? >>>>>>>>>> >>>>>>>>>> -D >>>>>>>>>> >>>>>>>>>> On Mon, Jan 18, 2010 at 4:34 PM, Rob Stewart >>>>>>>>>> wrote: >>>>>>>>>>> Hi folks, >>>>>>>>>>> >>>>>>>>>>> I have some initial results to run through with you. I >>>>>>>>>>> have a >>>> number >>>>>> of >>>>>>>>>>> implementations ready to push onto the Hadoop cluster, but I >>> have >>>>>>>>> finalized >>>>>>>>>>> the tests for Hive, JAQL and Pig for the simple WordCount >>>>>> application. >>>>>>>>>>> >>>>>>>>>>> The experiments were performance over a 32 node cluster (31 >>>>>> datanodes). >>>>>>>>> The >>>>>>>>>>> input data was generated from the Pig DataGenerator (thanks >>> goes >>>> to >>>>>>>>> Alan >>>>>>>>>>> Gates). >>>>>>>>>>> >>>>>>>>>>> The reason for asking you all at this early stage is because >>> Pig >>>>>> seems >>>>>>>>> to >>>>>>>>>>> fall behind some way when it comes to overal execution time. >>> The >>>>>>>>> experiment >>>>>>>>>>> is a scale-up scenario, where I have fixed the nodes at >>>>>>>>>>> 31, and >>>>>>>>> increased >>>>>>>>>>> the data to be processed. I have not yet done a scale-out >>>>>> experiment, >>>>>>>>> but >>>>>>>>>>> that is my next task (i.e. 10million data rows on 10 >>>>>>>>>>> DataNodes >>>>>> should >>>>>>>>>>> execute only slightly quicker than 20million rows on 20 >>>> DataNodes). >>>>>>>>>>> >>>>>>>>>>> Anyway, let me point you to the test results: >>>>>>>>>>> >>> http://www.macs.hw.ac.uk/~rs46/WordCount_Scale_Up_Execution.pdf>> > >>>>>>>>>>> >>>>>>>>>>> On that PDF I include the Pig script I've used, and why I've >>>> stated >>>>>> the >>>>>>>>>>> parallelism used. I have not yet executed the Java MapReduce >>>>>> wordcount, >>>>>>>>> but >>>>>>>>>>> imagine it will undercut the hive performance. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Does anyone have any comments on the results, and reason why >>> Pig >>>>>>>>> performs >>>>>>>>>>> poorly, or any obvious error I have done ? That said, this >>>>>>>>>>> is >>> the >>>>>> first >>>>>>>>> of a >>>>>>>>>>> number of experiments, and Pig may shine (for instance, I am >>>>>> planning a >>>>>>>>>>> skewed join benchmark, which will be interesting as all >>> languages >>>>>> have >>>>>>>>> an >>>>>>>>>>> implementation for skewed data). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Rob Stewart >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> Zaki Rahaman >>