Mailing-List: contact pig-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: pig-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=message-id:from:to:in-reply-to:content-type:
	content-transfer-encoding:mime-version:subject:date:references:x-mailer;
	b=k8PWJ5MqXtuZQZEV3bgU+k8TYkdRXni1oaSJRtK2lBgUR5ltH90tpp42KQEuZwO6
Message-Id: <DBE67301-27C5-4D35-80B1-50876D4CF89D@yahoo-inc.com>
From: Alan Gates <gates@yahoo-inc.com>
To: pig-user@hadoop.apache.org
In-Reply-To: <357a70951001191517v535d2cc9s2b728062ed49489e@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Subject: Re: Initial Benchmark Results
Date: Tue, 19 Jan 2010 15:20:43 -0800
References: <d1e795761001181634v1a6a802aq27f9f0096a5d0d1e@mail.gmail.com>
 <357a70951001181647w33d234a9heafa229b17a31a5@mail.gmail.com>
 <357a70951001181648o545e5f44sf9fed89982697e79@mail.gmail.com>
 <d1e795761001181700w2ba579ceh4a66d8f0effc7084@mail.gmail.com>
 <d1e795761001181701i2a04013ahc73013f0db6a1889@mail.gmail.com>
 <357a70951001181716y49791d33la218103791076eb5@mail.gmail.com>
 <d1e795761001181726ud199310s894e6902c2c5dbdd@mail.gmail.com>
 <357a70951001181748i77645a08g668ec03fee9ef8f0@mail.gmail.com>
 <d1e795761001190233v108ca090g2035cbfdf11bb537@mail.gmail.com>
 <518b17a1001190819nd021b93xcb18d1248e6941f9@mail.gmail.com>
 <357a70951001191517v535d2cc9s2b728062ed49489e@mail.gmail.com>


On Jan 19, 2010, at 3:17 PM, Dmitriy Ryaboy wrote:

> Zaki,
> It's PIG-960. But Alan's right, of course, this doesn't help with
> textloader (though it could, with some elbow grease).

Elbow grease provided.  See PIG-1197.

Alan.

>
> -D
>
> On Tue, Jan 19, 2010 at 8:19 AM, zaki rahaman  
> <zaki.rahaman@gmail.com> wrote:
>> @ Rob,
>>
>> Great work so far and really useful for all those out there looking  
>> to
>> test-drive and compare different MapReduce abstractions. Looking  
>> forward to
>> seeing what other kinds of benchmarks you're thinking of running (a  
>> list
>> would be nice maybe?)
>>
>> @Dmitriy,
>>
>> Can you point me to the JIRA for the PigStorage improvements... and  
>> maybe
>> this would be better answered on pig-dev but whats the ETA on 0.6?  
>> Are there
>> still commits/changes being made to trunk as we'd like to start  
>> moving over
>> to a 0.5/0.6 and Hadoop 0.20 setup
>>
>> On Tue, Jan 19, 2010 at 5:33 AM, Rob Stewart <robstewart57@googlemail.com 
>> >wrote:
>>
>>> Dmitry, apologies, the full Hive script is:
>>> ---------------
>>> CREATE EXTERNAL TABLE Text(words STRING) STORED AS TEXTFILE
>>> LOCATION '/user/rs46/Inputs/WordCount/wordsx1_skewed.dat';
>>> FROM Text
>>> INSERT OVERWRITE DIRECTORY
>>> '/user/rs46/Outputs/WordCount/outputHiveWordCount'
>>> SELECT words, COUNT(words) as totals
>>> GROUP BY words;
>>> ---------------
>>>
>>> And, you have pointed a valid issue with regards to JAQL - Perhaps  
>>> it is
>>> fairer to include a tokenizer to produce the Text output. So I  
>>> have amended
>>> all the JAQL scripts to output to plain text, instead of sequence  
>>> files.
>>>
>>>
>>> I will find Pig 0.6, and rerun the Pig scripts to incorporate the  
>>> IO speed
>>> improvements, and the JAQL scripts to write to text files.
>>>
>>> The complete script takes about 24 hours to run (I run each  
>>> multiple times
>>> for a fair average). I imagine I'll be able to post the updated  
>>> stats by
>>> the
>>> weekend. Thanks for the guidance Dmitry, I'm keen get some
>>> useful/interesting data.
>>>
>>>
>>> Regards,
>>>
>>> Rob Stewart
>>>
>>>
>>>
>>>
>>> 2010/1/19 Dmitriy Ryaboy <dvryaboy@gmail.com>
>>>
>>>> Thanks Rob.
>>>> Can you point me to where the tokenization is happening in the Hive
>>>> and Jaql scripts? ie, how is "Text" constructed?
>>>>
>>>> -D
>>>>
>>>> On Mon, Jan 18, 2010 at 5:26 PM, Rob Stewart
>>>> <robstewart57@googlemail.com> wrote:
>>>>> Hi Dmitry,
>>>>>
>>>>> thanks for the info on Pig 0.6 . I will definitely give that a  
>>>>> go, and
>>> it
>>>>> would perhaps be interesting to annotate Pig 0.5 against 0.6 on  
>>>>> the
>>>> original
>>>>> charts. I did have my suspicions that disk IO had a part to play  
>>>>> in the
>>>> slow
>>>>> performance.
>>>>>
>>>>> With regards to the output results. I can confirm that the  
>>>>> numbers are
>>>>> correct.. i.e. Each application is correctly counting the  
>>>>> frequency of
>>>> each
>>>>> word. I am less concerned with the output format (e.g.  
>>>>> outputting to a
>>>> Text
>>>>> file might give an advantage to Pig over Hive, or outputting to a
>>>> sequence
>>>>> file may give advantage to JAQL etc etc....). It seemed to make  
>>>>> sense
>>> to
>>>>> allow Hive to output to a Hive structure "table", Pig to a plain  
>>>>> text
>>>> file,
>>>>> and JAQL to a JAQL sequence file. Once again though - the  
>>>>> wordcount
>>>> numbers
>>>>> are certainly correct.
>>>>>
>>>>> I will test Pig 0.6 over the next few days, and would be more than
>>> happy
>>>> to
>>>>> provide you with the updated execution time statistics.
>>>>>
>>>>> Rob Stewart
>>>>>
>>>>>
>>>>>
>>>>> 2010/1/19 Dmitriy Ryaboy <dvryaboy@gmail.com>
>>>>>
>>>>>> Rob,
>>>>>> Pig 0.6 (on the verge of being released) replaces the way Pig  
>>>>>> reads
>>>>>> data off the disk for PigStorage, which gives it a significant  
>>>>>> speed
>>>>>> up. Prior to that, I was seeing Pig lag behind Hive when simply
>>>>>> reading data and doing nothing else -- now it's equal.  So  
>>>>>> definitely
>>>>>> try that.
>>>>>>
>>>>>> Did you check that you get the same output from all three  
>>>>>> frameworks?
>>>>>> I wonder if they tokenize in the same manner. It might be  
>>>>>> better to
>>>>>> write a UDF you know to be equivalent for each of the  
>>>>>> frameworks, and
>>>>>> use that instead of something built-in.
>>>>>>
>>>>>> -D
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 18, 2010 at 5:01 PM, Rob Stewart
>>>>>> <robstewart57@googlemail.com> wrote:
>>>>>>> And For your information, the JAQL script looks like:
>>>>>>>
>>>>>>> $input = read(lines("Inputs/WordCount/wordsx1_skewed.dat"));
>>>>>>> $input -> group by $word = $
>>>>>>>  into { $word, num: count($) }
>>>>>>> -> write(hdfs('Outputs/WordCount/ 
>>>>>>> wordCountOutputx1_skewed.jaql'));
>>>>>>>
>>>>>>> (Note, this is from JAQL svn snapshot).
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2010/1/19 Rob Stewart <robstewart57@googlemail.com>
>>>>>>>
>>>>>>>> Sure, the Hive script is as follows:
>>>>>>>>
>>>>>>>> FROM Text
>>>>>>>> INSERT OVERWRITE DIRECTORY
>>>>>>>> '/user/rs46/Outputs/WordCount/outputHiveWordCount'
>>>>>>>> SELECT words, COUNT(words) as totals
>>>>>>>> GROUP BY words;
>>>>>>>>
>>>>>>>> Notice I have been careful not to include the *loading times*  
>>>>>>>> of
>>> the
>>>>>> data,
>>>>>>>> as this is handled by Hadoop FileInput, and is the same for  
>>>>>>>> each.
>>> For
>>>>>> each
>>>>>>>> of the languages, I have only timed:
>>>>>>>> 1. Reading the Input file
>>>>>>>> 2. Doing the word count
>>>>>>>> 3. Writing results out to the HDFS (For Hive, to a Hive  
>>>>>>>> "database
>>>>>> table",
>>>>>>>> and for JAQL, to a JAQL Sequence file), though this only take a
>>>> matter
>>>>>> of
>>>>>>>> seconds.
>>>>>>>>
>>>>>>>> I've been careful in each of the word counts *not* to order  
>>>>>>>> by the
>>>>>>>> frequency of each word, as this would cloud the issue with an
>>> "ORDER
>>>> BY"
>>>>>>>> query. So all the results give me a list of words and their
>>>> frequency,
>>>>>> in
>>>>>>>> alphabetical order of the words (done automatically by the
>>> MapReduce
>>>>>> model).
>>>>>>>>
>>>>>>>> I am using Pig 0.5.0, with Hadoop 0.20.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Rob Stewart
>>>>>>>>
>>>>>>>>
>>>>>>>> 2010/1/19 Dmitriy Ryaboy <dvryaboy@gmail.com>
>>>>>>>>
>>>>>>>> Oh and which version of pig are you using?
>>>>>>>>>
>>>>>>>>> On Mon, Jan 18, 2010 at 4:47 PM, Dmitriy Ryaboy <
>>> dvryaboy@gmail.com
>>>>>
>>>>>>>>> wrote:
>>>>>>>>>> Rob,
>>>>>>>>>> Can you show the Hive script you used, as well?
>>>>>>>>>>
>>>>>>>>>> -D
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 18, 2010 at 4:34 PM, Rob Stewart
>>>>>>>>>> <robstewart57@googlemail.com> wrote:
>>>>>>>>>>> Hi folks,
>>>>>>>>>>>
>>>>>>>>>>> I have some initial results to run through with you. I  
>>>>>>>>>>> have a
>>>> number
>>>>>> of
>>>>>>>>>>> implementations ready to push onto the Hadoop cluster, but I
>>> have
>>>>>>>>> finalized
>>>>>>>>>>> the tests for Hive, JAQL and Pig for the simple WordCount
>>>>>> application.
>>>>>>>>>>>
>>>>>>>>>>> The experiments were performance over a 32 node cluster (31
>>>>>> datanodes).
>>>>>>>>> The
>>>>>>>>>>> input data was generated from the Pig DataGenerator (thanks
>>> goes
>>>> to
>>>>>>>>> Alan
>>>>>>>>>>> Gates).
>>>>>>>>>>>
>>>>>>>>>>> The reason for asking you all at this early stage is because
>>> Pig
>>>>>> seems
>>>>>>>>> to
>>>>>>>>>>> fall behind some way when it comes to overal execution time.
>>> The
>>>>>>>>> experiment
>>>>>>>>>>> is a scale-up scenario, where I have fixed the nodes at  
>>>>>>>>>>> 31, and
>>>>>>>>> increased
>>>>>>>>>>> the data to be processed. I have not yet done a scale-out
>>>>>> experiment,
>>>>>>>>> but
>>>>>>>>>>> that is my next task (i.e. 10million data rows on 10  
>>>>>>>>>>> DataNodes
>>>>>> should
>>>>>>>>>>> execute only slightly quicker than 20million rows on 20
>>>> DataNodes).
>>>>>>>>>>>
>>>>>>>>>>> Anyway, let me point you to the test results:
>>>>>>>>>>>
>>> http://www.macs.hw.ac.uk/~rs46/WordCount_Scale_Up_Execution.pdf<http://www.macs.hw.ac.uk/%7Ers46/WordCount_Scale_Up_Execution.pdf 
>>> >
>>>>>>>>>>>
>>>>>>>>>>> On that PDF I include the Pig script I've used, and why I've
>>>> stated
>>>>>> the
>>>>>>>>>>> parallelism used. I have not yet executed the Java MapReduce
>>>>>> wordcount,
>>>>>>>>> but
>>>>>>>>>>> imagine it will undercut the hive performance.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Does anyone have any comments on the results, and reason why
>>> Pig
>>>>>>>>> performs
>>>>>>>>>>> poorly, or any obvious error I have done ? That said, this  
>>>>>>>>>>> is
>>> the
>>>>>> first
>>>>>>>>> of a
>>>>>>>>>>> number of experiments, and Pig may shine (for instance, I am
>>>>>> planning a
>>>>>>>>>>> skewed join benchmark, which will be interesting as all
>>> languages
>>>>>> have
>>>>>>>>> an
>>>>>>>>>>> implementation for skewed data).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Rob Stewart
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Zaki Rahaman
>>