pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...
Date Wed, 08 Dec 2010 01:19:44 GMT
Hi Jay,

I believe even you use pig ,the performance of fetching from HDFS
won't be better than Hive, because pig and hive both store result data
in hdfs and fetch data from client. And in most of cases, the result
data won't be very large. So the performance wont' be a problem. But I
guess your result data is very large because you mention that it  is
network bound, then I suggest run another pig script or native
mapreduce jobs on your result data.


On Wed, Dec 8, 2010 at 2:26 AM, Jae Lee <Jae.Lee@forward.co.uk> wrote:
> yeah I came across the openIterator(alias) on PigServer.
>
> basically that's what I like to get (dump of the alias and nothing else) when I execute
pig script.
>
> I'm currently writing a ruby wrapper that will use STORE the alias into temporary location
in hdfs then do Hadoop file fetch
> any better idea?
>
> J
> On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:
>
>> I am not sure if I understood your requirements clearly, but if you
>> are not looking for a pure PigLatin solution and can work through
>> Pig's java api, then you may want to look at PigServer.
>> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
>> Something along the following lines:
>>
>> PigServer pig = new PigServer(pc, true);
>> pig.registerQuery("A = load 'mydata'; ");
>> pig.registerQuery("B = filter A by $0 > 10;");
>> Iterator<Tuple> itr = pig.operIterator("B");
>> while(itr.hasNext()){
>>  if ( itr.next().get(0) == 25 ) {
>>    // trigger further processing.
>>  }
>> }
>>
>> Its obviously not directly useful, but conveys the general idea. Hope it helps.
>>
>> Ashutosh
>> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Jae.Lee@forward.co.uk> wrote:
>>> Hi,
>>>
>>> In our application Hive is used as a database. i.e. a result set from a select
query is consumed outside of hadoop cluster.
>>>
>>> The consumption process is not Hadoop friendly as in it is network bound not
cpu/disk bound.
>>>
>>> I'm in a process of converting hive query into pig query to see if it reads better.
>>>
>>> What I'm stuck at is finding the content of a specific alias dump, from all the
other stuff being logged, to be able to trigger further process.
>>>
>>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process,
it's just that it seems not suitable for the kind of process we are looking at, because the
<cmd> gets run in hadoop cluster.
>>>
>>> any thought?
>>>
>>> J
>>
>
>



-- 
Best Regards

Jeff Zhang

Mime
View raw message