pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jae Lee <Jae....@forward.co.uk>
Subject Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...
Date Wed, 08 Dec 2010 10:20:02 GMT
Hi Jeff,

It's the process that we do with the result data from Hive (or equally from Pig) is network
bound. Pig at the moment only allow "DUMP" and "STORE" to get that result data which makes
it a bit in-convinient.


On 8 Dec 2010, at 01:19, Jeff Zhang wrote:

> Hi Jay,
> I believe even you use pig ,the performance of fetching from HDFS
> won't be better than Hive, because pig and hive both store result data
> in hdfs and fetch data from client. And in most of cases, the result
> data won't be very large. So the performance wont' be a problem. But I
> guess your result data is very large because you mention that it  is
> network bound, then I suggest run another pig script or native
> mapreduce jobs on your result data.
> On Wed, Dec 8, 2010 at 2:26 AM, Jae Lee <Jae.Lee@forward.co.uk> wrote:
>> yeah I came across the openIterator(alias) on PigServer.
>> basically that's what I like to get (dump of the alias and nothing else) when I execute
pig script.
>> I'm currently writing a ruby wrapper that will use STORE the alias into temporary
location in hdfs then do Hadoop file fetch
>> any better idea?
>> J
>> On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:
>>> I am not sure if I understood your requirements clearly, but if you
>>> are not looking for a pure PigLatin solution and can work through
>>> Pig's java api, then you may want to look at PigServer.
>>> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
>>> Something along the following lines:
>>> PigServer pig = new PigServer(pc, true);
>>> pig.registerQuery("A = load 'mydata'; ");
>>> pig.registerQuery("B = filter A by $0 > 10;");
>>> Iterator<Tuple> itr = pig.operIterator("B");
>>> while(itr.hasNext()){
>>>  if ( itr.next().get(0) == 25 ) {
>>>    // trigger further processing.
>>>  }
>>> }
>>> Its obviously not directly useful, but conveys the general idea. Hope it helps.
>>> Ashutosh
>>> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Jae.Lee@forward.co.uk> wrote:
>>>> Hi,
>>>> In our application Hive is used as a database. i.e. a result set from a select
query is consumed outside of hadoop cluster.
>>>> The consumption process is not Hadoop friendly as in it is network bound
not cpu/disk bound.
>>>> I'm in a process of converting hive query into pig query to see if it reads
>>>> What I'm stuck at is finding the content of a specific alias dump, from all
the other stuff being logged, to be able to trigger further process.
>>>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a
process, it's just that it seems not suitable for the kind of process we are looking at, because
the <cmd> gets run in hadoop cluster.
>>>> any thought?
>>>> J
> -- 
> Best Regards
> Jeff Zhang

View raw message