incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Johnston <cjohns...@megatome.com>
Subject Re: CqlStorage creates wrong schema for Pig
Date Sat, 31 Aug 2013 02:32:47 GMT
I threw together a quick UDF to work around this issue. It just extracts
the value portion of the tuple while taking advantage of the CqlStorage
generated schema to keep the type correct.

You can get it here: https://github.com/iamthechad/cqlstorage-udf

I'll see if I can find more useful information and open a defect, since
that's what this seems to be.

Chad


On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
mianmarjun.mailinglist@gmail.com> wrote:

> I try this:
>
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
>
> *dump rows;*
>
> *ILLUSTRATE rows;*
>
> *describe rows;*
>
> *
> *
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
> (mycolumn:tuple(name,value));*
>
> *dump values2;*
>
> *describe values2;*
> *
> *
>
> But I get this results:
>
>
>
> -------------------------------------------------------------
> | rows     | id:chararray   | age:int   | title:chararray   |
> -------------------------------------------------------------
> |          | (id, 6)        | (age, 30) | (title, QA)       |
> -------------------------------------------------------------
>
> rows: {id: chararray,age: int,title: chararray}
> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1031: Incompatable field schema: left is
> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>
>
>
>
>
> or
>
>
>
> ....
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
> *dump values2;*
> *describe values2;*
>
>
>
>
> and  the results are:
>
>
> ...
> (((id,6)))
> (((id,5)))
> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>
>
>
> Aggg!!!!!
>
>
> *
> *
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.martin@brainsins.com
>
>
>
> 2013/8/26 Miguel Angel Martin junquera <mianmarjun.mailinglist@gmail.com>
>
>> hi Chad .
>>
>> I have this issue
>>
>> I send a mail to user-pig-list and  I still i can resolve this, and I can
>> not  access to column values.
>> In this mail  I write some things that I try without results... and
>> information about this issue.
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3CCAJeG_hQ9S2Po3_XytZX5Xki4J1maO8q26jYdG2Wndy_KYiv9CQ@mail.gmail.com%3E
>>
>>
>>
>> I hope  someOne reply  one comment, idea or  solution about  this issue
>> or bug.
>>
>>
>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>> not have configure the environmetn to debug  and trace this issue.
>>
>> Only  I find some comments like, but I do not understand at all.
>>
>>
>> /**
>>
>>  * A LoadStoreFunc for retrieving data from and storing data to Cassandra
>>
>>  *
>>
>>  * A row from a standard CF will be returned as nested tuples:
>>
>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>  */
>>
>>
>> I you found some idea or solution, please post it
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2013/8/23 Chad Johnston <cjohnston@megatome.com>
>>
>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>
>>> I'm loading some simple data from Cassandra into Pig using CqlStorage.
>>> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
>>> but it seems to be wrong.
>>>
>>> If I do:
>>>
>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>> DESCRIBE data;
>>>
>>> I get this:
>>>
>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>> chararray,publisher: chararray,yearofpublication: int}
>>>
>>> However, if I DUMP data, I get results like these:
>>>
>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
>>> Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>
>>> Clearly the results from Cassandra are key/value pairs, as would be
>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>> different.
>>>
>>> This is really causing me problems trying to access the column values. I
>>> tried a naive approach of FLATTENing each tuple, then trying to access the
>>> values that way:
>>>
>>> flattened = FOREACH data GENERATE
>>>   FLATTEN(isbn),
>>>   FLATTEN(booktitle),
>>>   ...
>>> values = FOREACH flattened GENERATE
>>>   $1 AS ISBN,
>>>   $3 AS BookTitle,
>>>   ...
>>>
>>> As soon as I try to access field $5, Pig complains about the index being
>>> out of bounds.
>>>
>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>> something wrong, or have I stumbled across a defect?
>>>
>>> Thanks,
>>> Chad
>>>
>>
>>
>

Mime
View raw message