cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3371) Cassandra inferred schema and actual data don't match
Date Tue, 31 Jan 2012 20:58:11 GMT


Brandon Williams commented on CASSANDRA-3371:

I resolved PIG-2485 as invalid.  You can read the explanation there, but I'll go ahead and
summarize: a bag's schema can only contain one tuple because it is assumed that all tuples
in the bag have the same schema.  Obviously this won't be true in Cassandra since we allow
any column to have any schema that you like.  However, after talking with Dmitriy Ryaboy,
I have a plan.  We got good results out of tuple-of-tuples, but this won't work with wide
rows.  Another thing it won't work with is small rows where some columns have metadata, and
some do not, because when you define a tuple-of-tuples that is a hard constraint; you can't
define 4 and then return 20.  So what I propose is that we change the output format to be
a tuple-of-tuples for all columns that have metadata, and then a bag with the rest of the
columns with a single schema (the default comparator/validator.)  This will work for both
static and wide rows, unless you manage to define metadata on so many columns in a wide row
that they themselves qualify as wide.

To give an example, let's continue with what Pete started with a slight modification:
create column family PhotoVotes with
comparator = UTF8Type and
column_metadata =
{column_name: voter, validation_class: UTF8Type, index_type: KEYS},
{column_name: vote_type, validation_class: UTF8Type},
{column_name: photo_owner, validation_class: UTF8Type, index_type: KEYS},
{column_name: src_big, validation_class: UTF8Type},
{column_name: pid, validation_class: UTF8Type, index_type: KEYS},
{column_name: matched_string, validation_class: UTF8Type},
{column_name: time, validation_class: LongType},

Loading this from pig produces a schema like:
(key: bytearray,matched_string: (name: chararray,value: chararray),photo_owner: (name: chararray,value:
chararray),pid: (name: chararray,value: chararray),src_big: (name: chararray,value: chararray),time:
(name: chararray,value: chararray),vote_type: (name: chararray,value: chararray),voter: (name:
chararray,value: chararray),columns: {(name: chararray,value: bytearray)})

This should allow you do things like:

FILTER rows by vote_type.value eq 'album_like'

Note that the *tuple* is named after the index, and inside the tuple we still have 'name'
and 'value'.  This is because if we don't have the name accessible, this is going to be hard
to store later (and schema introspection is a bit more magic than I'd care to use.)
> Cassandra inferred schema and actual data don't match
> -----------------------------------------------------
>                 Key: CASSANDRA-3371
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.8.7
>            Reporter: Pete Warden
>            Assignee: Brandon Williams
>         Attachments: 3371-v2.txt, 3371-v3.txt, 3371-v4.txt, pig.diff
> It's looking like there may be a mismatch between the schema that's being reported by
the latest, and the data that's actually returned. Here's an example:
> rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
> DESCRIBE rows;
> rows: {key: chararray,columns: {(name: chararray,value: bytearray,photo_owner: chararray,value_photo_owner:
bytearray,pid: chararray,value_pid: bytearray,matched_string: chararray,value_matched_string:
bytearray,src_big: chararray,value_src_big: bytearray,time: chararray,value_time: bytearray,vote_type:
chararray,value_vote_type: bytearray,voter: chararray,value_voter: bytearray)}}
> DUMP rows;
> (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})
> getSchema() is reporting the columns as an inner bag of tuples, each of which contains
16 values. In fact, getNext() seems to return an inner bag containing 7 tuples, each of which
contains two values. 
> It appears that things got out of sync with this change:
> See more discussion at:

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message