incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiller, Dean" <Dean.Hil...@nrel.gov>
Subject Re: 1000's of column families
Date Tue, 02 Oct 2012 19:26:24 GMT
So you're saying that you can access the primary index with a key range, but to access the
secondary index, you first need to get all keys and follow up with a multiget, which would
use the secondary index to speed the lookup of the matching rows?

Yes, that is how I "believe" it works.  I am by no means an expert.

I also wanted to fire off a MR to process matching rows in the "virtual" CF ideally running
on the nodes where it reads data in.  In 0.7, I thought the M/R jobs did not run locally with
the data like hadoop does???  Anyone know if that is still true or does it run locally to
the data now?

Thanks,
Dean

From: Ben Hood <0x6e6562@gmail.com<mailto:0x6e6562@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, October 2, 2012 1:01 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: 1000's of column families

Dean,

On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

Because the data for an index is not all together(ie. Need a multi get to get the data). It
is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what I understand
is contiguous.





QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case
we would need to specify the partition (in addition to the index)which means all that data
is on one node, correct? Or did I miss something there.

Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job to process
all matching rows in the CF - I was assuming that that this job would get executed on the
same node as the data.

But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat
has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop,
such that the predicate needs to be applied in the job rather than up front in the Cassandra
query.

Cheers,

Ben


Mime
View raw message