cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: 1000's of column families
Date Tue, 02 Oct 2012 22:03:58 GMT
It's always had data locality (since hadoop support was added in 0.6).

You don't need to specify a partition, you specify the input predicate with ConfigHelper or
the cassandra.input.predicate property.

On Oct 2, 2012, at 2:26 PM, "Hiller, Dean" <> wrote:

> So you're saying that you can access the primary index with a key range, but to access
the secondary index, you first need to get all keys and follow up with a multiget, which would
use the secondary index to speed the lookup of the matching rows?
> Yes, that is how I "believe" it works.  I am by no means an expert.
> I also wanted to fire off a MR to process matching rows in the "virtual" CF ideally running
on the nodes where it reads data in.  In 0.7, I thought the M/R jobs did not run locally with
the data like hadoop does???  Anyone know if that is still true or does it run locally to
the data now?
> Thanks,
> Dean
> From: Ben Hood <<>>
> Reply-To: "<>" <<>>
> Date: Tuesday, October 2, 2012 1:01 PM
> To: "<>" <<>>
> Subject: Re: 1000's of column families
> Dean,
> On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:
> Because the data for an index is not all together(ie. Need a multi get to get the data).
It is not contiguous.
> The prefix in a partition they keep the data so all data for a prefix from what I understand
is contiguous.
> QUESTION: What I don't get in the comment is I assume you are referring to CQL in which
case we would need to specify the partition (in addition to the index)which means all that
data is on one node, correct? Or did I miss something there.
> Maybe my question was just silly - I wasn't referring to CQL.
> As for the locality of the data, I was hoping to be able to fire off an MR job to process
all matching rows in the CF - I was assuming that that this job would get executed on the
same node as the data.
> But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat
has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop,
such that the predicate needs to be applied in the job rather than up front in the Cassandra
> Cheers,
> Ben

View raw message