cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shao-Chuan Wang <shaochuan.w...@bloomreach.com>
Subject Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.
Date Mon, 24 Mar 2014 23:13:49 GMT
Tyler mentioned that client.describe_ring(myKeyspace); can be replaced by a
query of system.peers table which has the ring information. The challenge
here is to describe_splits_ex which needs the estimate the number of rows
in each sub token range (as you mentioned).

>From what I understand and trials and errors so far, I don't think Datastax
Java driver is able to do describe_splits_ex via a simple API call. If you
look at the implementation of CassandraServer.describe_splits_ex() and
StorageService.instance.getSplits(), what it does is that it is splitting a
token range into several sub token ranges, with estimated row count in each
sub token rage. Inside StorageService.instance.getSplits() call, it is
adjusting split count based on a estimated row count, too.
StorageService.instance.getSplits() is only publicly exported by thrift. It
would be non-trivial to re-build the same logic inside
StorageService.instance.getSplits().

That said, it looks like we could implement the splits logic at
AbstractColumnFamilyInputFormat.getSubSplits by querying
system.schema_columnfamilies and use CFMetaData.fromSchema to construct
CFMetaData. Inside CFMetaData it has the indexInterval which can be used to
estimate row count, and the next thing is to mimic the logic in
StorageService.instance.getSplits() to divide token range into several sub
token ranges and use TokenFactory (which is obtained from partitioner) to
construct sub token ranges at AbstractColumnFamilyInputFormat.getSubSplits.
Basically, it is moving the splitting code from the server side to the
client side.

Any thoughts?

Shao-Chuan


On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly <clint.kelly@gmail.com> wrote:

> I just saw this question about thrift in the Hadoop / Cassandra integration
> in the discussion on the user list about freezing thrift.  I have been
> working on a project to integrate Hadoop 2 and Cassandra 2 and have been
> trying to move all of the way over to the Java driver and away from thrift.
>
> I have finished most of the driver.  It is still pretty rough, but I have
> been using it for testing a prototype of the Kiji platfrom (www.kiji.org)
> that uses Cassandra instead of HBase.
>
> One thing I have not been able to figure out is how to calculate input
> splits without thrift.  I am currently doing the following:
>
>       map = client.describe_ring(myKeyspace);
>
> (where client is of type Cassandra.Client).
>
> This call returns a list of token ranges (max and min token values) for
> different nodes in the cluster.  We then use this information, along with
> another thrift call,
>
>     client.describe_splits_ex(cfName, range.start_token, range.end_token,
> splitSize);
>
> to estimate the number of rows in each token range, etc.
>
> I have looked all over the Java driver documentation and pinged the user
> list and have not gotten any proposals that work for the Java driver.  Does
> anyone here have any suggestions?
>
> Thanks!
>
> Best regards,
> Clint
>
>
> On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang <
> shaochuan.wang@bloomreach.com> wrote:
>
> > Hi,
> >
> > I just received this email from Jonathan regarding this deprecation of
> > thrift in 2.1 in dev emailing list.
> >
> > In fact, we migrated from thrift client to native one several months ago;
> > however, in the Cassandra.hadoop, there are still a lot of dependencies
> on
> > thrift interface, for example describe_splits_ex in
> > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.
> >
> > Therefore, we had to keep thrift and native in our server but mainly, the
> > CRUD query are through native protocol.
> > However, Jonathan says "*I don't know of any use cases for Thrift that
> > can't be **done in CQL"*. This statement makes me wonder maybe there is
> > something I don't know about native protocol yet.
> >
> > So, does anyone know how to do "describing the splits" and "describing
> the
> > local rings" using native protocol?
> >
> > Also, cqlsh uses python client, which is talking via thrift protocol too.
> > Does it mean that it will be migrated to native protocol soon as well?
> >
> > Comments, pointers, suggestions are much appreciated.
> >
> > Many thanks,
> >
> > Shao-Chuan
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message