incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject range ghosts and more with hadoop support (with proposed solution)
Date Sat, 02 Jul 2011 02:09:10 GMT
We think we're running into a situation where we've deleted all the columns on several thousand
rows but they still show up in the results of our pig scripts.  We think that's a product
of range ghosts because ColumnFamilyRecordReader uses getRangeSlices.  So that might be a
problem for people and I think we have something that might address that.

What if we were to have a hadoop job specific option to have the CFRR filter out rows returned
that don't contain any columns?  It's true that it used to do that in core Cassandra and was
removed as a feature because of the performance penalty.  However with hadoop type loads,
latency isn't as big of a deal.  That and it could be a job specific option.  Also, for CFRR
there's the option for a SlicePredicate.  In addition to being able to suppress range ghosts,
it could also skip rows that had no data for that SlicePredicate, which would also be a nice
feature - since it might have similar undesirable consequences.  True the person doing the
MapReduce job or the pig script or whatever could deal with it at that level.  However, this
is core enough and could could be optional so that people wouldn't have to do checking all
over the place for keys without any columns.

Would such an option be okay to add to the hadoop config and to the CFRR?

View raw message