incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Luciani <jak...@gmail.com>
Subject Re: BinaryMemtable and collisions
Date Sat, 08 May 2010 05:09:12 GMT
Any reason why you aren't using Lucandra directly?

On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jungen@gmail.com>wrote:

> Greetings,
>
> Started getting my feet wet with Cassandra in earnest this week. I'm
> building a custom inverted index of sorts on top of Cassandra, in part
> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
> nearly a million documents over a 3-node cluster, and initial query tests
> look promising.
>
> The problem is that our target use case has hundreds of millions of
> documents (each document is very small however). Loading time will be an
> important factor. I've investigated using the BinaryMemtable interface (as
> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype
> up that successfully inserts data using BMT, but there is a problem.
>
> If I perform multiple writes for the same row key & column family, the row
> ends up containing only one of the writes. I'm guessing this is because with
> BMT I need to group all writes for a given row key & column family into one
> operation, rather than doing it incrementally as is possible with the thrift
> interface. Hadoop obviously is the solution for doing such a grouping.
> Unfortunately, we can't perform such a process over our entire dataset, we
> will need to do it in increments.
>
> So my question is: If I properly flush every node after performing a larger
> bulk insert, can Cassandra merge multiple writes on a single row & column
> family when using the BMT interface? Or is using BMT only feasible for
> loading data on rows that don't exist yet?
>
> Thanks in advance,
> Toby Jungen
>
>
>
>

Mime
View raw message