incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Jungen <tobias.jun...@gmail.com>
Subject Re: BinaryMemtable and collisions
Date Sat, 08 May 2010 05:17:20 GMT
Without going into too much depth: Our retrieval model is a bit more
structured than standard lucene retrieval, and I'm trying to leverage that
structure. Some of the terms we're going to retrieve against have high
occurrence, and because of that I'm worried about getting killed by
processing large term vectors. Instead I'm trying to index on term
relationships, if that makes sense.

On Sat, May 8, 2010 at 12:09 AM, Jake Luciani <jakers@gmail.com> wrote:

> Any reason why you aren't using Lucandra directly?
>
>
> On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jungen@gmail.com>wrote:
>
>> Greetings,
>>
>> Started getting my feet wet with Cassandra in earnest this week. I'm
>> building a custom inverted index of sorts on top of Cassandra, in part
>> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
>> nearly a million documents over a 3-node cluster, and initial query tests
>> look promising.
>>
>> The problem is that our target use case has hundreds of millions of
>> documents (each document is very small however). Loading time will be an
>> important factor. I've investigated using the BinaryMemtable interface (as
>> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype
>> up that successfully inserts data using BMT, but there is a problem.
>>
>> If I perform multiple writes for the same row key & column family, the row
>> ends up containing only one of the writes. I'm guessing this is because with
>> BMT I need to group all writes for a given row key & column family into one
>> operation, rather than doing it incrementally as is possible with the thrift
>> interface. Hadoop obviously is the solution for doing such a grouping.
>> Unfortunately, we can't perform such a process over our entire dataset, we
>> will need to do it in increments.
>>
>> So my question is: If I properly flush every node after performing a
>> larger bulk insert, can Cassandra merge multiple writes on a single row &
>> column family when using the BMT interface? Or is using BMT only feasible
>> for loading data on rows that don't exist yet?
>>
>> Thanks in advance,
>> Toby Jungen
>>
>>
>>
>>
>

Mime
View raw message