incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Luciani <jak...@gmail.com>
Subject Re: BinaryMemtable and collisions
Date Sat, 08 May 2010 05:22:27 GMT
Got it.  I'm working on making term vectors optional and just store
frequency in this case.  Just FYI.

On Sat, May 8, 2010 at 1:17 AM, Tobias Jungen <tobias.jungen@gmail.com>wrote:

> Without going into too much depth: Our retrieval model is a bit more
> structured than standard lucene retrieval, and I'm trying to leverage that
> structure. Some of the terms we're going to retrieve against have high
> occurrence, and because of that I'm worried about getting killed by
> processing large term vectors. Instead I'm trying to index on term
> relationships, if that makes sense.
>
>
> On Sat, May 8, 2010 at 12:09 AM, Jake Luciani <jakers@gmail.com> wrote:
>
>> Any reason why you aren't using Lucandra directly?
>>
>>
>> On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jungen@gmail.com>wrote:
>>
>>> Greetings,
>>>
>>> Started getting my feet wet with Cassandra in earnest this week. I'm
>>> building a custom inverted index of sorts on top of Cassandra, in part
>>> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
>>> nearly a million documents over a 3-node cluster, and initial query tests
>>> look promising.
>>>
>>> The problem is that our target use case has hundreds of millions of
>>> documents (each document is very small however). Loading time will be an
>>> important factor. I've investigated using the BinaryMemtable interface (as
>>> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype
>>> up that successfully inserts data using BMT, but there is a problem.
>>>
>>> If I perform multiple writes for the same row key & column family, the
>>> row ends up containing only one of the writes. I'm guessing this is because
>>> with BMT I need to group all writes for a given row key & column family into
>>> one operation, rather than doing it incrementally as is possible with the
>>> thrift interface. Hadoop obviously is the solution for doing such a
>>> grouping. Unfortunately, we can't perform such a process over our entire
>>> dataset, we will need to do it in increments.
>>>
>>> So my question is: If I properly flush every node after performing a
>>> larger bulk insert, can Cassandra merge multiple writes on a single row &
>>> column family when using the BMT interface? Or is using BMT only feasible
>>> for loading data on rows that don't exist yet?
>>>
>>> Thanks in advance,
>>> Toby Jungen
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message