Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 26522 invoked from network); 8 May 2010 05:09:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 May 2010 05:09:44 -0000 Received: (qmail 61436 invoked by uid 500); 8 May 2010 05:09:43 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 61334 invoked by uid 500); 8 May 2010 05:09:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 61326 invoked by uid 99); 8 May 2010 05:09:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 May 2010 05:09:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jakers@gmail.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 May 2010 05:09:33 +0000 Received: by gyh4 with SMTP id 4so1085013gyh.31 for ; Fri, 07 May 2010 22:09:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=S8BW40rY8HNgV/zYXpHVc6HRzMMpv5poGCuabqTMJy0=; b=noQ9y4IQHaYAEXU1hq/WbQKnhOTsbmF411ueDCzt1NoHaubqccw+G4HJ1gpMJS9mRg SI7g+UnH3Qkrd5jvNKzse8Uix5E0MUBbYKk7bLYgMkaHjnyLokQ96yvk/js0Q2T2NW2g 2gP0LibGzBDlk+UHK2xlDzb8AfBHAA9HlafgI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=QSeckUJMc9Utgg58j/y6DtLzNqmAyyMGrPTsraMySQggK/T4iUDuj2yQlmVEWCRHaf 8Bq2YHMttBHBve1Wm4hxaLCm5VsZ1btqoKXQ55nrA47+uNhJJpMeJT8gcpYxoFy8D7K4 zLLN1PzaOGzZBxbnTPQoCDhicF6KiLC5TNX7g= MIME-Version: 1.0 Received: by 10.231.148.143 with SMTP id p15mr592630ibv.5.1273295352402; Fri, 07 May 2010 22:09:12 -0700 (PDT) Received: by 10.231.174.82 with HTTP; Fri, 7 May 2010 22:09:12 -0700 (PDT) In-Reply-To: References: Date: Sat, 8 May 2010 01:09:12 -0400 Message-ID: Subject: Re: BinaryMemtable and collisions From: Jake Luciani To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6475fee2c956904860e2d21 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6475fee2c956904860e2d21 Content-Type: text/plain; charset=ISO-8859-1 Any reason why you aren't using Lucandra directly? On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen wrote: > Greetings, > > Started getting my feet wet with Cassandra in earnest this week. I'm > building a custom inverted index of sorts on top of Cassandra, in part > inspired by the work of Jake Luciani in Lucandra. I've successfully loaded > nearly a million documents over a 3-node cluster, and initial query tests > look promising. > > The problem is that our target use case has hundreds of millions of > documents (each document is very small however). Loading time will be an > important factor. I've investigated using the BinaryMemtable interface (as > found in contrib/bmt_example) to speed up bulk insertion. I have a prototype > up that successfully inserts data using BMT, but there is a problem. > > If I perform multiple writes for the same row key & column family, the row > ends up containing only one of the writes. I'm guessing this is because with > BMT I need to group all writes for a given row key & column family into one > operation, rather than doing it incrementally as is possible with the thrift > interface. Hadoop obviously is the solution for doing such a grouping. > Unfortunately, we can't perform such a process over our entire dataset, we > will need to do it in increments. > > So my question is: If I properly flush every node after performing a larger > bulk insert, can Cassandra merge multiple writes on a single row & column > family when using the BMT interface? Or is using BMT only feasible for > loading data on rows that don't exist yet? > > Thanks in advance, > Toby Jungen > > > > --0016e6475fee2c956904860e2d21 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Any reason why you aren't using Lucandra directly?

On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jungen@gmail.com> wrote:
Greetings,

Started getting my feet w= et with Cassandra in earnest this week. I'm building a custom inverted = index of sorts on top of Cassandra, in part inspired by the work of Jake Lu= ciani in Lucandra. I've successfully loaded nearly a million documents = over a 3-node cluster, and initial query tests look promising.

The problem is that our target use case has hundreds of millions of doc= uments (each document is very small however). Loading time will be an impor= tant factor. I've investigated using the BinaryMemtable interface (as f= ound in contrib/bmt_example) to speed up bulk insertion. I have a prototype= up that successfully inserts data using BMT, but there is a problem.

If I perform multiple writes for the same row key & column family, = the row ends up containing only one of the writes. I'm guessing this is= because with BMT I need to group all writes for a given row key & colu= mn family into one operation, rather than doing it incrementally as is poss= ible with the thrift interface. Hadoop obviously is the solution for doing = such a grouping. Unfortunately, we can't perform such a process over ou= r entire dataset, we will need to do it in increments.

So my question is: If I properly flush every node after performing a la= rger bulk insert, can Cassandra merge multiple writes on a single row &= column family when using the BMT interface? Or is using BMT only feasible = for loading data on rows that don't exist yet?

Thanks in advance,
Toby Jungen




--0016e6475fee2c956904860e2d21--