Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 27385 invoked from network); 8 May 2010 05:17:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 May 2010 05:17:50 -0000 Received: (qmail 65297 invoked by uid 500); 8 May 2010 05:17:49 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 65237 invoked by uid 500); 8 May 2010 05:17:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 65229 invoked by uid 99); 8 May 2010 05:17:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 May 2010 05:17:48 +0000 X-ASF-Spam-Status: No, hits=1.9 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tobias.jungen@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 May 2010 05:17:42 +0000 Received: by wwb13 with SMTP id 13so215197wwb.31 for ; Fri, 07 May 2010 22:17:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=Ia2Wm/qLOwnxzNHXVE3rC290lSq6Miu4aIKXuM4iz5o=; b=DNUb6HYQ+mmFsSFVSe8TXU9ZzO5hEpm9GWFSXWq0lwLefOy+FQBucAgSAU4frAfqLu Nat0Hr+USTLHz0yT2E6UtNVI8aT17j365dXWlcxtalPnbpxNxZmQcE2J3IvqrJeLeb5C K7dZ+v5QyD605ZC5aoJx4a/N6RzSKCY031qZc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=V1gM+EIb7wK9YdpiWzwbFXJ/Fl1xEpBJa4jHONx1yhDsgzY/I+nemHbotal34ARQ/7 z5YHOewBpGXD8xUTuRpxcqGNWL5V3ED5W/s6gohgepd6ATB5il2mS0RuYFK/nUfzOUOn tMQYf+ba0VSxcUhJ6/tiDl2VapiHoS+ePTvRM= MIME-Version: 1.0 Received: by 10.216.85.1 with SMTP id t1mr613547wee.3.1273295840897; Fri, 07 May 2010 22:17:20 -0700 (PDT) Received: by 10.216.166.193 with HTTP; Fri, 7 May 2010 22:17:20 -0700 (PDT) In-Reply-To: References: Date: Sat, 8 May 2010 00:17:20 -0500 Message-ID: Subject: Re: BinaryMemtable and collisions From: Tobias Jungen To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6dab0244a6c5e04860e4a25 --0016e6dab0244a6c5e04860e4a25 Content-Type: text/plain; charset=ISO-8859-1 Without going into too much depth: Our retrieval model is a bit more structured than standard lucene retrieval, and I'm trying to leverage that structure. Some of the terms we're going to retrieve against have high occurrence, and because of that I'm worried about getting killed by processing large term vectors. Instead I'm trying to index on term relationships, if that makes sense. On Sat, May 8, 2010 at 12:09 AM, Jake Luciani wrote: > Any reason why you aren't using Lucandra directly? > > > On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen wrote: > >> Greetings, >> >> Started getting my feet wet with Cassandra in earnest this week. I'm >> building a custom inverted index of sorts on top of Cassandra, in part >> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded >> nearly a million documents over a 3-node cluster, and initial query tests >> look promising. >> >> The problem is that our target use case has hundreds of millions of >> documents (each document is very small however). Loading time will be an >> important factor. I've investigated using the BinaryMemtable interface (as >> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype >> up that successfully inserts data using BMT, but there is a problem. >> >> If I perform multiple writes for the same row key & column family, the row >> ends up containing only one of the writes. I'm guessing this is because with >> BMT I need to group all writes for a given row key & column family into one >> operation, rather than doing it incrementally as is possible with the thrift >> interface. Hadoop obviously is the solution for doing such a grouping. >> Unfortunately, we can't perform such a process over our entire dataset, we >> will need to do it in increments. >> >> So my question is: If I properly flush every node after performing a >> larger bulk insert, can Cassandra merge multiple writes on a single row & >> column family when using the BMT interface? Or is using BMT only feasible >> for loading data on rows that don't exist yet? >> >> Thanks in advance, >> Toby Jungen >> >> >> >> > --0016e6dab0244a6c5e04860e4a25 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Without going into too much depth: Our retrieval model is a bit more struct= ured than standard lucene retrieval, and I'm trying to leverage that st= ructure. Some of the terms we're going to retrieve against have high oc= currence, and because of that I'm worried about getting killed by proce= ssing large term vectors. Instead I'm trying to index on term relations= hips, if that makes sense.

On Sat, May 8, 2010 at 12:09 AM, Jake Lucian= i <jakers@gmail.co= m> wrote:
Any reason why you aren't using Lucandra directly?


On Fri, May 7, 2010 at 8:21= PM, Tobias Jungen <tobias.jungen@gmail.com> wrote:
Greetings,
Started getting my feet wet with Cassandra in earnest this week. I'm b= uilding a custom inverted index of sorts on top of Cassandra, in part inspi= red by the work of Jake Luciani in Lucandra. I've successfully loaded n= early a million documents over a 3-node cluster, and initial query tests lo= ok promising.

The problem is that our target use case has hundreds of millions of doc= uments (each document is very small however). Loading time will be an impor= tant factor. I've investigated using the BinaryMemtable interface (as f= ound in contrib/bmt_example) to speed up bulk insertion. I have a prototype= up that successfully inserts data using BMT, but there is a problem.

If I perform multiple writes for the same row key & column family, = the row ends up containing only one of the writes. I'm guessing this is= because with BMT I need to group all writes for a given row key & colu= mn family into one operation, rather than doing it incrementally as is poss= ible with the thrift interface. Hadoop obviously is the solution for doing = such a grouping. Unfortunately, we can't perform such a process over ou= r entire dataset, we will need to do it in increments.

So my question is: If I properly flush every node after performing a la= rger bulk insert, can Cassandra merge multiple writes on a single row &= column family when using the BMT interface? Or is using BMT only feasible = for loading data on rows that don't exist yet?

Thanks in advance,
Toby Jungen





--0016e6dab0244a6c5e04860e4a25--