Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 99915 invoked from network); 25 Apr 2010 16:55:28 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Apr 2010 16:55:27 -0000 Received: (qmail 38252 invoked by uid 500); 25 Apr 2010 16:55:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 38228 invoked by uid 500); 25 Apr 2010 16:55:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 38220 invoked by uid 99); 25 Apr 2010 16:55:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Apr 2010 16:55:27 +0000 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=AWL,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tuxracer69@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Apr 2010 16:55:19 +0000 Received: by wwb24 with SMTP id 24so3001087wwb.31 for ; Sun, 25 Apr 2010 09:54:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=fwAb5GX2HySQCOV/HZSbfJiLhsGnXpp79IwGFM07KJE=; b=tDyB+yy9LSYDnhpUpJqiePVDXudUYbZ46oWus6+JR5H00aiFzRzfgZNcSzEIF9vE1z KBS/bz8H3IDwBUIw2S0f7jtnsr0suvkpHnHkPxwSivawfM+3Uqi4MiyIrIih8MqH7BE8 /Sqp/B7gBSlX4gBg3X88ABaWYJxqbnhE+z7Vw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; b=EnXTEF05CPHmULxZmRHvA+O8Lze8hnsllG5eVI4bZhSQfXT9q6/40iVLLfIX4FawVC kiHyCUxC7a1UNUYqv8n4LlJo85/QSeBLke8r9OPCPqj3/+nVA1+tv+CNWWIVr06rGpu2 nzra8Om3YTHW2UTUWS3C7fUx197IiSIDp0DWM= Received: by 10.216.88.11 with SMTP id z11mr3495284wee.116.1272214497819; Sun, 25 Apr 2010 09:54:57 -0700 (PDT) Received: from [192.168.1.64] (78-86-128-147.zone2.bethere.co.uk [78.86.128.147]) by mx.google.com with ESMTPS id z3sm15536282wbs.10.2010.04.25.09.54.56 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 25 Apr 2010 09:54:57 -0700 (PDT) Message-ID: <4BD473DF.1020406@gmail.com> Date: Sun, 25 Apr 2010 17:54:55 +0100 From: TuX RaceR User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090701) MIME-Version: 1.0 To: user@cassandra.apache.org Subject: newbie question on how columns names are indexed/lucene limitations? Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hello Cassandra Users, When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e. no SuperColumns) my understanding is that one signle Row can store millions of columns. If I look at the http://wiki.apache.org/cassandra/API, I understand that I can get a subset of the millions of columns defined above using: SlicePredicate->ColumnNames or SlicePredicate->SliceRange My question is about the implementation of this columns 'selection'. I vaguely remember reading somewhere (but I cannot find the link again) that this was implemented using a Lucene index over the column names for each row. Is that true? Is there a small lucene index per row? Also we know from that lucene have some limitations http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you cannot index more than 2.1 billions documents as a document ID is mapped to a 32 bits int. As I plan to store in column names the ID of my cassandra documents (the global number of documents can go well beyond 2.1 billions), will I be hit by the lucene limitations? I.e can I store cassandra documents ID (i.e keys) in column names, if in each individual row there are no more than few millions of those IDs? I guess the answer is "yes I can", because lucandra uses a similar schema but it is not clear for me why. Is that because the lucene index is made on each row and what really matters in the number of columns in one single row and not the number of distinct column names (globally over all the rows)? Thanks in advance TuX