Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 9242 invoked from network); 11 May 2010 18:40:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 May 2010 18:40:19 -0000 Received: (qmail 97571 invoked by uid 500); 11 May 2010 18:40:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 97547 invoked by uid 500); 11 May 2010 18:40:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 97539 invoked by uid 99); 11 May 2010 18:40:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 18:40:18 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 18:40:11 +0000 Received: by wyb42 with SMTP id 42so340290wyb.31 for ; Tue, 11 May 2010 11:39:50 -0700 (PDT) Received: by 10.216.85.21 with SMTP id t21mr2397948wee.151.1273603189436; Tue, 11 May 2010 11:39:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.80.96 with HTTP; Tue, 11 May 2010 11:39:29 -0700 (PDT) From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= Date: Tue, 11 May 2010 20:39:29 +0200 Message-ID: Subject: Inverted Indexing a ColumnFamily To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6db2b39b0f079048655d9bf X-Virus-Checked: Checked by ClamAV on apache.org --0016e6db2b39b0f079048655d9bf Content-Type: text/plain; charset=UTF-8 Hello All, I guess the subject talks for itself. I'm currently developing a document analysis engine using cassandra as the scalable storage. I just want to briefly make an overview of the data model I'm using for this purpose. "the key" is formed in the format of timestamp.random(), so that it'll be sorted on the Chronological order. so I have out-of-box range queries based on timestamps. But I still need to index some values: I started testing with three types of fields in the Document ColumnFamily - fields containing text (several words) : (every word is an index term) - fields containing positive integers : (zero padded integer is the index term) - fields containing enumeration : (value itself is the index term) For indexing purposes I used another ColumnFamily called IndexCF; the key is formed in the format of "field_name||index_term", where values are the actual references to the keys in Documents ColumnFamily. After searching the projects related to indexing in cassandra, I've come up with Lucandra. I've recently been running tests with Lucandra since then ( http://github.com/tjake/Lucandra) for indexing those type of columns, it's basically using a similar approach. Lucandra works fine for indexing the columns containing text values, zero padded integers and range queries on integers also work fine too. However, the enumeration indexing is a really big problem. Say we have 1M documents, with the type field which can have 4 values (book, magazine, newspaper, other). Assuming the values are distributed equally, each "field_name||index_term" pair would have 250K related documents. When we try to index with respect to this distribution, We'll end up with only 4 index keys each one of them containing 250k columns. This basically means it's not reasonable to index and search with respect to the enumeration fields. I wrote all these in a hurry, I hope I was able to express what I'm opening for discussion. Can you think of a better implementation for indexing enumeration in cassandra? Best Regards, Utku --0016e6db2b39b0f079048655d9bf Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hello All,

I guess the subject talks for itself.
I'm currentl= y developing a document analysis engine using cassandra as the scalable sto= rage.

I just want to briefly make an overview of the data model I= 9;m using for this purpose.

"the key" is formed in the format of timestamp.random(), so t= hat it'll be sorted on the Chronological order.
so I have out-of-box= range queries based on timestamps.

But I still need to index some v= alues:

I started testing with three types of fields in the Document ColumnFami= ly

- fields containing text (several words) : (every word is an inde= x term)
- fields containing positive integers : (zero padded integer is = the index term)
- fields containing enumeration : (value itself is the index term)

F= or indexing purposes I used another ColumnFamily called IndexCF; the key is= formed in the format of "field_name||index_term", where values a= re the actual references to the keys in Documents ColumnFamily.

After searching the projects related to indexing in cassandra, I've= come up with Lucandra.

I've recently been running tests with L= ucandra since then (http://git= hub.com/tjake/Lucandra) for indexing those type of columns, it's ba= sically using a similar approach.
Lucandra works fine for indexing the columns containing text values, zero p= added integers and range queries on integers also work fine too.

How= ever, the enumeration indexing is a really big problem.
Say we have 1M d= ocuments, with the type field which can have 4 values (book, magazine, news= paper, other). Assuming the values are distributed equally, each "fiel= d_name||index_term" pair would have 250K related documents. When we tr= y to index with respect to this distribution, We'll end up with only 4 = index keys each one of them containing 250k columns. This basically means i= t's not reasonable to index and search with respect to the enumeration = fields.

I wrote all these in a hurry, I hope I was able to express what I'm= opening for discussion. Can you think of a better implementation for index= ing enumeration in cassandra?

Best Regards,
Utku
--0016e6db2b39b0f079048655d9bf--