Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= <utku@topcu.gen.tr>
Date: Tue, 11 May 2010 20:39:29 +0200
Message-ID: <AANLkTinY2zR4NLcYqnKSBuTGDF_3hD2x_i0rOW4_W5mG@mail.gmail.com>
Subject: Inverted Indexing a ColumnFamily
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6db2b39b0f079048655d9bf

--0016e6db2b39b0f079048655d9bf
Content-Type: text/plain; charset=UTF-8

Hello All,

I guess the subject talks for itself.
I'm currently developing a document analysis engine using cassandra as the
scalable storage.

I just want to briefly make an overview of the data model I'm using for this
purpose.

"the key" is formed in the format of timestamp.random(), so that it'll be
sorted on the Chronological order.
so I have out-of-box range queries based on timestamps.

But I still need to index some values:

I started testing with three types of fields in the Document ColumnFamily

- fields containing text (several words) : (every word is an index term)
- fields containing positive integers : (zero padded integer is the index
term)
- fields containing enumeration : (value itself is the index term)

For indexing purposes I used another ColumnFamily called IndexCF; the key is
formed in the format of "field_name||index_term", where values are the
actual references to the keys in Documents ColumnFamily.

After searching the projects related to indexing in cassandra, I've come up
with Lucandra.

I've recently been running tests with Lucandra since then (
http://github.com/tjake/Lucandra) for indexing those type of columns, it's
basically using a similar approach.
Lucandra works fine for indexing the columns containing text values, zero
padded integers and range queries on integers also work fine too.

However, the enumeration indexing is a really big problem.
Say we have 1M documents, with the type field which can have 4 values (book,
magazine, newspaper, other). Assuming the values are distributed equally,
each "field_name||index_term" pair would have 250K related documents. When
we try to index with respect to this distribution, We'll end up with only 4
index keys each one of them containing 250k columns. This basically means
it's not reasonable to index and search with respect to the enumeration
fields.

I wrote all these in a hurry, I hope I was able to express what I'm opening
for discussion. Can you think of a better implementation for indexing
enumeration in cassandra?

Best Regards,
Utku

--0016e6db2b39b0f079048655d9bf
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hello All,<br><br>I guess the subject talks for itself.<br>I&#39;m currentl=
y developing a document analysis engine using cassandra as the scalable sto=
rage.<br><br>I just want to briefly make an overview of the data model I=
9;m using for this purpose.<br>

<br>&quot;the key&quot; is formed in the format of timestamp.random(), so t=
hat it&#39;ll be sorted on the Chronological order.<br>so I have out-of-box=
 range queries based on timestamps.<br><br>But I still need to index some v=
alues:<br>

<br>I started testing with three types of fields in the Document ColumnFami=
ly<br><br>- fields containing text (several words) : (every word is an inde=
x term)<br>- fields containing positive integers : (zero padded integer is =
the index term)<br>

- fields containing enumeration : (value itself is the index term)<br><br>F=
or indexing purposes I used another ColumnFamily called IndexCF; the key is=
 formed in the format of &quot;field_name||index_term&quot;, where values a=
re the actual references to the keys in Documents ColumnFamily.<br>

<br>After searching the projects related to indexing in cassandra, I&#39;ve=
 come up with Lucandra. <br><br>I&#39;ve recently been running tests with L=
ucandra since then (<a href=3D"http://github.com/tjake/Lucandra">http://git=
hub.com/tjake/Lucandra</a>) for indexing those type of columns, it&#39;s ba=
sically using a similar approach.<br>

Lucandra works fine for indexing the columns containing text values, zero p=
added integers and range queries on integers also work fine too.<br><br>How=
ever, the enumeration indexing is a really big problem.<br>Say we have 1M d=
ocuments, with the type field which can have 4 values (book, magazine, news=
paper, other). Assuming the values are distributed equally, each &quot;fiel=
d_name||index_term&quot; pair would have 250K related documents. When we tr=
y to index with respect to this distribution, We&#39;ll end up with only 4 =
index keys each one of them containing 250k columns. This basically means i=
t&#39;s not reasonable to index and search with respect to the enumeration =
fields.<br>

<br>I wrote all these in a hurry, I hope I was able to express what I&#39;m=
 opening for discussion. Can you think of a better implementation for index=
ing enumeration in cassandra?<br><br>Best Regards,<br>Utku<br>

--0016e6db2b39b0f079048655d9bf--