cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Overton <...@acunu.com>
Subject Re: wildcards as both ends
Date Mon, 25 Jun 2012 16:36:48 GMT
Hi Sam,

On 20 June 2012 15:20, Sam Z J <sammyjiang721@gmail.com> wrote:
>
> - for each string I have, index all the prefixes in a column family, e.g.
> for string 'string', I'd have rows string, strin, stri, str, st, s, with
> column values somehow pointing back as row keys. This almost blows up the
> storage needed =/ (also, what do I do if I hit the 2billion row width limit?
> is there a way to say 'insert into another row if the current one is full'?)

It's actually not the prefixes that you want to store, it's the suffixes.

Searching for prefixes is naturally possible in Cassandra with normal
range queries, eg. if you want to search for "str*" then a range query
from "str" to "sts" would find all values starting with "str". Range
queries are inclusive at both ends so you would need to strip off any
exact match for "sts" at the end of the results.

To implement the wildcard at the other end you could store all
suffixes for a word. Suppose you stored these as a composite column in
the format "suffix:prefix", eg. "string" would map to columns with the
names "string:", "tring:s", "ring:st", "ing:str" ... etc.

To search for "*tri*" you perform a range query from "tri:" to "trj:"
which would match "tring:s". It would also match "trium:a",
"trict:dis", "tric:me". Since the prefix is stored as the second part
of the column name it's easy to map back to the original word.

Your data blow up will be of the order of the average word length.
Increased storage and write load is a standard trade-off when
denormalising to gain some advantage/flexibility at query time.

Regards,


--
Sam Overton
Acunu | http://www.acunu.com | @acunu

Mime
View raw message