cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <>
Subject Re: Large primary keys
Date Mon, 11 Apr 2016 23:05:16 GMT
I do realize its kind of a weird use case, but it is legitimate. I have a collection of documents
that I need to index, and I want to perform entity extraction on them and give the extracted
entities special treatment in my full-text index. Because entity extraction costs money, and
each document will end up being indexed multiple times, I want to cache them in Cassandra.
The document text is the obvious key to retrieve entities from the cache. If I use the document
ID, then I have to track timestamps. I know that sounds like a simple workaround, but I’m
presenting a much-simplified view of my actual data model.

The reason for needing the text in the table, and not just a digest, is that sometimes entity
extraction has to be deferred due to license limitations. In those cases, the entity extraction
occurs on a background process, and the entities will be included in the index the next time
the document is indexed.

I will use a digest as the key. I suspected that would be the answer, but its good to get


On Apr 11, 2016, at 4:36 PM, Jan Kesten <> wrote:

> Hi Robert,
> why do you need the actual text as a key? I sounds a bit unatural at least for me. Keep
in mind that you cannot do "like" queries on keys in cassandra. For performance and keeping
things more readable I would prefer hashing your text and use the hash as key.
> You should also take into account to store the keys (hashes) in a seperate table per
day / hour or something like that, so you can quickly get all keys for a time range. A query
without the partition key may be very slow.
> Jan
> Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> I have a need to be able to use the text of a document as the primary key in a table.
These texts are usually less than 1K, but can sometimes be 10’s of K’s in size. Would
it be better to use a digest of the text as the key? I have a background process that will
occasionally need to do a full table scan and retrieve all of the texts, so using the digest
doesn’t eliminate the need to store the text. Anyway, is it better to keep primary keys
small, or is C* okay with large primary keys?
>> Robert

View raw message