cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Carman <>
Subject Re: Large primary keys
Date Mon, 11 Apr 2016 23:12:11 GMT
S3 maybe?
On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <> wrote:

> I do realize its kind of a weird use case, but it is legitimate. I have a
> collection of documents that I need to index, and I want to perform entity
> extraction on them and give the extracted entities special treatment in my
> full-text index. Because entity extraction costs money, and each document
> will end up being indexed multiple times, I want to cache them in
> Cassandra. The document text is the obvious key to retrieve entities from
> the cache. If I use the document ID, then I have to track timestamps. I
> know that sounds like a simple workaround, but I’m presenting a
> much-simplified view of my actual data model.
> The reason for needing the text in the table, and not just a digest, is
> that sometimes entity extraction has to be deferred due to license
> limitations. In those cases, the entity extraction occurs on a background
> process, and the entities will be included in the index the next time the
> document is indexed.
> I will use a digest as the key. I suspected that would be the answer, but
> its good to get confirmation.
> Robert
> On Apr 11, 2016, at 4:36 PM, Jan Kesten <> wrote:
> > Hi Robert,
> >
> > why do you need the actual text as a key? I sounds a bit unatural at
> least for me. Keep in mind that you cannot do "like" queries on keys in
> cassandra. For performance and keeping things more readable I would prefer
> hashing your text and use the hash as key.
> >
> > You should also take into account to store the keys (hashes) in a
> seperate table per day / hour or something like that, so you can quickly
> get all keys for a time range. A query without the partition key may be
> very slow.
> >
> > Jan
> >
> > Am 11.04.2016 um 23:43 schrieb Robert Wille:
> >> I have a need to be able to use the text of a document as the primary
> key in a table. These texts are usually less than 1K, but can sometimes be
> 10’s of K’s in size. Would it be better to use a digest of the text as the
> key? I have a background process that will occasionally need to do a full
> table scan and retrieve all of the texts, so using the digest doesn’t
> eliminate the need to store the text. Anyway, is it better to keep primary
> keys small, or is C* okay with large primary keys?
> >>
> >> Robert
> >>
> >

View raw message