cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James A. Robinson" <>
Subject Cassandra and UTF-8 BOM?
Date Tue, 29 Oct 2019 20:13:02 GMT
Hi folks,

I'm looking at a table that has a primary key defined as "publisher_id
text".  I've noticed some of the entries have what appears to me to be
a UTF-8 BOM marker and some do not.
says text is a UTF-8 encoded string.  If I look at the first 3 bytes
of one of these columns:

$ dd if=~/tmp/ of=/dev/stdout bs=1 count=3 2>/dev/null | hexdump
0000000 bbef 00bf

When I swap the byte order:

$ dd if=~/tmp/ of=/dev/stdout bs=1 count=3 conv=swab
2>/dev/null | hexdump
0000000 efbb 00bf

And I think this matches the UTF-8 BOM.

However, not all the rows have this prefix, and I'm wondering if this
is a client issue (client being inconsistent about  how it's dealing
with strings) or if Cassandra is doing something special on its own.
The rest of the column falls within the US-ASCII codepoint compatible
range of UTF-8, e.g., something as simple as 'abc' but in some cases
it's got this marker in front of it.

Cassandra is treating '<BOM>abc' as a distinct value from 'abc' ,
which certainly makes sense, for the sake of efficiency I assume it'd
just be looking at the byte-for-byte values w/o layering meaning on
top of it.  But that means I'll need to clean the data up to be
consistent, and I need to figure out how to prevent it from being
reintroduced in the future.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message