incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject NULL values in sort cache files
Date Thu, 30 Apr 2009 22:49:33 GMT

We're currently planning on dividing up sort caches for string values using
three files.  Here's how I have things set up in the KS prototype:

  * seg_XXX/sort-YYY.ord => ordinals
  * seg_XXX/sort-YYY.ix  => offsets
  * seg_XXX/sort-YYY.dat => character data

The offsets file is written as a stack of big-endian 64-bit integer file
pointers -- one more than the total number of unique values, with the last
file pointer being the end of the file.  That allows us to determine the byte
length of a value's character data by looking forward to the next offset.  

  i64_t offset      = Math_decode_bigend_u64(self->offsets + ord);
  i64_t next_offset = Math_decode_bigend_u64(self->offsets + ord + 1);
  i64_t len         = next_offset - offset;

However, we have a problem: there is no natural way to indicate NULL values
within this format.  Once we allow custom sort orders, we have to be ready for
NULLs that occur at the beginning, the end, or anywhere in the middle of the
sorted array of unique values. 

I propose that we indicate a NULL using a sentinel value of -1 in the offsets
file -- "all bits on", in two's complement binary.

  i64_t offset = Math_decode_bigend_u64(self->offsets + ord);
  if (offset == -1) { return NULL; }

However, this sentinel will cause an incorrect length calculation if it
appears a "next_offset" without further intervention. I think the way to
handle this is to scan forward looking for "next_offset" until we find a valid

    u32_t next_ord = ord + 1;
    i64_t next_offset;
    while (NULL_SENTINEL == (next_offset = SI_fetch_offset(self, next_ord))) {

Since the last value in the offsets file is the length of the character data
file, we're guaranteed to find a valid next_offset.

Full code for retrieving a string value from a sort cache is below.

Marvin Humphrey


static INLINE i64_t
SI_fetch_offset(SortCache *self, u32_t ord)
    i64_t *const offsets = self->offsets + ord;
    if (offsets >= self->offsets_limit) {
        THROW("Ordinal %u32 out of bounds (num unique: %i32)", ord,
    return (i64_t)Math_decode_bigend_u64(offsets);

#define NULL_SENTINEL -1 

SortCache_value(SortCache *self, i32_t doc_num, ViewCharBuf *value)
    u32_t  ord    = SortCache_Ordinal(self, doc_num);
    i64_t  offset = SI_fetch_offset(self, ord);
    if (offset == NULL_SENTINEL) { 
        return NULL; 
    else {
        u32_t next_ord = ord + 1;
        i64_t next_offset;
        while (NULL_SENTINEL == (next_offset = SI_fetch_offset(self, next_ord))) {
            i64_t  len = next_offset - offset;
            char  *end = self->char_data + next_offset;
            if (end > self->char_data_limit) {
                i64_t over = end - self->char_data_limit;
                THROW("Read %i64 beyond char data limit", over);
            ViewCB_Assign_Str(value, self->char_data + offset, len);
    return value;

View raw message