incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Sort cache file format
Date Wed, 08 Apr 2009 17:43:04 GMT
On Wed, Apr 01, 2009 at 07:52:11AM -0400, Michael McCandless wrote:

> Can the same field name have more than one FieldSpec?  

Absolutely not.  I have always regarded that feature of Lucene as insane.
It's like an SQL engine where any INSERT can silently ALTER TABLE.

Unsurprisingly, the back-end code to support that insane feature is ginormous
and convoluted and has been the source of many bugs over the years.
LUCENE-1590 is only the latest.

> What's otherwise the difference?  

[regarding per-field files vs. per-FieldSpec files]

Lexicon objects in KinoSearch only iterate over one field's terms.
PostingLists are single-field, too -- you can't seek to a term in another

In the Lucene file format, each term in the term dictionary has to list a
field number.  That wastes space compared to the KinoSearch format, since in
KS there is only one field number per file and it is encoded within the file's

Furthermore, per-field lexicon and postings files are easier to troubleshoot.
You just look at the directory listings and you get a pretty good idea of
what's going on: files that are missing, that have zero length, etc, are big
red flags.

If you're using the compound file format, you don't have to worry about
running out of file descriptors, no matter how many virtual files you use --
because a single descriptor is shared by all of them.  Thus, the only
consequence of dividing up files per-FieldSpec rather than per-field is... to
make the index more opaque and harder to debug. :(  It's a step backwards.

However, I think Lucy will have to accept that loss of clarity in order to
support the non-compound format. :(

> > We'd like to be able to use the field's lexicon to obtain the field values.
> > However, since lexicons use prefix compression, we can't jump to a random
> > entry.  Therefore, we either need to change the lexicon format for sortable
> > fields so that it doesn't use prefix compression, or write redundant data.
> The requirement is that each doc has a single value, right?


> For numeric fields, can't you simply store the values instead of
> separate ord + values?  EG sorting float/long/doubles is very easy.
> Also for smaller types (byte, short) you can be much more compact with
> only the values.

Right now there are three core FieldSpec classes in KS:


Only "fulltext" fields are associated with Analyzers (and they are *always*
analyzed).  Only "string" fields -- which are always single-value since they
aren't analyzed -- allow sorting.

The plan in the near future is to add additional field types such as these:


I think the sort cache format for a given field would be dependent on the
FieldSpec class.  So would the encode/decode for the full document storage,
for that matter.

> Does/will Lucy have column-stride field storage?  

I think these sort caches should be either column-stride or variable width.
Which one would be most appropriate would be data dependent.  I don't know how
important it is to expose an API to support multiple sort cache types.  I
suppose that someone will need it eventually for maximum efficiency.

As I mentioned in this JIRA comment for Lucene column-stride fields this

... I think the only purpose for column-stride fields should be to aid search.

> > For pure Unicode character data, a two-file format would work well.
> >
> >   * A stack of 64-bit file offsets into the character data file.
> >   * Pure character data, with string lengths implied by the offset file.

> > Now it turns out that our offsets have a funny property: they sort in the same
> > order as the string data that they point to.  Because of that, we can
> > theoretically use them as our sort-ords, and do away with the stack of 32-bit
> > sort ords.
> Except, they take 2X the storage as int ords.

True.  For a field with a small number of unique values, the 32-bit ords are a win.

  // Less space with a small number of unique values:
  32-bit ord => 64-bit file pointer => character data

  // Less space with many unique values:
  64-bit file pointer => character data

However, I think that we should accomodate fields with few values using
dedicated enum field types, e.g. a "StringEnumField" that requires you to
declare all possible values in advance.

There's still an unoptimized case where a field has few values, but they
aren't all known in advance.  It would be nice if we could come up with a
field type for that.

Nevertheless, I think we cover a lot of cases if we support both an enum type
and a one-value-per-doc type.

Marvin Humphrey

View raw message