incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Sort cache file format
Date Thu, 09 Apr 2009 17:38:46 GMT
On Thu, Apr 09, 2009 at 06:52:11AM -0400, Michael McCandless wrote:
> Ahhhhh... OK this finally clarifies my [silly] confusion: a FieldSpec
> corresponds to more than one field.  It's like FieldInfos in Lucene?

I wouldn't say so.  It's more like a single FieldInfo on steroids.

Here's the the Schema API from KS svn trunk, translated to Java:

   Schema        schema     = new Schema();
   PolyAnalyzer  analyzer   = new PolyAnalyzer("en");
   FullTextField fulltext   = new FullTextField(analyzer);
   StringField   notIndexed = new StringField();
   schema.specField("title",    fulltext);
   schema.specField("content",  fulltext);
   schema.specField("url",      notIndexed);
   schema.specField("category", new StringField());

So, a single FieldSpec can be used to spec multiple fields -- as in this case
with "title" and "content".

Note that unlike Lucene, there is no default Analyzer.  There is no default
FieldSpec, either.  It's up to you to assign fields to FieldSpecs.

In some sense, it would actually be cleaner to require that each field name be
associated with a unique FieldSpec object.  That would make the process of
defining a Schema closely analogous to SQL's CREATE TABLE idiom.  
 CREATE SCHEMA my_schema (
   STRING category

However, that would require a unique Analyzer instance for each field, which
would be very expensive if you had a lot of small fields.

BTW, in KS svn trunk, Schemas are now fully serialized and written to the
index as "schema_NNN.json".  Including Analyzers.  :)

> So for Lucy you'll move away from KS's required "compound file
> format", and allow either compound or not.

I think so.  It would be nice not to foreclose on the option of a discrete
file format.  In some sense discrete files are actually easier to troubleshoot
because you can inspect the individual binary files directly.  And of course,
that extra file copy at index time goes away.

> But you didn't answer the original question: will you simply use the
> numeric value as the ord?

Yes, that was what I was thinking.  For example, if the FieldSpec for a given
field is an Int32Field, then we'd build a stack of the numeric field values.

> Also, for enumerated fields (small universe of possible values, eg
> "yes" or "no" in the binary case) will you bit-pack the ords?

I imagine so.

> I think variable/fixed width storage vs column/row stride are orthogonal.

OK, fair enough.  

> > ... I think the only purpose for column-stride fields should be to aid search.
> I don't really understand what "to aid search" means -- that's very
> generic.

Meaning, we'd use the column-stride fields during matching/scoring, but
IndexReader.fetchDoc() would never use them.

> >> Except, they take 2X the storage as int ords.
> >
> > True.  For a field with a small number of unique values, the 32-bit ords are a win.
> Well, you can do better by bit packing.  For enumerated value fields
> you usually can use far fewer than 32 bits per value.

Hmm... I suppose you could think of any field as enumerated.  :)  Once you
invert all the docs in a segment, you know how many terms there are.

Can we take advantage of that to simplify our implementation and only provide
simple variants on the bit-packed-array-of-ords data structure?

Say that after we finish indexing a segment, we discover that the "category"
field, which is marked as sortable, has only 237 unique values.  We can then
build our ords cache as a stack of bytes.  Similarly, we discover that the
"enabled" field has only two values, "y" and "n".  We can build our ords cache
as a bit map.

In both cases, we avoid requiring the user to specify values in advance -- but
because we proceed at a segment-at-a-time, we can pretend that they did.

That doesn't solve all our problems, because we still need the actual values
for resolving inter-segment comparisons...  For fixed-width fields, we can
just build a raw stack of values.  For variable-width fields, we can either do
something block-based, or we can use the two-file model with one pointer file
and one data file.  

I prefer the two-file model over the block model because comparison of long
values which cross block boundaries gets messy unless you build a contiguous
value by copying.  With the two-file model, you just pass around pointers into
the mmap'd data.  Locality of reference isn't a big deal if those two files
are jammed right next to each other in the compound file.

Marvin Humphrey

View raw message