incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Sort cache file format
Date Fri, 10 Apr 2009 13:52:01 GMT
On Thu, Apr 9, 2009 at 1:38 PM, Marvin Humphrey <> wrote:

>> Ahhhhh... OK this finally clarifies my [silly] confusion: a
>> FieldSpec corresponds to more than one field.  It's like FieldInfos
>> in Lucene?
> I wouldn't say so.  It's more like a single FieldInfo on steroids.
> Here's the the Schema API from KS svn trunk, translated to Java:
>   Schema        schema     = new Schema();
>   PolyAnalyzer  analyzer   = new PolyAnalyzer("en");
>   FullTextField fulltext   = new FullTextField(analyzer);
>   StringField   notIndexed = new StringField();
>   notIndexed.setIndexed(false);
>   schema.specField("title",    fulltext);
>   schema.specField("content",  fulltext);
>   schema.specField("url",      notIndexed);
>   schema.specField("category", new StringField());
> So, a single FieldSpec can be used to spec multiple fields -- as in this case
> with "title" and "content".

So in this code, FullTextField is an instance (subclass?) of FieldSpec?

It seems like a FieldSpec represents the "extended type" of a field
(ie, includes all juicy details about how it should be indexed
(including an analyzer instance), stored, etc), and then you're free
to have more than one field in your doc share that "extended type".

So things like "I intend to do range searching" and "I intend to sort"
this field, and "I want to store term vectors, with positions but not
offsets", etc., belong in FieldSpec?

Hmmm... actually I sort of talked about the beginnings of this, for
Lucene, in the last paragraph here:

Ie, maybe we should instantiate Field.Index and tweak its options
(norms, omitTFAP, etc.) and that instance becomes the type of your
field (at least wrt indexing).

This is also sort of like the crazy static types one can create with
generics, ie, a "type" used to be something nice simple (int, float,
your own class, etc.) but now can be a rich object (instance) in
itself.  [A class and an instance really should not be different,
anyway (prototype languages like Self don't differentiate).]

> Note that unlike Lucene, there is no default Analyzer.  There is no
> default FieldSpec, either.  It's up to you to assign fields to
> FieldSpecs.


> In some sense, it would actually be cleaner to require that each
> field name be associated with a unique FieldSpec object.  That would
> make the process of defining a Schema closely analogous to SQL's
>  CREATE SCHEMA my_schema (
>   STRING category
>  );
> However, that would require a unique Analyzer instance for each
> field, which would be very expensive if you had a lot of small
> fields.

Now that I understand FieldSpec (I think!), I think allowing sharing
is fine.  Different variables in my source code can share the same

> BTW, in KS svn trunk, Schemas are now fully serialized and written
> to the index as "schema_NNN.json".  Including Analyzers.  :)

How do you serialize Analyzers again?  Maybe it's because the only
allowed "custom" analyzer is via specifying your own regexp (and not
custom host or core source code)?

>> So for Lucy you'll move away from KS's required "compound file
>> format", and allow either compound or not.
> I think so.  It would be nice not to foreclose on the option of a
> discrete file format.  In some sense discrete files are actually
> easier to troubleshoot because you can inspect the individual binary
> files directly.  And of course, that extra file copy at index time
> goes away.

I'm still torn on how crucial "live" transparency is, vs running a
tool (CheckIndex, Luke) to see things, but yes it's a plus.

But: I've often wondered whether that extra copy gains performance.
Lucene now sets the file size of the CFS before writing, which in
theory at least means an FS can optimize placement of the file to
reduce fragmentation.  So that copy may be good.  I suppose we could
do a similar trick to the discrete files too.

>> But you didn't answer the original question: will you simply use
>> the numeric value as the ord?
> Yes, that was what I was thinking.  For example, if the FieldSpec
> for a given field is an Int32Field, then we'd build a stack of the
> numeric field values.


>> Also, for enumerated fields (small universe of possible values, eg
>> "yes" or "no" in the binary case) will you bit-pack the ords?
> I imagine so.


>> I don't really understand what "to aid search" means -- that's very
>> generic.
> Meaning, we'd use the column-stride fields during matching/scoring,
> but IndexReader.fetchDoc() would never use them.

I'm still unsure.  I think how CSFs get used will be app dependent; I
guess we throw them out there and see how they're used.

EG maybe the APP stores price & manuf only in CSF and not in the
"normal" stored fields... the additional cost to retrieve stored field
/ excerpts for each of 10 hits on the page can be quite high.

Of course it's really only a win if you never need the stored fields
(ie, if you're gonna pay a seek cost, the added cost to retrieve 1 vs
10 fields is negligible), which if you're doing excerpts won't be the

>> >> Except, they take 2X the storage as int ords.
>> >
>> > True.  For a field with a small number of unique values, the
>> > 32-bit ords are a win.
>> Well, you can do better by bit packing.  For enumerated value
>> fields you usually can use far fewer than 32 bits per value.
> Hmm... I suppose you could think of any field as enumerated.  :)
> Once you invert all the docs in a segment, you know how many terms
> there are.
> Can we take advantage of that to simplify our implementation and
> only provide simple variants on the bit-packed-array-of-ords data
> structure?
> Say that after we finish indexing a segment, we discover that the
> "category" field, which is marked as sortable, has only 237 unique
> values.  We can then build our ords cache as a stack of bytes.
> Similarly, we discover that the "enabled" field has only two values,
> "y" and "n".  We can build our ords cache as a bit map.
> In both cases, we avoid requiring the user to specify values in
> advance -- but because we proceed at a segment-at-a-time, we can
> pretend that they did.

Well if the terms are "largely" unique (eg a "title" field), you lose
by treating them as enumerated (as Lucene does today) since the extra
deref only adds waste.

But then, if you are going to sort by that field, you need the ord
[deref] anyway, so... yeah maybe we only implement "enumerated"
values, at least for the sort cache.

If I just want to retrieve values, and the values are mostly unique,
it's more efficient to inline the values.

And yes at indexing time we know precisely all stats on the field so
we could pick and choose betweeen the "hey it's obviously enumerated"
vs the "hey most values are unique" cases, if we want to do both.

> That doesn't solve all our problems, because we still need the
> actual values for resolving inter-segment comparisons...  For
> fixed-width fields, we can just build a raw stack of values.  For
> variable-width fields, we can either do something block-based, or we
> can use the two-file model with one pointer file and one data file.
> I prefer the two-file model over the block model because comparison
> of long values which cross block boundaries gets messy unless you
> build a contiguous value by copying.  With the two-file model, you
> just pass around pointers into the mmap'd data.  Locality of
> reference isn't a big deal if those two files are jammed right next
> to each other in the compound file.

In the block-based approach, you'd still need to store, somewhere, the
pointers (block + offset) for each value, somewhere?

View raw message