incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Types and Schemas
Date Tue, 14 Apr 2009 10:38:24 GMT
On Mon, Apr 13, 2009 at 09:43:06AM -0400, Michael McCandless wrote:

> > Because Lucy's Doc objects will be hash based, there will *never* be a case
> > where the same field has two "values" per se within the same doc.
> >
> > However, it's fine if we support compound types via specific FieldType
> > subclasses, e.g. Float32ArrayType, or StringArrayType.
> I see -- does KS support multi-valued (compound) types today?  

No.  There hasn't been a pressing need for them.  (IMO, the fact that Lucene
allows multiple "values" per field is a misfeature.  Effectively, *all* fields
in Lucene are compound types, which is limiting.)

Nevertheless, I'm up for supplementing scalar types with compound types in
Lucy.  The "tags" use case in particular might be more elegantly handled with
a StringArrayType.  Or maybe a FullTextArrayType, if it was important for the
field to be analyzed.

Right now, you can fake up a "tags" field in KS using a dedicated Tokenizer
pretty easily, but the scoring is kind of messed up because of length

> For which "types"?  And I imagine for such types, "sortable" is not allowed

There's an inherent confusion in how fields that can "hit" in multiple ways
should sort.  On one hand, you might want to sort by the value that "hit".  On
the other hand, you might want to sort by the first value in the field.

In the face of that confusion, I think it makes sense to just disable sorting
for compound types.

> (yet "sortable" is set at the top FieldSpec, right?)?

Sure, but subclasses of FieldType can override Set_Sortable() to throw an
error and avoid it as a constructor arg.

> > It's also important to distinguish between "multi-valued" and the
> > "multi-token" FullTextType.  FullTextType fields are tokenized within the
> > index, but in the context of the doc reader, they only have one string
> > "value".  Note, however, that you cannot sort on a FullTextType field in KS.
> So if I want to index & sort by "title" field, I make 2 separate fields?

Hmm.  Good point, that's a waste and shouldn't be necessary.

That behavior is an artifact of using Lexicon data to build the sort cache.
Once we move to a dedicated SortWriter/SortReader, though, we'll be building
the sort cache at index time from the full field value, and that problem goes

So, I think it makes sense to allow sorting on FullTextType fields after all.

> >>   * Open vs closed (known set of values) enums
> >
> > It would be nice to add this later.  I don't think it's a high priority, since
> > it's an optimization.
> You mean you'd start with "open" enums?

I meant no enum type right now.

> >>   * Sortable
> >
> > I think this belongs in the base class -- that's where KS has it now.  That
> > way, we can perform the following test, regardless of what the type is.
> >
> >   if (FieldType_Sortable(field_type)) {
> >        /* Build sort cache. */
> >        ...
> >   }
> Yeah... except multi-valued (compound) types would disable this, I
> guess.  Though Lucene users seem to hit this limitation enough to make
> it relaxable... and customize how SortCache gets created.

In the abstract, that sounds like a can of worms, but we can revisit after the
sort cache writer (SortWriter?) gets a provisional implementation.

> >>   * nulls sort on top or bottom
> >
> > This would be individual to each sort comparator.  Note that we might want to
> > use a different sort comparator for NOT NULL fields for efficiency's sake,
> > which complicates making the comparator a method on FieldSpec.
> Yes, we're iterating on this now in LUCENE-831.  Though I wonder if
> this ought to be the realm of source code specialization...
> multiplying out all the combinations of "single comparator or not",
> "scoring or not", "track max score or not", "string index may have
> nulls or not", in Lucene's "true" sources (vs generated sources)
> starts to get crazy.  Soon we'll also multiply in "docIDs guaranteed
> to arrive in order to the collector, or not" as well.

Actually, you know what?  The vast majority of our sort costs come at
index-time, when we build the ords array.  At search time, the only time we
have to worry about the cost of the comparator is when comparing values across
segments.  So: we can afford to have NULL checks in the default comparator

> > My general inclination is to have NULLs sort towards the end of the array.
> >
> >>   * Omit norms, omit TFAP
> >
> > I'm putting this off for now.  It will be addressed when we refactor for
> > flexible indexing.
> OK.  These would seem to live nicely under FullTextType... oh actually
> maybe not, because presumably I can index single-valued fields (the
> equivalent of NOT_ANALYZED in Lucene).  

Yes.  Right now in KS, StringType fields -- which are single-valued -- can be

> EG an Int32Type may in fact be indexed, and I would at that point want to
> put omit norms/TFAP there.  Hmmm, cross cutting concerns.  Maybe sub-typing
> is needed...

Right now in KS, norms are stored in the postings files, a la the original
"flexible indexing" design that Doug, Grant and I hashed out a while back.
It's inefficient and needs refactoring.

However, I plan to wait on that until after the next dev release.

> >>   * Term vectors or not, positions, offsets
> >
> > Term vectors are unique to FullTextType, since it is the only multi-token
> > field.  Right now in KS, it's a boolean member var in FullTextType.
> Single-token indexed fields might want term vectors too?

I dunno, is that necessary?  I guess it's not a big deal to move it down into

Right now in KS, there's only one flag, "vectorized", and start offsets and
end offsets are always included.  That's because the only significant use case
is highlighting.  (I've always regarded MoreLikeThis queries based on term
vectors as fatally flawed.)

I've often wondered whether or not to call that flag "highlightable" rather
than the obtuse "vectorized".  IMO, it's important to have a high quality
highlighter/excerpter in core, and perhaps the API should be adjusted to
reflect that priority.  If you really need "term vectors" per se, you can
either go with a dedicated plugin or specify "highlightable" and exploit the
fact that it's a term-vector based implementation.

> >>   * CSF'd or not
> >
> > Right now, I'd say keep this out of core.
> OK, and, merge with sort cache somehow.  For most types they are one
> and the same.

Yeah, I think that's right.  The only difference is the extra deref in cases
where high levels of uniqueness suggest a pure array would be ideal.

> >>   * I will use RangeFilter on this field
> >
> > The "sortable" boolean member var fills this need, no?
> They are different?  Eg you'll add aggregates (Trie*) to your index
> for fast range constraints, but for sorting you just need a sort cache
> computed.

I haven't really looked at the TrieRange stuff yet...

In KS, range queries are implemented to just look up a term number in the
Lexicon for both the lower and upper terms before scoring commences, then see
if the ord value from the sort cache falls between them for each document.

> > FWIW, the current implementation of Boilerplater only supports two level
> > namespacing (with nicknames).  Outside of core, fully qualified code would
> > look like this:
> >
> >  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
> >  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);
> What are the two levels here?  Level 1 is "StdAnalyzer", and Level 2
> is "new" and "Transform_Text"?

Level 1 is "lucy_".  Level 2 is "StdAnalyzer_".

> > If we readonly that Hash, we can't add subclasses to it -- and therefore we
> > won't be able to retrieve their deserializers.
> I guess it's only subclasses implemented in C where this is important?
> Because a hosty subclass's deserializer is using/relying the host's
> namespace to find classes by name.

Within Schema_deserialize, Lucy will have to be able to track down
deserializers for custom subclasses of Analyzer and FieldType.  Same thing
with custom Query subclasses and remote searching.

We either deal with that need in the Lucy core or punt back to the host.

Marvin Humphrey

View raw message