incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Types and Schemas
Date Tue, 14 Apr 2009 12:40:15 GMT
On Tue, Apr 14, 2009 at 6:38 AM, Marvin Humphrey <> wrote:
> On Mon, Apr 13, 2009 at 09:43:06AM -0400, Michael McCandless wrote:
>> > Because Lucy's Doc objects will be hash based, there will *never* be a case
>> > where the same field has two "values" per se within the same doc.
>> >
>> > However, it's fine if we support compound types via specific FieldType
>> > subclasses, e.g. Float32ArrayType, or StringArrayType.
>> I see -- does KS support multi-valued (compound) types today?
> No.  There hasn't been a pressing need for them.  (IMO, the fact that Lucene
> allows multiple "values" per field is a misfeature.  Effectively, *all* fields
> in Lucene are compound types, which is limiting.)

I think compound types are important (eg "author"), though "compound"
is a bit too powerful sounding (eg a "struct" is a compound type, but
we're not going there, I hope).  Maybe we can call them "arrays" or
"lists" or "multi-valued".

Maybe you mean Lucene's weak typing (of multi-valued types, in
particular) is the misfeature here?

> Nevertheless, I'm up for supplementing scalar types with compound types in
> Lucy.  The "tags" use case in particular might be more elegantly handled with
> a StringArrayType.  Or maybe a FullTextArrayType, if it was important for the
> field to be analyzed.
> Right now, you can fake up a "tags" field in KS using a dedicated Tokenizer
> pretty easily, but the scoring is kind of messed up because of length
> normalization.

Another EG might be a product that comes in three sizes (S, M, L) and
your current search is filtering by "size == S", that's hard to
emulate well w/o compound types (you could do substring search, but
that scales poorly).

>> For which "types"?  And I imagine for such types, "sortable" is not allowed
> There's an inherent confusion in how fields that can "hit" in multiple ways
> should sort.  On one hand, you might want to sort by the value that "hit".  On
> the other hand, you might want to sort by the first value in the field.
> In the face of that confusion, I think it makes sense to just disable sorting
> for compound types.

Or allow custom comparator, or custom "ValueSource".  Hmm, I wonder
whether ValueSource should make it possible to eg return multiple ints
for a single doc.

>> (yet "sortable" is set at the top FieldSpec, right?)?
> Sure, but subclasses of FieldType can override Set_Sortable() to throw an
> error and avoid it as a constructor arg.

Ahh, right, you can "subtract" functionality from the base.  OK.

>> > It's also important to distinguish between "multi-valued" and the
>> > "multi-token" FullTextType.  FullTextType fields are tokenized within the
>> > index, but in the context of the doc reader, they only have one string
>> > "value".  Note, however, that you cannot sort on a FullTextType field in KS.
>> So if I want to index & sort by "title" field, I make 2 separate fields?
> Hmm.  Good point, that's a waste and shouldn't be necessary.
> That behavior is an artifact of using Lexicon data to build the sort cache.
> Once we move to a dedicated SortWriter/SortReader, though, we'll be building
> the sort cache at index time from the full field value, and that problem goes
> away.
> So, I think it makes sense to allow sorting on FullTextType fields after all.

OK that sounds right.  Lucene won't be able to do this until we have
CSF, or, if we also write sort caches at index time, which does make

>> >>   * Open vs closed (known set of values) enums
>> >
>> > It would be nice to add this later.  I don't think it's a high priority, since
>> > it's an optimization.
>> You mean you'd start with "open" enums?
> I meant no enum type right now.

OK, since at index time we can basically deduce ourself it it's
"relatively" enumerated and act accordingly (or simply treat all as
enums for now, as you've suggested).

>> >>   * Sortable
>> >
>> > I think this belongs in the base class -- that's where KS has it now.  That
>> > way, we can perform the following test, regardless of what the type is.
>> >
>> >   if (FieldType_Sortable(field_type)) {
>> >        /* Build sort cache. */
>> >        ...
>> >   }
>> Yeah... except multi-valued (compound) types would disable this, I
>> guess.  Though Lucene users seem to hit this limitation enough to make
>> it relaxable... and customize how SortCache gets created.
> In the abstract, that sounds like a can of worms, but we can revisit after the
> sort cache writer (SortWriter?) gets a provisional implementation.


>> >>   * nulls sort on top or bottom
>> >
>> > This would be individual to each sort comparator.  Note that we might want to
>> > use a different sort comparator for NOT NULL fields for efficiency's sake,
>> > which complicates making the comparator a method on FieldSpec.
>> Yes, we're iterating on this now in LUCENE-831.  Though I wonder if
>> this ought to be the realm of source code specialization...
>> multiplying out all the combinations of "single comparator or not",
>> "scoring or not", "track max score or not", "string index may have
>> nulls or not", in Lucene's "true" sources (vs generated sources)
>> starts to get crazy.  Soon we'll also multiply in "docIDs guaranteed
>> to arrive in order to the collector, or not" as well.
> Actually, you know what?  The vast majority of our sort costs come at
> index-time, when we build the ords array.  At search time, the only time we
> have to worry about the cost of the comparator is when comparing values across
> segments.  So: we can afford to have NULL checks in the default comparator
> routines.

I think the search time optimizations add up... not having to break
ties on docID is a good gain, for example, if the sort has ties.

I'm seeing sizable gains by specializing the source code (in Java, at
least).  Though, a good chunk of that is pushing random-access filters
down low, so that's a low hanging fruit for the true source code.

>> > My general inclination is to have NULLs sort towards the end of the array.
>> >
>> >>   * Omit norms, omit TFAP
>> >
>> > I'm putting this off for now.  It will be addressed when we refactor for
>> > flexible indexing.
>> OK.  These would seem to live nicely under FullTextType... oh actually
>> maybe not, because presumably I can index single-valued fields (the
>> equivalent of NOT_ANALYZED in Lucene).
> Yes.  Right now in KS, StringType fields -- which are single-valued -- can be
> indexed.


>> EG an Int32Type may in fact be indexed, and I would at that point want to
>> put omit norms/TFAP there.  Hmmm, cross cutting concerns.  Maybe sub-typing
>> is needed...
> Right now in KS, norms are stored in the postings files, a la the original
> "flexible indexing" design that Doug, Grant and I hashed out a while back.
> It's inefficient and needs refactoring.
> However, I plan to wait on that until after the next dev release.


>> >>   * Term vectors or not, positions, offsets
>> >
>> > Term vectors are unique to FullTextType, since it is the only multi-token
>> > field.  Right now in KS, it's a boolean member var in FullTextType.
>> Single-token indexed fields might want term vectors too?
> I dunno, is that necessary?  I guess it's not a big deal to move it down into
> FieldType.
> Right now in KS, there's only one flag, "vectorized", and start offsets and
> end offsets are always included.  That's because the only significant use case
> is highlighting.  (I've always regarded MoreLikeThis queries based on term
> vectors as fatally flawed.)

Why flawed?

> I've often wondered whether or not to call that flag "highlightable" rather
> than the obtuse "vectorized".

> IMO, it's important to have a high quality
> highlighter/excerpter in core, and perhaps the API should be adjusted to
> reflect that priority.


> If you really need "term vectors" per se, you can
> either go with a dedicated plugin or specify "highlightable" and exploit the
> fact that it's a term-vector based implementation.

Yeah, maybe.  Besides highlighting, MoreLikeThis and maybe
clustoring/categorizing , I don't have a good sense of what
else term vectors are "typically" used for.

>> >>   * CSF'd or not
>> >
>> > Right now, I'd say keep this out of core.
>> OK, and, merge with sort cache somehow.  For most types they are one
>> and the same.
> Yeah, I think that's right.  The only difference is the extra deref in cases
> where high levels of uniqueness suggest a pure array would be ideal.


>> >>   * I will use RangeFilter on this field
>> >
>> > The "sortable" boolean member var fills this need, no?
>> They are different?  Eg you'll add aggregates (Trie*) to your index
>> for fast range constraints, but for sorting you just need a sort cache
>> computed.
> I haven't really looked at the TrieRange stuff yet...
> In KS, range queries are implemented to just look up a term number in the
> Lexicon for both the lower and upper terms before scoring commences, then see
> if the ord value from the sort cache falls between them for each document.

Ahh got it.  Lucene recently added that approach (using our FieldCache
to check inclusion in the range).  We now have too many RangeQueries.

Trie simply aggreates big ranges at indexing time (logically
equivalent to, 0-10, 20-20, then next trie 0-100, 100-200, etc.), ie
each range is new term on the doc, and then at search time you can
pick a much smaller set of terms to iterate.

>> > FWIW, the current implementation of Boilerplater only supports two level
>> > namespacing (with nicknames).  Outside of core, fully qualified code would
>> > look like this:
>> >
>> >  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
>> >  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);
>> What are the two levels here?  Level 1 is "StdAnalyzer", and Level 2
>> is "new" and "Transform_Text"?
> Level 1 is "lucy_".  Level 2 is "StdAnalyzer_".


>> > If we readonly that Hash, we can't add subclasses to it -- and therefore we
>> > won't be able to retrieve their deserializers.
>> I guess it's only subclasses implemented in C where this is important?
>> Because a hosty subclass's deserializer is using/relying the host's
>> namespace to find classes by name.
> Within Schema_deserialize, Lucy will have to be able to track down
> deserializers for custom subclasses of Analyzer and FieldType.  Same thing
> with custom Query subclasses and remote searching.
> We either deal with that need in the Lucy core or punt back to the host.

Seems like if the subclass is in the host, the host's namespace should
locate it (and its deserializer method).  The global hash that expands
at runtime to include all known named things in the universe still
doesn't quite sit right w/ me... but I agree based on deserializer's
needs, and lack of namespaces in C, it seems to solve those needs.


View raw message