incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Types and Schemas (was "Sort cache file format")
Date Sun, 12 Apr 2009 21:04:24 GMT
> I think Lucene could continue to merge yet isolate information
> (subdivision, subclassing).  At least I sure hope so :)
> > I see why subdividing options might be useful in Lucene, but I'm not
> > sure it's necessary for Lucy.
> It's all still hazy to me :) Hopefully once we talk about it enough
> I'll get some clarity... 

Actually, what we probably need are Python bindings so that you can start
playing around.  :)

I've been trying to clean up Boilerplater enough so that it porting
Boilerplater::Binding::Perl to Boilerplater::Binding::Python would be a
reasonable undertaking.  Perl's C API and object model are so complicated that
other languages will probably be a lot easier -- but right now, it's not
apparent from Boilerplater's API how you would get started.

> it is sort of scary that we're inventing a type system.

What's scary is that Java Lucene *has* a type system but won't admit it.

> EG there are many things the FieldType should somehow tell us:
>   * How does FieldSpec model "multi-valued" fields? Is there a
>     boolean in the base class?

Because Lucy's Doc objects will be hash based, there will *never* be a case
where the same field has two "values" per se within the same doc.

However, it's fine if we support compound types via specific FieldType
subclasses, e.g. Float32ArrayType, or StringArrayType.

It's also important to distinguish between "multi-valued" and the
"multi-token" FullTextType.  FullTextType fields are tokenized within the
index, but in the context of the doc reader, they only have one string
"value".  Note, however, that you cannot sort on a FullTextType field in KS.

>   * Must not be null -- base class?

Yes, I think that makes sense.

>   * "Has only one token" -- I guess this is implied by the class (ie
>     only FullTextType may have > 1 token)

For the near-to-middle-term future, yes -- FullTextType is the only
multi-token, single-valued type.

Looking down the road, I suppose other types like Int32ArrayType could have
more than one "token", but it wouldn't be an ordinary string "token".

>   * Open vs closed (known set of values) enums

It would be nice to add this later.  I don't think it's a high priority, since
it's an optimization.

>   * Sortable

I think this belongs in the base class -- that's where KS has it now.  That
way, we can perform the following test, regardless of what the type is.

   if (FieldType_Sortable(field_type)) {
        /* Build sort cache. */

>   * nulls sort on top or bottom

This would be individual to each sort comparator.  Note that we might want to
use a different sort comparator for NOT NULL fields for efficiency's sake,
which complicates making the comparator a method on FieldSpec.

My general inclination is to have NULLs sort towards the end of the array.  

>   * Omit norms, omit TFAP

I'm putting this off for now.  It will be addressed when we refactor for
flexible indexing.

>   * Binary or not (I guess BlobType <-> binary)

BlobType is one binary type, but I propose adding others, e.g. Int32Type.  

Binary() should be an abstract method on the base class.  It shouldn't be a
boolean flag member, because it's not something that can be switched up within
a class.

>   * Term vectors or not, positions, offsets

Term vectors are unique to FullTextType, since it is the only multi-token
field.  Right now in KS, it's a boolean member var in FullTextType.

>   * Stored or not -- toplevel?

Yes.  As a boolean member.

>   * CSF'd or not

Right now, I'd say keep this out of core.

>   * ValueSource is XYZ for this field

I'd like to avoid ValueSource if we can.  I think it's better to add real
binary types like Int32Type, DateStamp32, and so on -- instead of faking them
with strings.

>   * I will use RangeFilter on this field

The "sortable" boolean member var fills this need, no?

>   * Analyzer to use (exposed only FullTextType)

Analyzer should be a required constructor arg to FullTextType.

>   * Extensibility -- so app can enroll new attrs / make new type
>     subclasses

So long as the core performs inheritance checks rather than absolute class
membership checks, subclasses will work fine.

> Remind me again: do custom subclasses get enrolled into the global
> hash in Lucy's core?  I know you had said it's a thread risk, ie, not
> read only...


> I'm still confused.  Say StandardAnalyzer is implemented in C; maybe
> you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
> namespaces you put prefixes in front).

FWIW, the current implementation of Boilerplater only supports two level
namespacing (with nicknames).  Outside of core, fully qualified code would
look like this:

  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);

One of the constraints the two-level limitation imposes is that the last part
of every core class name must be unique.  However, it makes for fully
qualified C names that are are just cumbersome rather than unworkably long.

> Any time something in core wants to use that class, it refers to it by
> name (and the C compiler/linker maps it), not via the global hash?

For the most part.  A quick once-over of the KS code seems to indicate that
the exceptions to that rule are all related to Deserialize() and Load().

> But for deserializing a core object, when the deserializer is
> implemented in C, I agree you'd need a global lookup; basically
> because you can't consult the OBJ's symbol table dynamically.  (If you
> have a hosty deserializer, then it would "import lucy; lucy.XXX" to
> find its classes).
> (But it seems like that global hash should be readonly-able).

If we readonly that Hash, we can't add subclasses to it -- and therefore we
won't be able to retrieve their deserializers.

Marvin Humphrey

View raw message