incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Types and Schemas (was "Sort cache file format")
Date Sun, 12 Apr 2009 13:08:50 GMT
On Sat, Apr 11, 2009 at 10:58:44AM -0400, Michael McCandless wrote:

> Does FieldSpec sub divide the options?  Eg options about indexing
> could live in its own class, with commonly used constants like "NO".
> This was the motivation of that comment in Lucene (the fact that we
> don't subdivide means suddenly stored only fields have to figure out
> what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO
> that's be better).

Right now, FieldSpec doesn't subdivide, but it's not a least common
denominator, either.  To illustrate: FieldSpec has boolean members for
"indexed", "stored", and "sortable", but knows nothing about Analyzers.
Analyzers are the exclusive province of the FullTextField subclass.

If you don't permit automatic merging of field types, then there isn't a need
for FieldSpec to know everything about all its subclasses.  I see why
subdividing options might be useful in Lucene, but I'm not sure it's necessary
for Lucy.  
I think it's better OO design for the parent class to be simple rather than

> Well, in Lucene we could better decouple a Field's value from its
> "extended type".  The type would still be attached to the Field's
> value (not to the global schema as in KS), but strongly decoupled &
> shared across Field instances.

That makes sense.  The "extended type" class could look almost identical, but
in Lucene the user would make the connection directly, while in Lucy it
would be made indirectly via the field name.

> [A fun aside: Wow I just did a Google search for "javascript self" and
> it offered up respelling to "javascript this" -- they've got one smart
> respeller!]

Haha, awesome. :)

> Lucene in fact implicitly has a global schema in that when segments
> are merged, or when docs are added into a single segment, the schema
> for each document or segment are "merged" according to certain rules.
> When your index is optimized then you have your global schema.

That's a good way of putting it.

> > Dump them to a JSON-izable data structure.  Include the class name so that you
> > can pick a deserialization routine at load time.
> You rely on the same namespace -> obj mapping being present at
> deserialize time?  Ie its the callers responsibility to import the
> same modules, ensure the names "map" to the same objs (or at least
> compatible ones) as were used during serialization, etc.

If the user has implemented custom subclasses, then yes, the subclasses must be
loaded or you'll get a "class not found" error.

> Though, for core objects, you would use the global name -> vtable
> mapping that Lucy core maintains?  

Yes.  Any core class would already be loaded.

> (I still don't fully understand why Lucy needs that global hash -- this is
> what namespaces are for).

If we didn't implement it internally, we'd need to implement it in the
bindings for e.g. looking up deserialization routines.  Furthermore, we need
some mechanism for C-level subclassing, since that's not part of the C
language.  No namespaces there.  :)

> OK, so if I've made a custom Tokenizer doing some funky Python code
> instead of a regexp, I could simply implement dump/load to do the
> right thing.


BTW, I saw that Earwin Burrfoot calls his type class "FieldType".  

"FieldType" is probably a better name than "FieldSpec", as it implies
subclasses with "Type" as a suffix: FullTextType, StringType, BlobType,
Int32Type, etc.

Marvin Humphrey

View raw message