lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Global field semantics
Date Mon, 10 Jul 2006 09:06:56 GMT

: > Are there good reasons this path has not been followed?
: Hoss, that's your cue.

I must admit, I haven't been able to fully follow this thread, perhaps
it's just because it's late (no, that can't be it ... i started reading it
at 3:30 this afternoon and then stoped because it was making my head
hurt).  In honestly, I probably would skimmed the whole thing without
commenting if Marvin hadn't called me out onto the mat -- so I'll do my
best to make sense of it.

As near as i can tell, the large issue can be sumarized with the following

	Performance gains could be realized if Field
	properties were made fixed and homogeneous for
	all Documents in an index.

...I've left this sentiment vague, and i'll ignore the implimentation
specifics since i don't understand them -- but there seems to be two high
level approaches that are involved, which are advocated to varying degrees
by varying folks...

  1) all Fields and their properties must be predeclared before any
     document is ever added to the index, and any Field not declared is
  2) a Field springs into existence the first time a Document is added
     with a value for it -- but after that all newly added Documents with
     a value for that field must conform to the Field properites initially

(have I missed any general approaches?)

The questions (in my mind at least) are:

  a) How much performance gain can be realized by these limitations?
  b) Would it be possible to impliment these limitiations in such a way
     that they are "optional" for people willing to accept the trade off?
  c) if (b) is false, then is (a) great enough to warrant changing Lucene
     anyway?  What exactly is sacrificed?

I can't speak to (a) or (b) ... but I'll throw out some examples for (c)

Regarding #1...

If Fields must be predeclared, Lucene would lose two of the biggest
advantages it has in my opinion:

 * The ability to evolve an index.  To have an extremely large index, and
to add a field to this index that is only used by "new" documents.  This
is not only usefull when the nature of you data changes (TPS Reports
didn't use to have a "cover_sheet" field, and now they do) but also when
the usage of an existing field changes and you don't want to rebuild from
scratch (you've allways had an index "cover_sheet" field, and now you want
it to be stored to .. so you change your index building code, and let it
run for a little while, and then go back and reindex the old stuff later)

 * the ability to have dynamicly named fields.  At CNET we have
"attibutes" for products, those attributes are defined in a database, and
the list of valid attributes is differnet based on the type of product.  I
don't know what they all are, and that list could change tomorow -- and i
don't want to have to rebuild my index from scratch just because someone
decided that laptops need a new attribute called "heat disopation factor"

Regarding #2...

This approach wouldn't neccessarily conflict with the dynamicly named
fields example above, but it would suffer the same "evolving index"

Last but not least is the high level issue of "homogeneous" Fields and
Field properties for all documents.  As has been pointed out, in many
cases this is not that big of a deal, because even if you want
heterogenous documents stored in a single index, you can construct a list
of Fields which is the union of the Fields from your heterogenous
Documents and use it -- hopefully no new requirement is added that all
Documents must have a value for all fields.  But what about complex
iteractions between multi-values, stored, indexed fields?

How would something like this work?

  docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)):
  docA.add(new Field(f, "foo", Store.NO,  Index.TOKENIZED)):

  docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)):
  docB.add(new Field(f, "z",   Store.NO,  Index.UN_TOKENIZED)):

...both docs have two "FIelds" for field name "f", both have a stored
value for f, both have some indexed terms for f, both have
some tokenized terms and one utokenized term for f ... but do these two
docs both conform to the same "Global field semantics" ?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message