Marvin Humphrey
Re: [lucy-dev] Schema for searching IRC logs
Mon, 21 Feb 2011 01:18:44 GMT
On Sun, Feb 20, 2011 at 10:46:33PM +0100, Moritz Lenz wrote:
> Is the kinosearch list still in use then? If yes, what for?

Once we get an initial release of Lucy out, none of this will matter.  When
Lucy entered the Incubator, we'd hoped to make a release quickly enough that
it wouldn't be necessary to plan for the closure of KS resources prior to the
existence of an official Lucy tarball -- we'd release Lucy, then deal with
deprecating KS after that.  To be frank, I didn't think it was going to take
this long to resolve all of the dependency licensing issues that are currently
holding us up, but lots of commits are going in and we'll get there.

> > As for your issues below, why not aggregate all lines with a particular
> > user (and set of timestamps) into a single Document with multi-valued
> > fields for timestamp and for line? Would that help?
> I haven't come across multi-valued fields yet. Where are they documented?

We don't have "multi-valued fields" -- that's a Lucene-only thing.  I strongly
dislike that quirk of Lucene and consider it a deceptive misfeature.  For
example, it takes a while for users to realize that you don't get multi-valued
sorting with Lucene's multi-valued fields (the first term is used, even if it
wasn't the term that matched).

You can fake up something like a multi-valued field in Lucy/KS using custom

  my $pipe_splitter = Lucy::Analysis::Tokenizer->new(pattern => '[^|]+');
  my $field_type = Lucy::Plan::FullTextType->new(analyzer => $pipe_splitter);
  $schema->spec_field(name => 'nick', type => $field_type);
  $doc->{nick} = join('|', @nicks);

Note that if you make that "nick" field sortable, Lucy/KS will use the
*entire* string field value to determine sort order.  In other words, if the
value for a document is "chromatic|moritz" and the term "moritz" matches, it
will still be sorted by chromatic's nick first.

Marvin Humphrey

