incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: [lucy-dev] Schema for searching IRC logs
Date Sun, 20 Feb 2011 18:33:20 GMT
Hi Moritz,

Thanks for your email. I would suggest in general that Lucy *is* the place that you should
come to for KinoSearch support since Apache Lucy is now where the developers of KinoSearch
are, and since Apache Lucy is what KinoSearch has evolved into.

As for your issues below, why not aggregate all lines with a particular user (and set of timestamps)
into a single Document with multi-valued fields for timestamp and for line? Would that help?

Cheers,
Chris

On Feb 20, 2011, at 10:01 AM, Moritz Lenz wrote:

> (originally I sent this mail to the kinosearch mailing list, but since
> it's temporarily down Marvin suggested I send this to lucy-dev instead.
> Please excuse me if it's not quite on topic here).
> 
> Hi all,
> 
> I've been running public IRC logs for a few years now, and have decided
> to replace the crappy search with something decent. So, KinoSearch it is :-)
> 
> One page of these logs contains the conversation from one channel at one
> particular day, and each such page contains many rows consisting of an
> ID, a timestamp, a nickname, and the line that was being uttered.
> Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
> about 20 channels, a few years worth of logs and 4 million rows; I want
> to be able to scale up to maybe 20 million rows)
> 
> I want my search results to be grouped similarly, so my current schema
> looks like this:
> 
> my $schema      = KinoSearch::Plan::Schema->new;
> my $poly_an     = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
> my $full_text   = KinoSearch::Plan::FullTextType->new(
>                    analyzer => $poly_an,
>                    stored   => 0,
>                  );
> my $string      = KinoSearch::Plan::StringType->new( stored => 0);
> my $kept_string = KinoSearch::Plan::StringType->new( stored => 1,
> sortable => 1);
> my $sort_string = KinoSearch::Plan::StringType->new( stored => 0,
> sortable => 1);
> 
> $schema->spec_field(name => 'line',     type => $full_text);
> $schema->spec_field(name => 'nick',     type => $string);
> $schema->spec_field(name => 'channel',  type => $kept_string);
> $schema->spec_field(name => 'day',      type => $kept_string);
> $schema->spec_field(name => 'timestamp',type => $sort_string);
> $schema->spec_field(name => 'id',       type => $kept_string);
> 
> Having each line as a separate document has three disadvantages:
> 
> 1) when displaying the results, I have to construct the context manually
> (so I need to hit the DB to get the rows before and after, which is why
> I don't store the line in the index)
> 
> 2) when paging the search results, I rip apart the last page, because
> the num_wanted option works with rows, not pages.
> 
> 3) not sure about this one, but it feels that this solution doesn't
> scale well. I've wait more than half a minute for a query that was
> limited to 100 rows. (Mabe my three sort_specs hurt here?)
> 
> Is there a way to construct my schema in a way to avoid these problems
> (and still allows searching by field)? Something like sub-documents,
> where I have pages as top level documents, and each page can have
> multiple rows?
> 
> Cheers,
> Moritz


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message