incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Moritz Lenz <mor...@faui2k3.org>
Subject Re: [lucy-dev] Schema for searching IRC logs
Date Sun, 20 Feb 2011 21:46:33 GMT
Hello Chris,

thanks for your swift reply.

On 02/20/2011 07:33 PM, Mattmann, Chris A (388J) wrote:

> Thanks for your email. I would suggest in general that Lucy *is* the place that you should
come to for KinoSearch support since Apache Lucy is now where the developers of KinoSearch
are, and since Apache Lucy is what KinoSearch has evolved into.

Works for me.
Is the kinosearch list still in use then? If yes, what for?

> As for your issues below, why not aggregate all lines with a particular user (and set
of timestamps) into a single Document with multi-valued fields for timestamp and for line?
Would that help?

I haven't come across multi-valued fields yet. Where are they documented?

Also if I put all lines from one user into a Document, I still have to
manually reconstruct the context (that's not too bad, but not optimal
either). Also will I be able to retrieve the ID of a found line somehow?

Cheers,
Moritz

> Cheers,
> Chris
> 
> On Feb 20, 2011, at 10:01 AM, Moritz Lenz wrote:
> 
>> (originally I sent this mail to the kinosearch mailing list, but since
>> it's temporarily down Marvin suggested I send this to lucy-dev instead.
>> Please excuse me if it's not quite on topic here).
>> 
>> Hi all,
>> 
>> I've been running public IRC logs for a few years now, and have decided
>> to replace the crappy search with something decent. So, KinoSearch it is :-)
>> 
>> One page of these logs contains the conversation from one channel at one
>> particular day, and each such page contains many rows consisting of an
>> ID, a timestamp, a nickname, and the line that was being uttered.
>> Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
>> about 20 channels, a few years worth of logs and 4 million rows; I want
>> to be able to scale up to maybe 20 million rows)
>> 
>> I want my search results to be grouped similarly, so my current schema
>> looks like this:
>> 
>> my $schema      = KinoSearch::Plan::Schema->new;
>> my $poly_an     = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
>> my $full_text   = KinoSearch::Plan::FullTextType->new(
>>                    analyzer => $poly_an,
>>                    stored   => 0,
>>                  );
>> my $string      = KinoSearch::Plan::StringType->new( stored => 0);
>> my $kept_string = KinoSearch::Plan::StringType->new( stored => 1,
>> sortable => 1);
>> my $sort_string = KinoSearch::Plan::StringType->new( stored => 0,
>> sortable => 1);
>> 
>> $schema->spec_field(name => 'line',     type => $full_text);
>> $schema->spec_field(name => 'nick',     type => $string);
>> $schema->spec_field(name => 'channel',  type => $kept_string);
>> $schema->spec_field(name => 'day',      type => $kept_string);
>> $schema->spec_field(name => 'timestamp',type => $sort_string);
>> $schema->spec_field(name => 'id',       type => $kept_string);
>> 
>> Having each line as a separate document has three disadvantages:
>> 
>> 1) when displaying the results, I have to construct the context manually
>> (so I need to hit the DB to get the rows before and after, which is why
>> I don't store the line in the index)
>> 
>> 2) when paging the search results, I rip apart the last page, because
>> the num_wanted option works with rows, not pages.
>> 
>> 3) not sure about this one, but it feels that this solution doesn't
>> scale well. I've wait more than half a minute for a query that was
>> limited to 100 rows. (Mabe my three sort_specs hurt here?)
>> 
>> Is there a way to construct my schema in a way to avoid these problems
>> (and still allows searching by field)? Something like sub-documents,
>> where I have pages as top level documents, and each page can have
>> multiple rows?
>> 
>> Cheers,
>> Moritz
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

Mime
View raw message