lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Schema for searching IRC logs
Date Sun, 20 Feb 2011 23:53:30 GMT
On Sun, Feb 20, 2011 at 07:01:46PM +0100, Moritz Lenz wrote:
> Having each line as a separate document has three disadvantages:
> 
> 1) when displaying the results, I have to construct the context manually
> (so I need to hit the DB to get the rows before and after, which is why
> I don't store the line in the index)

You could theoretically store the entire page with each line.  However, that
would waste space thanks to the redundancy, and so it's probably better to
store the pages in a separate data structure (RDBMS, Berkeley DB, etc), and
retrieve them after getting the results back from the index.

Alternately, consider providing less context: just the lines before and after.

> 2) when paging the search results, I rip apart the last page, because
> the num_wanted option works with rows, not pages.

I don't quite grok what you mean.  However, I can certainly see how there
would be difficulties if you want to display results broken up by "page", but
your engine returns results broken up by "line": you'll have to post-process
the line-based hits to collate the page-based results.  That gets very
complicated as soon as you go past the first SERP.  (SERP = "Search Engine
Results Page", distinct from how you're using the word "page" to describe IRC
log content).

> 3) not sure about this one, but it feels that this solution doesn't
> scale well. I've wait more than half a minute for a query that was
> limited to 100 rows. 

The two fundamentals when optimizing for search speed are RAM and posting list
size.

First, you need enough RAM on the box to fit all of the important index
components (lexicons, posting lists, and sort caches) into the OS cache.  With
millions of records, you are really going to feel it you are hitting the hard
disk.

Second, slow queries are almost always slow because some part of the query
matches a very large number of documents -- or to put it into our native
terminology, at least one term has a very large posting list.  Even if the
complete query doesn't match very many documents, it's possible that a
sub-section of the query is slowing things down.  Thus, the process of query
optimization generally involves finding ways to match fewer documents.

Try running this code and see if anything stands out as likely to produce a
large result set:

    use Data::Dumper;
    my $query = $query_parser->parse($query_string);
    warn Dumper($query->dump);

> (Mabe my three sort_specs hurt here?)

Almost certainly not.  Search-time sorting in Lucy/KinoSearch is very fast; we
spend a fair amount of effort building up optimized data structures to support
sorting at index-time.  Thanks to that approach, if anything, the costs of
making additional fields sortable are felt at index-time, not search-time.

> Is there a way to construct my schema in a way to avoid these problems
> (and still allows searching by field)? Something like sub-documents,
> where I have pages as top level documents, and each page can have
> multiple rows?

If I understand correctly, there seem to be inherent difficulties with the
one-to-many relationships in that approach.

If you organize documents by "page", and each "page" has multiple values for
the 'nick' field, you are going to get false positives when filtering by
'nick'.  For instance, if both "chromatic" and "moritz" have authored lines on
a given page, then a filter on "moritz" will fail to exclude nearby content
authored by "chromatic".

Similarly, if you organize documents by page, each now has multiple
'timestamp' values.  How do you know which line within the page caused the
hit, and thus which associated timestamp the result should sort by?

I think the only way to achieve the ideal logical result you've described is
to organize the index with "line" as the top-level document.  However, there
is no question that organizing by "page" would drastically cut down the size
of the posting lists which are being iterated, improving search speed.

Would it be acceptable to modify the spec?

  * Pages are top-level documents.
  * Each page is associated with one timestamp -- the time of the first line.
  * No page can cross multiple days.
  * Pages can have multiple values for 'nick', so filtering on 'nick' limits
    results to pages that an author has participated in, rather than lines
    they've written.

Marvin Humphrey


Mime
View raw message