lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Moritz Lenz <mor...@faui2k3.org>
Subject Re: [lucy-dev] Schema for searching IRC logs
Date Sun, 27 Feb 2011 10:06:31 GMT
On 02/21/2011 12:53 AM, Marvin Humphrey wrote:
> On Sun, Feb 20, 2011 at 07:01:46PM +0100, Moritz Lenz wrote:
>> Having each line as a separate document has three disadvantages:
>> 
>> 1) when displaying the results, I have to construct the context manually
>> (so I need to hit the DB to get the rows before and after, which is why
>> I don't store the line in the index)
> 
> You could theoretically store the entire page with each line.  However, that
> would waste space thanks to the redundancy, and so it's probably better to
> store the pages in a separate data structure (RDBMS, Berkeley DB, etc), and
> retrieve them after getting the results back from the index.
> 
> Alternately, consider providing less context: just the lines before and after.

Makes sense, thank you.

>> 2) when paging the search results, I rip apart the last page, because
>> the num_wanted option works with rows, not pages.
> 
> I don't quite grok what you mean.  However, I can certainly see how there
> would be difficulties if you want to display results broken up by "page", but
> your engine returns results broken up by "line": you'll have to post-process
> the line-based hits to collate the page-based results.  That gets very
> complicated as soon as you go past the first SERP.  (SERP = "Search Engine
> Results Page", distinct from how you're using the word "page" to describe IRC
> log content).

That's exactly what I meant, I just didn't describe it good enough.

>> 3) not sure about this one, but it feels that this solution doesn't
>> scale well. I've wait more than half a minute for a query that was
>> limited to 100 rows. 
> 
> The two fundamentals when optimizing for search speed are RAM and posting list
> size.
> 
> First, you need enough RAM on the box to fit all of the important index
> components (lexicons, posting lists, and sort caches) into the OS cache.  With
> millions of records, you are really going to feel it you are hitting the hard
> disk.
> 
> Second, slow queries are almost always slow because some part of the query
> matches a very large number of documents -- or to put it into our native
> terminology, at least one term has a very large posting list. 

That is certainly the case with my slow queries. In fact one example I
remember were two AND-connected terms that would both produce quite much
output when run separately.

>> (Mabe my three sort_specs hurt here?)
> 
> Almost certainly not.  Search-time sorting in Lucy/KinoSearch is very fast; we
> spend a fair amount of effort building up optimized data structures to support
> sorting at index-time.  Thanks to that approach, if anything, the costs of
> making additional fields sortable are felt at index-time, not search-time.

Great.

>> Is there a way to construct my schema in a way to avoid these problems
>> (and still allows searching by field)? Something like sub-documents,
>> where I have pages as top level documents, and each page can have
>> multiple rows?
> 
> If I understand correctly, there seem to be inherent difficulties with the
> one-to-many relationships in that approach.
> 
> If you organize documents by "page", and each "page" has multiple values for
> the 'nick' field, you are going to get false positives when filtering by
> 'nick'.  For instance, if both "chromatic" and "moritz" have authored lines on
> a given page, then a filter on "moritz" will fail to exclude nearby content
> authored by "chromatic".
> 
> Similarly, if you organize documents by page, each now has multiple
> 'timestamp' values.  How do you know which line within the page caused the
> hit, and thus which associated timestamp the result should sort by?
> 
> I think the only way to achieve the ideal logical result you've described is
> to organize the index with "line" as the top-level document.  However, there
> is no question that organizing by "page" would drastically cut down the size
> of the posting lists which are being iterated, improving search speed.
> 
> Would it be acceptable to modify the spec?
> 
>   * Pages are top-level documents.
>   * Each page is associated with one timestamp -- the time of the first line.
>   * No page can cross multiple days.
>   * Pages can have multiple values for 'nick', so filtering on 'nick' limits
>     results to pages that an author has participated in, rather than lines
>     they've written.

I fear that this makes the search too unspecific. I often use the search
when I have a vague memory like "didn't TimToday say something about the
problem with submethod BUILD?", but since he's active almost every day,
narrowing down the search by nick name would lose nearly all value.

I'll continue pondering the problem and trying out things, maybe I'll
find a better solution.

In the mean time I'd like to thank everybody for their helpful input and
for the work on KinoSearch/lucy.

Cheers,
Moritz

Mime
View raw message