lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <>
Subject Re: [lucy-dev] Schema for searching IRC logs
Date Mon, 21 Feb 2011 04:29:07 GMT
Moritz Lenz wrote on 2/20/11 12:01 PM:

> One page of these logs contains the conversation from one channel at one
> particular day, and each such page contains many rows consisting of an
> ID, a timestamp, a nickname, and the line that was being uttered.
> Example: (Currently i have
> about 20 channels, a few years worth of logs and 4 million rows; I want
> to be able to scale up to maybe 20 million rows)
> I want my search results to be grouped similarly, so my current schema
> looks like this:

When I've done similar projects I eventually ask myself, what is the smallest
unit I want to represent as a "result". In this case, is it actually the row, or
the page of rows? I.e., start from the visual idea you want and work backwards.
It seems like you are doing that (you want to group your results similarly) --
what does "similar" mean? Same page? Same channel? etc.

One approach I have taken is to build multiple indexes, each with a different
unit of granularity. E.g., page-level index and a row-level index. Then my
search code first executes on the row-level index for its field-specificity, and
then pulls out displayed results from the page-level index, in order to get the
context. It's like hitting the db (as you mention) but usually faster because
the de-normalizing of the rows has already taken place at index build-time.

> Is there a way to construct my schema in a way to avoid these problems
> (and still allows searching by field)? Something like sub-documents,
> where I have pages as top level documents, and each page can have
> multiple rows?

I de-normalize my db to XML files, using <xinclude> tags to represent the
one-to-many relationships. So in this case, I would create a page.xml:

  <xi:include href="path/to/row1.xml" />

and then create 2 indexes, one pointed at the page xml and one at the row xml.
(This is with SWISH::Prog::KSx and swish3.)

You wouldn't even have to have two indexes, if you didn't ever want to return
row-level results specifically.

Peter Karman  .  .

View raw message