lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Indexing books, chapters and pages
Date Tue, 01 Mar 2016 16:44:32 GMT
The chapter seems like the optimal unit for initial searches - just combine
the page text with a line break between them or index as a multivalued
field and set the position increment gap to be 1 so that phrases work.

You could have a separate collection for pages, with each page as a Solr
document, but include the last line of text from the previous page and the
first line of text from the next page so that phrases will match across
page boundaries. Unfortunately, that may also result in false hits if the
full phrase is found on the two adopted lines. That would require some
special filtering to eliminate those false positives.

There is also the question of maximum phrase size - most phrases tend to be
reasonably short, but sometimes people may want to search for an entire
paragraph (e.g., a quote) that may span multiple lines on two adjacent
pages.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
emir.arnautovic@sematext.com> wrote:

> Hi,
> From the top of my head - probably does not solve problem completely, but
> may trigger brainstorming: Index chapters and include page break tokens.
> Use highlighting to return matches and make sure fragment size is large
> enough to get page break token. In such scenario you should use slop for
> phrase searches...
>
> More I write it, less I like it, but will not delete...
>
> Regards,
> Emir
>
>
> On 01.03.2016 12:56, Zaccheo Bagnati wrote:
>
>> Hi all,
>> I'm searching for ideas on how to define schema and how to perform queries
>> in this use case: we have to index books, each book is split into chapters
>> and chapters are split into pages (pages represent original page cutting
>> in
>> printed version). We should show the result grouped by books and chapters
>> (for the same book) and pages (for the same chapter). As far as I know, we
>> have 2 options:
>>
>> 1. index pages as SOLR documents. In this way we could theoretically
>> retrieve chapters (and books?)  using grouping but
>>      a. we will miss matches across two contiguous pages (page cutting is
>> only due to typographical needs so concepts could be split... as in
>> printed
>> books)
>>      b. I don't know if it is possible in SOLR to group results on two
>> different levels (books and chapters)
>>
>> 2. index chapters as SOLR documents. In this case we will have the right
>> matches but how to obtain the matching pages? (we need pages because the
>> client can only display pages)
>>
>> we have been struggling on this problem for a lot of time and we're  not
>> able to find a suitable solution so I'm looking if someone has ideas or
>> has
>> already solved a similar issue.
>> Thanks
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message