lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Indexing books, chapters and pages
Date Tue, 01 Mar 2016 16:05:17 GMT
You could index both pages and chapters, with a type field.

You could index by chapter with the page number as a payload for each token.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati <zaccheob@gmail.com> wrote:
> 
> Thank you, Jack for your answer.
> There are 2 reasons:
> 1. the requirement is to show in the result list both books and chapters
> grouped, so I would have to execute the query grouping by book, retrieve
> first, let's say, 10 books (sorted by relevance) and then for each book
> repeat the query grouping by chapter (always ordering by relevance) in
> order to obtain what we need (unfortunately it is not up to me defining the
> requirements... but it however make sense). Unless there exist some SOLR
> feature to do this in only one call (and that would be great!).
> 2. searching on pages will not match phrases that spans across 2 pages
> (e.g. if last word of page 1 is "broken" and first word of page 2 is
> "sentence" searching for "broken sentence" will not match)
> However if we will not find a better solution I think that your proposal is
> not so bad... I hope that reason #2 could be negligible and that #1
> performs quite fast though we are multiplying queries.
> 
> Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> jack.krupansky@gmail.com> ha scritto:
> 
>> Any reason not to use the simplest structure - each page is one Solr
>> document with a book field, a chapter field, and a page text field? You can
>> then use grouping to group results by book (title text) or even chapter
>> (title text and/or number). Maybe initially group by book and then if the
>> user selects a book group you can re-query with the specific book and then
>> group by chapter.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zaccheob@gmail.com>
>> wrote:
>> 
>>> Original data is quite well structured: it comes in XML with chapters and
>>> tags to mark the original page breaks on the paper version. In this way
>> we
>>> have the possibility to restructure it almost as we want before creating
>>> SOLR index.
>>> 
>>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>>> jack.krupansky@gmail.com> ha scritto:
>>> 
>>>> To start, what is the form of your input data - is it already divided
>>> into
>>>> chapters and pages? Or... are you starting with raw PDF files?
>>>> 
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zaccheob@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> I'm searching for ideas on how to define schema and how to perform
>>>> queries
>>>>> in this use case: we have to index books, each book is split into
>>>> chapters
>>>>> and chapters are split into pages (pages represent original page
>>> cutting
>>>> in
>>>>> printed version). We should show the result grouped by books and
>>> chapters
>>>>> (for the same book) and pages (for the same chapter). As far as I
>> know,
>>>> we
>>>>> have 2 options:
>>>>> 
>>>>> 1. index pages as SOLR documents. In this way we could theoretically
>>>>> retrieve chapters (and books?)  using grouping but
>>>>>    a. we will miss matches across two contiguous pages (page cutting
>>> is
>>>>> only due to typographical needs so concepts could be split... as in
>>>> printed
>>>>> books)
>>>>>    b. I don't know if it is possible in SOLR to group results on two
>>>>> different levels (books and chapters)
>>>>> 
>>>>> 2. index chapters as SOLR documents. In this case we will have the
>>> right
>>>>> matches but how to obtain the matching pages? (we need pages because
>>> the
>>>>> client can only display pages)
>>>>> 
>>>>> we have been struggling on this problem for a lot of time and we're
>>> not
>>>>> able to find a suitable solution so I'm looking if someone has ideas
>> or
>>>> has
>>>>> already solved a similar issue.
>>>>> Thanks
>>>>> 
>>>> 
>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message