lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zaccheo Bagnati <zacch...@gmail.com>
Subject Re: Indexing books, chapters and pages
Date Wed, 02 Mar 2016 08:09:04 GMT
Thanks Walter,
the payload idea is something that I've never heard... it seems interesting
but quite complex to implement. I think we'll have to write a custom filter
to add page numbers and it's not clear to me how to retrieve payloads in
the query result. However I'll try to go more in deep on this.
any further detail on how to use payloads?

Il giorno mar 1 mar 2016 alle ore 17:05 Walter Underwood <
wunder@wunderwood.org> ha scritto:

> You could index both pages and chapters, with a type field.
>
> You could index by chapter with the page number as a payload for each
> token.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati <zaccheob@gmail.com> wrote:
> >
> > Thank you, Jack for your answer.
> > There are 2 reasons:
> > 1. the requirement is to show in the result list both books and chapters
> > grouped, so I would have to execute the query grouping by book, retrieve
> > first, let's say, 10 books (sorted by relevance) and then for each book
> > repeat the query grouping by chapter (always ordering by relevance) in
> > order to obtain what we need (unfortunately it is not up to me defining
> the
> > requirements... but it however make sense). Unless there exist some SOLR
> > feature to do this in only one call (and that would be great!).
> > 2. searching on pages will not match phrases that spans across 2 pages
> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
> > "sentence" searching for "broken sentence" will not match)
> > However if we will not find a better solution I think that your proposal
> is
> > not so bad... I hope that reason #2 could be negligible and that #1
> > performs quite fast though we are multiplying queries.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> > jack.krupansky@gmail.com> ha scritto:
> >
> >> Any reason not to use the simplest structure - each page is one Solr
> >> document with a book field, a chapter field, and a page text field? You
> can
> >> then use grouping to group results by book (title text) or even chapter
> >> (title text and/or number). Maybe initially group by book and then if
> the
> >> user selects a book group you can re-query with the specific book and
> then
> >> group by chapter.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zaccheob@gmail.com>
> >> wrote:
> >>
> >>> Original data is quite well structured: it comes in XML with chapters
> and
> >>> tags to mark the original page breaks on the paper version. In this way
> >> we
> >>> have the possibility to restructure it almost as we want before
> creating
> >>> SOLR index.
> >>>
> >>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> >>> jack.krupansky@gmail.com> ha scritto:
> >>>
> >>>> To start, what is the form of your input data - is it already divided
> >>> into
> >>>> chapters and pages? Or... are you starting with raw PDF files?
> >>>>
> >>>>
> >>>> -- Jack Krupansky
> >>>>
> >>>> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zaccheob@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi all,
> >>>>> I'm searching for ideas on how to define schema and how to perform
> >>>> queries
> >>>>> in this use case: we have to index books, each book is split into
> >>>> chapters
> >>>>> and chapters are split into pages (pages represent original page
> >>> cutting
> >>>> in
> >>>>> printed version). We should show the result grouped by books and
> >>> chapters
> >>>>> (for the same book) and pages (for the same chapter). As far as
I
> >> know,
> >>>> we
> >>>>> have 2 options:
> >>>>>
> >>>>> 1. index pages as SOLR documents. In this way we could theoretically
> >>>>> retrieve chapters (and books?)  using grouping but
> >>>>>    a. we will miss matches across two contiguous pages (page cutting
> >>> is
> >>>>> only due to typographical needs so concepts could be split... as
in
> >>>> printed
> >>>>> books)
> >>>>>    b. I don't know if it is possible in SOLR to group results on
two
> >>>>> different levels (books and chapters)
> >>>>>
> >>>>> 2. index chapters as SOLR documents. In this case we will have the
> >>> right
> >>>>> matches but how to obtain the matching pages? (we need pages because
> >>> the
> >>>>> client can only display pages)
> >>>>>
> >>>>> we have been struggling on this problem for a lot of time and we're
> >>> not
> >>>>> able to find a suitable solution so I'm looking if someone has ideas
> >> or
> >>>> has
> >>>>> already solved a similar issue.
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message