lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zaccheo Bagnati <zacch...@gmail.com>
Subject Re: Indexing books, chapters and pages
Date Tue, 01 Mar 2016 13:08:38 GMT
Original data is quite well structured: it comes in XML with chapters and
tags to mark the original page breaks on the paper version. In this way we
have the possibility to restructure it almost as we want before creating
SOLR index.

Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
jack.krupansky@gmail.com> ha scritto:

> To start, what is the form of your input data - is it already divided into
> chapters and pages? Or... are you starting with raw PDF files?
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zaccheob@gmail.com>
> wrote:
>
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> >     a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> >     b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message