lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Indexing books, chapters and pages
Date Tue, 01 Mar 2016 13:28:11 GMT
Any reason not to use the simplest structure - each page is one Solr
document with a book field, a chapter field, and a page text field? You can
then use grouping to group results by book (title text) or even chapter
(title text and/or number). Maybe initially group by book and then if the
user selects a book group you can re-query with the specific book and then
group by chapter.


-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zaccheob@gmail.com> wrote:

> Original data is quite well structured: it comes in XML with chapters and
> tags to mark the original page breaks on the paper version. In this way we
> have the possibility to restructure it almost as we want before creating
> SOLR index.
>
> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> jack.krupansky@gmail.com> ha scritto:
>
> > To start, what is the form of your input data - is it already divided
> into
> > chapters and pages? Or... are you starting with raw PDF files?
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zaccheob@gmail.com>
> > wrote:
> >
> > > Hi all,
> > > I'm searching for ideas on how to define schema and how to perform
> > queries
> > > in this use case: we have to index books, each book is split into
> > chapters
> > > and chapters are split into pages (pages represent original page
> cutting
> > in
> > > printed version). We should show the result grouped by books and
> chapters
> > > (for the same book) and pages (for the same chapter). As far as I know,
> > we
> > > have 2 options:
> > >
> > > 1. index pages as SOLR documents. In this way we could theoretically
> > > retrieve chapters (and books?)  using grouping but
> > >     a. we will miss matches across two contiguous pages (page cutting
> is
> > > only due to typographical needs so concepts could be split... as in
> > printed
> > > books)
> > >     b. I don't know if it is possible in SOLR to group results on two
> > > different levels (books and chapters)
> > >
> > > 2. index chapters as SOLR documents. In this case we will have the
> right
> > > matches but how to obtain the matching pages? (we need pages because
> the
> > > client can only display pages)
> > >
> > > we have been struggling on this problem for a lot of time and we're
> not
> > > able to find a suitable solution so I'm looking if someone has ideas or
> > has
> > > already solved a similar issue.
> > > Thanks
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message