lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zaccheo Bagnati <zacch...@gmail.com>
Subject Re: Indexing books, chapters and pages
Date Wed, 02 Mar 2016 08:42:10 GMT
Thanks Alexandre,
your solution seems very good: I'll surely try it and let you know. I like
the Idea of mixing blockjoins and grouping!

Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
arafalov@gmail.com> ha scritto:

> Here is an - untested - possible approach. I might be missing
> something by combining these things in too many layers, but.....
>
> 1) Have chapter as parent documents and pages as children within that.
> Block index them together.
> 2) On pages, include page text (probably not stored) as one field.
> Also include a second field that has last paragraph of that page as
> well as first paragraph of the next page. This gives you phrase
> matches across boundaries. Also include pageId, etc.
> 3) On chapters, include book id as a string field.
> 4) Use block join query to search against pages, but return (parent)
> chapters
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
> 5) Use grouping or collapsing+expanding by book id to group chapters
> within a book:
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> or
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> 6) Use [child] DocumentTransformer to get pages back with childFilter
> to re-limit them by your query:
>
> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
>
> The main question is whether 6) will be able to piggyback on the
> output of 5)...... And, of course, the performance...
>
> I would love to know if this works, even partially. Either on the
> mailing list or directly.
>
> Regards,
>    Alex.
>
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 March 2016 at 00:50, Zaccheo Bagnati <zaccheob@gmail.com> wrote:
> > Thank you, Jack for your answer.
> > There are 2 reasons:
> > 1. the requirement is to show in the result list both books and chapters
> > grouped, so I would have to execute the query grouping by book, retrieve
> > first, let's say, 10 books (sorted by relevance) and then for each book
> > repeat the query grouping by chapter (always ordering by relevance) in
> > order to obtain what we need (unfortunately it is not up to me defining
> the
> > requirements... but it however make sense). Unless there exist some SOLR
> > feature to do this in only one call (and that would be great!).
> > 2. searching on pages will not match phrases that spans across 2 pages
> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
> > "sentence" searching for "broken sentence" will not match)
> > However if we will not find a better solution I think that your proposal
> is
> > not so bad... I hope that reason #2 could be negligible and that #1
> > performs quite fast though we are multiplying queries.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> > jack.krupansky@gmail.com> ha scritto:
> >
> >> Any reason not to use the simplest structure - each page is one Solr
> >> document with a book field, a chapter field, and a page text field? You
> can
> >> then use grouping to group results by book (title text) or even chapter
> >> (title text and/or number). Maybe initially group by book and then if
> the
> >> user selects a book group you can re-query with the specific book and
> then
> >> group by chapter.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zaccheob@gmail.com>
> >> wrote:
> >>
> >> > Original data is quite well structured: it comes in XML with chapters
> and
> >> > tags to mark the original page breaks on the paper version. In this
> way
> >> we
> >> > have the possibility to restructure it almost as we want before
> creating
> >> > SOLR index.
> >> >
> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> >> > jack.krupansky@gmail.com> ha scritto:
> >> >
> >> > > To start, what is the form of your input data - is it already
> divided
> >> > into
> >> > > chapters and pages? Or... are you starting with raw PDF files?
> >> > >
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zaccheob@gmail.com
> >
> >> > > wrote:
> >> > >
> >> > > > Hi all,
> >> > > > I'm searching for ideas on how to define schema and how to perform
> >> > > queries
> >> > > > in this use case: we have to index books, each book is split
into
> >> > > chapters
> >> > > > and chapters are split into pages (pages represent original page
> >> > cutting
> >> > > in
> >> > > > printed version). We should show the result grouped by books
and
> >> > chapters
> >> > > > (for the same book) and pages (for the same chapter). As far
as I
> >> know,
> >> > > we
> >> > > > have 2 options:
> >> > > >
> >> > > > 1. index pages as SOLR documents. In this way we could
> theoretically
> >> > > > retrieve chapters (and books?)  using grouping but
> >> > > >     a. we will miss matches across two contiguous pages (page
> cutting
> >> > is
> >> > > > only due to typographical needs so concepts could be split...
as
> in
> >> > > printed
> >> > > > books)
> >> > > >     b. I don't know if it is possible in SOLR to group results
on
> two
> >> > > > different levels (books and chapters)
> >> > > >
> >> > > > 2. index chapters as SOLR documents. In this case we will have
the
> >> > right
> >> > > > matches but how to obtain the matching pages? (we need pages
> because
> >> > the
> >> > > > client can only display pages)
> >> > > >
> >> > > > we have been struggling on this problem for a lot of time and
> we're
> >> > not
> >> > > > able to find a suitable solution so I'm looking if someone has
> ideas
> >> or
> >> > > has
> >> > > > already solved a similar issue.
> >> > > > Thanks
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message