lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zaccheo Bagnati <zacch...@gmail.com>
Subject Re: Indexing books, chapters and pages
Date Wed, 02 Mar 2016 08:45:12 GMT
If someone of you cares about his Stackoverflow reputation and has time to
do it I also opened a question there:
http://stackoverflow.com/questions/35722672/solr-schema-to-model-books-chapters-and-pages.
Thanks again to everybody

Il giorno mer 2 mar 2016 alle ore 09:42 Zaccheo Bagnati <zaccheob@gmail.com>
ha scritto:

> Thanks Alexandre,
> your solution seems very good: I'll surely try it and let you know. I like
> the Idea of mixing blockjoins and grouping!
>
>
> Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
> arafalov@gmail.com> ha scritto:
>
>> Here is an - untested - possible approach. I might be missing
>> something by combining these things in too many layers, but.....
>>
>> 1) Have chapter as parent documents and pages as children within that.
>> Block index them together.
>> 2) On pages, include page text (probably not stored) as one field.
>> Also include a second field that has last paragraph of that page as
>> well as first paragraph of the next page. This gives you phrase
>> matches across boundaries. Also include pageId, etc.
>> 3) On chapters, include book id as a string field.
>> 4) Use block join query to search against pages, but return (parent)
>> chapters
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
>> 5) Use grouping or collapsing+expanding by book id to group chapters
>> within a book:
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>> or
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>> 6) Use [child] DocumentTransformer to get pages back with childFilter
>> to re-limit them by your query:
>>
>> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
>>
>> The main question is whether 6) will be able to piggyback on the
>> output of 5)...... And, of course, the performance...
>>
>> I would love to know if this works, even partially. Either on the
>> mailing list or directly.
>>
>> Regards,
>>    Alex.
>>
>> ----
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 2 March 2016 at 00:50, Zaccheo Bagnati <zaccheob@gmail.com> wrote:
>> > Thank you, Jack for your answer.
>> > There are 2 reasons:
>> > 1. the requirement is to show in the result list both books and chapters
>> > grouped, so I would have to execute the query grouping by book, retrieve
>> > first, let's say, 10 books (sorted by relevance) and then for each book
>> > repeat the query grouping by chapter (always ordering by relevance) in
>> > order to obtain what we need (unfortunately it is not up to me defining
>> the
>> > requirements... but it however make sense). Unless there exist some SOLR
>> > feature to do this in only one call (and that would be great!).
>> > 2. searching on pages will not match phrases that spans across 2 pages
>> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
>> > "sentence" searching for "broken sentence" will not match)
>> > However if we will not find a better solution I think that your
>> proposal is
>> > not so bad... I hope that reason #2 could be negligible and that #1
>> > performs quite fast though we are multiplying queries.
>> >
>> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
>> > jack.krupansky@gmail.com> ha scritto:
>> >
>> >> Any reason not to use the simplest structure - each page is one Solr
>> >> document with a book field, a chapter field, and a page text field?
>> You can
>> >> then use grouping to group results by book (title text) or even chapter
>> >> (title text and/or number). Maybe initially group by book and then if
>> the
>> >> user selects a book group you can re-query with the specific book and
>> then
>> >> group by chapter.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zaccheob@gmail.com>
>> >> wrote:
>> >>
>> >> > Original data is quite well structured: it comes in XML with
>> chapters and
>> >> > tags to mark the original page breaks on the paper version. In this
>> way
>> >> we
>> >> > have the possibility to restructure it almost as we want before
>> creating
>> >> > SOLR index.
>> >> >
>> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>> >> > jack.krupansky@gmail.com> ha scritto:
>> >> >
>> >> > > To start, what is the form of your input data - is it already
>> divided
>> >> > into
>> >> > > chapters and pages? Or... are you starting with raw PDF files?
>> >> > >
>> >> > >
>> >> > > -- Jack Krupansky
>> >> > >
>> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <
>> zaccheob@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > > I'm searching for ideas on how to define schema and how to
>> perform
>> >> > > queries
>> >> > > > in this use case: we have to index books, each book is split
into
>> >> > > chapters
>> >> > > > and chapters are split into pages (pages represent original
page
>> >> > cutting
>> >> > > in
>> >> > > > printed version). We should show the result grouped by books
and
>> >> > chapters
>> >> > > > (for the same book) and pages (for the same chapter). As
far as I
>> >> know,
>> >> > > we
>> >> > > > have 2 options:
>> >> > > >
>> >> > > > 1. index pages as SOLR documents. In this way we could
>> theoretically
>> >> > > > retrieve chapters (and books?)  using grouping but
>> >> > > >     a. we will miss matches across two contiguous pages (page
>> cutting
>> >> > is
>> >> > > > only due to typographical needs so concepts could be split...
as
>> in
>> >> > > printed
>> >> > > > books)
>> >> > > >     b. I don't know if it is possible in SOLR to group results
>> on two
>> >> > > > different levels (books and chapters)
>> >> > > >
>> >> > > > 2. index chapters as SOLR documents. In this case we will
have
>> the
>> >> > right
>> >> > > > matches but how to obtain the matching pages? (we need pages
>> because
>> >> > the
>> >> > > > client can only display pages)
>> >> > > >
>> >> > > > we have been struggling on this problem for a lot of time
and
>> we're
>> >> > not
>> >> > > > able to find a suitable solution so I'm looking if someone
has
>> ideas
>> >> or
>> >> > > has
>> >> > > > already solved a similar issue.
>> >> > > > Thanks
>> >> > > >
>> >> > >
>> >> >
>> >>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message