Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC29D18095 for ; Wed, 2 Mar 2016 03:46:45 +0000 (UTC) Received: (qmail 97613 invoked by uid 500); 2 Mar 2016 03:46:42 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 97555 invoked by uid 500); 2 Mar 2016 03:46:42 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 97543 invoked by uid 99); 2 Mar 2016 03:46:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2016 03:46:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6E8F318055A for ; Wed, 2 Mar 2016 03:46:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.802 X-Spam-Level: X-Spam-Status: No, score=-0.802 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id mJcws7jCGmoQ for ; Wed, 2 Mar 2016 03:46:39 +0000 (UTC) Received: from mail-yk0-f171.google.com (mail-yk0-f171.google.com [209.85.160.171]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id F1B895FBBD for ; Wed, 2 Mar 2016 03:46:38 +0000 (UTC) Received: by mail-yk0-f171.google.com with SMTP id 1so12536379ykg.3 for ; Tue, 01 Mar 2016 19:46:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=vkj5Wmw+zr+umBQmxmVRCmItnNhi0NPeWyrYrN/E8ew=; b=zHxZoe438SfdsjT1N82lrmxs6XWkniotDGs+GrATtMbAfTs1SpiMheW1ZRn/iB8odd gWEPM/aqQxuRXPF6Vw74K/WeHAJzzV0+MAjGsxSG9YtqtoqhHi3hK5YIGQmvtA8woI7r B+EpwJCMpbf86XRGnjOEzZer+c7M823Rj+bsVLA/uKtgB2aHQzamDyuVxerqYBKraQYE R74rnNHCvyqNt7UJUSyVDgxPj0e055c0kM9q+k9Ast/hAZzm2nrgw5xqEXPaLRsErfEv HZGCg5As+0eDryYRiZ2BGlGTRjeu4DLnIip3Wadg+CeJB6vw405ObKASBn0xp0UCnn3m ss6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=vkj5Wmw+zr+umBQmxmVRCmItnNhi0NPeWyrYrN/E8ew=; b=ZFsmMytDHq3Tw10bkpQMptVCM+z44X7aCzdWQVLQmeAAHsqJRTxAAk5jMhHgQ8svd4 j6YvQVAHpl40oSdUAHyT8mLhBb1CS79MCevd8kzTBwkIZRFIDScfy5e4wdWqVx//4kGX 14C9d3Mr32n4yHlqL1iM8NuaUGb6t/sJ3m5GfNNoOS1V/fSvPZXfyotymVgy68HcAShw o+OGL/5joonQCfjR9dpXKQzQyX4Vfoa+OeRdf0TM2KB17rOH6oezqVIHzOW7liD5Cc3O ETMXqoYO+W5TuAl4czsprmYOAK5SzMH4wpVGA89Y1zurO7dTTWkT0ZlZ7JdtW7OlEJbw dBwQ== X-Gm-Message-State: AD7BkJJssWJpDzZtlQkVu0jFd1yeXVZOhw3G3p3xg8GXMb9be1ZjbYg0TzSx6m9TrylYuNkkx31gOYWyNArWzA== X-Received: by 10.37.59.72 with SMTP id i69mr13872469yba.30.1456890096858; Tue, 01 Mar 2016 19:41:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.37.76.4 with HTTP; Tue, 1 Mar 2016 19:40:57 -0800 (PST) In-Reply-To: References: From: Alexandre Rafalovitch Date: Wed, 2 Mar 2016 14:40:57 +1100 Message-ID: Subject: Re: Indexing books, chapters and pages To: solr-user Content-Type: text/plain; charset=UTF-8 Here is an - untested - possible approach. I might be missing something by combining these things in too many layers, but..... 1) Have chapter as parent documents and pages as children within that. Block index them together. 2) On pages, include page text (probably not stored) as one field. Also include a second field that has last paragraph of that page as well as first paragraph of the next page. This gives you phrase matches across boundaries. Also include pageId, etc. 3) On chapters, include book id as a string field. 4) Use block join query to search against pages, but return (parent) chapters https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers 5) Use grouping or collapsing+expanding by book id to group chapters within a book: https://cwiki.apache.org/confluence/display/solr/Result+Grouping or https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results 6) Use [child] DocumentTransformer to get pages back with childFilter to re-limit them by your query: https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory The main question is whether 6) will be able to piggyback on the output of 5)...... And, of course, the performance... I would love to know if this works, even partially. Either on the mailing list or directly. Regards, Alex. ---- Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 2 March 2016 at 00:50, Zaccheo Bagnati wrote: > Thank you, Jack for your answer. > There are 2 reasons: > 1. the requirement is to show in the result list both books and chapters > grouped, so I would have to execute the query grouping by book, retrieve > first, let's say, 10 books (sorted by relevance) and then for each book > repeat the query grouping by chapter (always ordering by relevance) in > order to obtain what we need (unfortunately it is not up to me defining the > requirements... but it however make sense). Unless there exist some SOLR > feature to do this in only one call (and that would be great!). > 2. searching on pages will not match phrases that spans across 2 pages > (e.g. if last word of page 1 is "broken" and first word of page 2 is > "sentence" searching for "broken sentence" will not match) > However if we will not find a better solution I think that your proposal is > not so bad... I hope that reason #2 could be negligible and that #1 > performs quite fast though we are multiplying queries. > > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky < > jack.krupansky@gmail.com> ha scritto: > >> Any reason not to use the simplest structure - each page is one Solr >> document with a book field, a chapter field, and a page text field? You can >> then use grouping to group results by book (title text) or even chapter >> (title text and/or number). Maybe initially group by book and then if the >> user selects a book group you can re-query with the specific book and then >> group by chapter. >> >> >> -- Jack Krupansky >> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati >> wrote: >> >> > Original data is quite well structured: it comes in XML with chapters and >> > tags to mark the original page breaks on the paper version. In this way >> we >> > have the possibility to restructure it almost as we want before creating >> > SOLR index. >> > >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky < >> > jack.krupansky@gmail.com> ha scritto: >> > >> > > To start, what is the form of your input data - is it already divided >> > into >> > > chapters and pages? Or... are you starting with raw PDF files? >> > > >> > > >> > > -- Jack Krupansky >> > > >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati >> > > wrote: >> > > >> > > > Hi all, >> > > > I'm searching for ideas on how to define schema and how to perform >> > > queries >> > > > in this use case: we have to index books, each book is split into >> > > chapters >> > > > and chapters are split into pages (pages represent original page >> > cutting >> > > in >> > > > printed version). We should show the result grouped by books and >> > chapters >> > > > (for the same book) and pages (for the same chapter). As far as I >> know, >> > > we >> > > > have 2 options: >> > > > >> > > > 1. index pages as SOLR documents. In this way we could theoretically >> > > > retrieve chapters (and books?) using grouping but >> > > > a. we will miss matches across two contiguous pages (page cutting >> > is >> > > > only due to typographical needs so concepts could be split... as in >> > > printed >> > > > books) >> > > > b. I don't know if it is possible in SOLR to group results on two >> > > > different levels (books and chapters) >> > > > >> > > > 2. index chapters as SOLR documents. In this case we will have the >> > right >> > > > matches but how to obtain the matching pages? (we need pages because >> > the >> > > > client can only display pages) >> > > > >> > > > we have been struggling on this problem for a lot of time and we're >> > not >> > > > able to find a suitable solution so I'm looking if someone has ideas >> or >> > > has >> > > > already solved a similar issue. >> > > > Thanks >> > > > >> > > >> > >>