Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1536518151 for ; Tue, 1 Mar 2016 13:08:55 +0000 (UTC) Received: (qmail 60767 invoked by uid 500); 1 Mar 2016 13:08:52 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 60696 invoked by uid 500); 1 Mar 2016 13:08:52 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 60668 invoked by uid 99); 1 Mar 2016 13:08:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2016 13:08:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A5C7BC35A0 for ; Tue, 1 Mar 2016 13:08:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id LV6Jt4hw6PX9 for ; Tue, 1 Mar 2016 13:08:50 +0000 (UTC) Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 471B85FAD8 for ; Tue, 1 Mar 2016 13:08:50 +0000 (UTC) Received: by mail-lf0-f47.google.com with SMTP id l13so785645lfb.1 for ; Tue, 01 Mar 2016 05:08:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=fIrfBP/Gv/LCFJ5jsWjbiNRVaPDgCbCRwetHKn/orso=; b=TGsAecrmdbmmiDZ3Yz2ZPHDSuTcshLFRF1zxLgAusaOpi63T2cuBAf6yCjtXwgzLL7 6sQVymKW3LvQ1iRro+E3u4KHL+FGU85BDDdsZhUpO4oF0/XPQ9S/ii+LZwRo3F3jfsz+ 94P2ITmPwaw9zOCT5I8ZxzamxnDxJdjXqAiHHM4R21lTKTAc6tUlhbMZ3I5+cNhCr/ai JvM3AXtN+5dUo8vOI85sLgbdTduWf6mkoQUay9r9QNvRUsUMQ+k2LKYPRBPY1+Ms4KtE QTNdBY0xEj92HUrXtWh8KGdU8LGnX4w9DJIIe5ZfPJgCWLtwYeWrSgP0Pgb1lgJuCzHV ewhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=fIrfBP/Gv/LCFJ5jsWjbiNRVaPDgCbCRwetHKn/orso=; b=HPl442NqPkhE2Vj/V2f7pkX5e+QB3IJvmdc+W2N9mee9Lbpe+xDZnwSBF/J3gTBmRs Hl5LoUmFsStw9ILH8HN7uOZnPB6FwsupzAlktLtFL8BNWHov2APsqRedig+CosJdyzOu dzaR/7GmihMLANYQ1DjTZx9xrSx3wLyPFtP6SzqJZGzNYhIJdv3KfST/1lD6HG+0jawn 9WM7Tq1gWLoH3RWeEr7vxzM7D/Qd7g4tUIOIiJsjDW6td4GMZRxXLvc/OE6fw71YiIgm E3yYc3/Fm76KjUpmA00dhykvIyO9UrbcpueF3CLUoogvrRWcfmponJub/3AOl+DEcEDO 41Pw== X-Gm-Message-State: AD7BkJL61YDzHcbS9ebSToIF4v36LtkiK7JT8teYBWeFiB++JZuWC2raDFXnKFnfLLVjQ5pXkHhRKVcX4uZs7g== X-Received: by 10.25.41.212 with SMTP id p203mr6175809lfp.48.1456837728690; Tue, 01 Mar 2016 05:08:48 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Zaccheo Bagnati Date: Tue, 01 Mar 2016 13:08:38 +0000 Message-ID: Subject: Re: Indexing books, chapters and pages To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a1141281c5066cb052cfc7703 --001a1141281c5066cb052cfc7703 Content-Type: text/plain; charset=UTF-8 Original data is quite well structured: it comes in XML with chapters and tags to mark the original page breaks on the paper version. In this way we have the possibility to restructure it almost as we want before creating SOLR index. Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky < jack.krupansky@gmail.com> ha scritto: > To start, what is the form of your input data - is it already divided into > chapters and pages? Or... are you starting with raw PDF files? > > > -- Jack Krupansky > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati > wrote: > > > Hi all, > > I'm searching for ideas on how to define schema and how to perform > queries > > in this use case: we have to index books, each book is split into > chapters > > and chapters are split into pages (pages represent original page cutting > in > > printed version). We should show the result grouped by books and chapters > > (for the same book) and pages (for the same chapter). As far as I know, > we > > have 2 options: > > > > 1. index pages as SOLR documents. In this way we could theoretically > > retrieve chapters (and books?) using grouping but > > a. we will miss matches across two contiguous pages (page cutting is > > only due to typographical needs so concepts could be split... as in > printed > > books) > > b. I don't know if it is possible in SOLR to group results on two > > different levels (books and chapters) > > > > 2. index chapters as SOLR documents. In this case we will have the right > > matches but how to obtain the matching pages? (we need pages because the > > client can only display pages) > > > > we have been struggling on this problem for a lot of time and we're not > > able to find a suitable solution so I'm looking if someone has ideas or > has > > already solved a similar issue. > > Thanks > > > --001a1141281c5066cb052cfc7703--