Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D826A200BC6 for ; Sun, 6 Nov 2016 05:33:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C922B160B09; Sun, 6 Nov 2016 04:33:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F1EEE160AEF for ; Sun, 6 Nov 2016 05:33:07 +0100 (CET) Received: (qmail 74340 invoked by uid 500); 6 Nov 2016 04:33:03 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 74328 invoked by uid 99); 6 Nov 2016 04:33:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Nov 2016 04:33:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9621C1804C1 for ; Sun, 6 Nov 2016 04:33:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.321 X-Spam-Level: X-Spam-Status: No, score=-0.321 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id nCH6Z3ob1fpO for ; Sun, 6 Nov 2016 04:33:00 +0000 (UTC) Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id DC08E5FC86 for ; Sun, 6 Nov 2016 04:32:59 +0000 (UTC) Received: by mail-it0-f44.google.com with SMTP id q124so57126668itd.1 for ; Sat, 05 Nov 2016 21:32:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=LCe0xZSGECUXwjzfxpZapzu4yWJHGuhBC2pmqYy1+ck=; b=mO2zSprajIMuMGZELDBAArWNqrbl+FR4yrmwLVAn4wVxWwDaFHr/5sn/s5Mak6oIWu JqFQY1OyakAW097bl3A74DrzNsXpgZvj+Juj4FcuYdqgr1HsFxE+y+YgqaYVNdDefFA7 463xzW6Ol49Fimpdjr4NM4Py8G5g5DsV5ipuTPV9e+Iyqi6tVo0A3Gos4W3k0eO4DNO8 8Ot8KcDoSg5p1T5qZautfxkqgPH+sObSzy3epyYWo/zOffRun3ekIkNMGX6euX4TkWId 89v1bnmPmDeLlsNRF/rlGuGGjQw3O1CiOQb6VMTIExegvoVtW4UuEbgWhSCa79SfH2Ib 4PSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=LCe0xZSGECUXwjzfxpZapzu4yWJHGuhBC2pmqYy1+ck=; b=FIbCpIBvsiF5zu8NPu6izEUQejD39yZf3/XeqvDZyMS8/JYlqEt5V7nH9SRDcYbbuF NNQDva62fCGNxKPET+ecLS0mLDkyPfHPOAPYUyo/i/ML2lufLVLNTpCJeapJcd4mGZgj BH+X/pigNa/CWJs0CLfG9Mz2Ln2MDjlvl5kic4W+e231N3M4Kkpg8EN3rpT0RbGU98RA v+nwHYsiserW9/iE5aL+r/MgKkAdO2v5A9u3ll5lhXbvZFTGdi4dblGUbAa0pDSuGWj9 iQe3kiO9ajto71KkSvZR40HieMHSuoIafb4+ua0tlmchqyDsX2EIVrBLlU2+efuXDR44 B3tw== X-Gm-Message-State: ABUngvfHBgQF6JzTK/WjpSfZyKcnXogr4b2MkUyfzqqUyZsbBHhYBXro3mrOBk5Foo1kszZjjHVsXNAGHgMkjg== X-Received: by 10.36.93.193 with SMTP id w184mr2679182ita.85.1478406779181; Sat, 05 Nov 2016 21:32:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.129.86 with HTTP; Sat, 5 Nov 2016 21:32:18 -0700 (PDT) In-Reply-To: References: From: Erick Erickson Date: Sat, 5 Nov 2016 21:32:18 -0700 Message-ID: Subject: Re: Parallelize Cursor approach To: solr-user Content-Type: text/plain; charset=UTF-8 archived-at: Sun, 06 Nov 2016 04:33:09 -0000 Hmmm, export is supposed to handle 10s of million result sets. I know of a situation where the Streaming Aggregation functionality back ported to Solr 4.10 processes on that scale. So do you have any clue what exactly is failing? Is there anything in the Solr logs? _How_ are you using /export, through Streaming Aggregation (SolrJ) or just the raw xport handler? It might be worth trying to do this from SolrJ if you're not, it should be a very quick program to write, just to test we're talking 100 lines max. You could always roll your own cursor mark stuff by partitioning the data amongst N threads/processes if you have any reasonable expectation that you could form filter queries that partition the result set anywhere near evenly. For example, let's say you have a field with random numbers between 0 and 100. You could spin off 10 cursorMark-aware processes each with its own fq clause like fq=partition_field:[0 TO 10} fq=[10 TO 20} .... fq=[90 TO 100] Note the use of inclusive/exclusive end points.... Each one would be totally independent of all others with no overlapping documents. And since the fq's would presumably be cached you should be able to go as fast as you can drive your cluster. Of course you lose query-wide sorting and the like, if that's important you'd need to figure something out there. Do be aware of a potential issue. When regular doc fields are returned, for each document returned, a 16K block of data will be decompressed to get the stored field data. Streaming Aggregation (/xport) reads docValues entries which are held in MMapDirectory space so will be much, much faster. As of Solr 5.5. You can override the decompression stuff, see: https://issues.apache.org/jira/browse/SOLR-8220 for fields that are both stored and docvalues... Best, Erick On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi wrote: > Thanks Yonik for the explanation. > > Hi Erick, > I was using the /xport functionality. But it hasn't been stable (Solr > 5.5.0). I started running into run time Exceptions (JSON parsing > exceptions) while reading the stream of Tuples. This started happening as > the size of my collection increased 3 times and I started running queries > that return millions of documents (>10mm). I don't know if it is the query > result size or the actual data size (total number of docs in the > collection) that is causing the instability. > > org.noggit.JSONParser$ParseException: Expected ',' or '}': > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/ > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":" > 0lG99sHT8P5e' > > I won't be able to move to Solr 6.0 due to some constraints in our > production environment and hence moving back to the cursor approach. Do you > have any other suggestion for me? > > Thanks, > Chetas. > > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson > wrote: > >> Have you considered the /xport functionality? >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley wrote: >> > No, you can't get cursor-marks ahead of time. >> > They are the serialized representation of the last sort values >> > encountered (hence not known ahead of time). >> > >> > -Yonik >> > >> > >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi >> wrote: >> >> Hi, >> >> >> >> I am using the cursor approach to fetch results from Solr (5.5.0). Most >> of >> >> my queries return millions of results. Is there a way I can read the >> pages >> >> in parallel? Is there a way I can get all the cursors well in advance? >> >> >> >> Let's say my query returns 2M documents and I have set rows=100,000. >> >> Can I have multiple threads iterating over different pages like >> >> Thread1 -> docs 1 to 100K >> >> Thread2 -> docs 101K to 200K >> >> ...... >> >> ...... >> >> >> >> for this to happen, can I get all the cursorMarks for a given query so >> that >> >> I can leverage the following code in parallel >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark) >> >> val rsp: QueryResponse = c.query(cursorQ) >> >> >> >> Thank you, >> >> Chetas. >>