Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <CAA70BoWMtzesgO8ArRWEhNrpONqR-aLaiSjt1sY3YtHN6opPfQ@mail.gmail.com>
References: <CAA70BoXHWDfUqMyVBxcXikAsbwe7xZNZdMRa16Cf-0xB5hmkCw@mail.gmail.com>
 <CAB_8Yd_DuFcYL0cPpxYAS0it5KCLivT6YbP+2dC0LaET0U8dMQ@mail.gmail.com>
 <CAN4YXvfyGdeXZEFHLCcxdSScfgFC3=X+3+c5_NeVSWzy8RSqTg@mail.gmail.com> <CAA70BoWMtzesgO8ArRWEhNrpONqR-aLaiSjt1sY3YtHN6opPfQ@mail.gmail.com>
From: Erick Erickson <erickerickson@gmail.com>
Date: Sat, 5 Nov 2016 21:32:18 -0700
Message-ID: <CAN4YXvfg5bD6JA1ajoj4TTH0gvLW_avbtOQHmxCwq5LX0oD+_g@mail.gmail.com>
Subject: Re: Parallelize Cursor approach
To: solr-user <solr-user@lucene.apache.org>
Content-Type: text/plain; charset=UTF-8
archived-at: Sun, 06 Nov 2016 04:33:09 -0000

Hmmm, export is supposed to handle 10s of million result sets. I know
of a situation where the Streaming Aggregation functionality back
ported to Solr 4.10 processes on that scale. So do you have any clue
what exactly is failing? Is there anything in the Solr logs?

_How_ are you using /export, through Streaming Aggregation (SolrJ) or
just the raw xport handler? It might be worth trying to do this from
SolrJ if you're not, it should be a very quick program to write, just
to test we're talking 100 lines max.

You could always roll your own cursor mark stuff by partitioning the
data amongst N threads/processes if you have any reasonable
expectation that you could form filter queries that partition the
result set anywhere near evenly.

For example, let's say you have a field with random numbers between 0
and 100. You could spin off 10 cursorMark-aware processes each with
its own fq clause like

fq=partition_field:[0 TO 10}
fq=[10 TO 20}
....
fq=[90 TO 100]

Note the use of inclusive/exclusive end points....

Each one would be totally independent of all others with no
overlapping documents. And since the fq's would presumably be cached
you should be able to go as fast as you can drive your cluster. Of
course you lose query-wide sorting and the like, if that's important
you'd need to figure something out there.

Do be aware of a potential issue. When regular doc fields are
returned, for each document returned, a 16K block of data will be
decompressed to get the stored field data. Streaming Aggregation
(/xport) reads docValues entries which are held in MMapDirectory space
so will be much, much faster. As of Solr 5.5. You can override the
decompression stuff, see:
https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
both stored and docvalues...

Best,
Erick

On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.joshi@gmail.com> wrote:
> Thanks Yonik for the explanation.
>
> Hi Erick,
> I was using the /xport functionality. But it hasn't been stable (Solr
> 5.5.0). I started running into run time Exceptions (JSON parsing
> exceptions) while reading the stream of Tuples. This started happening as
> the size of my collection increased 3 times and I started running queries
> that return millions of documents (>10mm). I don't know if it is the query
> result size or the actual data size (total number of docs in the
> collection) that is causing the instability.
>
> org.noggit.JSONParser$ParseException: Expected ',' or '}':
> char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> 0lG99sHT8P5e'
>
> I won't be able to move to Solr 6.0 due to some constraints in our
> production environment and hence moving back to the cursor approach. Do you
> have any other suggestion for me?
>
> Thanks,
> Chetas.
>
> On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> Have you considered the /xport functionality?
>>
>> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <yseeley@gmail.com> wrote:
>> > No, you can't get cursor-marks ahead of time.
>> > They are the serialized representation of the last sort values
>> > encountered (hence not known ahead of time).
>> >
>> > -Yonik
>> >
>> >
>> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <chetas.joshi@gmail.com>
>> wrote:
>> >> Hi,
>> >>
>> >> I am using the cursor approach to fetch results from Solr (5.5.0). Most
>> of
>> >> my queries return millions of results. Is there a way I can read the
>> pages
>> >> in parallel? Is there a way I can get all the cursors well in advance?
>> >>
>> >> Let's say my query returns 2M documents and I have set rows=100,000.
>> >> Can I have multiple threads iterating over different pages like
>> >> Thread1 -> docs 1 to 100K
>> >> Thread2 -> docs 101K to 200K
>> >> ......
>> >> ......
>> >>
>> >> for this to happen, can I get all the cursorMarks for a given query so
>> that
>> >> I can leverage the following code in parallel
>> >>
>> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> val rsp: QueryResponse = c.query(cursorQ)
>> >>
>> >> Thank you,
>> >> Chetas.
>>