lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-9636) Add support for javabin for /stream, /sql internode communication
Date Tue, 27 Dec 2016 23:39:58 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781496#comment-15781496
] 

Joel Bernstein edited comment on SOLR-9636 at 12/27/16 11:39 PM:
-----------------------------------------------------------------

I finally had the time to test out the javabin writer with the /export handler and streaming
stack. My initial findings are really good. Here is a summary:

1) Currently testing must be done on branch_6x. There is a bug in master which breaks the
/export handler. I haven't gotten to the bottom of it yet but I'm pretty sure it was introduced
with the new docValues iterator API which is only in master. I will open a ticket for this
bug shortly and see if I can fix the problem.

But testing in branch_6x is better anyway as it won't be testing both the docValues iterator
API performance at the same time as the javabin /export performance.

2) For my test I worked on a single Solr instance with a single data shard (collection1) loaded
with 10,000,000 small documents. I also created a worker collection with 5 shards (collection2).
Then I ran the following expression with and without the javabin writer.
{code}
parallel(collection2, workers=5, sort="test_s desc", 
         rollup(over="test_s", sum(price_f),
                search(collection1, 
                       q=*:*,
                       fl="test_s, price_f", 
                       sort="test_s desc", 
                       qt="/export", 
                       wt="javabin", 
                       partitionKeys=test_s)))
{code}

Notice that there are five parallel workers (collection2) partitioning the stream from a single
data shard (collection1). This is how you achieve maximum throughput from a single node.

3) Throughput numbers were fairly impressive with this test expression:

* With json writer: 900,000 Tuples per second.
* With javabin writer: 1,100,000 Tuples per second.

So Javabin gives a significant throughput boost. It's also nice to have an example of 1 million+
documents per second from a single node.

4) Javabin also produced a much smaller output, roughly 50% smaller then json.

5) I also reviewed the code and it looks really nice. Big improvement as far cleaning up the
integration with Solr. 

6) The core export/sort algorithm looked to be untouched, which was nice because there was
a lot of hardening on that in the past. My biggest concern going into this ticket was that
refactoring would cause a change in the export/sort algorithm and we'd have go through the
hardening all over again. But that was not the case.

Very nice work [~noble.paul]! Big improvements and so far I haven't found any functional problems.
I will continue testing.





was (Author: joel.bernstein):
I finally had the time to test out the javabin writer with the /export handler and streaming
stack. My initial findings are really good. Here is a summary:

1) Currently testing must be done on branch_6x. There is a bug in master which breaks the
/export handler. I haven't gotten to the bottom of it yet but I'm pretty sure it was introduced
with the new docValues iterator API which is only in master. I will open a ticket for this
bug shortly and see if I can fix the problem.

But testing in branch_6x is better anyway as it won't be testing both the docValues iterator
API performance at the same time as the javabin /export performance.

2) For my test I worked on a single Solr instance with a single data shard (collection1) loaded
with 10,000,000 small documents. I also created a worker collection with 5 shards (collection2).
Then I ran the following expression with and without the javabin writer.
{code}
parallel(collection2, workers=5, sort="test_s desc", 
         rollup(over="test_s", sum(price_f),
                search(collection1, 
                       q=*:*,
                       fl="test_s, price_f", 
                       sort="test_s desc", 
                       qt="/export", 
                       wt="javabin", 
                       partitionKeys=test_s)))
{code}

Notice that there are five parallel workers (collection2) partitioning the stream from a single
data shard (collection1). This is how you achieve maximum throughput from a single node.

3) Throughput numbers were fairly impressive with this test expression:

* With json writer: 900,000 Tuples per second.
* With javabin writer: 1,100,000 Tuples per second.

So Javabin gives a significant throughput boost. It's also nice to have an example of 1 million+
documents per second from a single node.

4) Javabin also produced a much smaller output, roughly 50% smaller then json.

5) I also reviewed the code and looks really nice. Big improvement as far cleaning up the
integration with Solr. 

6) The core export/sort algorithm looked to be untouched, which was nice because there was
a lot of hardening on that in the past. My biggest concern going into this ticket was that
refactoring would cause a change in the export/sort algorithm and we'd have go through the
hardening all over again. But that was not the case.

Very nice work [~noble.paul]! Big improvements and so far I haven't found any functional problems.
I will continue testing.




> Add support for javabin for /stream, /sql internode communication
> -----------------------------------------------------------------
>
>                 Key: SOLR-9636
>                 URL: https://issues.apache.org/jira/browse/SOLR-9636
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Noble Paul
>            Assignee: Noble Paul
>             Fix For: master (7.0), 6.4
>
>         Attachments: SOLR-9636.patch
>
>
> currently it uses json, which is verbose and slow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message