lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-8406) Make ByteBufferIndexInput public
Date Tue, 17 Jul 2018 13:05:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dawid Weiss updated LUCENE-8406:
--------------------------------
    Description: 
The logic of handling byte buffers splits, their proper closing (cleaner) and all the trickery
involved in slicing, cloning and proper exception handling is quite daunting. 

While ByteBufferIndexInput.newInstance(..) is public, the parent class ByteBufferIndexInput
is not. I think we should make the parent class public to allow advanced users to make use
of this (complex) piece of code to create IndexInput based on a sequence of ByteBuffers.

One particular example here is RAMDirectory, which currently uses a custom IndexInput implementation,
which in turn reaches to RAMFile's synchronized methods. This is the cause of quite dramatic
congestions on multithreaded systems. While we clearly discourage RAMDirectory from being
used in production environments, there really is no need for it to be slow. If modified only
slightly (to use ByteBuffer-based input), the performance is on par with FSDirectory. Here's
a sample log comparing FSDirectory with RAMDirectory and the "modified" RAMDirectory making
use of the ByteBuffer input:

{code}
14:26:40 INFO  console: FSDirectory index.
14:26:41 INFO  console: Opened with 299943 documents.
14:26:50 INFO  console: Finished: 8.820 s, 240000 matches.

14:26:50 INFO  console: RAMDirectory index.
14:26:50 INFO  console: Opened with 299943 documents.
14:28:50 INFO  console: Finished: 2.012 min, 240000 matches.

14:28:50 INFO  console: RAMDirectory2 index (wrapped byte[] buffers).
14:28:50 INFO  console: Opened with 299943 documents.
14:29:00 INFO  console: Finished: 9.215 s, 240000 matches.

14:29:00 INFO  console: RAMDirectory2 index (direct memory buffers).
14:29:00 INFO  console: Opened with 299943 documents.
14:29:08 INFO  console: Finished: 8.817 s, 240000 matches.
{code}

Note the performance difference is an order of magnitude on this 32-CPU system (2 minutes
vs. 9 seconds). The tiny performance difference between the implementation based on direct
memory buffers vs. those acquired via ByteBuffer.wrap(byte[]) is due to the fact that direct
buffers access their data via unsafe and the wrapped counterpart uses regular java array access
(my best guess).


  was:
The logic of handling byte buffers splits, their proper closing (cleaner) and all the trickery
involved in slicing, cloning and proper exception handling is quite daunting. 

While ByteBufferIndexInput.newInstance(..) is public, the parent class ByteBufferIndexInput
is not. I think we should make the parent class public to allow advanced users to make use
of this (complex) piece of code to create IndexInput based on a sequence of ByteBuffers.

The specific rationale I'm aiming at here is RAMDirectory, which currently uses a custom IndexInput
implementation, which in turn reaches to RAMFile's synchronized methods. This is the cause
of quite dramatic congestions on multithreaded systems. While we clearly discourage RAMDirectory
from being used in production environments, there really is no need for it to be slow. If
modified only slightly (to use ByteBuffer-based input), the performance is on par with FSDirectory.
Here's a sample log comparing FSDirectory with RAMDirectory and the "modified" RAMDirectory
making use of the ByteBuffer input:

{code}
14:26:40 INFO  console: FSDirectory index.
14:26:41 INFO  console: Opened with 299943 documents.
14:26:50 INFO  console: Finished: 8.820 s, 240000 matches.

14:26:50 INFO  console: RAMDirectory index.
14:26:50 INFO  console: Opened with 299943 documents.
14:28:50 INFO  console: Finished: 2.012 min, 240000 matches.

14:28:50 INFO  console: RAMDirectory2 index (wrapped byte[] buffers).
14:28:50 INFO  console: Opened with 299943 documents.
14:29:00 INFO  console: Finished: 9.215 s, 240000 matches.

14:29:00 INFO  console: RAMDirectory2 index (direct memory buffers).
14:29:00 INFO  console: Opened with 299943 documents.
14:29:08 INFO  console: Finished: 8.817 s, 240000 matches.
{code}

Note the performance difference is an order of magnitude on this 32-CPU system (2 minutes
vs. 9 seconds). The tiny performance difference between the implementation based on direct
memory buffers vs. those acquired via ByteBuffer.wrap(byte[]) is due to the fact that direct
buffers access their data via unsafe and the wrapped counterpart uses regular java array access
(my best guess).


> Make ByteBufferIndexInput public
> --------------------------------
>
>                 Key: LUCENE-8406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8406
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 6.7
>
>
> The logic of handling byte buffers splits, their proper closing (cleaner) and all the
trickery involved in slicing, cloning and proper exception handling is quite daunting. 
> While ByteBufferIndexInput.newInstance(..) is public, the parent class ByteBufferIndexInput
is not. I think we should make the parent class public to allow advanced users to make use
of this (complex) piece of code to create IndexInput based on a sequence of ByteBuffers.
> One particular example here is RAMDirectory, which currently uses a custom IndexInput
implementation, which in turn reaches to RAMFile's synchronized methods. This is the cause
of quite dramatic congestions on multithreaded systems. While we clearly discourage RAMDirectory
from being used in production environments, there really is no need for it to be slow. If
modified only slightly (to use ByteBuffer-based input), the performance is on par with FSDirectory.
Here's a sample log comparing FSDirectory with RAMDirectory and the "modified" RAMDirectory
making use of the ByteBuffer input:
> {code}
> 14:26:40 INFO  console: FSDirectory index.
> 14:26:41 INFO  console: Opened with 299943 documents.
> 14:26:50 INFO  console: Finished: 8.820 s, 240000 matches.
> 14:26:50 INFO  console: RAMDirectory index.
> 14:26:50 INFO  console: Opened with 299943 documents.
> 14:28:50 INFO  console: Finished: 2.012 min, 240000 matches.
> 14:28:50 INFO  console: RAMDirectory2 index (wrapped byte[] buffers).
> 14:28:50 INFO  console: Opened with 299943 documents.
> 14:29:00 INFO  console: Finished: 9.215 s, 240000 matches.
> 14:29:00 INFO  console: RAMDirectory2 index (direct memory buffers).
> 14:29:00 INFO  console: Opened with 299943 documents.
> 14:29:08 INFO  console: Finished: 8.817 s, 240000 matches.
> {code}
> Note the performance difference is an order of magnitude on this 32-CPU system (2 minutes
vs. 9 seconds). The tiny performance difference between the implementation based on direct
memory buffers vs. those acquired via ByteBuffer.wrap(byte[]) is due to the fact that direct
buffers access their data via unsafe and the wrapped counterpart uses regular java array access
(my best guess).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message