lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations
Date Sat, 11 Sep 2010 22:49:32 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated LUCENE-2575:
-------------------------------------

    Attachment: LUCENE-2575.patch

Here's a start at concurrency, the terms dictionary, and
iterating over doc ids. 

* It needs concurrency unit tests

* At an as yet undetermined interval, we need to conglomerate
the existing terms into a sorted int[] rather than continue to
use the ConcurrentSkipListMap, which consumes a far greater
amount of RAM. The tradeoff and reason for using the CSLM is the
level of concurrency gained by using it at the cost of greater
memory consumption when compared with the sorted int[] of term
ids.

* An int[] based term enum needs to be implemented. In addition,
a multi term enum, maybe there's one we can use, I'm not
familiar enough with the new flex code base.

* Copy on write is used to obtain a read-only version of the
ByteBlockPool and IntBlockPool. In the case of the byte blocks,
a boolean[] marks which elements need to be copied prior to
writing by the DocumentsWriterPerThread on byte slice forwarding
address rewrite.

* A write lock on each DWPT guarantees that as reference copies
are made, arrays being copied will not be altered in flight.
There shouldn't be an issue even though to get a complete
IndexReader[], we need to wait for each document to finish
flushing, we're not blocking indexing, only the obtaining of the
IRs. I can't see this being an issue for most use cases.

* Similarly, a reference is copied of the ParallelPostingsArray
(rather than a full copy) for use by the RAM Buffer based
IndexReader. It is OK for the PPA to be changed during future doc
adds, as the only the elements greater than the IRs max term id
will be altered, ie, we're not going to run into JMM thread
issues because the writing and read-only array reference copies
occur in a reentrant lock.

* Recycling of byte[]s becomes a bit more complex as RAM IRs will
likely hold references to them. When the RAM IR is closed, however,
the byte[]s can be recycled. The user could experience unusual
RAM usage spikes if IRs are not closed properly.



> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message