arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Shleifer" <sshlei...@gmail.com>
Subject Re: Python Plasma Store Best Practices
Date Tue, 02 Mar 2021 17:49:41 GMT
Thanks, had no idea!

On Tue, Mar 02, 2021 at 12:00 PM, Micah Kornfield < emkornfield@gmail.com > wrote:

> 
> Hi Sam,
> I think the lack of responses might be because Plasma is not being
> actively maintained.  The original authors have forked it into the Ray
> project.
> 
> 
> I'm sorry I don't have the expertise to answer your questions.
> 
> 
> -Micah
> 
> On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer < sshleifer@ gmail. com (
> sshleifer@gmail.com ) > wrote:
> 
> 
>> Partial answers are super helpful!
>> 
>> I'm happy to break this up if it's too much for 1 question @moderators
>> 
>> Sam
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer < sshleifer@ gmail. com (
>> sshleifer@gmail.com ) > wrote:
>> 
>>> Hi!
>>> 
>>> I am trying to use plasma store to reduce the memory usage of a pytorch
>>> dataset/dataloader combination, and had 4  questions. I don’t think any of
>>> them require pytorch knowledge. If you prefer to comment inline there is a
>>> quip with identical content and prettier formatting here https:/ / quip. com/
>>> 3mwGAJ9KR2HT ( https://quip.com/3mwGAJ9KR2HT )
>>> 
>>> 
>>> 
>>> *1)* My script starts the plasma-store from python with 200 GB:
>>> 
>>> 
>>> 
>>> nbytes = (1024 ** 3) * 200
>>> 
>>> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
>>> path])
>>> 
>>> where nbytes is chosen arbitrarily. From my experiments it seems that one
>>> should start the store as large as possible within the limits of dev/shm .
>>> I wanted to verify whether this is actually the best practice (it would be
>>> hard for my app to know the storage needs up front) and also whether there
>>> is an automated way to figure out how much storage to allocate.
>>> 
>>> 
>>> 
>>> *2)* Does plasma store support simultaneous reads? My code, which has
>>> multiple clients all asking for the 6 arrays from the plasma-store
>>> thousands of times, was segfaulting with different errors, e.g.
>>> 
>>> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
>>> 
>>> until I added a lock around my client.get
>>> 
>>> 
>>> 
>>> if self.use_lock: # Fix segfault
>>> 
>>> with FileLock("/tmp/plasma_lock"):
>>> 
>>> ret = self.client.get(self.object_id)
>>> 
>>> else:
>>> 
>>> ret = self.client.get(self.object_id)
>>> 
>>> 
>>> 
>>> which fixes.
>>> 
>>> 
>>> 
>>> Here is a full traceback of the failure without the lock https:/ / gist. github.
>>> com/ sshleifer/ 75145ba828fcb4e998d5e34c46ce13fc (
>>> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc )
>>> 
>>> Is this expected behavior?
>>> 
>>> 
>>> 
>>> *3)* Is there a simple way to add many objects to the plasma store at
>>> once? Right now, we are considering changing,
>>> 
>>> 
>>> 
>>> oid = client.put(array)
>>> 
>>> to
>>> 
>>> oids = [client.put(x) for x in array]
>>> 
>>> 
>>> 
>>> so that we can fetch one entry at a time. but the writes are much slower.
>>> 
>>> 
>>> 
>>> * 3a) Is there a lower level interface for bulk writes?
>>> 
>>> * 3b) Or is it recommended to chunk the array and have different python
>>> processes write simultaneously to make this faster?
>>> 
>>> 
>>> 
>>> *4)* Is there a way to save/load the contents of the plasma-store to disk
>>> without loading everything into memory and then saving it to some other
>>> format?
>>> 
>>> 
>>> 
>>> Replication
>>> 
>>> 
>>> 
>>> Setup instructions for fairseq+replicating the segfault: https:/ / gist. github.
>>> com/ sshleifer/ bd6982b3f632f1d4bcefc9feceb30b1a (
>>> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a )
>>> 
>>> My code is here: https:/ / github. com/ pytorch/ fairseq/ pull/ 3287 (
>>> https://github.com/pytorch/fairseq/pull/3287 )
>>> 
>>> 
>>> 
>>> Thanks!
>>> 
>>> Sam
>>> 
>> 
>> 
> 
>
Mime
View raw message