arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Python Plasma Store Best Practices
Date Tue, 02 Mar 2021 17:00:18 GMT
Hi Sam,
I think the lack of responses might be because Plasma is not being actively
maintained.  The original authors have forked it into the Ray project.

I'm sorry I don't have the expertise to answer your questions.

-Micah

On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer <sshleifer@gmail.com> wrote:

> Partial answers are super helpful!
> I'm happy to break this up if it's too much for 1 question @moderators
> Sam
>
>
>
> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer <sshleifer@gmail.com> wrote:
>
>> Hi!
>> I am trying to use plasma store to reduce the memory usage of a pytorch
>> dataset/dataloader combination, and had 4  questions. I don’t think any of
>> them require pytorch knowledge. If you prefer to comment inline there is a
>> quip with identical content and prettier formatting here
>> https://quip.com/3mwGAJ9KR2HT
>>
>> *1)* My script starts the plasma-store from python with 200 GB:
>>
>> nbytes = (1024 ** 3) * 200
>> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
>> path])
>> where nbytes is chosen arbitrarily. From my experiments it seems that one
>> should start the store as large as possible within the limits of dev/shm .
>> I wanted to verify whether this is actually the best practice (it would be
>> hard for my app to know the storage needs up front) and also whether there
>> is an automated way to figure out how much storage to allocate.
>>
>> *2)* Does plasma store support simultaneous reads? My code, which has
>> multiple clients all asking for the 6 arrays from the plasma-store
>> thousands of times, was segfaulting with different errors, e.g.
>> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
>> until I added a lock around my client.get
>>
>> if self.use_lock: # Fix segfault
>>     with FileLock("/tmp/plasma_lock"):
>>         ret = self.client.get(self.object_id)
>> else:
>>     ret = self.client.get(self.object_id)
>>
>> which fixes.
>>
>> Here is a full traceback of the failure without the lock
>> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
>> Is this expected behavior?
>>
>> *3)* Is there a simple way to add many objects to the plasma store at
>> once? Right now, we are considering changing,
>>
>> oid = client.put(array)
>> to
>> oids = [client.put(x) for x in array]
>>
>> so that we can fetch one entry at a time. but the writes are much slower.
>>
>> * 3a) Is there a lower level interface for bulk writes?
>> * 3b) Or is it recommended to chunk the array and have different python
>> processes write simultaneously to make this faster?
>>
>> *4)* Is there a way to save/load the contents of the plasma-store to disk
>> without loading everything into memory and then saving it to some other
>> format?
>>
>> Replication
>>
>> Setup instructions for fairseq+replicating the segfault:
>> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a
>> My code is here: https://github.com/pytorch/fairseq/pull/3287
>>
>> Thanks!
>> Sam
>>
>
>

Mime
View raw message