arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Shleifer" <sshlei...@gmail.com>
Subject Re: Python Plasma Store Best Practices
Date Tue, 02 Mar 2021 02:48:48 GMT
Partial answers are super helpful!

I'm happy to break this up if it's too much for 1 question @moderators

Sam

On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer < sshleifer@gmail.com > wrote:

> 
> Hi!
> 
> I am trying to use plasma store to reduce the memory usage of a pytorch
> dataset/dataloader combination, and had 4  questions. I don’t think any of
> them require pytorch knowledge. If you prefer to comment inline there is a
> quip with identical content and prettier formatting here https:/ / quip. com/
> 3mwGAJ9KR2HT ( https://quip.com/3mwGAJ9KR2HT )
> 
> 
> 
> *1)* My script starts the plasma-store from python with 200 GB:
> 
> 
> 
> nbytes = (1024 ** 3) * 200
> 
> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
> path])
> 
> where nbytes is chosen arbitrarily. From my experiments it seems that one
> should start the store as large as possible within the limits of dev/shm .
> I wanted to verify whether this is actually the best practice (it would be
> hard for my app to know the storage needs up front) and also whether there
> is an automated way to figure out how much storage to allocate.
> 
> 
> 
> *2)* Does plasma store support simultaneous reads? My code, which has
> multiple clients all asking for the 6 arrays from the plasma-store
> thousands of times, was segfaulting with different errors, e.g.
> 
> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
> 
> until I added a lock around my client.get
> 
> 
> 
> if self.use_lock: # Fix segfault
> 
> with FileLock("/tmp/plasma_lock"):
> 
> ret = self.client.get(self.object_id)
> 
> else:
> 
> ret = self.client.get(self.object_id)
> 
> 
> 
> which fixes.
> 
> 
> 
> Here is a full traceback of the failure without the lock https:/ / gist. github.
> com/ sshleifer/ 75145ba828fcb4e998d5e34c46ce13fc (
> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc )
> 
> Is this expected behavior?
> 
> 
> 
> *3)* Is there a simple way to add many objects to the plasma store at
> once? Right now, we are considering changing,
> 
> 
> 
> oid = client.put(array)
> 
> to
> 
> oids = [client.put(x) for x in array]
> 
> 
> 
> so that we can fetch one entry at a time. but the writes are much slower.
> 
> 
> 
> * 3a) Is there a lower level interface for bulk writes?
> 
> * 3b) Or is it recommended to chunk the array and have different python
> processes write simultaneously to make this faster?
> 
> 
> 
> *4)* Is there a way to save/load the contents of the plasma-store to disk
> without loading everything into memory and then saving it to some other
> format?
> 
> 
> 
> Replication
> 
> 
> 
> Setup instructions for fairseq+replicating the segfault: https:/ / gist. github.
> com/ sshleifer/ bd6982b3f632f1d4bcefc9feceb30b1a (
> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a )
> 
> My code is here: https:/ / github. com/ pytorch/ fairseq/ pull/ 3287 (
> https://github.com/pytorch/fairseq/pull/3287 )
> 
> 
> 
> Thanks!
> 
> Sam
>
Mime
View raw message