From user-return-1037-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Mar 2 02:48:55 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 8BEB318062C for ; Tue, 2 Mar 2021 03:48:55 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id A580864472 for ; Tue, 2 Mar 2021 02:48:54 +0000 (UTC) Received: (qmail 8685 invoked by uid 500); 2 Mar 2021 02:48:52 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 8675 invoked by uid 99); 2 Mar 2021 02:48:52 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Mar 2021 02:48:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id E824A1FF39A for ; Tue, 2 Mar 2021 02:48:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id zCYedB_aKSu4 for ; Tue, 2 Mar 2021 02:48:50 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.221.172; helo=mail-vk1-f172.google.com; envelope-from=sshleifer@gmail.com; receiver= Received: from mail-vk1-f172.google.com (mail-vk1-f172.google.com [209.85.221.172]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1C190BCF9E for ; Tue, 2 Mar 2021 02:48:50 +0000 (UTC) Received: by mail-vk1-f172.google.com with SMTP id j188so4090812vke.13 for ; Mon, 01 Mar 2021 18:48:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:in-reply-to:from:date:message-id:references:to; bh=Lzy2okiqVdUG82aJsP1ejUOUEHJ0y5Mlc4wJBw2QQx0=; b=UEWac70iV7pN5LTl1Fkx7xOLL442FIPCvb5W4IDaFTqOfQ6XWAgGNi4D2VZw3z8dbt FCL0lARhooizDV3l1+7PTe6GCufdJqcqjI13CoCegkIqhEYikrCKNo/zq9XVJC/t/hxI 70jCd6mNOyHkULLqGLEAhiEExOPttWEHY/WDMMvNzExm2ThUMEOTJaUoc10l0YopOnsp +sgNSl4o0qmSpWdNzU+5MfGGedvFzUid4uCxOaSO/rFdvOUtBQqaQJ7mAgaTPWltrgRn R2JZ31jiFau8PZ55l2kDS2HSTQED74vzb0Id/fX6O6M78mx/CxV8mQA1Dzbbdorf/2AS xJ9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:in-reply-to:from:date :message-id:references:to; bh=Lzy2okiqVdUG82aJsP1ejUOUEHJ0y5Mlc4wJBw2QQx0=; b=n0kZCy5lbvYaP/TX/qX8kOCLNF3/QYLcPwKqbjBcU8Go/hi1C+fQuvqJPUsoQCEvOo Q6fzjaVteTJBcexOSKXWyS9GOzmmuY2oUklosAiMeBZUzrVuiQgbhsUixZXQp7vA4Sir dCqZXjJzAT5w5HSHUK5msxyyU9bjBX2ZoVoJLmm4qw6PXXCgkp8XVaPMt/fX7ezVQaUZ 3yVbkh1y+jVEcQ8vsqj3SiGAzfVZOt2r/s+693fAHAkp0v1+Nlc8rOidSZC+6HF4KR5c ZV5eZYJ2zhtla1GohMpm5HuKHMWHH5wYD+NElk3CMawISl1wH8XZljaUhPWTj3780g6i YSTw== X-Gm-Message-State: AOAM5319vwUIZmP3LaQjxl32StHeXrfWjSnppuQbCblQijdezfHkuxq5 gOQlN/wvOeZ9c3npHaIiscz0EbXav2U= X-Google-Smtp-Source: ABdhPJzYzxj6oGH4EI+OSyG+PqW4MI+MWnZTYF+1K6XlyeIYvOwitHynxAfLdUV1yMjqltgX48WnIA== X-Received: by 2002:a05:6122:895:: with SMTP id 21mr777414vkf.15.1614653329359; Mon, 01 Mar 2021 18:48:49 -0800 (PST) Received: from localhost (0.92.231.35.bc.googleusercontent.com. [35.231.92.0]) by smtp.gmail.com with ESMTPSA id i189sm2597367vkg.5.2021.03.01.18.48.49 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 01 Mar 2021 18:48:49 -0800 (PST) Mime-Version: 1.0 Subject: Re: Python Plasma Store Best Practices X-Mailer: Superhuman Desktop (2021-03-01T23:06:07Z) X-Superhuman-ID: klrew17x.61511617-347b-470f-baf1-1b0815ea3904 In-Reply-To: From: "Sam Shleifer" Date: Tue, 02 Mar 2021 02:48:48 +0000 Message-ID: X-Superhuman-Draft-ID: draft00ab5fb85c432f5c References: To: user@arrow.apache.org Content-Type: multipart/alternative; boundary=8cafc5c67a85f000f24cdb085b1cbe114b04d28cf7583afb75f3eb30c151 --8cafc5c67a85f000f24cdb085b1cbe114b04d28cf7583afb75f3eb30c151 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Partial answers are super=C2=A0helpful! I'm happy to break this up if it's too much for 1 question @moderators Sam On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer < sshleifer@gmail.com > wrote= : >=20 > Hi! >=20 > I am trying to use plasma store to reduce the memory usage of a pytorch > dataset/dataloader combination, and had 4=C2=A0 questions. I don=E2=80=99= t think any of > them require pytorch knowledge. If you prefer to comment inline there is = a > quip with identical content and prettier formatting here https:/ / quip. = com/ > 3mwGAJ9KR2HT ( https://quip.com/3mwGAJ9KR2HT ) >=20 >=20 >=20 > *1)* My script starts the plasma-store from python with 200 GB: >=20 >=20 >=20 > nbytes =3D (1024 ** 3) * 200 >=20 > _server =3D subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s", > path]) >=20 > where nbytes is chosen arbitrarily. From my experiments it seems that one > should start the store as large as possible within the limits of dev/shm = . > I wanted to verify whether this is actually the best practice (it would b= e > hard for my app to know the storage needs up front) and also whether ther= e > is an automated way to figure out how much storage to allocate. >=20 >=20 >=20 > *2)* Does plasma store support simultaneous reads? My code, which has > multiple clients all asking for the 6 arrays from the plasma-store > thousands of times, was segfaulting with different errors, e.g. >=20 > Check failed: RemoveFromClientObjectIds(object_id, entry, client) =3D=3D = 1 >=20 > until I added a lock around my client.get >=20 >=20 >=20 > if self.use_lock: # Fix segfault >=20 > with FileLock("/tmp/plasma_lock"): >=20 > ret =3D self.client.get(self.object_id) >=20 > else: >=20 > ret =3D self.client.get(self.object_id) >=20 >=20 >=20 > which fixes. >=20 >=20 >=20 > Here is a full traceback of the failure without the lock https:/ / gist. = github. > com/ sshleifer/ 75145ba828fcb4e998d5e34c46ce13fc ( > https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc ) >=20 > Is this expected behavior? >=20 >=20 >=20 > *3)* Is there a simple way to add many objects to the plasma store at > once? Right now, we are considering changing, >=20 >=20 >=20 > oid =3D client.put(array) >=20 > to >=20 > oids =3D [client.put(x) for x in array] >=20 >=20 >=20 > so that we can fetch one entry at a time. but the writes are much slower. >=20 >=20 >=20 > * 3a) Is there a lower level interface for bulk writes? >=20 > * 3b) Or is it recommended to chunk the array and have different python > processes write simultaneously to make this faster? >=20 >=20 >=20 > *4)* Is there a way to save/load the contents of the plasma-store to disk > without loading everything into memory and then saving it to some other > format? >=20 >=20 >=20 > Replication >=20 >=20 >=20 > Setup instructions for fairseq+replicating the segfault: https:/ / gist. = github. > com/ sshleifer/ bd6982b3f632f1d4bcefc9feceb30b1a ( > https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a ) >=20 > My code is here: https:/ / github. com/ pytorch/ fairseq/ pull/ 3287 ( > https://github.com/pytorch/fairseq/pull/3287 ) >=20 >=20 >=20 > Thanks! >=20 > Sam > --8cafc5c67a85f000f24cdb085b1cbe114b04d28cf7583afb75f3eb30c151 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8
Partial answers are super=C2= =A0helpful!
I'm happy to break this up if it's too m= uch for 1 question @moderators=C2=A0
Sam

3D"

On Sat, Feb 27, 2021 at 1:27 PM, Sam Shl= eifer <sshleifer@gmail.com> wrote:
Hi!
I am trying= to use plasma store to reduce the memory usage of a pytorch dataset/datalo= ader combination, and had 4=C2=A0 questions. I don=E2=80=99t think any of t= hem require pytorch knowledge. If you prefer to comment inline there is a q= uip with identical content and prettier formatting here https:= //quip.com/3mwGAJ9KR2HT

*1)* My script starts the plasma-store from python with 200 GB:
=

nbytes =3D (1024 ** 3) * 200
_server =3D subprocess.Popen(["plasma_store", "= ;-m", str(nbytes), "-s", path])
where nbytes is chosen arbitrarily. From my experiments i= t seems that one should start the store as large as possible within the lim= its of dev/shm . I wanted to verify whether this is actually the best pract= ice (it would be hard for my app to know the storage needs up front) and al= so whether there is an automated way to figure out how much storage to allo= cate.

*2)* Does plasma store support simultaneous r= eads? My code, which has multiple clients all asking for the 6 arrays from = the plasma-store thousands of times, was segfaulting with different errors,= e.g.
Check failed: Remove= FromClientObjectIds(object_id, entry, client) =3D=3D 1
until I added a lock around my client.get
=

if self.use_lock: # Fix segfault
=C2=A0=C2=A0=C2=A0 with FileLock("/tmp/pla= sma_lock"):
=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D self.client.get(self.object_id)
else:
=C2=A0=C2=A0=C2=A0 ret =3D self.client.get(sel= f.object_id)

which fixes.

H= ere is a full traceback of the failure without the lock https://gist.github.= com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
Is this expected behavior?
<= div class=3D"sh-color-black sh-color">
*3)* Is there a simple way to add many objects to the plasma s= tore at once? Right now, we are considering changing,

oid =3D client.put(array)
to
oids =3D [client.pu= t(x) for x in array]

=
so that we can fetch one entry= at a time. but the writes are much slower.

* 3a) I= s there a lower level interface for bulk writes?
* 3b) Or is it recommended to chunk the array and ha= ve different python processes write simultaneously to make this faster?

*4)* Is there a way to save/load the contents of the p= lasma-store to disk without loading everything into memory and then saving = it to some other format?
<= br/>
Replication

Setup instructions for fairseq+replicating the segfault:=C2=A0https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a<= /a>

Thanks!
Sam
<= /div>

--8cafc5c67a85f000f24cdb085b1cbe114b04d28cf7583afb75f3eb30c151--