From user-return-494-archive-asf-public=cust-asf.ponee.io@madlib.apache.org Fri Jan 5 05:14:50 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 3BFAD180657 for ; Fri, 5 Jan 2018 05:14:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2B60D160C3A; Fri, 5 Jan 2018 04:14:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F0315160C2B for ; Fri, 5 Jan 2018 05:14:48 +0100 (CET) Received: (qmail 63894 invoked by uid 500); 5 Jan 2018 04:14:48 -0000 Mailing-List: contact user-help@madlib.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@madlib.apache.org Delivered-To: mailing list user@madlib.apache.org Received: (qmail 63883 invoked by uid 99); 5 Jan 2018 04:14:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Jan 2018 04:14:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8BED0C0A0B for ; Fri, 5 Jan 2018 04:14:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.98 X-Spam-Level: * X-Spam-Status: No, score=1.98 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ywvMzvy7ewUw for ; Fri, 5 Jan 2018 04:14:45 +0000 (UTC) Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3F4D15F239 for ; Fri, 5 Jan 2018 04:14:44 +0000 (UTC) Received: by mail-qk0-f182.google.com with SMTP id i17so199544qke.0 for ; Thu, 04 Jan 2018 20:14:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=T2L7nreLGqWebiWBUUaq/05ZBxQ68//WbXaN3O65bQs=; b=ERcyNe6hikG7cX9BI8To2PlSIHo95vrvW01uFnAzmpAGgq2Atxs7VAyBvDEgXUdQUf Dp0ffwkBNKZlk+tGKlQDXkXR7xN9ItGUHDiRqFoJKN467X7eiE0rHGv37TWuPvt+pHpH yhgZeHhQS44cf/JP//hspv7Z9zRnr+sZyuFsLmyU+/2Fu3WorDM1il+/wyBV7xo4K8Kt aeTXNWziK5YOcyUEhXR3C+w8/ZwUSzVWNxsFkdheSB63ANHd7PbMLdSbYnQH9kJneNCJ 9mgSxiRB4I3OS2uELe4aC84+dmaVZigVmG9S2lC2Y9ipjk6+/ChQub/tWYLU/dplb9Od kYGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=T2L7nreLGqWebiWBUUaq/05ZBxQ68//WbXaN3O65bQs=; b=beYyxeTz2wpYbG8izu1ZKmAcRBsehOoi/G8v46ja4WbeJpqxLlhslCQ94y5KTljA81 azAbOj+PTnY9/kRfUu3hayadPWGvp06GAZPunAOSz+qvqLc83hLwNUSn3eHWcxFjwhFR eqysp+f0BIhhjhOmVp+SaWZieIvzrrRnA393iMiURRUky9QR5xW4LsWQmZM3TLjqopFa agjtNJ9jBG97ARA6PK8ON4+iA2fHxAYJlw6seybLLINyelpgn8ds71XQ8YJffxjcFjLk tt2s8KYjjuGyHHJ1f8PdOLgkKZYM8b2htEtmnyz1nZCOnan+uQVnj6EIshXVJKY+K4KU C+0g== X-Gm-Message-State: AKwxytdEADsrH4Yegch3GOkA1gsq3UstqyDEFIneQXF/7CeIXfl6hCbZ YFQeAi1kKfUeniuDhX+RDvxMB9B1N4QAy0T97cMCpNqx X-Google-Smtp-Source: ACJfBotdr4aTjhAdefZgdXOSjhP9zFi3Mjc2vIFrkPxJqS3eN70gnnrijj4IWf3NzOK4ChpHtZZumInhDRnjIX4ieRA= X-Received: by 10.55.15.2 with SMTP id z2mr2526161qkg.91.1515125682789; Thu, 04 Jan 2018 20:14:42 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.41.138 with HTTP; Thu, 4 Jan 2018 20:14:41 -0800 (PST) In-Reply-To: References: From: Frank McQuillan Date: Thu, 4 Jan 2018 20:14:41 -0800 Message-ID: Subject: Re: Multiplying a large sparse matrix by a vector To: user@madlib.apache.org Content-Type: multipart/alternative; boundary="001a114688f21cfb7d0561ffb060" --001a114688f21cfb7d0561ffb060 Content-Type: text/plain; charset="UTF-8" I like Orhan's suggestion, it is less work. Slight correction to my comment above: "For each of the n chunks, if there is no non-zero value in the 100th column, you will get an error that looks like this..." I meant For each of the n chunks, if there is no value of any kind (0 or otherwise) in the 100th column, you will get an error that looks like this..." Frank On Thu, Jan 4, 2018 at 5:26 PM, Orhan Kislal wrote: > Hello Anthony, > > I agree with Frank's suggestion, operating on chunks of the matrix should > work. An alternate workaround for the 100th column issue you might > encounter could be this: > > Check if there exists a value for the the first (or last or any other) > row, last column. If there is one, then you can use the chunk as is. If > not, put 0 as the value of that particular row/column. This will ensure the > matrix size is calculated correctly, will not affect the output and will > not require any additional operation for the assembly of the final vector. > > Please let us know if you have any questions. > > Thanks, > > Orhan Kislal > > On Thu, Jan 4, 2018 at 12:12 PM, Frank McQuillan > wrote: > >> Anthony, >> >> In that case, I think you are hitting the 1GB PostgreSQL limit. >> >> Operations on sparse matrix format requires loading into memory 2 >> INTEGERS for row/col plus the value (INTEGER, DOUBLE PRECISION, whatever >> size it is). >> >> It means for your matrix the 2 INTEGERS alone are ~1.00E+09 bytes which >> is already on the limit without even considering the value yet. >> >> So I would suggest you do the computation in blocks. One approach to >> this: >> >> * chunk your long matrix into n smaller VIEWS, say n=10 (i.e., MADlib >> matrix operations do work on VIEWS) >> * call matrix*vector for each chunk >> * reassemble the n result vectors into the final vector >> >> You could do this in a PL/pgSQL or PL/Python function. >> >> There is one subtlety to be aware of though because you are working with >> sparse matrices. For each of the n chunks, if there is no non-zero value in >> the 100th column, you will get an error that looks like this: >> >> madlib=# SELECT madlib.matrix_vec_mult('mat_a_view', >> NULL, >> array[1,2,3,4,5,6,7,8,9,10] >> ); >> ERROR: plpy.Error: Matrix error: Dimension mismatch between matrix (1 x >> 9) and vector (10 x 1) >> CONTEXT: Traceback (most recent call last): >> PL/Python function "matrix_vec_mult", line 24, in >> matrix_in, in_args, vector) >> PL/Python function "matrix_vec_mult", line 2031, in matrix_vec_mult >> PL/Python function "matrix_vec_mult", line 77, in _assert >> PL/Python function "matrix_vec_mult" >> >> See the explanation at the top of >> http://madlib.apache.org/docs/latest/group__grp__matrix.html >> regarding dimensionality of sparse matrices. >> >> One way around this is to add a (fake) row to the bottom of your VIEW >> with a 0 in the 100th column. But if you do this, be sure to drop the last >> (fake) entry of each of the n intermediate vectors before you assemble into >> the final vector. >> >> Frank >> >> >> >> >> >> On Wed, Jan 3, 2018 at 8:15 PM, Anthony Thomas >> wrote: >> >>> Thanks Frank - the answer to both your questions is "yes" >>> >>> Best, >>> >>> Anthony >>> >>> On Wed, Jan 3, 2018 at 3:13 PM, Frank McQuillan >>> wrote: >>> >>>> >>>> Anthony, >>>> >>>> Correct the install check error you are seeing is not related. >>>> >>>> Cpl questions: >>>> >>>> (1) >>>> Are you using: >>>> >>>> -- Multiply matrix with vector >>>> matrix_vec_mult( matrix_in, in_args, vector) >>>> >>>> (2) >>>> Is matrix_in encoded in sparse format like at the top of >>>> http://madlib.apache.org/docs/latest/group__grp__matrix.html >>>> >>>> e.g., like this? >>>> >>>> row_id | col_id | value >>>> --------+--------+------- >>>> 1 | 1 | 9 >>>> 1 | 5 | 6 >>>> 1 | 6 | 6 >>>> 2 | 1 | 8 >>>> 3 | 1 | 3 >>>> 3 | 2 | 9 >>>> 4 | 7 | 0 >>>> >>>> >>>> Frank >>>> >>>> >>>> On Wed, Jan 3, 2018 at 2:52 PM, Anthony Thomas >>>> wrote: >>>> >>>>> Okay - thanks Ivan, and good to know about support for Ubuntu from >>>>> Greenplum! >>>>> >>>>> Best, >>>>> >>>>> Anthony >>>>> >>>>> On Wed, Jan 3, 2018 at 2:38 PM, Ivan Novick >>>>> wrote: >>>>> >>>>>> Hi Anthony, this does NOT look like a Ubuntu problem, and in fact >>>>>> there is OSS Greenplum officially on Ubuntu you can see here: >>>>>> http://greenplum.org/install-greenplum-oss-on-ubuntu/ >>>>>> >>>>>> Greenplum and PostgreSQL do limit to 1 Gig for each field (row/col >>>>>> combination) but there are techniques to manage data sets working within >>>>>> these constraints. I will let someone else who has more experience then me >>>>>> working with matrices answer how is the best way to do so in a case like >>>>>> you have provided. >>>>>> >>>>>> Cheers, >>>>>> Ivan >>>>>> >>>>>> On Wed, Jan 3, 2018 at 2:22 PM, Anthony Thomas >>>>> > wrote: >>>>>> >>>>>>> Hi Madlib folks, >>>>>>> >>>>>>> I have a large tall and skinny sparse matrix which I'm trying to >>>>>>> multiply by a dense vector. The matrix is 1.25e8 by 100 with approximately >>>>>>> 1% nonzero values. This operations always triggers an error from Greenplum: >>>>>>> >>>>>>> plpy.SPIError: invalid memory alloc request size 1073741824 (context >>>>>>> 'accumArrayResult') (mcxt.c:1254) (plpython.c:4957) >>>>>>> CONTEXT: Traceback (most recent call last): >>>>>>> PL/Python function "matrix_vec_mult", line 24, in >>>>>>> matrix_in, in_args, vector) >>>>>>> PL/Python function "matrix_vec_mult", line 2044, in matrix_vec_mult >>>>>>> PL/Python function "matrix_vec_mult", line 2001, in >>>>>>> _matrix_vec_mult_dense >>>>>>> PL/Python function "matrix_vec_mult" >>>>>>> >>>>>>> Some Googling suggests this error is caused by a hard limit from >>>>>>> Postgres which restricts the maximum size of an array to 1GB. If this is >>>>>>> indeed the cause of the error I'm seeing does anyone have any suggestions >>>>>>> about how to circumvent this issue? This comes up in other cases as well >>>>>>> like transposing a tall and skinny matrix. MVM with smaller matrices works >>>>>>> fine. >>>>>>> >>>>>>> Here is relevant version information: >>>>>>> >>>>>>> SELECT VERSION(); >>>>>>> PostgreSQL 8.3.23 (Greenplum Database 5.1.0 build dev) on >>>>>>> x86_64-pc-linux-gnu, compiled by GCC gcc >>>>>>> (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609 compiled on Dec 21 >>>>>>> 2017 09:09:46 >>>>>>> >>>>>>> SELECT madlib.version(); >>>>>>> MADlib version: 1.12, git revision: unknown, cmake configuration >>>>>>> time: Thu Dec 21 18:04:47 UTC 201 >>>>>>> 7, build type: RelWithDebInfo, build system: >>>>>>> Linux-4.4.0-103-generic, C compiler: gcc 4.9.3, C++ co >>>>>>> mpiler: g++ 4.9.3 >>>>>>> >>>>>>> Madlib install-check reported one error in the "convex" module >>>>>>> related to "loss too high" which seems unrelated to the issue described >>>>>>> above. I know Ubuntu isn't officially supported by Greenplum so I'd like to >>>>>>> be confident this issue isn't just the result of using an unsupported OS. >>>>>>> Please let me know if any other information would be helpful. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Anthony >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ivan Novick, Product Manager Pivotal Greenplum >>>>>> inovick@pivotal.io -- (Mobile) 408-230-6491 <(408)%20230-6491> >>>>>> https://www.youtube.com/GreenplumDatabase >>>>>> >>>>>> >>>>> >>>> >>> >> > --001a114688f21cfb7d0561ffb060 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I like Orhan's suggestion, it is less work.

Slight correction to my comment above:

"For each of the n chunks, if there is = no non-zero value in the 100th column, you will get an error that looks lik= e this..."

<= /span>
I meant=C2=A0

For each of the n chunks, if there is no value of any ki= nd (0 or otherwise) in the 100th column, you will get an error that looks l= ike this..."
<= div>
Frank

On Thu, Jan 4, 2= 018 at 5:26 PM, Orhan Kislal <okislal@pivotal.io> wrote:
Hello Anthony,

I agree with Frank's suggestion, operating on chunks of the mat= rix should work. An alternate workaround for the 100th column issue you mig= ht encounter could be this:

Check if there exists a value= for the the first (or last or any other) row, last column. If there is one= , then you can use the chunk as is. If not, put 0 as the value of that part= icular row/column. This will ensure the matrix size is calculated correctly= , will not affect the output and will not require any additional operation = for the assembly of the final vector.=C2=A0

Please= let us know if you have any questions.

Thanks,

Orhan Kislal





=
madlib=3D# SELECT madlib.matrix_vec_mult('mat_a_view',=C2= =A0
NULL,
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 array[1,2,3,4,5,6,7,8,9,10]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 );
ERROR:=C2=A0 plpy.E= rror: Matrix error: Dimension mismatch between matrix (1 x 9) and vector (1= 0 x 1)
CONTEXT:=C2=A0 Traceback (most recent call last):
=C2=A0 PL/Python function "matrix_vec_mult", line 24, in = <module>
=C2=A0 =C2=A0 matrix_in, in_args, vector)
=C2=A0 PL/Python function "matrix_vec_mult", line 2031, = in matrix_vec_mult
=C2=A0 PL/Python function "matrix_vec_mul= t", line 77, in _assert
PL/Python function "matrix_vec_= mult"

See the explanation at the top of=C2=A0=
regarding dimensionality of sparse matrices= .

One way around this is to add a (fake) row to th= e bottom of your VIEW with a 0 in the 100th column.=C2=A0 But if you do thi= s, be sure to drop the last (fake) entry of each of the n intermediate vect= ors before you assemble into the final vector.

Frank




=

On W= ed, Jan 3, 2018 at 8:15 PM, Anthony Thomas <ahthomas@eng.ucsd.edu&= gt; wrote:
<= div>Thanks Frank - the answer to both your questions is "yes"

Best,

Anthony

On Wed, Jan 3, 2018 at 3:13 PM, Fran= k McQuillan <fmcquillan@pivotal.io> wrote:

Anthony,

Correct the install check error you are seeing is not rela= ted.=C2=A0=C2=A0

Cpl questions:

(1)
Are you using:

-- Multiply ma= trix with vector
=C2=A0 matrix_vec_mult( matrix_in, in_args, vect= or)

(2)
Is matrix_in encoded in sparse f= ormat like at the top of=C2=A0

e.g., like this?

row_id | col_id | value
--------+--------+-------
=C2=A0 =C2=A0 =C2=A0 1 |=C2=A0 =C2= =A0 =C2=A0 1 |=C2=A0 =C2=A0 =C2=A09
=C2=A0 =C2=A0 =C2=A0 1 |=C2= =A0 =C2=A0 =C2=A0 5 |=C2=A0 =C2=A0 =C2=A06
=C2=A0 =C2=A0 =C2=A0 1= |=C2=A0 =C2=A0 =C2=A0 6 |=C2=A0 =C2=A0 =C2=A06
=C2=A0 =C2=A0 =C2= =A0 2 |=C2=A0 =C2=A0 =C2=A0 1 |=C2=A0 =C2=A0 =C2=A08
=C2=A0 =C2= =A0 =C2=A0 3 |=C2=A0 =C2=A0 =C2=A0 1 |=C2=A0 =C2=A0 =C2=A03
=C2= =A0 =C2=A0 =C2=A0 3 |=C2=A0 =C2=A0 =C2=A0 2 |=C2=A0 =C2=A0 =C2=A09
=C2=A0 =C2=A0 =C2=A0 4 |=C2=A0 =C2=A0 =C2=A0 7 |=C2=A0 =C2=A0 =C2=A00


Fr= ank


On Wed, Jan 3, 2018 at 2:52 PM= , Anthony Thomas <ahthomas@eng.ucsd.edu> wrote:
Okay - thanks Ivan, an= d good to know about support for Ubuntu from Greenplum!

Best,<= br>
Anthony

On W= ed, Jan 3, 2018 at 2:38 PM, Ivan Novick <inovick@pivotal.io> wrote:
= Hi Anthony, this does NOT look like a Ubuntu problem, and in fact there is = OSS Greenplum officially on Ubuntu you can see here:
http://gre= enplum.org/install-greenplum-oss-on-ubuntu/

Greenplum= and PostgreSQL do limit to 1 Gig for each field (row/col combination) but = there are techniques to manage data sets working within these constraints.= =C2=A0 I will let someone else who has more experience then me working with= matrices answer how is the best way to do so in a case like you have provi= ded.

Cheers,
Ivan

On Wed, Jan 3, 2018 at 2:22 PM, Anthony Thomas &= lt;ahthomas@eng.= ucsd.edu> wrote:
Hi Madlib folks,

I have a large t= all and skinny sparse matrix which I'm trying to multiply by a dense ve= ctor. The matrix is 1.25e8 by 100 with approximately 1% nonzero values. Thi= s operations always triggers an error from Greenplum:

<= div>plpy.SPIError: invalid memory alloc request size 1073741824 (context &#= 39;accumArrayResult') (mcxt.c:1254) (plpython.c:4957)
CONTEXT:=C2=A0= Traceback (most recent call last):
=C2=A0 PL/Python function "matr= ix_vec_mult", line 24, in <module>
=C2=A0=C2=A0=C2=A0 matrix_= in, in_args, vector)
=C2=A0 PL/Python function "matrix_vec_mult&quo= t;, line 2044, in matrix_vec_mult
=C2=A0 PL/Python function "matrix= _vec_mult", line 2001, in _matrix_vec_mult_dense
PL/Python function= "matrix_vec_mult"

Some Googling suggests thi= s error is caused by a hard limit from Postgres which restricts the maximum= size of an array to 1GB. If this is indeed the cause of the error I'm = seeing does anyone have any suggestions about how to circumvent this issue?= This comes up in other cases as well like transposing a tall and skinny ma= trix. MVM with smaller matrices works fine.

Here i= s relevant version information:

SELECT VERSION();<= /div>
PostgreSQL 8.3.23 (Greenplum Database 5.1.0 build dev) on x86_64-= pc-linux-gnu, compiled by GCC gcc
=C2=A0(Ubuntu 5.4.0-6ubuntu1~16.04.5) = 5.4.0 20160609 compiled on Dec 21 2017 09:09:46

SE= LECT madlib.version();
MADlib version: 1.12, git revision: unknow= n, cmake configuration time: Thu Dec 21 18:04:47 UTC 201
7, build type: = RelWithDebInfo, build system: Linux-4.4.0-103-generic, C compiler: gcc 4.9.= 3, C++ co
mpiler: g++ 4.9.3

Madlib install-= check reported one error in the "convex" module related to "= loss too high" which seems unrelated to the issue described above. I k= now Ubuntu isn't officially supported by Greenplum so I'd like to b= e confident this issue isn't just the result of using an unsupported OS= . Please let me know if any other information would be helpful.

Thanks,

Anthony



--
Ivan Novick, Product Manager Pivotal Greenplum






--001a114688f21cfb7d0561ffb060--