madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Multiplying a large sparse matrix by a vector
Date Fri, 05 Jan 2018 04:14:41 GMT
I like Orhan's suggestion, it is less work.

Slight correction to my comment above:

"For each of the n chunks, if there is no non-zero value in the 100th
column, you will get an error that looks like this..."

I meant

For each of the n chunks, if there is no value of any kind (0 or otherwise)
in the 100th column, you will get an error that looks like this..."

Frank

On Thu, Jan 4, 2018 at 5:26 PM, Orhan Kislal <okislal@pivotal.io> wrote:

> Hello Anthony,
>
> I agree with Frank's suggestion, operating on chunks of the matrix should
> work. An alternate workaround for the 100th column issue you might
> encounter could be this:
>
> Check if there exists a value for the the first (or last or any other)
> row, last column. If there is one, then you can use the chunk as is. If
> not, put 0 as the value of that particular row/column. This will ensure the
> matrix size is calculated correctly, will not affect the output and will
> not require any additional operation for the assembly of the final vector.
>
> Please let us know if you have any questions.
>
> Thanks,
>
> Orhan Kislal
>
> On Thu, Jan 4, 2018 at 12:12 PM, Frank McQuillan <fmcquillan@pivotal.io>
> wrote:
>
>> Anthony,
>>
>> In that case, I think you are hitting the 1GB PostgreSQL limit.
>>
>> Operations on sparse matrix format requires loading into memory 2
>> INTEGERS for row/col plus the value (INTEGER, DOUBLE PRECISION, whatever
>> size it is).
>>
>> It means for your matrix the 2 INTEGERS alone are ~1.00E+09 bytes which
>> is already on the limit without even considering the value yet.
>>
>> So I would suggest you do the computation in blocks.  One approach to
>> this:
>>
>> * chunk your long matrix into n smaller VIEWS, say n=10 (i.e., MADlib
>> matrix operations do work on VIEWS)
>> * call matrix*vector for each chunk
>> * reassemble the n result vectors into the final vector
>>
>> You could do this in a PL/pgSQL or PL/Python function.
>>
>> There is one subtlety to be aware of though because you are working with
>> sparse matrices. For each of the n chunks, if there is no non-zero value in
>> the 100th column, you will get an error that looks like this:
>>
>> madlib=# SELECT madlib.matrix_vec_mult('mat_a_view',
>> NULL,
>>                               array[1,2,3,4,5,6,7,8,9,10]
>>                               );
>> ERROR:  plpy.Error: Matrix error: Dimension mismatch between matrix (1 x
>> 9) and vector (10 x 1)
>> CONTEXT:  Traceback (most recent call last):
>>   PL/Python function "matrix_vec_mult", line 24, in <module>
>>     matrix_in, in_args, vector)
>>   PL/Python function "matrix_vec_mult", line 2031, in matrix_vec_mult
>>   PL/Python function "matrix_vec_mult", line 77, in _assert
>> PL/Python function "matrix_vec_mult"
>>
>> See the explanation at the top of
>> http://madlib.apache.org/docs/latest/group__grp__matrix.html
>> regarding dimensionality of sparse matrices.
>>
>> One way around this is to add a (fake) row to the bottom of your VIEW
>> with a 0 in the 100th column.  But if you do this, be sure to drop the last
>> (fake) entry of each of the n intermediate vectors before you assemble into
>> the final vector.
>>
>> Frank
>>
>>
>>
>>
>>
>> On Wed, Jan 3, 2018 at 8:15 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
>> wrote:
>>
>>> Thanks Frank - the answer to both your questions is "yes"
>>>
>>> Best,
>>>
>>> Anthony
>>>
>>> On Wed, Jan 3, 2018 at 3:13 PM, Frank McQuillan <fmcquillan@pivotal.io>
>>> wrote:
>>>
>>>>
>>>> Anthony,
>>>>
>>>> Correct the install check error you are seeing is not related.
>>>>
>>>> Cpl questions:
>>>>
>>>> (1)
>>>> Are you using:
>>>>
>>>> -- Multiply matrix with vector
>>>>   matrix_vec_mult( matrix_in, in_args, vector)
>>>>
>>>> (2)
>>>> Is matrix_in encoded in sparse format like at the top of
>>>> http://madlib.apache.org/docs/latest/group__grp__matrix.html
>>>>
>>>> e.g., like this?
>>>>
>>>> row_id | col_id | value
>>>> --------+--------+-------
>>>>       1 |      1 |     9
>>>>       1 |      5 |     6
>>>>       1 |      6 |     6
>>>>       2 |      1 |     8
>>>>       3 |      1 |     3
>>>>       3 |      2 |     9
>>>>       4 |      7 |     0
>>>>
>>>>
>>>> Frank
>>>>
>>>>
>>>> On Wed, Jan 3, 2018 at 2:52 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
>>>> wrote:
>>>>
>>>>> Okay - thanks Ivan, and good to know about support for Ubuntu from
>>>>> Greenplum!
>>>>>
>>>>> Best,
>>>>>
>>>>> Anthony
>>>>>
>>>>> On Wed, Jan 3, 2018 at 2:38 PM, Ivan Novick <inovick@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> Hi Anthony, this does NOT look like a Ubuntu problem, and in fact
>>>>>> there is OSS Greenplum officially on Ubuntu you can see here:
>>>>>> http://greenplum.org/install-greenplum-oss-on-ubuntu/
>>>>>>
>>>>>> Greenplum and PostgreSQL do limit to 1 Gig for each field (row/col
>>>>>> combination) but there are techniques to manage data sets working
within
>>>>>> these constraints.  I will let someone else who has more experience
then me
>>>>>> working with matrices answer how is the best way to do so in a case
like
>>>>>> you have provided.
>>>>>>
>>>>>> Cheers,
>>>>>> Ivan
>>>>>>
>>>>>> On Wed, Jan 3, 2018 at 2:22 PM, Anthony Thomas <ahthomas@eng.ucsd.edu
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Madlib folks,
>>>>>>>
>>>>>>> I have a large tall and skinny sparse matrix which I'm trying
to
>>>>>>> multiply by a dense vector. The matrix is 1.25e8 by 100 with
approximately
>>>>>>> 1% nonzero values. This operations always triggers an error from
Greenplum:
>>>>>>>
>>>>>>> plpy.SPIError: invalid memory alloc request size 1073741824 (context
>>>>>>> 'accumArrayResult') (mcxt.c:1254) (plpython.c:4957)
>>>>>>> CONTEXT:  Traceback (most recent call last):
>>>>>>>   PL/Python function "matrix_vec_mult", line 24, in <module>
>>>>>>>     matrix_in, in_args, vector)
>>>>>>>   PL/Python function "matrix_vec_mult", line 2044, in matrix_vec_mult
>>>>>>>   PL/Python function "matrix_vec_mult", line 2001, in
>>>>>>> _matrix_vec_mult_dense
>>>>>>> PL/Python function "matrix_vec_mult"
>>>>>>>
>>>>>>> Some Googling suggests this error is caused by a hard limit from
>>>>>>> Postgres which restricts the maximum size of an array to 1GB.
If this is
>>>>>>> indeed the cause of the error I'm seeing does anyone have any
suggestions
>>>>>>> about how to circumvent this issue? This comes up in other cases
as well
>>>>>>> like transposing a tall and skinny matrix. MVM with smaller matrices
works
>>>>>>> fine.
>>>>>>>
>>>>>>> Here is relevant version information:
>>>>>>>
>>>>>>> SELECT VERSION();
>>>>>>> PostgreSQL 8.3.23 (Greenplum Database 5.1.0 build dev) on
>>>>>>> x86_64-pc-linux-gnu, compiled by GCC gcc
>>>>>>>  (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609 compiled on Dec
21
>>>>>>> 2017 09:09:46
>>>>>>>
>>>>>>> SELECT madlib.version();
>>>>>>> MADlib version: 1.12, git revision: unknown, cmake configuration
>>>>>>> time: Thu Dec 21 18:04:47 UTC 201
>>>>>>> 7, build type: RelWithDebInfo, build system:
>>>>>>> Linux-4.4.0-103-generic, C compiler: gcc 4.9.3, C++ co
>>>>>>> mpiler: g++ 4.9.3
>>>>>>>
>>>>>>> Madlib install-check reported one error in the "convex" module
>>>>>>> related to "loss too high" which seems unrelated to the issue
described
>>>>>>> above. I know Ubuntu isn't officially supported by Greenplum
so I'd like to
>>>>>>> be confident this issue isn't just the result of using an unsupported
OS.
>>>>>>> Please let me know if any other information would be helpful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Anthony
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ivan Novick, Product Manager Pivotal Greenplum
>>>>>> inovick@pivotal.io --  (Mobile) 408-230-6491 <(408)%20230-6491>
>>>>>> https://www.youtube.com/GreenplumDatabase
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message