madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Thomas <ahtho...@eng.ucsd.edu>
Subject Re: Multiplying a large sparse matrix by a vector
Date Fri, 05 Jan 2018 05:33:13 GMT
Thanks for the suggestions Frank and Orhan - I'll give chunking the matrix
a try.

Best,

Anthony

On Thu, Jan 4, 2018 at 8:14 PM, Frank McQuillan <fmcquillan@pivotal.io>
wrote:

> I like Orhan's suggestion, it is less work.
>
> Slight correction to my comment above:
>
> "For each of the n chunks, if there is no non-zero value in the 100th
> column, you will get an error that looks like this..."
>
> I meant
>
> For each of the n chunks, if there is no value of any kind (0 or
> otherwise) in the 100th column, you will get an error that looks like
> this..."
>
> Frank
>
> On Thu, Jan 4, 2018 at 5:26 PM, Orhan Kislal <okislal@pivotal.io> wrote:
>
>> Hello Anthony,
>>
>> I agree with Frank's suggestion, operating on chunks of the matrix should
>> work. An alternate workaround for the 100th column issue you might
>> encounter could be this:
>>
>> Check if there exists a value for the the first (or last or any other)
>> row, last column. If there is one, then you can use the chunk as is. If
>> not, put 0 as the value of that particular row/column. This will ensure the
>> matrix size is calculated correctly, will not affect the output and will
>> not require any additional operation for the assembly of the final vector.
>>
>> Please let us know if you have any questions.
>>
>> Thanks,
>>
>> Orhan Kislal
>>
>> On Thu, Jan 4, 2018 at 12:12 PM, Frank McQuillan <fmcquillan@pivotal.io>
>> wrote:
>>
>>> Anthony,
>>>
>>> In that case, I think you are hitting the 1GB PostgreSQL limit.
>>>
>>> Operations on sparse matrix format requires loading into memory 2
>>> INTEGERS for row/col plus the value (INTEGER, DOUBLE PRECISION, whatever
>>> size it is).
>>>
>>> It means for your matrix the 2 INTEGERS alone are ~1.00E+09 bytes which
>>> is already on the limit without even considering the value yet.
>>>
>>> So I would suggest you do the computation in blocks.  One approach to
>>> this:
>>>
>>> * chunk your long matrix into n smaller VIEWS, say n=10 (i.e., MADlib
>>> matrix operations do work on VIEWS)
>>> * call matrix*vector for each chunk
>>> * reassemble the n result vectors into the final vector
>>>
>>> You could do this in a PL/pgSQL or PL/Python function.
>>>
>>> There is one subtlety to be aware of though because you are working with
>>> sparse matrices. For each of the n chunks, if there is no non-zero value in
>>> the 100th column, you will get an error that looks like this:
>>>
>>> madlib=# SELECT madlib.matrix_vec_mult('mat_a_view',
>>> NULL,
>>>                               array[1,2,3,4,5,6,7,8,9,10]
>>>                               );
>>> ERROR:  plpy.Error: Matrix error: Dimension mismatch between matrix (1 x
>>> 9) and vector (10 x 1)
>>> CONTEXT:  Traceback (most recent call last):
>>>   PL/Python function "matrix_vec_mult", line 24, in <module>
>>>     matrix_in, in_args, vector)
>>>   PL/Python function "matrix_vec_mult", line 2031, in matrix_vec_mult
>>>   PL/Python function "matrix_vec_mult", line 77, in _assert
>>> PL/Python function "matrix_vec_mult"
>>>
>>> See the explanation at the top of
>>> http://madlib.apache.org/docs/latest/group__grp__matrix.html
>>> regarding dimensionality of sparse matrices.
>>>
>>> One way around this is to add a (fake) row to the bottom of your VIEW
>>> with a 0 in the 100th column.  But if you do this, be sure to drop the last
>>> (fake) entry of each of the n intermediate vectors before you assemble into
>>> the final vector.
>>>
>>> Frank
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 3, 2018 at 8:15 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
>>> wrote:
>>>
>>>> Thanks Frank - the answer to both your questions is "yes"
>>>>
>>>> Best,
>>>>
>>>> Anthony
>>>>
>>>> On Wed, Jan 3, 2018 at 3:13 PM, Frank McQuillan <fmcquillan@pivotal.io>
>>>> wrote:
>>>>
>>>>>
>>>>> Anthony,
>>>>>
>>>>> Correct the install check error you are seeing is not related.
>>>>>
>>>>> Cpl questions:
>>>>>
>>>>> (1)
>>>>> Are you using:
>>>>>
>>>>> -- Multiply matrix with vector
>>>>>   matrix_vec_mult( matrix_in, in_args, vector)
>>>>>
>>>>> (2)
>>>>> Is matrix_in encoded in sparse format like at the top of
>>>>> http://madlib.apache.org/docs/latest/group__grp__matrix.html
>>>>>
>>>>> e.g., like this?
>>>>>
>>>>> row_id | col_id | value
>>>>> --------+--------+-------
>>>>>       1 |      1 |     9
>>>>>       1 |      5 |     6
>>>>>       1 |      6 |     6
>>>>>       2 |      1 |     8
>>>>>       3 |      1 |     3
>>>>>       3 |      2 |     9
>>>>>       4 |      7 |     0
>>>>>
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>> On Wed, Jan 3, 2018 at 2:52 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
>>>>> wrote:
>>>>>
>>>>>> Okay - thanks Ivan, and good to know about support for Ubuntu from
>>>>>> Greenplum!
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Anthony
>>>>>>
>>>>>> On Wed, Jan 3, 2018 at 2:38 PM, Ivan Novick <inovick@pivotal.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Anthony, this does NOT look like a Ubuntu problem, and in
fact
>>>>>>> there is OSS Greenplum officially on Ubuntu you can see here:
>>>>>>> http://greenplum.org/install-greenplum-oss-on-ubuntu/
>>>>>>>
>>>>>>> Greenplum and PostgreSQL do limit to 1 Gig for each field (row/col
>>>>>>> combination) but there are techniques to manage data sets working
within
>>>>>>> these constraints.  I will let someone else who has more experience
then me
>>>>>>> working with matrices answer how is the best way to do so in
a case like
>>>>>>> you have provided.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ivan
>>>>>>>
>>>>>>> On Wed, Jan 3, 2018 at 2:22 PM, Anthony Thomas <
>>>>>>> ahthomas@eng.ucsd.edu> wrote:
>>>>>>>
>>>>>>>> Hi Madlib folks,
>>>>>>>>
>>>>>>>> I have a large tall and skinny sparse matrix which I'm trying
to
>>>>>>>> multiply by a dense vector. The matrix is 1.25e8 by 100 with
approximately
>>>>>>>> 1% nonzero values. This operations always triggers an error
from Greenplum:
>>>>>>>>
>>>>>>>> plpy.SPIError: invalid memory alloc request size 1073741824
>>>>>>>> (context 'accumArrayResult') (mcxt.c:1254) (plpython.c:4957)
>>>>>>>> CONTEXT:  Traceback (most recent call last):
>>>>>>>>   PL/Python function "matrix_vec_mult", line 24, in <module>
>>>>>>>>     matrix_in, in_args, vector)
>>>>>>>>   PL/Python function "matrix_vec_mult", line 2044, in
>>>>>>>> matrix_vec_mult
>>>>>>>>   PL/Python function "matrix_vec_mult", line 2001, in
>>>>>>>> _matrix_vec_mult_dense
>>>>>>>> PL/Python function "matrix_vec_mult"
>>>>>>>>
>>>>>>>> Some Googling suggests this error is caused by a hard limit
from
>>>>>>>> Postgres which restricts the maximum size of an array to
1GB. If this is
>>>>>>>> indeed the cause of the error I'm seeing does anyone have
any suggestions
>>>>>>>> about how to circumvent this issue? This comes up in other
cases as well
>>>>>>>> like transposing a tall and skinny matrix. MVM with smaller
matrices works
>>>>>>>> fine.
>>>>>>>>
>>>>>>>> Here is relevant version information:
>>>>>>>>
>>>>>>>> SELECT VERSION();
>>>>>>>> PostgreSQL 8.3.23 (Greenplum Database 5.1.0 build dev) on
>>>>>>>> x86_64-pc-linux-gnu, compiled by GCC gcc
>>>>>>>>  (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609 compiled
on Dec 21
>>>>>>>> 2017 09:09:46
>>>>>>>>
>>>>>>>> SELECT madlib.version();
>>>>>>>> MADlib version: 1.12, git revision: unknown, cmake configuration
>>>>>>>> time: Thu Dec 21 18:04:47 UTC 201
>>>>>>>> 7, build type: RelWithDebInfo, build system:
>>>>>>>> Linux-4.4.0-103-generic, C compiler: gcc 4.9.3, C++ co
>>>>>>>> mpiler: g++ 4.9.3
>>>>>>>>
>>>>>>>> Madlib install-check reported one error in the "convex" module
>>>>>>>> related to "loss too high" which seems unrelated to the issue
described
>>>>>>>> above. I know Ubuntu isn't officially supported by Greenplum
so I'd like to
>>>>>>>> be confident this issue isn't just the result of using an
unsupported OS.
>>>>>>>> Please let me know if any other information would be helpful.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Anthony
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ivan Novick, Product Manager Pivotal Greenplum
>>>>>>> inovick@pivotal.io --  (Mobile) 408-230-6491 <(408)%20230-6491>
>>>>>>> https://www.youtube.com/GreenplumDatabase
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message