spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanbo Liang (JIRA)" <>
Subject [jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator
Date Fri, 23 Jun 2017 09:04:00 GMT


Yanbo Liang commented on SPARK-21152:

[~sethah] This is an interesting topic, thanks for working on it. Could you show the performance
comparison with respect to the size and type of data? AFAIK, the most common use case for
MLlib LR is training on {{low dimensional dense/sparse or high dimensional sparse}} data.
If blocked gradient update can get significant performance improvement for these cases, I
think it's worth the investment. Thanks.

> Use level 3 BLAS operations in LogisticAggregator
> -------------------------------------------------
>                 Key: SPARK-21152
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Seth Hendrickson
> In logistic regression gradient update, we currently compute by each individual row.
If we blocked the rows together, we can do a blocked gradient update which leverages the BLAS
GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem here, though,
is that it likely won't improve the sparse case so we need to keep both implementations around,
and this blocked algorithm will require caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train in the past
because it adds memory overhead if the user has cached this original dataset for other reasons.
Here, I'd like to discuss whether we think this patch would be worth the investment, given
that it only improves a subset of the use cases.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message