mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Lyubimov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM
Date Tue, 04 Oct 2016 21:10:22 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546663#comment-15546663
] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:09 PM:
-------------------------------------------------------------------

drmWrap is not internal in the least (which is why it is not package-private). it is public
and intended for plugging external general sources into input barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is not guaranteed
not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble cluster capacity.
Which is I am yet to encounter as algebraic tasks are CPU and io bound, but not memory. Usually
we run out of IO and CPU much sooner that we run out of memory, which makes this situation
pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't have explicit
caching api except for checkpoint "hints" but even that is only a hint, not guaranteed. Giving
it some heuristics about dataset doesn't guarantee that it won't compute others or won't cache
or sample for some other reason, now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as choosing degrees
of parallelization, product task sizes or operators to execute. Making those choices automatically
is, actually, the point. As long as optimizer does right enough things, that should be ok.


Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead specifically.
But I do not see a tangible benefit either. There's possibly only a slight benefit right now
(no no-cache or no-sample guarantee), which likely only decrease in the future. I am fine
with it as understood there's no "no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). it is public
and intended for plugging external general sources into input barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is not guaranteed
not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble cluster capacity.
Which is I am yet to encounter as algebraic tasks are CPU and io bound, but not memory. Usually
we run out of IO and CPU much sooner that we run out of memory, which makes this situation
pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't have explicit
caching api except for checkpoint "hints" but even that is only a hint, not guaranteed. Giving
it some heuristics about dataset doesn't guarantee that it won't compute others or won't cache
or sample for some other reason, now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as choosing degrees
of parallelization, product task sizes or operators to execute. Making those choices automatically
is, actually, the point. As long as optimizer does right enough things, that should be ok.


Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead specifically.
But I do not see a tangible benefit either. There's possibly only a slight benefit right now
(no no-cache or no-sample guarantee), which likely only decrease in the future. I am fine
with it as understood there's no "no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> ------------------------------------------
>
>                 Key: MAHOUT-1884
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1884
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.12.2
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>            Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a user calls
nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the matrices
are synthetically generated, or when some metadata about them is known). In such cases, the
user should be able to specify the dimensions upon creating the DRM and the caching should
be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message