mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Traupman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-666) DistributedSparseMatrix should clean up after itself when doing times(Vector) and timesSquared(Vector)
Date Tue, 12 Apr 2011 16:22:06 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018890#comment-13018890
] 

Jonathan Traupman commented on MAHOUT-666:
------------------------------------------

Wow, that was fast. 

I forgot to mention in my upload that it might be reasonable to just make this cleanup code
the default (i.e. get rid of the conf paramater), since the directories that get created have
random names and are thus hard for any external computation to access. I'm not familiar enough
with the users of this code to make that call, though, so I took the "do no harm" approach.

If you think it's worthwhile to just always clean up, I'll make the requisite changes and
send another patch.

> DistributedSparseMatrix should clean up after itself when doing times(Vector) and timesSquared(Vector)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-666
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-666
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.5
>         Environment: Linux x86_64 2.6.18, Mac OS 10.6 64-bit, Hadoop 0.20.2, Java 1.6
>            Reporter: Jonathan Traupman
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: mahout-666.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> The directories created during the times() and timesSquared() methods in DistributedSparseMatrix
leave behind a lot of cruft. While the individual files are tagged with deleteOnExit, but
the directories are not. Also, but not deleting them until JVM exit, a job that does repeated
matrix/vector multiplies, like DistributedLanczosSolver, creates a lot of temp files that
stick around for the whole run, even though the results they contain are read once and then
never again. 
> Our cluster admins enforce both file count and size quotas, so since 5 temp files/directories
are created on each iteration of DistributedLanczosSolver, we're constantly bumping into the
quota with large SVDs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message