Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8AC0510F32 for ; Tue, 1 Apr 2014 18:32:59 +0000 (UTC) Received: (qmail 4886 invoked by uid 500); 1 Apr 2014 18:32:56 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 4841 invoked by uid 500); 1 Apr 2014 18:32:56 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 4833 invoked by uid 99); 1 Apr 2014 18:32:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2014 18:32:55 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2014 18:32:49 +0000 Received: by mail-oa0-f41.google.com with SMTP id j17so11795015oag.28 for ; Tue, 01 Apr 2014 11:32:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=05/6BUAD+WwCOXlJTIoDFfAZ282WxC/8SUPM0zre/CM=; b=vs18cFcpWVn+fOJLNgVIM2OlVd/jly+DfGb130zz0N7JMR/HXxQEkMRk6mqKbK3yqr cLTZEtMnwAZV2eFOZ02BoB4ymIeTutVwwxr1ofwUJpISxPiSEk9G6RlcJbnVpQUOzlVN 1u1ZiPr0opazUwL0m/JtAHLw3Lw59kY3EvZ4xY6NKKtnME/ABqofRvaz6tQylZiC75zu kjFJMQJuIPDNcdRz4qYa28vbKefhIVnVFt5f/r23Raq3OCNgjzBe/jfyQVqqL7F62d29 JEXwOWAug9PepvRl711DHWWyAlEf10Sj+awXjwbxhTkyNXEXHyuDgwgU9Hzi4HqJRLQN 6MYw== MIME-Version: 1.0 X-Received: by 10.60.44.8 with SMTP id a8mr30620876oem.19.1396377146779; Tue, 01 Apr 2014 11:32:26 -0700 (PDT) Received: by 10.76.77.3 with HTTP; Tue, 1 Apr 2014 11:32:26 -0700 (PDT) In-Reply-To: <7F3F9B54-6FCC-4A86-AD0E-038E5229CE33@gmail.com> References: <7F3F9B54-6FCC-4A86-AD0E-038E5229CE33@gmail.com> Date: Tue, 1 Apr 2014 11:32:26 -0700 Message-ID: Subject: Re: [jira] [Commented] (MAHOUT-1500) H2O integration From: Dmitriy Lyubimov To: "dev@mahout.apache.org" Content-Type: multipart/alternative; boundary=001a11c2e458ce6cfb04f5ff6340 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2e458ce6cfb04f5ff6340 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Apr 1, 2014 at 3:09 AM, Ted Dunning wrote: > I would rather see a matrix that looks local but acts global so that > coders can produce very simple code that is still parallelized. > And that's exactly how it is done in Bindings. This discussion is not about that though. this discussion is about why doing that on Matrix and Vector hierarchy is a bad idea. Trying to explain why. Matrix and Vector api, historically, mix in a lot of concerns (not just linalg operators). E.g. they also include things like element data access views and patterns (getQuick, getRow, iterateNonZero); in-core specific optimizer things like */ double getLookupCost(); double getIteratorAdvanceCost(); etc. Normally that is addressed via Mix-ins but it wasn't (and it is hard in Java in general). Corrollary to that is simple fact that 95% of mahout (and, more importantly, outside code) is something like for (el:v.iterateNonZero()) { ... do something with element } *which is not parallelizable at all and would require major refactoring of apis and all user code to make it so. * *Corollary to that are 2 arguments :* *(1) doing what you say on AbstractMatrix or AbstractVector hierarchy is not possible without a "nuclear option" on the api, which will send a ripple effect inside and outside Mahout (my outside code in particular too);* (2) and even if we invoked "nuclear option", doing so does not have benefit compared to introducing a parallel type hierarchy for distributed matrices since write-once-run-everywhere works there too. The idea of write-once-run either in-core or out-of-core is very noble, but in practice is neither quite feasible (mostly because of component lifecycle and optimization checkpointing concerns), nor it has a significant value. (i.e. if one can have ssvd and dssvd in 29 lines, assuming same algorithm even has a parallelization strategy), then there's no harm in having two separate things for in-core and out-of-core -- dssvd() and ssvd(). --001a11c2e458ce6cfb04f5ff6340--