Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0823217F50 for ; Mon, 20 Apr 2015 22:05:19 +0000 (UTC) Received: (qmail 54903 invoked by uid 500); 20 Apr 2015 22:05:18 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 54764 invoked by uid 500); 20 Apr 2015 22:05:18 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 54751 invoked by uid 99); 20 Apr 2015 22:05:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2015 22:05:18 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.164.171.186 which is an MX secondary for dev@commons.apache.org) Received: from [54.164.171.186] (HELO mx1-us-east.apache.org) (54.164.171.186) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2015 22:05:12 +0000 Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 7B67B43ED3 for ; Mon, 20 Apr 2015 22:04:51 +0000 (UTC) Received: by pabtp1 with SMTP id tp1so219631062pab.2 for ; Mon, 20 Apr 2015 15:04:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=2CbhnKWMSBtwPqv5dGNiw5/HWfCvrMhrp0xBWYCcxzQ=; b=vBCzbH+XkOR0b/KW/iTcT5LtcEl7voG6xHF6UN9pQPbw0motM2japsMn78hykeJd8B RnKU3KkR9WZuGuHVi25dl97S9eDRpWAmCeevj4csQB7H+SZzWo0GGzwBk9D0aqJJ9RSQ nRMIr+2Yc5555lYvNCdBOUqdtYUyFX0WKVG/k9KcJz01JTb/ZSD0YO0XUs5hD4zdyUg2 YfDyZvduZjTIqD7lhtT54YRQEjBSVFHLlz/C0B9UGFg/kJULwpx1RslDE5smSVKpDCFZ 7lYqFccBNuL0Qa+wvLSVNLUz5Aatwj2KNNCtbjjZGL1O7Z1GqfjKs8kjJQ47oufIN0kh nc8w== X-Received: by 10.70.118.232 with SMTP id kp8mr32094944pdb.130.1429567484740; Mon, 20 Apr 2015 15:04:44 -0700 (PDT) Received: from psteitz-mbp.local ([67.133.165.2]) by mx.google.com with ESMTPSA id w6sm19086800pbt.60.2015.04.20.15.04.42 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Apr 2015 15:04:43 -0700 (PDT) Message-ID: <553577F9.1020100@gmail.com> Date: Mon, 20 Apr 2015 15:04:41 -0700 From: Phil Steitz User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Commons Developers List Subject: Re: [math] threading redux References: <552E84E3.1020003@gmail.com> <5531284E.7090804@gmail.com> <5531659A.4020904@gmail.com> <8e05f5484c1be36cfcbb15955885ffde@scarlet.be> In-Reply-To: <8e05f5484c1be36cfcbb15955885ffde@scarlet.be> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On 4/19/15 6:08 AM, Gilles wrote: > Hello. > > On Sat, 18 Apr 2015 22:25:20 -0400, James Carman wrote: >> I think I got sidetracked when typing that email. I was trying >> to say that >> we need an abstraction layer above raw threads in order to allow for >> different types of parallelism. The Future abstraction is there >> in order >> to support remote execution where side effects aren't good enough. > > I don't know what > "remote execution where side effects aren't good enough" > means. > > I'll describe my example of "prototype" (see quoted message below[1]) > and what *I* mean when I suggest that (some of) the CM code should > allow > to take advantage of multi-threading. > > I committed the first set of classes in "o.a.c.m.ml.neuralnet".[2] > Here is the idea of "parallelism" that drove the design of those > classes: The training of an artificial neural network (ANN) is > performed > by almost[3] independent updates of each of ANN's cells. You > _cannot_[4] > however chop the network into independent parts to be sent for remote > processing: each update must be visible ASAP by all the training > tasks.[5] There are lots of ways to allow distributed processes to share common data. Spark has a very nice construct called a Resilient Distributed Dataset (RDD) designed for exactly this purpose. > > "Future" instances do not appear in the "main" code, but the idea > was, > indeed, to be able to use that JDK abstraction: see the unit test[6] > testTravellerSalesmanSquareTourParallelSolver() > defined in class > org.apache.commons.math4.ml.neuralnet.sofm.KohonenTrainingTaskTest > in the "test" part of the repository. This is a good concrete example. The question is, is there a way that we could set up, say KohonenTrainingTask to not just directly implement Runnable and enable it to be executed by something other than an in-process, thread-spawning Executor. You're right that however we did set it up, we would have to allow each task to access the shared net. > >> As for a concrete example, you can try Phil's idea of the genetic >> algorithm >> stuff I suppose. > > I hope that with the above I made myself clear that I was not asking > for a pointer to code that could be parallelized[7], but rather that > people make it explicit what _they_ mean by parallelism[8]. What > I mean > is multithread safe code that can take advantage of the multiple core > machines through the readily available classes in the JDK: namely the > "Executor" framework which you also mentioned. That is one way to achieve parallelism. The Executor is one way to manage concurrently executing threads in a single process. There are other ways to do this. My challenge is to find a way to make it possible for users to plug in alternatives. > Of course, I do not preclude other approaches (I don't know them, as > mentioned previously) that may (or may not) be more appropriate > for the > example I gave or to other algorithms; but I truly believe that this > discussion should be more precise, unless we deepen the > misunderstanding > of what we think we are talking about. Agreed. Above example is also a good one to look at. > > > Regards, > Gilles > > [1] As a side note: Shall we agree that top-posting is bad? ;-) Yes! > [2] With the purpose to implement a version of a specific > algorithm (SOFM) so > that the data structures might not be generally useful for any > kind of > artificial neural network. > [3] The update should of course be thread-safe: two parallel tasks > might try > to update the same cell at the same time. Right, this is partly a function of what data structure and protocols you use to protect the shared data. > [4] At least, it's instinctively obvious that for a SOFM network > of "relatively > small", you'd _loose_ performance through I/O. Yes, just like it does not make sense to do spreadsheet math on hadoop clusters. The (perhaps impossible) idea is to set things up so that thread location and management is pluggable. Phil > [5] In later phases of the training, "regions" will have formed in > the ANN, so > that at that point, it might be possible to continue the > updates of those > regions on different computation nodes (with the necessary > synchronization > of the region's boundaries). > [6] It's more of an example usage that could probably go to the > "user guide". > [7] The GA perfectly lend itself to the same kind of "readiness to > parallelism" > code which I implemented for the SOFM. > [8] As applied concretely to a specific algorithm in CM. > >> On Saturday, April 18, 2015, Gilles >> wrote: >> >>> On Fri, 17 Apr 2015 16:53:56 -0500, James Carman wrote: >>> >>>> Do you have any pointers to code for this ForkJoin mechanism? I'm >>>> curious to see it. >>>> >>>> The key thing you will need in order to support parallelization >>>> in a >>>> generic way >>>> >>> >>> What do you mean by "generic way"? >>> >>> I'm afraid that we may be trying to compare apples and oranges; >>> each of us probably has in mind a "prototype" algorithm and an idea >>> of how to implement it to make it run in parallel. >>> >>> I think that it would focus the discussion if we could >>> 1. tell what the "prototype" is, >>> 2. show a sort of pseudo-code of the difference between a >>> sequential >>> and a parallel run of this "prototype" (i.e. what is the >>> data, how >>> the (sub)tasks operate on them). >>> >>> Regards, >>> Gilles >>> >>> is to not tie it directly to threads, but use some >>>> abstraction layer above threads, since that may not be the >>>> "worker" >>>> method you're using at the time. >>>> >>>> On Fri, Apr 17, 2015 at 2:57 PM, Thomas Neidhart >>>> wrote: >>>> >>>>> On 04/17/2015 05:35 PM, Phil Steitz wrote: >>>>> >>>>>> On 4/17/15 3:14 AM, Gilles wrote: >>>>>> >>>>>>> Hello. >>>>>>> >>>>>>> On Thu, 16 Apr 2015 17:06:21 -0500, James Carman wrote: >>>>>>> >>>>>>>> Consider me poked! >>>>>>>> >>>>>>>> So, the Java answer to "how do I run things in multiple >>>>>>>> threads" >>>>>>>> is to >>>>>>>> use an Executor (java.util). This doesn't necessarily mean >>>>>>>> that you >>>>>>>> *have* to use a separate thread (the implementation could >>>>>>>> execute >>>>>>>> inline). However, in order to accommodate the separate >>>>>>>> thread case, >>>>>>>> you would need to code to a Future-like API. Now, I'm not >>>>>>>> saying to >>>>>>>> use Executors directly, but I'd provide some abstraction >>>>>>>> layer above >>>>>>>> them or in lieu of them, something like: >>>>>>>> >>>>>>>> public interface ExecutorThingy { >>>>>>>> Future execute(Function fn); >>>>>>>> } >>>>>>>> >>>>>>>> One could imagine implementing different ExecutorThingy >>>>>>>> implementations which allow you to parallelize things in >>>>>>>> different >>>>>>>> ways (simple threads, JMS, Akka, etc, etc.) >>>>>>>> >>>>>>> >>>>>>> I did not understand what is being suggested: >>>>>>> parallelization of a >>>>>>> single algorithm or concurrent calls to multiple instances >>>>>>> of an >>>>>>> algorithm? >>>>>>> >>>>>> >>>>>> Really both. It's probably best to look at some concrete >>>>>> examples. >>>>>> The two I mentioned in my apachecon talk are: >>>>>> >>>>>> 1. Threads managed by some external process / application >>>>>> gathering >>>>>> statistics to be aggregated. >>>>>> >>>>>> 2. Allowing multiple threads to concurrently execute GA >>>>>> transformations within the GeneticAlgorithm "evolve" method. >>>>>> >>>>>> It would be instructive to think about how to handle both of >>>>>> these >>>>>> use cases using something like what James is suggesting.=20 >>>>>> What is >>>>>> nice about his idea is that it could give us a way to let >>>>>> users / >>>>>> systems decide whether they want to have [math] algorithms spawn >>>>>> threads to execute concurrently or to allow an external >>>>>> execution >>>>>> framework to handle task distribution across threads. >>>>>> >>>>> >>>>> I since a more viable option is to take advantage of the ForkJoin >>>>> mechanism that we can use now in math 4. >>>>> >>>>> For example, the GeneticAlgorithm could be quite easily >>>>> changed to use a >>>>> ForkJoinTask to perform each evolution, I will try to come up >>>>> with an >>>>> example soon as I plan to work on the genetics package anyway. >>>>> >>>>> The idea outlined above sounds nice but it is very unclear how an >>>>> algorithm or function would perform its parallelization in >>>>> such a way, >>>>> and whether it would still be efficient. >>>>> >>>>> Thomas >>>>> >>>>> Since 2. above is a good example of "internal" parallelism >>>>> and it >>>>>> also has data sharing / transfer challenges, maybe its best >>>>>> to start >>>>>> with that one. I have just started thinking about this and >>>>>> would >>>>>> love to get better ideas than my own hacking about how to do it >>>>>> >>>>>> a) Using Spark with RDD's to maintain population state data >>>>>> b) Hadoop with HDFS (or something else?) >>>>>> >>>>>> Phil >>>>>> >>>>>>> >>>>>>> >>>>>>> Gilles > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > For additional commands, e-mail: dev-help@commons.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org For additional commands, e-mail: dev-help@commons.apache.org