Mailing-List: contact dev-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Developers List" <dev@commons.apache.org>
Received-SPF: pass (athena.apache.org: message received from 54.164.171.186
 which is an MX secondary for dev@commons.apache.org)
Message-ID: <553577F9.1020100@gmail.com>
Date: Mon, 20 Apr 2015 15:04:41 -0700
From: Phil Steitz <phil.steitz@gmail.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10;
 rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Commons Developers List <dev@commons.apache.org>
Subject: Re: [math] threading redux
References: <552E84E3.1020003@gmail.com>
 <CALznzY71ENXA0BsXobQd8dLF8ZNEvWFAwO17TjpOrvqyY5N48w@mail.gmail.com>
 <c61a73227bfd1f4cf51a5a50aa5ca459@scarlet.be> <5531284E.7090804@gmail.com>
 <5531659A.4020904@gmail.com>
 <CALznzY6YT9r8UoQxdmG53xstgs3JjzWhZ8Er_CiT5k-QB4u4sg@mail.gmail.com>
 <b1ca62522e73d448a959cc29492eb0c4@scarlet.be>
 <CALznzY4iAjZDNEFKmqiCKmFOKsjdXQWmss31jS2VD6J=TOOYzA@mail.gmail.com>
 <8e05f5484c1be36cfcbb15955885ffde@scarlet.be>
In-Reply-To: <8e05f5484c1be36cfcbb15955885ffde@scarlet.be>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 4/19/15 6:08 AM, Gilles wrote:
> Hello.
>
> On Sat, 18 Apr 2015 22:25:20 -0400, James Carman wrote:
>> I think I got sidetracked when typing that email.  I was trying
>> to say that
>> we need an abstraction layer above raw threads in order to allow for
>> different types of parallelism.   The Future abstraction is there
>> in order
>> to support remote execution where side effects aren't good enough.
>
> I don't know what
>  "remote execution where side effects aren't good enough"
> means.
>
> I'll describe my example of "prototype" (see quoted message below[1])
> and what *I* mean when I suggest that (some of) the CM code should
> allow
> to take advantage of multi-threading.
>
> I committed the first set of classes in "o.a.c.m.ml.neuralnet".[2]
> Here is the idea of "parallelism" that drove the design of those
> classes: The training of an artificial neural network (ANN) is
> performed
> by almost[3] independent updates of each of ANN's cells.  You
> _cannot_[4]
> however chop the network into independent parts to be sent for remote
> processing: each update must be visible ASAP by all the training
> tasks.[5]

There are lots of ways to allow distributed processes to share
common data.  Spark has a very nice construct called a Resilient
Distributed Dataset (RDD) designed for exactly this purpose.
>
> "Future" instances do not appear in the "main" code, but the idea
> was,
> indeed, to be able to use that JDK abstraction: see the unit test[6]
>   testTravellerSalesmanSquareTourParallelSolver()
> defined in class
>   org.apache.commons.math4.ml.neuralnet.sofm.KohonenTrainingTaskTest
> in the "test" part of the repository.

This is a good concrete example. The question is, is there a way
that we could set up, say KohonenTrainingTask to not just directly
implement Runnable and enable it to be executed by something other
than an in-process, thread-spawning Executor.  You're right that
however we did set it up, we would have to allow each task to access
the shared net.
>
>> As for a concrete example, you can try Phil's idea of the genetic
>> algorithm
>> stuff I suppose.
>
> I hope that with the above I made myself clear that I was not asking
> for a pointer to code that could be parallelized[7], but rather that
> people make it explicit what _they_ mean by parallelism[8].  What
> I mean
> is multithread safe code that can take advantage of the multiple core
> machines through the readily available classes in the JDK: namely the
> "Executor" framework which you also mentioned.

That is one way to achieve parallelism.  The Executor is one way to
manage concurrently executing threads in a single process.  There
are other ways to do this.  My challenge is to find a way to make it
possible for users to plug in alternatives.

> Of course, I do not preclude other approaches (I don't know them, as
> mentioned previously) that may (or may not) be more appropriate
> for the
> example I gave or to other algorithms; but I truly believe that this
> discussion should be more precise, unless we deepen the
> misunderstanding
> of what we think we are talking about.

Agreed. Above example is also a good one to look at.
>
>
> Regards,
> Gilles
>
> [1] As a side note: Shall we agree that top-posting is bad? ;-)

Yes!

> [2] With the purpose to implement a version of a specific
> algorithm (SOFM) so
>     that the data structures might not be generally useful for any
> kind of
>     artificial neural network.
> [3] The update should of course be thread-safe: two parallel tasks
> might try
>     to update the same cell at the same time.

Right, this is partly a function of what data structure and
protocols you use to protect the shared data.

> [4] At least, it's instinctively obvious that for a SOFM network
> of "relatively
>     small", you'd _loose_ performance through I/O.

Yes, just like it does not make sense to do spreadsheet math on
hadoop clusters.  The (perhaps impossible) idea is to set things up
so that thread location and management is pluggable.

Phil

> [5] In later phases of the training, "regions" will have formed in
> the ANN, so
>     that at that point, it might be possible to continue the
> updates of those
>     regions on different computation nodes (with the necessary
> synchronization
>     of the region's boundaries).
> [6] It's more of an example usage that could probably go to the
> "user guide".
> [7] The GA perfectly lend itself to the same kind of "readiness to
> parallelism"
>     code which I implemented for the SOFM.
> [8] As applied concretely to a specific algorithm in CM.
>
>> On Saturday, April 18, 2015, Gilles
>> <gilles@harfang.homelinux.org> wrote:
>>
>>> On Fri, 17 Apr 2015 16:53:56 -0500, James Carman wrote:
>>>
>>>> Do you have any pointers to code for this ForkJoin mechanism?  I'm
>>>> curious to see it.
>>>>
>>>> The key thing you will need in order to support parallelization
>>>> in a
>>>> generic way
>>>>
>>>
>>> What do you mean by "generic way"?
>>>
>>> I'm afraid that we may be trying to compare apples and oranges;
>>> each of us probably has in mind a "prototype" algorithm and an idea
>>> of how to implement it to make it run in parallel.
>>>
>>> I think that it would focus the discussion if we could
>>> 1. tell what the "prototype" is,
>>> 2. show a sort of pseudo-code of the difference between a
>>> sequential
>>>    and a parallel run of this "prototype" (i.e. what is the
>>> data, how
>>>    the (sub)tasks operate on them).
>>>
>>> Regards,
>>> Gilles
>>>
>>>  is to not tie it directly to threads, but use some
>>>> abstraction layer above threads, since that may not be the
>>>> "worker"
>>>> method you're using at the time.
>>>>
>>>> On Fri, Apr 17, 2015 at 2:57 PM, Thomas Neidhart
>>>> <thomas.neidhart@gmail.com> wrote:
>>>>
>>>>> On 04/17/2015 05:35 PM, Phil Steitz wrote:
>>>>>
>>>>>> On 4/17/15 3:14 AM, Gilles wrote:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> On Thu, 16 Apr 2015 17:06:21 -0500, James Carman wrote:
>>>>>>>
>>>>>>>> Consider me poked!
>>>>>>>>
>>>>>>>> So, the Java answer to "how do I run things in multiple
>>>>>>>> threads"
>>>>>>>> is to
>>>>>>>> use an Executor (java.util).  This doesn't necessarily mean
>>>>>>>> that you
>>>>>>>> *have* to use a separate thread (the implementation could
>>>>>>>> execute
>>>>>>>> inline).  However, in order to accommodate the separate
>>>>>>>> thread case,
>>>>>>>> you would need to code to a Future-like API.  Now, I'm not
>>>>>>>> saying to
>>>>>>>> use Executors directly, but I'd provide some abstraction
>>>>>>>> layer above
>>>>>>>> them or in lieu of them, something like:
>>>>>>>>
>>>>>>>> public interface ExecutorThingy {
>>>>>>>>   Future<T> execute(Function<T> fn);
>>>>>>>> }
>>>>>>>>
>>>>>>>> One could imagine implementing different ExecutorThingy
>>>>>>>> implementations which allow you to parallelize things in
>>>>>>>> different
>>>>>>>> ways (simple threads, JMS, Akka, etc, etc.)
>>>>>>>>
>>>>>>>
>>>>>>> I did not understand what is being suggested:
>>>>>>> parallelization of a
>>>>>>> single algorithm or concurrent calls to multiple instances
>>>>>>> of an
>>>>>>> algorithm?
>>>>>>>
>>>>>>
>>>>>> Really both.  It's probably best to look at some concrete
>>>>>> examples.
>>>>>> The two I mentioned in my apachecon talk are:
>>>>>>
>>>>>> 1.  Threads managed by some external process / application
>>>>>> gathering
>>>>>> statistics to be aggregated.
>>>>>>
>>>>>> 2.  Allowing multiple threads to concurrently execute GA
>>>>>> transformations within the GeneticAlgorithm "evolve" method.
>>>>>>
>>>>>> It would be instructive to think about how to handle both of
>>>>>> these
>>>>>> use cases using something like what James is suggesting.=20
>>>>>> What is
>>>>>> nice about his idea is that it could give us a way to let
>>>>>> users /
>>>>>> systems decide whether they want to have [math] algorithms spawn
>>>>>> threads to execute concurrently or to allow an external
>>>>>> execution
>>>>>> framework to handle task distribution across threads.
>>>>>>
>>>>>
>>>>> I since a more viable option is to take advantage of the ForkJoin
>>>>> mechanism that we can use now in math 4.
>>>>>
>>>>> For example, the GeneticAlgorithm could be quite easily
>>>>> changed to use a
>>>>> ForkJoinTask to perform each evolution, I will try to come up
>>>>> with an
>>>>> example soon as I plan to work on the genetics package anyway.
>>>>>
>>>>> The idea outlined above sounds nice but it is very unclear how an
>>>>> algorithm or function would perform its parallelization in
>>>>> such a way,
>>>>> and whether it would still be efficient.
>>>>>
>>>>> Thomas
>>>>>
>>>>>  Since 2. above is a good example of "internal" parallelism
>>>>> and it
>>>>>> also has data sharing / transfer challenges, maybe its best
>>>>>> to start
>>>>>> with that one.  I have just started thinking about this and
>>>>>> would
>>>>>> love to get better ideas than my own hacking about how to do it
>>>>>>
>>>>>> a) Using Spark with RDD's to maintain population state data
>>>>>> b) Hadoop with HDFS (or something else?)
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gilles
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org