commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles <gil...@harfang.homelinux.org>
Subject Re: [Math] Moving on or not?
Date Thu, 07 Feb 2013 22:06:08 GMT
On Thu, 07 Feb 2013 08:32:46 -0800, Phil Steitz wrote:
> On 2/7/13 8:04 AM, Gilles wrote:
>> On Thu, 07 Feb 2013 07:01:42 -0800, Phil Steitz wrote:
>>> On 2/7/13 4:58 AM, Gilles wrote:
>>>> On Wed, 06 Feb 2013 09:46:55 -0800, Phil Steitz wrote:
>>>>> On 2/6/13 9:03 AM, Gilles wrote:
>>>>>> On Wed, 06 Feb 2013 07:19:47 -0800, Phil Steitz wrote:
>>>>>>> On 2/5/13 6:08 AM, Gilles wrote:
>>>>>>>> Hi.
>>>>>>>>
>>>>>>>> In the thread about "static import", Stephen noted that
>>>>>>>> decisions
>>>>>>>> on a
>>>>>>>> component's evolution are dependent on whether the future
of
>>>>>>>> the
>>>>>>>> Java
>>>>>>>> language is taken into account, or not.
>>>>>>>> A question on the same theme also arose after the
>>>>>>>> presentation of
>>>>>>>> Commons
>>>>>>>> Math in FOSDEM 2013.
>>>>>>>>
>>>>>>>> If we assume that efficiency is among the important
>>>>>>>> qualities for
>>>>>>>> Commons
>>>>>>>> Math, the future is to allow usage of the tools provided
by 
>>>>>>>> the
>>>>>>>> standard
>>>>>>>> Java library in order to ease the development of 
>>>>>>>> multi-threaded
>>>>>>>> algorithms.
>>>>>>>>
>>>>>>>> Maintaining Java 1.5 source compatibility for the reason
>>>>>>>> that we
>>>>>>>> may need
>>>>>>>> to support legacy applications will turn out to be
>>>>>>>> self-defeating:
>>>>>>>> 1. New users will not consider Commons Math's features that

>>>>>>>> are
>>>>>>>> notably
>>>>>>>>    apt to parallel processing.
>>>>>>>> 2. Current users might at some point simply switch to another
>>>>>>>> library if
>>>>>>>>    it proves more efficient (because it actually uses
>>>>>>>> multi-threading).
>>>>>>>> 3. New Java developers will be turned away because they will
>>>>>>>> want
>>>>>>>> to use
>>>>>>>>    the more convenient features of the language in order
to
>>>>>>>> provide
>>>>>>>>    potential contributions.
>>>>>>>>
>>>>>>>> If maintaining 1.5 source compatibility is kept as a
>>>>>>>> requirement, the
>>>>>>>> consequence is that Commons Math will _become_ a legacy
>>>>>>>> library.
>>>>>>>> In that perspective, implementing/improving algorithms for
>>>>>>>> which a
>>>>>>>> parallel version is known to be more efficient is plainly
a
>>>>>>>> waste of
>>>>>>>> development and maintenance time.
>>>>>>>>
>>>>>>>> In order to mitigate the risks (both of upgrading and of
not
>>>>>>>> upgrading
>>>>>>>> the source compatibility requirement), I would propose to
>>>>>>>> create a
>>>>>>>> new
>>>>>>>> project (say, "Commons Math MT") where we could implement
new
>>>>>>>> features[1]
>>>>>>>> without being encumbered with the 1.5 requirement.[2]
>>>>>>>> The "Commons Math MT" would depend on "Commons Math" where
we
>>>>>>>> would
>>>>>>>> continue developing single-thread (and thread-safe) "tasks",
>>>>>>>> i.e.
>>>>>>>> independent units of processing that could be used in
>>>>>>>> algorithms
>>>>>>>> located in "Commons Math MT".
>>>>>>>>
>>>>>>>> In summary:
>>>>>>>> - Commons Math (as usual):
>>>>>>>>   * single-thread (sequential) algorithms,
>>>>>>>>   * (pure) Java 5,
>>>>>>>>   * no dependencies.
>>>>>>>> - Commons Math MT:
>>>>>>>>   * multi-thread (parallel) algorithms,
>>>>>>>>   * Java 7 and beyond,
>>>>>>>>   * JNI allowed,
>>>>>>>>   * dependencies allowed (jCuda).
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>
>>>>>>> There are several other possibilities to consider:
>>>>>>>
>>>>>>> 0) Implement multithreading using JDK 1.5 primitives
>>>>>>> 1) Set things up within [math] to support parallel execution
in
>>>>>>> JDK
>>>>>>> 1.7, Hadoop or other frameworks
>>>>>>> 2) Instead of a new project, start a 4.x branch targeting JDK
>>>>>>> 1.7
>>>>>>>
>>>>>>> I think we should maintain a version that has no dependencies
>>>>>>> and no
>>>>>>> JNI in any case.
>>>>>>>
>>>>>>> Starting a branch and getting concrete about how to parallelize
>>>>>>> some
>>>>>>> algorithms would be a good way to start.  One thing I have not
>>>>>>> really investigated and would be interested in details on is
>>>>>>> what
>>>>>>> you actually get in efficiency gain (or loss?) using fork /
>>>>>>> join vs
>>>>>>> just using 1.5+ concurrency for the kinds of problems we
>>>>>>> would end
>>>>>>> up using this stuff for.
>>>>>>>
>>>>>>> Thinking about specific parallelization problem instances would
>>>>>>> also
>>>>>>> help decide whether 1) makes sense (i.e., whether it makes
>>>>>>> sense as
>>>>>>> you mention above to maintain a single-threaded library that
>>>>>>> provides task execution for a multithreaded version or
>>>>>>> multithreaded
>>>>>>> frameworks).
>>>>>>>
>>>>>>> One more thing to consider is that for at least some users of
>>>>>>> [math], having the library internally spawn threads and/or peg
>>>>>>> multiple processors may not be desirable.  It is a little
>>>>>>> misleading
>>>>>>> to say that multithreading is the way to get "efficiency."
>>>>>>> It is
>>>>>>> really the way to *use* more compute resources and unless there
>>>>>>> are
>>>>>>> real algorithmic improvements, the overall efficiency may
>>>>>>> actually
>>>>>>> be less, due to task coordination overhead.  What you get is
>>>>>>> faster
>>>>>>> execution due to more greedy utilization of available cores.
>>>>>>> Actual
>>>>>>> efficiency (how much overall compute resource it takes to
>>>>>>> complete a
>>>>>>> job) partly depends on how efficiently the coordination
>>>>>>> itself is
>>>>>>> done (which JDK 1.7 claims to do very well - I have just not
>>>>>>> seen
>>>>>>> substantiation or any benchmarks demonstrating this) and how

>>>>>>> the
>>>>>>> parallelization effects overall compute requirements.  In any
>>>>>>> case,
>>>>>>> for environments where library thread-spawning is not
>>>>>>> desirable, I
>>>>>>> think we should maintain a single-threaded version.
>>>>>>>
>>>>>>
>>>>>> Unless I missed the point, those reasons are exactly why I
>>>>>> propose to
>>>>>> have 2 projects/components. One, "Commons-Math", does not fiddle
>>>>>> with
>>>>>> resources, while the other would provide a 
>>>>>> "parallelizationLevel"
>>>>>> setting for the algorithms written to possibly take advantage of
>>>>>> the
>>>>>> Java 5+ "task framework".
>>>>>
>>>>> OK, what about the 4.x option?
>>>>>>
>>>>>> Yes, we could still be good by using only Java 5's concurrency
>>>>>> features
>>>>>> but the issue I raise is not only about concurrency but about
>>>>>> evolution/progress/maintenance, all things that require raising
>>>>>> interest
>>>>>> from new contributors (unless it's fine that Commons Math be
>>>>>> tagged as a
>>>>>> "library of the past"...).
>>>>>
>>>>> +1 for experimenting with parallelization.  I would just like to
>>>>> understand if the JDK 7 stuff really adds much - in particular,
>>>>> does
>>>>> it handle coordination / cpu allocation better than you could
>>>>> easily
>>>>> do it with 1.5.  More supported JDKs == more potential users, so 
>>>>> I
>>>>> like to see a real reason to bump the JDK level.
>>>>>>
>>>>>> But using concurrency features in "Commons Math" would also
>>>>>> contradict
>>>>>> your own point ("we should maintain a single-threaded
>>>>>> version"): I
>>>>>> agree,
>>>>>> and that's why I proposed this other project...
>>>>>>
>>>>>> As for efficiency (or faster execution, if you want), I don't
>>>>>> see the
>>>>>> point in doubting that tasks like global search (e.g. in a
>>>>>> genetic
>>>>>> algorithm) will complete in less time when run in parallel...
>>>>>>
>>>>>> As I summarized previously, having a "Commons Math MT" would
>>>>>> bring no
>>>>>> inconvenience, contrary to either your points 0, 1, or 2. [No
>>>>>> inconvenience to me, that is, but to people with requirements
>>>>>> like
>>>>>> "Java 5 compatible" or "no multi-threading").
>>>>>> As I indicated, the basic "task" could be defined in "Commons
>>>>>> Math" and
>>>>>> "Commons Math MT" would provide the parallelization "glue" (e.g.
>>>>>> to divide
>>>>>> the search space of the GA).
>>>>>
>>>>> I think it is best at this point to cut a branch and actually
>>>>> start
>>>>> working on specific algorithms.  Having a set of candidate
>>>>> algorithms for parallelization will help us decide what we
>>>>> actually
>>>>> need and how it might work.  I would personally favor the 4.x
>>>>> approach, with thread-spawning behavior configurable.
>>>>
>>>> It seems fair to wait until parallel algorithms are actually
>>>> implemented.
>>>>
>>>> However it is not clear what you mean with "the 4.x approach": if
>>>> it is
>>>> actually allowing Java 7, that would mean that, starting from 4.0,
>>>> we'll
>>>> indeed drop support of earlier JVMs!
>>>> Why would this be preferred to having 2 projects? Of course, if
>>>> everyone
>>>> agrees to that move to Java 7, that's fine. :-)
>>>
>>> What I meant was that instead of creating a new component, we would
>>> just create a new release line.  Like what tomcat does for servlet
>>> spec versions.  I guess this does mean that we end up having to
>>> stabilize the 3.x APIs because no additional "major" release would
>>> be allowed in that line.  That would be a *good thing* IMO as long
>>> as we can do it cleanly.  If not, maybe we end up having to use 5.x
>>> for the JDK 1.7+ version, using 4.0 to get to a stable API for the
>>> current trunk code.
>>
>> There's a still the human resource problem: we don't have it to
>> maintain
>> a single branch; having two will only make it worse.
>
> Yes, but the "new project" approach has the same problem.

Yes.
However, I meant it as a way to separate concerns, as shown
by diverging opinions, even in the few people who take part
in this discussion or in previous ones about the same subject.

A sibling (not separate!) project could allow interested
people to experiment while not adding yet another "distraction"
to the main project, where people more focused on the
mathematical (for lack of a better word) side can continue
their own improvements.
A healthy interaction could even come out of having a "public"
use-case in the form of a project that needs certain facilities
(algorithms as tasks) in order to provide multi-thread
utilities to users (who might prefer not to have to implement
them themselves at a higher level).

>>>> On the other hand, if we keep Java 5, at least until we get use
>>>> cases or
>>>> contributions that would benefit from features in JDKs newer than
>>>> 1.5,
>>>> there is no need to create a branch; we can just go on with adding
>>>> multi-thread codes to the trunk (to become part[1] of the upcoming
>>>> 3.x
>>>> releases).
>>>
>>> That is why I wanted to get a feel for what the JDK 1.7 stuff 
>>> really
>>> buys you.   Has anyone seen benchmarks showing better performance
>>> using 1.7 than can be obtained just using 1.5 concurrency
>>> primitives?
>>
>> Again, there are separate issues:
>>  1. Coding in Java 7
>>  2. Running with the JVM shipped with JDK 1.7
>>
>> The newer JVMs are faster, independently of whether new features
>> of the
>> language are used.
>> But it could well be that some of the new features allow even better
>> performance (as is foreseen for Java 8).
>
> Agreed.  I am interested in understanding better both how much
> easier it actually is to code and whether the 1.7 framework
> materially improves scheduling / allocation over what you could do
> just using 1.5 primitives.

I cannot provide proof, but nor is anyone on this list
eager to prove the contrary; hence the proposal to set
up a "playground".

>>> Has anyone used 1.7 to parallelize numerical algorithms
>>> and found it really easier / more performant?
>>
>> Where are those people who could answer?
>
> This is a public list :)
>> That is one of the points I raised. If we maintain source
>> compatibility
>> with a language version that is 9 years old, not many contributors
>> are
>> going to be interested. Thus reducing the chance to get answers...
>>
>>> Any opinions /
>>> responses to Konstantin's comment on where parallelization should 
>>> be
>>> implemented - i.e. in the library vs somewhere up the stack?
>>
>> What was the _question_?  ...
>
> The question he implicitly raised was whether or not it makes sense
> for a low-level library to parallelize tasks / run across cores.

In several areas, CM is not a low-level library (GA, multi-start
optimizers for example). In other areas like FFT, a user can
legitimately expect top performance without having to handle
parallelization by himself.

> This is a legitimate question.  It may be better actually to set
> things up so that higher-level frameworks or applications can
> arrange parallel execution rather than embedding it in the low-level
> library itself.  This is also what I was referring to when I said
> that in some contexts, thread-spawning / cpu hogging may not be
> desirable.

For several cases (GA, FFT, multi-start optimizers), I have the
opposite viewpoint: multi-threading is a implementation detail,
that could be handled at a _lower_ level. Of course, the user can
decide whether to enable more than one thread.

>>>  Any
>>> ideas how to set things up so that [math] code can play nicely with
>>> concurrency frameworks?
>>
>> That's a strange question in the context of a project that tries 
>> hard
>> not to have any dependency.
>
> I did not mean necessarily to bring in dependencies; but rather to
> make it easy for computational tasks executed by [math] code to be
> managed by external concurrency frameworks, e.g. Hadoop.

In the context of Commons Math, we often heard that "no dependency"
is good. Then, it is also good to not impose _implicit_ dependencies
(like: "If you use Hadoop, you could have better performance"). In a
way, the CM development "model" is: "We provide a toolkit of efficient
procedures, and you, the user, get top performance (on a best effort
basis of course)."
If we can provide better performance through multi-threading, why not?
Nobody will be forced to use it: they will use the "basic" (sequential)
tasks, or set the "parallelizationLevel" setting to 1.

Gilles

> Phil
>> If the requirement is to only depend on the standard JDK: the
>> framework
>> is in
>>  java.util.concurrent
>> and all we need to do is to define "tasks" that can be "submitted to
>> an executor:
>>
>> 
>> http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/AbstractExecutorService.html#submit(java.util.concurrent.Callable)
>>
>>
>> Regards,
>> Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message