Mailing-List: contact dev-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Developers List" <dev@commons.apache.org>
Received-SPF: pass (athena.apache.org: domain of phil.steitz@gmail.com
 designates 209.85.223.177 as permitted sender)
Message-ID: <511509C8.9070503@gmail.com>
Date: Fri, 08 Feb 2013 06:20:56 -0800
From: Phil Steitz <phil.steitz@gmail.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: Commons Developers List <dev@commons.apache.org>
Subject: Re: [Math] Moving on or not?
References: <b4c234d348388d6747a9c750d0ff6839@scarlet.be>
 <51127493.1020106@gmail.com> <db8b90eeacdf4082fd63b72eb81a55b2@scarlet.be>
 <5112970F.9030705@gmail.com> <c71a2b95b11e9a98198963ef5d0e8722@scarlet.be>
 <5113C1D6.3040206@gmail.com> <48278b4e601bd8eb89c889a46161884e@scarlet.be>
 <5113D72E.4070004@gmail.com> <7053ecaf56f408359e4e58f31974523e@scarlet.be>
 <-4789801613352334632@unknownmsgid> <5114B1A3.6000607@free.fr>
In-Reply-To: <5114B1A3.6000607@free.fr>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit

On 2/8/13 12:04 AM, Luc Maisonobe wrote:
> Le 08/02/2013 03:21, Konstantin Berlin a �crit :
>> Sorry, but not of this is making sense to me. We had a long discussion
>> about how the library doesn't test for large scale problem
>> performance. A lot of algorithms probably do not scale well as the
>> result. There was talk of dropping sparse support in linear algebra.
>> So instead of fixing that, you jump to parallelization, which is
>> needed only for large scale problems, which this library does not
>> handle well even in single thread right now.
>>
>> The most significant impact you can have is fixing the linear algebra
>> component.
> I agree with this. Also in order to avoid spreading our attention too
> much on keeping several branches in sync, I would suggest to not create
> a new component but directly decide we will not support Java 5 anymore
> as of Apache Commons Math 4.0, so people can progressively use the new
> features of the language and experiment directly on the trunk.

Actually, to get anything, you would need to bump to 1.7, abandoning
1.6 as well.  That would effectively mean abandoning a large segment
(likely the majority) of the user base.  I would not like to do
that.  So if we don't have the energy to maintain two lines, I would
say hold off requiring Java 7 until we have stabilized the API and
fixed things like above.  

Phil
>
> best regards,
> Luc
>
>> On Feb 7, 2013, at 5:06 PM, Gilles <gilles@harfang.homelinux.org> wrote:
>>
>>> On Thu, 07 Feb 2013 08:32:46 -0800, Phil Steitz wrote:
>>>> On 2/7/13 8:04 AM, Gilles wrote:
>>>>> On Thu, 07 Feb 2013 07:01:42 -0800, Phil Steitz wrote:
>>>>>> On 2/7/13 4:58 AM, Gilles wrote:
>>>>>>> On Wed, 06 Feb 2013 09:46:55 -0800, Phil Steitz wrote:
>>>>>>>> On 2/6/13 9:03 AM, Gilles wrote:
>>>>>>>>> On Wed, 06 Feb 2013 07:19:47 -0800, Phil Steitz wrote:
>>>>>>>>>> On 2/5/13 6:08 AM, Gilles wrote:
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>> In the thread about "static import", Stephen noted that
>>>>>>>>>>> decisions
>>>>>>>>>>> on a
>>>>>>>>>>> component's evolution are dependent on whether the future of
>>>>>>>>>>> the
>>>>>>>>>>> Java
>>>>>>>>>>> language is taken into account, or not.
>>>>>>>>>>> A question on the same theme also arose after the
>>>>>>>>>>> presentation of
>>>>>>>>>>> Commons
>>>>>>>>>>> Math in FOSDEM 2013.
>>>>>>>>>>>
>>>>>>>>>>> If we assume that efficiency is among the important
>>>>>>>>>>> qualities for
>>>>>>>>>>> Commons
>>>>>>>>>>> Math, the future is to allow usage of the tools provided by the
>>>>>>>>>>> standard
>>>>>>>>>>> Java library in order to ease the development of multi-threaded
>>>>>>>>>>> algorithms.
>>>>>>>>>>>
>>>>>>>>>>> Maintaining Java 1.5 source compatibility for the reason
>>>>>>>>>>> that we
>>>>>>>>>>> may need
>>>>>>>>>>> to support legacy applications will turn out to be
>>>>>>>>>>> self-defeating:
>>>>>>>>>>> 1. New users will not consider Commons Math's features that are
>>>>>>>>>>> notably
>>>>>>>>>>>   apt to parallel processing.
>>>>>>>>>>> 2. Current users might at some point simply switch to another
>>>>>>>>>>> library if
>>>>>>>>>>>   it proves more efficient (because it actually uses
>>>>>>>>>>> multi-threading).
>>>>>>>>>>> 3. New Java developers will be turned away because they will
>>>>>>>>>>> want
>>>>>>>>>>> to use
>>>>>>>>>>>   the more convenient features of the language in order to
>>>>>>>>>>> provide
>>>>>>>>>>>   potential contributions.
>>>>>>>>>>>
>>>>>>>>>>> If maintaining 1.5 source compatibility is kept as a
>>>>>>>>>>> requirement, the
>>>>>>>>>>> consequence is that Commons Math will _become_ a legacy
>>>>>>>>>>> library.
>>>>>>>>>>> In that perspective, implementing/improving algorithms for
>>>>>>>>>>> which a
>>>>>>>>>>> parallel version is known to be more efficient is plainly a
>>>>>>>>>>> waste of
>>>>>>>>>>> development and maintenance time.
>>>>>>>>>>>
>>>>>>>>>>> In order to mitigate the risks (both of upgrading and of not
>>>>>>>>>>> upgrading
>>>>>>>>>>> the source compatibility requirement), I would propose to
>>>>>>>>>>> create a
>>>>>>>>>>> new
>>>>>>>>>>> project (say, "Commons Math MT") where we could implement new
>>>>>>>>>>> features[1]
>>>>>>>>>>> without being encumbered with the 1.5 requirement.[2]
>>>>>>>>>>> The "Commons Math MT" would depend on "Commons Math" where we
>>>>>>>>>>> would
>>>>>>>>>>> continue developing single-thread (and thread-safe) "tasks",
>>>>>>>>>>> i.e.
>>>>>>>>>>> independent units of processing that could be used in
>>>>>>>>>>> algorithms
>>>>>>>>>>> located in "Commons Math MT".
>>>>>>>>>>>
>>>>>>>>>>> In summary:
>>>>>>>>>>> - Commons Math (as usual):
>>>>>>>>>>>  * single-thread (sequential) algorithms,
>>>>>>>>>>>  * (pure) Java 5,
>>>>>>>>>>>  * no dependencies.
>>>>>>>>>>> - Commons Math MT:
>>>>>>>>>>>  * multi-thread (parallel) algorithms,
>>>>>>>>>>>  * Java 7 and beyond,
>>>>>>>>>>>  * JNI allowed,
>>>>>>>>>>>  * dependencies allowed (jCuda).
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>> There are several other possibilities to consider:
>>>>>>>>>>
>>>>>>>>>> 0) Implement multithreading using JDK 1.5 primitives
>>>>>>>>>> 1) Set things up within [math] to support parallel execution in
>>>>>>>>>> JDK
>>>>>>>>>> 1.7, Hadoop or other frameworks
>>>>>>>>>> 2) Instead of a new project, start a 4.x branch targeting JDK
>>>>>>>>>> 1.7
>>>>>>>>>>
>>>>>>>>>> I think we should maintain a version that has no dependencies
>>>>>>>>>> and no
>>>>>>>>>> JNI in any case.
>>>>>>>>>>
>>>>>>>>>> Starting a branch and getting concrete about how to parallelize
>>>>>>>>>> some
>>>>>>>>>> algorithms would be a good way to start.  One thing I have not
>>>>>>>>>> really investigated and would be interested in details on is
>>>>>>>>>> what
>>>>>>>>>> you actually get in efficiency gain (or loss?) using fork /
>>>>>>>>>> join vs
>>>>>>>>>> just using 1.5+ concurrency for the kinds of problems we
>>>>>>>>>> would end
>>>>>>>>>> up using this stuff for.
>>>>>>>>>>
>>>>>>>>>> Thinking about specific parallelization problem instances would
>>>>>>>>>> also
>>>>>>>>>> help decide whether 1) makes sense (i.e., whether it makes
>>>>>>>>>> sense as
>>>>>>>>>> you mention above to maintain a single-threaded library that
>>>>>>>>>> provides task execution for a multithreaded version or
>>>>>>>>>> multithreaded
>>>>>>>>>> frameworks).
>>>>>>>>>>
>>>>>>>>>> One more thing to consider is that for at least some users of
>>>>>>>>>> [math], having the library internally spawn threads and/or peg
>>>>>>>>>> multiple processors may not be desirable.  It is a little
>>>>>>>>>> misleading
>>>>>>>>>> to say that multithreading is the way to get "efficiency."
>>>>>>>>>> It is
>>>>>>>>>> really the way to *use* more compute resources and unless there
>>>>>>>>>> are
>>>>>>>>>> real algorithmic improvements, the overall efficiency may
>>>>>>>>>> actually
>>>>>>>>>> be less, due to task coordination overhead.  What you get is
>>>>>>>>>> faster
>>>>>>>>>> execution due to more greedy utilization of available cores.
>>>>>>>>>> Actual
>>>>>>>>>> efficiency (how much overall compute resource it takes to
>>>>>>>>>> complete a
>>>>>>>>>> job) partly depends on how efficiently the coordination
>>>>>>>>>> itself is
>>>>>>>>>> done (which JDK 1.7 claims to do very well - I have just not
>>>>>>>>>> seen
>>>>>>>>>> substantiation or any benchmarks demonstrating this) and how the
>>>>>>>>>> parallelization effects overall compute requirements.  In any
>>>>>>>>>> case,
>>>>>>>>>> for environments where library thread-spawning is not
>>>>>>>>>> desirable, I
>>>>>>>>>> think we should maintain a single-threaded version.
>>>>>>>>> Unless I missed the point, those reasons are exactly why I
>>>>>>>>> propose to
>>>>>>>>> have 2 projects/components. One, "Commons-Math", does not fiddle
>>>>>>>>> with
>>>>>>>>> resources, while the other would provide a "parallelizationLevel"
>>>>>>>>> setting for the algorithms written to possibly take advantage of
>>>>>>>>> the
>>>>>>>>> Java 5+ "task framework".
>>>>>>>> OK, what about the 4.x option?
>>>>>>>>> Yes, we could still be good by using only Java 5's concurrency
>>>>>>>>> features
>>>>>>>>> but the issue I raise is not only about concurrency but about
>>>>>>>>> evolution/progress/maintenance, all things that require raising
>>>>>>>>> interest
>>>>>>>>> from new contributors (unless it's fine that Commons Math be
>>>>>>>>> tagged as a
>>>>>>>>> "library of the past"...).
>>>>>>>> +1 for experimenting with parallelization.  I would just like to
>>>>>>>> understand if the JDK 7 stuff really adds much - in particular,
>>>>>>>> does
>>>>>>>> it handle coordination / cpu allocation better than you could
>>>>>>>> easily
>>>>>>>> do it with 1.5.  More supported JDKs == more potential users, so I
>>>>>>>> like to see a real reason to bump the JDK level.
>>>>>>>>> But using concurrency features in "Commons Math" would also
>>>>>>>>> contradict
>>>>>>>>> your own point ("we should maintain a single-threaded
>>>>>>>>> version"): I
>>>>>>>>> agree,
>>>>>>>>> and that's why I proposed this other project...
>>>>>>>>>
>>>>>>>>> As for efficiency (or faster execution, if you want), I don't
>>>>>>>>> see the
>>>>>>>>> point in doubting that tasks like global search (e.g. in a
>>>>>>>>> genetic
>>>>>>>>> algorithm) will complete in less time when run in parallel...
>>>>>>>>>
>>>>>>>>> As I summarized previously, having a "Commons Math MT" would
>>>>>>>>> bring no
>>>>>>>>> inconvenience, contrary to either your points 0, 1, or 2. [No
>>>>>>>>> inconvenience to me, that is, but to people with requirements
>>>>>>>>> like
>>>>>>>>> "Java 5 compatible" or "no multi-threading").
>>>>>>>>> As I indicated, the basic "task" could be defined in "Commons
>>>>>>>>> Math" and
>>>>>>>>> "Commons Math MT" would provide the parallelization "glue" (e.g.
>>>>>>>>> to divide
>>>>>>>>> the search space of the GA).
>>>>>>>> I think it is best at this point to cut a branch and actually
>>>>>>>> start
>>>>>>>> working on specific algorithms.  Having a set of candidate
>>>>>>>> algorithms for parallelization will help us decide what we
>>>>>>>> actually
>>>>>>>> need and how it might work.  I would personally favor the 4.x
>>>>>>>> approach, with thread-spawning behavior configurable.
>>>>>>> It seems fair to wait until parallel algorithms are actually
>>>>>>> implemented.
>>>>>>>
>>>>>>> However it is not clear what you mean with "the 4.x approach": if
>>>>>>> it is
>>>>>>> actually allowing Java 7, that would mean that, starting from 4.0,
>>>>>>> we'll
>>>>>>> indeed drop support of earlier JVMs!
>>>>>>> Why would this be preferred to having 2 projects? Of course, if
>>>>>>> everyone
>>>>>>> agrees to that move to Java 7, that's fine. :-)
>>>>>> What I meant was that instead of creating a new component, we would
>>>>>> just create a new release line.  Like what tomcat does for servlet
>>>>>> spec versions.  I guess this does mean that we end up having to
>>>>>> stabilize the 3.x APIs because no additional "major" release would
>>>>>> be allowed in that line.  That would be a *good thing* IMO as long
>>>>>> as we can do it cleanly.  If not, maybe we end up having to use 5.x
>>>>>> for the JDK 1.7+ version, using 4.0 to get to a stable API for the
>>>>>> current trunk code.
>>>>> There's a still the human resource problem: we don't have it to
>>>>> maintain
>>>>> a single branch; having two will only make it worse.
>>>> Yes, but the "new project" approach has the same problem.
>>> Yes.
>>> However, I meant it as a way to separate concerns, as shown
>>> by diverging opinions, even in the few people who take part
>>> in this discussion or in previous ones about the same subject.
>>>
>>> A sibling (not separate!) project could allow interested
>>> people to experiment while not adding yet another "distraction"
>>> to the main project, where people more focused on the
>>> mathematical (for lack of a better word) side can continue
>>> their own improvements.
>>> A healthy interaction could even come out of having a "public"
>>> use-case in the form of a project that needs certain facilities
>>> (algorithms as tasks) in order to provide multi-thread
>>> utilities to users (who might prefer not to have to implement
>>> them themselves at a higher level).
>>>
>>>>>>> On the other hand, if we keep Java 5, at least until we get use
>>>>>>> cases or
>>>>>>> contributions that would benefit from features in JDKs newer than
>>>>>>> 1.5,
>>>>>>> there is no need to create a branch; we can just go on with adding
>>>>>>> multi-thread codes to the trunk (to become part[1] of the upcoming
>>>>>>> 3.x
>>>>>>> releases).
>>>>>> That is why I wanted to get a feel for what the JDK 1.7 stuff really
>>>>>> buys you.   Has anyone seen benchmarks showing better performance
>>>>>> using 1.7 than can be obtained just using 1.5 concurrency
>>>>>> primitives?
>>>>> Again, there are separate issues:
>>>>> 1. Coding in Java 7
>>>>> 2. Running with the JVM shipped with JDK 1.7
>>>>>
>>>>> The newer JVMs are faster, independently of whether new features
>>>>> of the
>>>>> language are used.
>>>>> But it could well be that some of the new features allow even better
>>>>> performance (as is foreseen for Java 8).
>>>> Agreed.  I am interested in understanding better both how much
>>>> easier it actually is to code and whether the 1.7 framework
>>>> materially improves scheduling / allocation over what you could do
>>>> just using 1.5 primitives.
>>> I cannot provide proof, but nor is anyone on this list
>>> eager to prove the contrary; hence the proposal to set
>>> up a "playground".
>>>
>>>>>> Has anyone used 1.7 to parallelize numerical algorithms
>>>>>> and found it really easier / more performant?
>>>>> Where are those people who could answer?
>>>> This is a public list :)
>>>>> That is one of the points I raised. If we maintain source
>>>>> compatibility
>>>>> with a language version that is 9 years old, not many contributors
>>>>> are
>>>>> going to be interested. Thus reducing the chance to get answers...
>>>>>
>>>>>> Any opinions /
>>>>>> responses to Konstantin's comment on where parallelization should be
>>>>>> implemented - i.e. in the library vs somewhere up the stack?
>>>>> What was the _question_?  ...
>>>> The question he implicitly raised was whether or not it makes sense
>>>> for a low-level library to parallelize tasks / run across cores.
>>> In several areas, CM is not a low-level library (GA, multi-start
>>> optimizers for example). In other areas like FFT, a user can
>>> legitimately expect top performance without having to handle
>>> parallelization by himself.
>>>
>>>> This is a legitimate question.  It may be better actually to set
>>>> things up so that higher-level frameworks or applications can
>>>> arrange parallel execution rather than embedding it in the low-level
>>>> library itself.  This is also what I was referring to when I said
>>>> that in some contexts, thread-spawning / cpu hogging may not be
>>>> desirable.
>>> For several cases (GA, FFT, multi-start optimizers), I have the
>>> opposite viewpoint: multi-threading is a implementation detail,
>>> that could be handled at a _lower_ level. Of course, the user can
>>> decide whether to enable more than one thread.
>>>
>>>>>> Any
>>>>>> ideas how to set things up so that [math] code can play nicely with
>>>>>> concurrency frameworks?
>>>>> That's a strange question in the context of a project that tries hard
>>>>> not to have any dependency.
>>>> I did not mean necessarily to bring in dependencies; but rather to
>>>> make it easy for computational tasks executed by [math] code to be
>>>> managed by external concurrency frameworks, e.g. Hadoop.
>>> In the context of Commons Math, we often heard that "no dependency"
>>> is good. Then, it is also good to not impose _implicit_ dependencies
>>> (like: "If you use Hadoop, you could have better performance"). In a
>>> way, the CM development "model" is: "We provide a toolkit of efficient
>>> procedures, and you, the user, get top performance (on a best effort
>>> basis of course)."
>>> If we can provide better performance through multi-threading, why not?
>>> Nobody will be forced to use it: they will use the "basic" (sequential)
>>> tasks, or set the "parallelizationLevel" setting to 1.
>>>
>>> Gilles
>>>
>>>> Phil
>>>>> If the requirement is to only depend on the standard JDK: the
>>>>> framework
>>>>> is in
>>>>> java.util.concurrent
>>>>> and all we need to do is to define "tasks" that can be "submitted to
>>>>> an executor:
>>>>>
>>>>> http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/AbstractExecutorService.html#submit(java.util.concurrent.Callable)
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gilles
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org