spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span
Date Fri, 31 Mar 2017 16:52:41 GMT


Joseph K. Bradley commented on SPARK-20179:

Thanks for the thoughts & work.  Sean's right that these practices are described in the
contributing guide, as well as a lot of other helpful info.  I'd recommend a few things:
* Split the proposals up into smaller pieces.  Putting everything into 1 JIRA and/or PR makes
it hard for reviewers to understand what is being proposed and how the changes interact.
* Make JIRA titles and descriptions very clear in terms of what the key change is.  If it's
multiple changes, can these be broken into separate parts and added incrementally?  If the
changes are related, it can be OK to create an umbrella JIRA which gives a holistic view;
you can put the actual changes and PRs under subtasks.
* Start with the smallest incremental changes you're interested in to get familiar with the
contribution process.
* Keep the perspective of reviewers in mind: If the code is long or complex to describe, it's
going to be overwhelming to reviewers who have never seen it.


> Major improvements to Spark's Prefix span
> -----------------------------------------
>                 Key: SPARK-20179
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.0
>         Environment: All
>            Reporter: Cyril de Vogelaere
>   Original Estimate: 0h
>  Remaining Estimate: 0h
> The code I would like to push allows major performances improvement for single-item patterns
through the use of a CP based solver (Namely OscaR, an open-source solver). And slight perfomance
improved for multi-item patterns. As you can see in the log scale graph reachable from the
link below, the performance are at worse roughly the same but at best, up to 50x faster (FIFA
> Link to graph :
> Link to implementation :
> Link for datasets :
(the two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G memory (more
than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new fonctionnalities
>   - Unlimited Max pattern length (with input 0) : No improvement in perfomance here,
simply allows the use of unlimited max pattern length.
>   - Min pattern length : Any item below that length won't be outputted. No improvement
in performance, just a new functionnality.
>   - Max Item per itemset : An itemset won't be grown further than the inputed number,
thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were found then
discarded. Which resulted in a inefficient first iteration of the genFreqPattern method. The
algorithm new uses them if they are provided, and uses the empty pattern in case they're not.
Slight improvement of performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the user disposes
of a small number of very powerfull machine, this parameter will allow a quick switch to local
execution. Tremendously improving performances. Outside of those conditions, the performance
may be the negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences of an item
(=, >, <, >=, <=, !=). For user who are looking for some specific results, the
search space can be greatly reduced. Which also improve performances.
> Of course, all of theses added feature where tested for correctness. As you can see on
the github link.
> Please take note that the afformentionned fonctionnalities didn't come into play when
testing the performance. The performance shown on the graph are 'merely' the result of the
remplacement of the local execution by a CP based algorithm. (maxLocalProjDBSize was also
kept to it's default value (32000000L), to make sure the performance couldn't be artificially
> Trade offs :
>   - The algorithm does have a slight trade off. Since the algorithm needs to detect whether
sequences can use the specialised single-item patterns CP algorithm. It may be a bit slower
when it is not needed. This trade of was mitigated by introducing a round sequence cleaning
before the local execution. Thus improving the performance of multi-item local executions
when the cleaning is effective. In case the no item can be cleaned, the check will appear
in the performence, creating a slight drop in them.
>   - All other change provided shouldn't have any effect on efficiency or complexity if
left to their default value. (Where they are basically desactivated). When activated, they
may however reduce the search space and thus improve performance.
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset are of
size one, since it is a necessary condition for the use of the CP based algorithm. But as
you can see in the two Slen datasets, performances were also slightly improved for algorithms
which have multiple items per itemset. The algorithm was built to detect when CP can be used,
and to use it, given the opportunity.
>  - The performance displayed here are the results of six months of work. Various other
things were tested to improve performance, without as much success. I can thus say with a
bit of confidence that the performances here attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to provide additionnal
informations, it would be my pleasure to do so :). Just it me up with my email below :
> Email :
>  - I am a newbie contributor to Spark, and am not familliar with the whole procedure
at all. In case, I did something incorrectly, I will fix it as soon as possible.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message