spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Resolved] (SPARK-20114) parity for sequential pattern mining - PrefixSpan
Date Mon, 07 May 2018 21:58:00 GMT


Joseph K. Bradley resolved SPARK-20114.
       Resolution: Fixed
    Fix Version/s: 2.4.0

Issue resolved by pull request 20973

> parity for sequential pattern mining - PrefixSpan
> ----------------------------------------------------------
>                 Key: SPARK-20114
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: yuhao yang
>            Assignee: Weichen Xu
>            Priority: Major
>             Fix For: 2.4.0
> Creating this jira to track the feature parity for PrefixSpan and sequential pattern
mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, Python and
R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward.
Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly
for predicting on new records. Please read
for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we
want to keep using the Estimator/Transformer pattern, options are:
>      #*  Implement a dummy transform for PrefixSpanModel, which will not add new column
to the input DataSet. The PrefixSpanModel is only used to provide access for frequent sequential
>      #*  Adding the feature to extract sequential rules from sequential patterns. Then
use the sequential rules in the transform as FPGrowthModel.  The rules extracted are of the
form X–> Y where X and Y are sequential patterns. But in practice, these rules are not
very good as they are too precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules can be extracted
from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The
rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y,
which is more general and can work better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules are more practical.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message