systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Glenn Weidner (JIRA)" <>
Subject [jira] [Updated] (SYSTEMML-1813) Preprocessing simplification and cleanup
Date Sat, 09 Sep 2017 05:09:00 GMT


Glenn Weidner updated SYSTEMML-1813:
    Fix Version/s:     (was: SystemML 1.0)
                   SystemML 0.15

> Preprocessing simplification and cleanup
> ----------------------------------------
>                 Key: SYSTEMML-1813
>                 URL:
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>             Fix For: SystemML 0.15
> In anticipation of near-future algorithmic improvements to the preprocessing to improve
model training, this simplifies and cleans up the preprocessing code as follows.
> - Previously, we were processing all slides into one large saved
> DataFrame, and then splitting that DataFrame into train and validation
> DataFrames.  We should simplify this by splitting the slide numbers
> into train and validation sets, and then processing those slides
> separately.  This will effectively skip the creation of the large DataFrame,
> and remove the need to split that large DataFrame into train/val ones,
> which should provide a large performance benefit.  The DataFrame `union`
> method can be used to combine two DataFrames row-wise.
> - Previously, we maintained a list of "broken" slides that were manually
> removed.  We should remove that manual list, and instead add a
> try/except filtering step to automatically remove problematic slides.
> - We should move ad-hoc sampling code into a new `sample` function.
> - We should move code to add row indices to a DataFrame into a new
> `add_row_indices` function.
> The benefit is that near-future algorithmic improvements to the
> preprocessing code will be much easier to incorporate.

This message was sent by Atlassian JIRA

View raw message