ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anton Dmitriev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-7437) Partition based dataset implementation
Date Sun, 21 Jan 2018 17:20:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Anton Dmitriev updated IGNITE-7437:
-----------------------------------
    Description: 
We want to implement our dataset based on entire partition instead of key sets.

 

*A main idea behind the partition based datasets is the classic [MapReduce.|https://en.wikipedia.org/wiki/MapReduce]*

The most important advantage of the MapReduce is an ability to perform computations on a data
distributed across the cluster without involving significant data transmissions over the network.
This idea is adopted in the partition based datasets in the following way:

1. Every dataset or learning context consists of partitions.
2. Partitions are built on top of the Apache Ignite Cache partitions (as a primary storage).
3. Computations needed to be performed on a dataset or learning context splits on Map operations
which executes on every partition and Reduce operations which reduces results of Map operations
into one final result.

_Why partitions have been selected as a building block of dataset and learning contain instead
of cluster node?_

One of the fundamental ideas of Apache Ignite Cache is that partitions are atomic, which means
that they cannot be splitted between multiply nodes. As result in case of rebalancing or
node failure partition will be recovered on another node with the same data it contained on
the previous node.

In case of machine learning algorithm it's very important because most of the ML algorithms
are iterative and require some context maintained between iterations. This context cannot
be split or merged and should be maintained in the consistent state during the whole learning
process.

*Another idea behind the partition based datasets is that we need to have data (in every partition)
in [BLAS-|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms]like format as much
as it possible.* 

[BLAS|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms] and [CUDA|https://en.wikipedia.org/wiki/CUDA]
makes machine learning 100x faster and more reliable than algorithms based on self-written
linear algebra subroutines and it means that not using BLAS is a recipe for disaster. In
other words we need to keep data in BLAS-like format at any price.

  was:We want to implement our dataset based on entire partition instead of key sets.


> Partition based dataset implementation
> --------------------------------------
>
>                 Key: IGNITE-7437
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7437
>             Project: Ignite
>          Issue Type: New Feature
>          Components: ml
>            Reporter: Yury Babak
>            Assignee: Anton Dmitriev
>            Priority: Major
>
> We want to implement our dataset based on entire partition instead of key sets.
>  
> *A main idea behind the partition based datasets is the classic [MapReduce.|https://en.wikipedia.org/wiki/MapReduce]*
> The most important advantage of the MapReduce is an ability to perform computations on
a data distributed across the cluster without involving significant data transmissions over
the network. This idea is adopted in the partition based datasets in the following way:
> 1. Every dataset or learning context consists of partitions.
> 2. Partitions are built on top of the Apache Ignite Cache partitions (as a primary storage).
> 3. Computations needed to be performed on a dataset or learning context splits on Map
operations which executes on every partition and Reduce operations which reduces results of
Map operations into one final result.
> _Why partitions have been selected as a building block of dataset and learning contain
instead of cluster node?_
> One of the fundamental ideas of Apache Ignite Cache is that partitions are atomic, which
means that they cannot be splitted between multiply nodes. As result in case of rebalancing
or node failure partition will be recovered on another node with the same data it contained
on the previous node.
> In case of machine learning algorithm it's very important because most of the ML algorithms
are iterative and require some context maintained between iterations. This context cannot
be split or merged and should be maintained in the consistent state during the whole learning
process.
> *Another idea behind the partition based datasets is that we need to have data (in every
partition) in [BLAS-|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms]like format
as much as it possible.* 
> [BLAS|https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms] and [CUDA|https://en.wikipedia.org/wiki/CUDA]
makes machine learning 100x faster and more reliable than algorithms based on self-written
linear algebra subroutines and it means that not using BLAS is a recipe for disaster. In
other words we need to keep data in BLAS-like format at any price.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message