Hi Stephan, hi Aljoscha,
thank you for your responses!
I am afraid that I explained the idea of the algorithm not good
enaugh. The workset is being constant until we find leafs in the tree.
This is because we will not split leafs therefore we can remove the
training data tuples that are assigned to them. The tuples that are
assigned to nonleaf nodes are emitted again but with another nodeId
property, i.e. they are being replaced. No append to the set is
performed (see the map function below)
Aljoscha's question has a good point. We did not have an implementation
of the work set modification function until yesterday evening. Stephan
is also right, we will join the work set (training data) with the
solution set (nodes in the tree). Within the function that receives the
joined tuples we will perform the reassignment of the training data to
the new child nodes (k*2 or k*2+1 with k as the currently assigned
node). Training data that is assigned to leafs is removed from the work
set. The join looks like this:
val newWorkSet = data.join(nodes.filter{ node => !node.isLeaf() })
.where{ data => data.nodeID }.isEqualTo{ node => node.nodeID }
.map{ (data, node) => new Data(
if (data.point.features.elementAt(node.split.dim) <= node.split.value)
data.nodeID * 2
else
data.nodeID * 2 + 1
,data.point) }
Unfortunately the exception is still raised.
Ragarding the solution set growing: We are generating new nodes with ID
2*k and 2*k+1 for each nonleaf node with ID k. Therefore new keys are
generated. We have removed the union now.
But I am afraid that we did not understood the API to emit the new
workset. We do not have keys on the training data. Is it possible to
reemit a completely new work set in each iteration or can you only
replace existing tuples in the current work set?
Can you also please have a look on our step function invokation?
Currently it is looking like this:
initializedTree.iterateWithDelta(data, {_.nodeID}, growTree, maxDepth)
with the following variables:
 initializedTree: Node with ID 1 and split point. (Initial solution
set)
 data: Training data set. (Initial work set)
 {_.nodeId} Attribute of the training data. But we are not entirely
sure what this parameter is used for.
 growTree: Step function returning (solutionSetDelta, newWorkSet)
 maxDepth: Maximum number of iterations.
Thank you very much for your support!
Best regards
Christoph
> Stephan Ewen <sewen@apache.org> hat am 10. Juni 2014 um 16:31 geschrieben:
>
>
> Hey!
>
> The way you have written the algorithm strikes me actually as a case for
> bulk iterations. Delta iterations are typically cases where the work set
> sizes goes down (not strictly up, as here).
>
> In your specific code, the problem is that you can actually only
> join/cogroup with the solution set, and produce deltas for it (deltas are
> merged in to the solution set). You do not reassign the solution set.
>
> In order to grow the solution set by a set of elements, make a delta where
> elements have new keys (do not replace any the current elements). When
> adding the delta to what is already there, you get a union, with the
> special condition that it is a set union on the keys, and if a key existed
> before, it gets replaced by the delta entry with the same key. So the
> solution set delta is just the result of the "newWorkSet" variable. No need
> to union before.
>
> For the workset, you may have to rethink the algorithm. The way you wrote
> it, the workset will always grow. A delta iteration terminates when the
> workset is empty. Try and write your algorithm such that the workset really
> contains only the changes that are relevant to the next iteration. If the
> changes become zero, the work is done and the iteration terminates.
>
> Stephan
