flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: KeySelector when joining with the solution set
Date Sun, 31 Aug 2014 06:54:46 GMT
HI Guys,
sorry for being so late in this discussion but I was thinking about
how we could make iterations more intuitive. My Idea would be to make
access to the solution set explicit in Delta Iterations. It should be
usable as input for any operation and updates to it would be done
through a special construct, not by joining with it. Maybe we could
have a special UpdateSolution operator which would make sure that the
data is partitioned correctly for the distributed hash table in which
the solution is stored. Or we could have a function on the Iteration
Context that can update the solution from an arbitraty operator. In my
idea, Bulk Iteration would be the base operation and the other
Iterations would add special features to it.  I also would like to add
an iteration type where you can output an element in every iteration
and the result is the concatenation of those elements. This seems to
be required for some machine-learning-style algorithms.

What do you guys think about this? I'm not sure on the details here
since I'm busy with other low-level stuff (Generalizing Pair
Comparators and the Scala API) but I was planning to tackle this
afterwards. If someone else wanted to look into this I'd be happy to
help and discuss, though. :D

Cheers,
Aljoscha

On Sat, Aug 30, 2014 at 6:27 PM, Stephan Ewen <sewen@apache.org> wrote:
> Addendum: The issue to enforce partitioning of the data set is tracked
> here: https://issues.apache.org/jira/browse/FLINK-1060
>
>
> On Sat, Aug 30, 2014 at 6:01 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> Hey!
>>
>> Sorry for the late reply, but here are some thoughts in that direction:
>>
>>
>> Stateful operations:
>>
>> In order to keep state (in the form of a hash map or so) around in
>> iterations, you need not use the restricted means that the delta iterations
>> give you. All function instances stay around for all supersteps, so they
>> can actually just create a map inside that function and update it as you
>> like.
>>
>> There is some example code at the end of the mail (listing 1) or nicely
>> formatted) in this gist:
>> https://gist.github.com/StephanEwen/7815c1f269f1f79b8e09
>>
>>
>> State across functions:
>>
>> If you actually want to have state that you access from multiple
>> functions, that is also possible. Use a static broker to fetch the map,
>> like in listing 2 or this gist:
>> https://gist.github.com/StephanEwen/5cdfa628d0e05b99f328
>>
>>
>> Making sure data is partitioned:
>>
>> The more tricky part is making sure that the data is partitioned when it
>> enters the functions. Right now, this needs a hack: A GroupReduce function
>> that simply emits everything. This will cause an unnecessary and costly
>> sort, so we need to get that out of the way. Until then, it should allow
>> you to write a prototype.
>>
>> The good news is that we are planning to add hints to tell the system to
>> partition/rebalance/etc data sets for operations. This would solve the
>> issue nicely. It is not a terribly large change, so it should not take too
>> long. I'll keep you posted on the progress...
>>
>> Greetings,
>> Stephan
>>
>>
>> ---------------------------------------------------------
>> Sample 1: Flexible State in a Mapper
>> ---------------------------------------------------------
>>
>> public static class MyMapper extends RichMapPartitionFunction<Edge, Edge> {
>>
>>       private final Map<Long, Edge> state = new HashMap<Long, Edge>();
>>
>>       @Override
>>       public void open(Configuration conf) throws Exception {
>>                               // load the state
>>
>>               if (getIterationRuntimeContext().getSuperstepNumber() == 1) {
>>                                               String pathToFragment = "hdfs://...
or file://...";
>>                       CsvInputFormat<Edge> reader = new CsvInputFormat<>(new
Path(pathToFragment));
>>                       reader.configure(new Configuration());
>>                       reader.open(new FileInputSplit(0, new Path(pathToFragment),
0, 36584, null));
>>                                               while (!reader.reachedEnd()) {
>>                               Edge next = reader.nextRecord(new Edge());
>>                               state.put(next.f0, next);
>>                       }
>>               }
>>
>>       }
>>
>>       @Override
>>
>>       public void close() {
>>               // check whether to write the state out
>>               if (getIterationRuntimeContext().getSuperstepNumber() == 42) {
>>                       // write the state (similar code as in open() for the reader)
>>
>>                                       }
>>       }
>>
>>
>>       @Override
>>       public void mapPartition(Iterable<Edge> records, Collector<Edge>
out) throws Exception {
>>               // do something with the state
>>               for (Edge e : records) {
>>                       Edge other = state.get(e.f0);
>>                       // do something
>>
>>                       state.put(...);
>>
>>               }
>>       }
>> }
>>
>>
>> =======================================================
>>
>> ---------------------------------------------------------
>> Sample 2: Sharing state across functions
>> ---------------------------------------------------------
>>
>> public class StateBroker {
>>
>>   public static final ConcurrentHashMap<Integer, Map<Long, Edge>> BROKER
= new ConcurrentHashMap<>();
>>               public static Map<Long, Edge> getForSubtask(int subtask) {
>>                               Map<Long, Edge> entry = BROKER.get(subtask);
>>               if (entry == null) {
>>                       entry = new HashMap<>();
>>                       Map<Long, Edge> previous = BROKER.putIfAbsent(subtask,
entry);
>>                       entry = previous == null ? entry : previous;
>>               }
>>                               return entry;
>>       }
>> }
>>
>>
>> public static class MyMapper extends RichMapPartitionFunction<Edge, Edge> {
>>
>>       private Map<Long, Edge> state;
>>
>>       @Override
>>       public void open(Configuration conf) throws Exception {
>>                               // load the state
>>
>>               if (getIterationRuntimeContext().getSuperstepNumber() == 1) {
>>                       state = StateBroker.getForSubtask(getRuntimeContext().getIndexOfThisSubtask());
>>               }
>>       }
>>
>>               ...
>> }
>>
>>
>>
>> On Thu, Jul 31, 2014 at 4:15 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> thanks for replying so fast :)
>>>
>>> I saw the discussion in a previous thread concerning changing the API to
>>> offer more explicit join functions.
>>> I think providing these special functions is a good way to disable any
>>> other kind of interaction with the solution set.
>>>
>>> However, as a user, I would like to have a more flexible way of
>>> interacting
>>> with the state of the iteration.
>>> In the program I'm describing above, I actually want to join on the value
>>> of the solution set, not the key.
>>> It would be nice to somehow have access to the solution set as any other
>>> normal DataSet.
>>>
>>> I'm not sure how this could be supported, but if you think this is a good
>>> idea, I could work on this!
>>>
>>> Cheers,
>>> V.
>>>
>>>
>>>
>>> On 31 July 2014 15:50, Stephan Ewen <sewen@apache.org> wrote:
>>>
>>> > Hi Vasia!
>>> >
>>> > There is no fundamental reason, we simply have not gotten around to
>>> > implementing it, yet. Any help along these lines is highly welcome.
>>> >
>>> > One reason that held us back is that we need to make sure that the key
>>> of
>>> > the solution set and the key of the join is the same.
>>> > That is hard to verify with general functions. One approach is to
>>> actually
>>> > change the delta iteration API to define the keys only at
>>> > one place (the definition of the iteration), and offer special
>>> > "joinWithSolution" and "coGroupWithSolution" functions, rather then
>>> using
>>> > the regular join syntax (which allows to create invalid constructs).
>>> >
>>> > What are your thoughts on this, from a DeltaIteration user perspective?
>>> >
>>> > Greetings,
>>> > Stephan
>>> >
>>>
>>
>>

Mime
View raw message