Hey,
I also have some other doubts regarding another algorithm(Item Based
Collaborative Filtering) that I am implementing on top of giraph.:
The algorithm consists of 5 super steps and I have implemented five
separate computation classes for each of the 5 super steps. I use the
MasterCompute to switch the Computation class for each super step. I also
set Combiner Class for each of the superstep where messages can be combined
in parallel.
My implementation of Item Based Collaborative Filtering runs fine for a
small dataset. But for very large ones, like 100 Million users and 100
Million items, I run into a lot of issues. Out of the 5 supersteps, 4
supersteps have a definitive combine step where the combiners make sure
that the number of messages sent is one per vertex. So, my message sending
traffic is reduced significantly. However, there is one remaining superstep
(the second superstep) where I compute the similarity between items. I
cannot use a combiner here, since I need the whole vector on the other side
to compute a similarity value using either cosine or pearson distance.
Every item sends its vector to all of its second degree vertices(2 hops
out) for the similarity computation. So the higher your vertex degree is,
the larger your vector will be. And you will be sending this large vector
to all your second degree vertices (which will also be more in number if
you have a high degree). This is resulting in outOfMemoryErrors because of
sending too many messages or too large messages. I am running it on a
cluster of 100 machines with each having 32 GB memory.
I came with two solutions to solve this:
1. I tried running it out of core messages set to true. It finished
completely for 100M users and 100 Million Items. But the results were wrong
because the combiners did not run. So,
a. Can we use the following parameters together:
D giraph.useOutOfCoreMessages=true and D
giraph.useBigDataIOForMessages=true at the same time. Can we use the unsafe
serialization and the out of core messaging at the same time?
I have tried this and it seems to work. But please confirm.
b. Also whenever I use the D giraph.useOutOfCoreMessages=true, the
combiners don't run while sending the messages. In my code, I know when the
combiners runs I will get only one message. So, I read only the first
message that I get using
messages.iterator().next()
and do the processing. But, running out of core causes the combiners not to
run, so I am only reading the first message being sent to the vertex and
ignoring all the others.
I need the combiners to run in all the other steps(superstep number
0,2,3,4) where I am using them except for the similarity computation
step(superstep number 1) where I cannot combine the messages being sent.
*Is there any way to just run my second superstep out of core and the rest
in core so the combiners run?*
*is there a way to run Combiners when running out of core messages set to
true?*
2. My second solution is split the job into 3 parts. Run parts 1 and 3 in
core and the part 2 out of core. But again it would involve disk IO at the
end of each job and the beginning of each job.
3. Instead of all item vertices sending their vector in the second
superstep to all their second degree vertices. I spread it out over a
series of supersteps. I take the modulus of the vertex id and if it is
equal to the superstep, then and only then do I send messages. But again I
run into some Zookeeper time out errors or some other errors that I cannot
fathom.
Can you please suggest a way to make the combiners run in out of core or a
better way of implementing my algorithm.
Thanks,
Ameya
Zynga Inc.
On Thu, Nov 21, 2013 at 3:49 PM, Ameya Vilankar <ameya.vilankar@gmail.com>wrote:
> Nice suggestion. I will try that out. Thanks a ton.
>
>
> On Thu, Nov 21, 2013 at 3:43 PM, Claudio Martella <
> claudio.martella@gmail.com> wrote:
>
>> The simplest thing, is that you get a flag for each vertex to signal
>> whether they are really active. If not, they return. This means that
>> vertices never really vote to halt. Computationally, it does not cost you
>> much more than this check. You can play the rest of the logics with some
>> aggregators and the master compute.
>>
>>
>> On Thu, Nov 21, 2013 at 11:57 PM, Ameya Vilankar <
>> ameya.vilankar@gmail.com> wrote:
>>
>>> Hi,
>>> I have implemented Alternating Least Squares on top apache giraph. On
>>> the edge, I store the type of the edge. Edges can be either a training edge
>>> or testing edge. When I run the algorithm, I use only the ratings on the
>>> training edge to tune the vectors on the vertices.
>>> The algorithm ends in one of the two scenarios:
>>> 1. All the vertices have tuned their vector with in the tolerable error.
>>> At this point there are no active vertices since everyone has called vote
>>> to halt.
>>> 2. We reached the maximum number of supersteps. At this point, some
>>> vertices are active since they received messages from the last superstep.
>>>
>>> I have written an Aggregator that counts the training error along this
>>> process. But now, I want to calculate the prediction/testing error which is
>>> along the testing labelled edges. But there are either no active vertices
>>> or few active vertices at this point in my algorithm. I need all the
>>> vertices to send their vectors along all of their testing edges to compute
>>> the testing error and send it to a error sum aggregator. For this I need to
>>> activate all the vertices.
>>> Hope it is clear to you now.
>>>
>>> Thanks,
>>> Ameya.
>>> Zynga
>>>
>>>
>>> On Thu, Nov 21, 2013 at 2:45 PM, Claudio Martella <
>>> claudio.martella@gmail.com> wrote:
>>>
>>>> Hi Ameya,
>>>>
>>>> I'm not sure I understand the problem correctly. The maximum number of
>>>> supersteps allows you to halt the computation when that threshold is
>>>> reached. The RMSE can be computed within the master compute.
>>>>
>>>> What do you want to achieve exactly?
>>>>
>>>>
>>>> On Thu, Nov 21, 2013 at 10:47 PM, Ameya Vilankar <
>>>> ameya.vilankar@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> I am implementing a machine learning algorithm on top giraph. The
>>>>> algorithm converges when all the vertices call voteToHalt or some max
>>>>> number of supersteps have completed.
>>>>> I want to calculate the RMSE error after the algorithm has converged.
>>>>> But the problem is either all the vertices have called vote to halt or
(in
>>>>> the case where we reach max supersteps) only some of them are still active.
>>>>> I need to reactivate or wake up all the vertices. Is there any way in
>>>>> giraph that I could do this?
>>>>>
>>>>> Thanks,
>>>>> Ameya Vilankar
>>>>> Zynga
>>>>>
>>>>
>>>>
>>>>
>>>> 
>>>> Claudio Martella
>>>> claudio.martella@gmail.com
>>>>
>>>
>>>
>>
>>
>> 
>> Claudio Martella
>> claudio.martella@gmail.com
>>
>
>
