giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Young Han <young....@uwaterloo.ca>
Subject Re: Input format problems running Giraph 1.1.0 on Twitter dataset
Date Mon, 04 May 2015 19:35:58 GMT
Hmm.. you might have an off-by-one error in your MasterCompute. The
superstep counter is -1 during input loading and starts at 0 for the first
iteration of computation. Assuming things haven't changed since 1.1.0-RC0,
MasterCompute executes after the end of a superstep (after the global
barrier) but before the start of a new superstep. However, the tricky bit
is that it also runs after the input superstep (superstep -1). So what you
might be seeing is the # of vertices after SS -1 (incorrect), after SS 0
(still incorrect), and after SS 1 (now correct).

What Steven said regarding vertex addition is correct. Internally, when
there is a message for a vertex that doesn't exist, Giraph will (by
default) add that vertex via a vertex mutation. These mutations are all
performed during a global barrier (i.e., between SS 0 and SS 1). So for
SimplePageRankComputation, you have all vertices broadcasting to their
out-edge neighbours during SS 0. This means all missing vertices receive
messages and so they get added after SS 0 but before SS 1. In SS 1, you
will observe that all vertices without out-edge neighbours are now added.
The VertexValueFactory solution works because it is called by Giraph when
creating/adding these missing vertices.

Peering into the internals, I believe the order of execution is: end of
superstep reached -> workers flush all messages -> workers perform graph
mutations -> all workers arrive at the global barrier -> master compute
executes -> workers begin new superstep. (And input loading is a special
case: input loading/partition exchange -> all workers arrive at the global
barrier -> master compute executes -> workers begin superstep 0.)

Young

On Mon, May 4, 2015 at 1:16 PM, Steven Harenberg <sdharenb@ncsu.edu> wrote:

> My understanding is that a vertex with only incoming edges will not be
> active until it receives a message, which is why you don't see all of the
> vertices initially. The easiest way to test this is to write a script that
> parses your input and creates a new data file where every vertex is
> specified on a line of its own. Even if it has no outgoing neighbors, just
> leave the neighbor empty. Or, first just check if you have
> 40383589-40103281=280308 vertices with only incoming edges.
>
> Young provided another solution for fixing the initialization problem, and
> it looks like in the code that wasn't specified this code to still have the
> problem.
>
> Either transform the input (seems like the easiest thing to do), or try
> the fix Young said. I would bet either of those would fix the issue. Young
> may have better ideas since he seems more experienced with Giraph than I am.
>
> --Steve
>
> On Sat, May 2, 2015 at 2:19 PM, Kenrick Fernandes <kenrick.f15@gmail.com>
> wrote:
>
>> Thank you both for your responses.
>>
>> Steve, I faced the same problem when I created the Long input format
>> files.
>> I tried running the code linked by Young above, using the
>> *SimplePageRankInputFormat.java*
>> as well as the *SimplePageRankVertex.java* in the repo.
>>
>> For the Twitter dataset, I added some *MasterCompute* code to log the
>> number of vertices
>> that existed at each superstep. The results, however, look pretty similar
>> to the previous iteration:
>>
>> Current step is 1 - 40103281 existed in the previous superstep 0Current step is 2
- 40103281 existed in the previous superstep 1
>>
>> Current step is 3 - 40383589 existed in the previous superstep 2
>>
>> Current step is 31 - 40383589 existed in the previous superstep 30
>>
>> It seems that a subset of vertices still only become active after the
>> first superstep,
>> despite all vertices being initialized in superstep 0. I cant think of a
>> reason why
>> - thoughts ?
>>
>> Thanks,
>> Kenrick
>>
>>
>>
>> On Wed, Apr 29, 2015 at 2:33 PM, Young Han <young.han@uwaterloo.ca>
>> wrote:
>>
>>> For the initialization issue, you can define a (nested) class that
>>> extends DefaultVertexValueFactory (from org.apache.giraph.factories) and
>>> add
>>> "-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory"
>>> after "org.apache.giraph.GiraphRunner" in your hadoop jar command.
>>>
>>> Also, the reason those input formats don't work is because PageRank is
>>> using LongWritable for vertex id and DoubleWritable for vertex value. As
>>> Roman pointed out, you have to have an input class that matches it (even if
>>> the input dataset has no "double" vertex values). An example (for Giraph
>>> 1.0.0) can be found here:
>>> https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java
>>> and an example command that uses it here:
>>> https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50
>>>
>>> Young
>>>
>>> On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <sdharenb@ncsu.edu>
>>> wrote:
>>>
>>>> Hey Kenrick,
>>>>
>>>> First, your commands above are wrong since you are specifying adjacency
>>>> list format with the -vif argument and since I believe *LongLongNullTextInputFormat
>>>> *refers to adjacency list format. However, even with the right
>>>> commands there will be issues and more things you need to do.
>>>>
>>>> I did get it the edgelist input format to work by creating a
>>>> LongNullTextEdgeInputFormat.java file just like the
>>>> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
>>>> file, but with longs instead of ints (this also required creating a
>>>> LongPair class).
>>>>
>>>> However, I would advise against using an edgelist input format in
>>>> Giraph as there are major underlying issues that I never figured out how
to
>>>> resolve. Namely, for an edgelist format, Giraph only considers a vertex
>>>> active in the first superstep if it has an outgoing edge. This means that
>>>> vertices with only incoming edges won't be initialized with correct values
>>>> during things like PageRank, SSSP, or WCC and hence will output incorrect
>>>> results. (You can see my previous thread here:
>>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
>>>> )
>>>>
>>>> The above issue can be avoided with adjacency list format by specifying
>>>> the vertex with no neighbors. For example, if vertex v has only incoming
>>>> edges, then you make sure there is a line with just v and no neighbors
>>>> listed (
>>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
>>>> ).
>>>>
>>>> If you figure out how to resolve the edgelist input issue please let me
>>>> know.
>>>>
>>>> Regards,
>>>> Steve
>>>>
>>>>
>>>> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <
>>>> kenrick.f15@gmail.com> wrote:
>>>>
>>>>> Hi Roman,
>>>>>
>>>>> Thanks for the quick response. There is no vertex data in this
>>>>> dataset though, and the vertex IDs posted above would fit in a
>>>>> Long. Would you advise changing the PageRankComputation
>>>>> formats, or working on a new input format ?
>>>>>
>>>>> Thanks,
>>>>> Kenrick
>>>>>
>>>>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <
>>>>> roman@shaposhnik.org> wrote:
>>>>>
>>>>>> One of the slightly annoying things in Giraph is that you have
>>>>>> to manually match your input format to your computation. In
>>>>>> your case, PageRankComputation requires LongWritable for
>>>>>> vertex ID and DoubleWritable for vertex Data. You may need
>>>>>> to hack one of the existing formats slightly.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Roman.
>>>>>>
>>>>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>>>>>> <kenrick.f15@gmail.com> wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > Im trying to get Giraph to read the Twitter dataset as input
for the
>>>>>> > SimplePageRankComputation program. The dataset format looks
like
>>>>>> this:
>>>>>> > 61578010 61147436
>>>>>> > 61578037 61147436
>>>>>> > 61578040 61147436
>>>>>> > (vertex id's, with pairs representing edges)
>>>>>> >
>>>>>> > When I run the command with
>>>>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat,
I get
>>>>>> this
>>>>>> > error :
>>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex
index
>>>>>> types not
>>>>>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>>>>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>>>>>> >
>>>>>> > So I tried running the command with
>>>>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat
and I
>>>>>> get a
>>>>>> > different one:
>>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex
value
>>>>>> types not
>>>>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>>>>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>>>>>> >
>>>>>> > I dont understand why the types in the input show up as different
>>>>>> formats in
>>>>>> > each error. Also, as far as I could tell, there is no input
format
>>>>>> for
>>>>>> > DoubleDouble. Is there a different way to get the graph into
Giraph
>>>>>> without
>>>>>> > having to write custom input code ? Thoughts would be much
>>>>>> appreciated.
>>>>>> >
>>>>>> > -----
>>>>>> > Reference Command:
>>>>>> > hadoop jar
>>>>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>>>>>> > org.apache.giraph.GiraphRunner
>>>>>> > org.apache.giraph.examples.PageRankComputation -vif
>>>>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>>>>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output
-w 30
>>>>>> > -----
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Kenrick
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message