giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kenrick Fernandes <kenrick....@gmail.com>
Subject Re: Input format problems running Giraph 1.1.0 on Twitter dataset
Date Sat, 02 May 2015 18:19:13 GMT
Thank you both for your responses.

Steve, I faced the same problem when I created the Long input format files.
I tried running the code linked by Young above, using the
*SimplePageRankInputFormat.java*
as well as the *SimplePageRankVertex.java* in the repo.

For the Twitter dataset, I added some *MasterCompute* code to log the
number of vertices
that existed at each superstep. The results, however, look pretty similar
to the previous iteration:

Current step is 1 - 40103281 existed in the previous superstep
0Current step is 2 - 40103281 existed in the previous superstep 1

Current step is 3 - 40383589 existed in the previous superstep 2

Current step is 31 - 40383589 existed in the previous superstep 30

It seems that a subset of vertices still only become active after the first
superstep,
despite all vertices being initialized in superstep 0. I cant think of a
reason why
- thoughts ?

Thanks,
Kenrick



On Wed, Apr 29, 2015 at 2:33 PM, Young Han <young.han@uwaterloo.ca> wrote:

> For the initialization issue, you can define a (nested) class that extends
> DefaultVertexValueFactory (from org.apache.giraph.factories) and add
> "-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory"
> after "org.apache.giraph.GiraphRunner" in your hadoop jar command.
>
> Also, the reason those input formats don't work is because PageRank is
> using LongWritable for vertex id and DoubleWritable for vertex value. As
> Roman pointed out, you have to have an input class that matches it (even if
> the input dataset has no "double" vertex values). An example (for Giraph
> 1.0.0) can be found here:
> https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java
> and an example command that uses it here:
> https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50
>
> Young
>
> On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <sdharenb@ncsu.edu>
> wrote:
>
>> Hey Kenrick,
>>
>> First, your commands above are wrong since you are specifying adjacency
>> list format with the -vif argument and since I believe *LongLongNullTextInputFormat
>> *refers to adjacency list format. However, even with the right commands
>> there will be issues and more things you need to do.
>>
>> I did get it the edgelist input format to work by creating a
>> LongNullTextEdgeInputFormat.java file just like the
>> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java
>> file, but with longs instead of ints (this also required creating a
>> LongPair class).
>>
>> However, I would advise against using an edgelist input format in Giraph
>> as there are major underlying issues that I never figured out how to
>> resolve. Namely, for an edgelist format, Giraph only considers a vertex
>> active in the first superstep if it has an outgoing edge. This means that
>> vertices with only incoming edges won't be initialized with correct values
>> during things like PageRank, SSSP, or WCC and hence will output incorrect
>> results. (You can see my previous thread here:
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E
>> )
>>
>> The above issue can be avoided with adjacency list format by specifying
>> the vertex with no neighbors. For example, if vertex v has only incoming
>> edges, then you make sure there is a line with just v and no neighbors
>> listed (
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%3C1409255770206.93691@uiowa.edu%3E
>> ).
>>
>> If you figure out how to resolve the edgelist input issue please let me
>> know.
>>
>> Regards,
>> Steve
>>
>>
>> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes <kenrick.f15@gmail.com
>> > wrote:
>>
>>> Hi Roman,
>>>
>>> Thanks for the quick response. There is no vertex data in this
>>> dataset though, and the vertex IDs posted above would fit in a
>>> Long. Would you advise changing the PageRankComputation
>>> formats, or working on a new input format ?
>>>
>>> Thanks,
>>> Kenrick
>>>
>>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik <roman@shaposhnik.org>
>>> wrote:
>>>
>>>> One of the slightly annoying things in Giraph is that you have
>>>> to manually match your input format to your computation. In
>>>> your case, PageRankComputation requires LongWritable for
>>>> vertex ID and DoubleWritable for vertex Data. You may need
>>>> to hack one of the existing formats slightly.
>>>>
>>>>
>>>> Thanks,
>>>> Roman.
>>>>
>>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes
>>>> <kenrick.f15@gmail.com> wrote:
>>>> > Hello,
>>>> >
>>>> > Im trying to get Giraph to read the Twitter dataset as input for the
>>>> > SimplePageRankComputation program. The dataset format looks like this:
>>>> > 61578010 61147436
>>>> > 61578037 61147436
>>>> > 61578040 61147436
>>>> > (vertex id's, with pairs representing edges)
>>>> >
>>>> > When I run the command with
>>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get
>>>> this
>>>> > error :
>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index
>>>> types not
>>>> > assignable, computation - class org.apache.hadoop.io.LongWritable,
>>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable
>>>> >
>>>> > So I tried running the command with
>>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I
>>>> get a
>>>> > different one:
>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value
>>>> types not
>>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable,
>>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable
>>>> >
>>>> > I dont understand why the types in the input show up as different
>>>> formats in
>>>> > each error. Also, as far as I could tell, there is no input format for
>>>> > DoubleDouble. Is there a different way to get the graph into Giraph
>>>> without
>>>> > having to write custom input code ? Thoughts would be much
>>>> appreciated.
>>>> >
>>>> > -----
>>>> > Reference Command:
>>>> > hadoop jar
>>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar
>>>> > org.apache.giraph.GiraphRunner
>>>> > org.apache.giraph.examples.PageRankComputation -vif
>>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip
>>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30
>>>> > -----
>>>> >
>>>> > Thanks,
>>>> > Kenrick
>>>>
>>>
>>>
>>
>

Mime
View raw message