flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasiliki Kalavri <vasilikikala...@gmail.com>
Subject Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?
Date Mon, 23 Jan 2017 10:40:34 GMT
Hi Miguel,

I don't think you're doing anything wrong. The NaN values you are getting
are there because of your data. The LCC value is computed as
#number_of_triangles / #number_of_triples, where #number_of_triples is
[n*(n-1)]/2 for a vertex with n neighbors. It looks like there are no
triangles in your graph, and only one vertex has more than one neighbor.

Cheers,
-Vasia.

On 21 January 2017 at 16:46, Miguel Coimbra <miguel.e.coimbra@gmail.com>
wrote:

> Hello Vasia and Greg,
>
> Thank you for the feedback!
>
> I am probably misusing the Gelly API in some way, but I thought I could
> run the undirected version after calling getUndirected()?
> While not going into the concept of local clustering coefficients, I
> thought that from a Gelly API point-of-view, both my code and data set were
> properly established.
> However:
> - I believe that the graph was already undirected;
> - I am getting NaN results after executing the algorithm.
>
> This is the code I am using to obtain an (undirected) graph instance upon
> which I call LocalClusteringCoefficient:
>
>
> import org.apache.flink.graph.library.clustering.undirected.
> LocalClusteringCoefficient.Result;
> import org.apache.flink.graph.library.clustering.undirected.
> LocalClusteringCoefficient;
> /** other imports and method definitions **/
>
> // Generate edge tuples from the input file.
> final DataSet<Tuple2<LongValue, LongValue>> edgeTuples =
> env.readCsvFile(inputPath)
>     .fieldDelimiter("\t") // node IDs are separated by spaces
>     .ignoreComments("#")  // comments start with "%"
>     .types(LongValue.class, LongValue.class);
>
> // Generate actual Edge<Long, Double> instances.
> @SuppressWarnings("serial")
> final DataSet<Edge<LongValue, Double>> edges = edgeTuples.map(
>     new MapFunction<Tuple2<LongValue, LongValue>, Edge<LongValue,
> Double>>() {
>         @Override
>         public Edge<LongValue, Double> map(Tuple2<LongValue, LongValue>
> arg0) throws Exception {
>             return new Edge<LongValue, Double>(arg0.f0,  arg0.f1, 1.0d);
>         }
>     });
>
> // Generate the basic graph.
> @SuppressWarnings("serial")
> final Graph<LongValue, Double, Double> graph = Graph.fromDataSet(
>     edges,
>     new MapFunction<LongValue, Double>() {
>         @Override
>         public Double map(LongValue arg0) throws Exception {
>             // For testing purposes, just setting each vertex value to 1.0.
>             return 1.0;
>         }
>     },
>     env).*getUndirected(*);
>
> // Execute the LocalClusteringCoefficient algorithm.
> final DataSet<Result<LongValue>> localClusteringCoefficients =
> graph.run(new LocalClusteringCoefficient<LongValue, Double, Double>());
>
> // Get the values as per Vasia's help:
> @SuppressWarnings("serial")
> DataSet<Double> *CLUSTERING_COEFFICIENTS* = localClusteringCoefficients.map(new
> MapFunction<Result<LongValue>, Double>() {
>     @Override
>     public Double map(Result<LongValue> arg0) throws Exception {
>         return arg0.getLocalClusteringCoefficientScore();
>     }
> });
>
> I believe this is the correct way to get a DataSet<Double> of
> coefficients from a DataSet<Result<LongValue>> ?
> Among the coefficients are a lot of NaN values:
>
> *CLUSTERING_COEFFICIENTS*.print();
>
> NaN
> 0.0
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
>
> Apologies for the verbosity in advance, but just to provide detail,
> printing the graph edges yields this (notice that for each pair or vertices
> there are two links, which are original and the reverse version derived
> from getUndirected()).
>
> *Greg:* I therefore believe the *graph is undirected*:
>
> graph.getEdgesAsTuple3().print();
> (5113,6008,1.0)
> (6008,5113,1.0)
> (5113,6774,1.0)
> (6774,5113,1.0)
> (5113,32938,1.0)
> (32938,5113,1.0)
> (5113,6545,1.0)
> (6545,5113,1.0)
> (5113,7088,1.0)
> (7088,5113,1.0)
> (5113,37929,1.0)
> (37929,5113,1.0)
> (5113,26562,1.0)
> (26562,5113,1.0)
> (5113,6107,1.0)
> (6107,5113,1.0)
> (5113,7171,1.0)
> (7171,5113,1.0)
> (5113,6192,1.0)
> (6192,5113,1.0)
> (5113,7763,1.0)
> (7763,5113,1.0)
> (9748,5113,1.0)
> (5113,9748,1.0)
> (10191,5113,1.0)
> (5113,10191,1.0)
> (6064,5113,1.0)
> (5113,6064,1.0)
> (6065,5113,1.0)
> (5113,6065,1.0)
> (6279,5113,1.0)
> (5113,6279,1.0)
> (4907,5113,1.0)
> (5113,4907,1.0)
> (6465,5113,1.0)
> (5113,6465,1.0)
> (6707,5113,1.0)
> (5113,6707,1.0)
> (7089,5113,1.0)
> (5113,7089,1.0)
> (7172,5113,1.0)
> (5113,7172,1.0)
> (14310,5113,1.0)
> (5113,14310,1.0)
> (6252,5113,1.0)
> (5113,6252,1.0)
> (33855,5113,1.0)
> (5113,33855,1.0)
> (7976,5113,1.0)
> (5113,7976,1.0)
> (26284,5113,1.0)
> (5113,26284,1.0)
> (8056,5113,1.0)
> (5113,8056,1.0)
> (10371,5113,1.0)
> (5113,10371,1.0)
> (16785,5113,1.0)
> (5113,16785,1.0)
> (19801,5113,1.0)
> (5113,19801,1.0)
> (6715,5113,1.0)
> (5113,6715,1.0)
> (31724,5113,1.0)
> (5113,31724,1.0)
> (32443,5113,1.0)
> (5113,32443,1.0)
> (10370,5113,1.0)
> (5113,10370,1.0)
>
> Any insight into what I may be doing wrong would be greatly appreciated.
>
> Thanks for your time,
>
> Kind regards,
>
> Miguel E. Coimbra
> Email: miguel.e.coimbra@gmail.com <miguel.e.coimbra@ist.utl.pt>
> Skype: miguel.e.coimbra
>
> On 20 January 2017 at 19:31, Greg Hogan <code@greghogan.com> wrote:
>
>> Hi Miguel,
>>
>> The '--output print' option describes the values and also displays the
>> local clustering coefficient value.
>>
>> You're running the undirected algorithm on a directed graph. In 1.2 there
>> is an option '--simplify true' that will add reverse edges and remove
>> duplicate edges and self-loops. Alternatively, it looks like you could
>> simply add reverse edges to your input file (with an optional ' | sort |
>> uniq' following):
>>
>> $ cat edges.txt | awk ' { print $1, $2; print $2, $1 } '
>>
>> The drivers are being reworked for 1.3 to better reuse code and options
>> which will better support additional drivers and algorithms and make
>> documentation simpler.
>>
>> Greg
>>
>> On Fri, Jan 20, 2017 at 2:06 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>
>>> Hi Miguel,
>>>
>>> the LocalClusteringCoefficient algorithm returns a DataSet of type
>>> Result, which basically wraps a vertex id, its degree, and the number
>>> of triangles containing this vertex. The number 11 you see is indeed the
>>> degree of vertex 5113. The Result type contains the method
>>> getLocalClusteringCoefficientScore() which allows you to retrieve the
>>> clustering coefficient score for a vertex. The method simply divides the
>>> numbers of triangles by the number of potential edges between neighbors.
>>>
>>> I'm sorry that you this is not clear in the docs. We should definitely
>>> improve them to explain what is the output and how to retrieve the actual
>>> clustering coefficient values. I have opened a JIRA for this [1].
>>>
>>> Cheers,
>>> -Vasia.
>>>
>>> [1]: https://issues.apache.org/jira/browse/FLINK-5597
>>>
>>> On 20 January 2017 at 19:31, Miguel Coimbra <miguel.e.coimbra@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> In the documentation of the LocalClusteringCoefficient algorithm, it
>>>> is said:
>>>>
>>>>
>>>> *The local clustering coefficient measures the connectedness of each
>>>> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors)
to
>>>> 1.0 (neighborhood is a clique).*
>>>>
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>>>> apis/batch/libs/gelly.html#local-clustering-coefficient
>>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>>>>
>>>> However, upon running the algorithm (undirected version), I obtained
>>>> values above 1.
>>>>
>>>> The result I got was this. As you can see, vertex 5113 has a score of
>>>> 11:
>>>> (the input edges for the graph are shown further below - around *35
>>>> edges*):
>>>>
>>>> (4907,(1,0))
>>>> *(5113,(11,0))*
>>>> (6008,(0,0))
>>>> (6064,(1,0))
>>>> (6065,(1,0))
>>>> (6107,(0,0))
>>>> (6192,(0,0))
>>>> (6252,(1,0))
>>>> (6279,(1,0))
>>>> (6465,(1,0))
>>>> (6545,(0,0))
>>>> (6707,(1,0))
>>>> (6715,(1,0))
>>>> (6774,(0,0))
>>>> (7088,(0,0))
>>>> (7089,(1,0))
>>>> (7171,(0,0))
>>>> (7172,(1,0))
>>>> (7763,(0,0))
>>>> (7976,(1,0))
>>>> (8056,(1,0))
>>>> (9748,(1,0))
>>>> (10191,(1,0))
>>>> (10370,(1,0))
>>>> (10371,(1,0))
>>>> (14310,(1,0))
>>>> (16785,(1,0))
>>>> (19801,(1,0))
>>>> (26284,(1,0))
>>>> (26562,(0,0))
>>>> (31724,(1,0))
>>>> (32443,(1,0))
>>>> (32938,(0,0))
>>>> (33855,(1,0))
>>>> (37929,(0,0))
>>>>
>>>> This was from a small isolated test with these edges:
>>>>
>>>> 5113    6008
>>>> 5113    6774
>>>> 5113    32938
>>>> 5113    6545
>>>> 5113    7088
>>>> 5113    37929
>>>> 5113    26562
>>>> 5113    6107
>>>> 5113    7171
>>>> 5113    6192
>>>> 5113    7763
>>>> 9748    5113
>>>> 10191    5113
>>>> 6064    5113
>>>> 6065    5113
>>>> 6279    5113
>>>> 4907    5113
>>>> 6465    5113
>>>> 6707    5113
>>>> 7089    5113
>>>> 7172    5113
>>>> 14310    5113
>>>> 6252    5113
>>>> 33855    5113
>>>> 7976    5113
>>>> 26284    5113 <262%20845%20113>
>>>> 8056    5113
>>>> 10371    5113
>>>> 16785    5113
>>>> 19801    5113
>>>> 6715    5113
>>>> 31724    5113
>>>> 32443    5113
>>>> 10370    5113
>>>>
>>>> I am not sure what I may be doing wrong, but is there perhaps some form
>>>> of normalization lacking in my execution of:
>>>>
>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>> usteringCoefficient.Result;
>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>> usteringCoefficient;
>>>>
>>>> Am I supposed to divide all scores by the greatest score obtained by
>>>> the algorithm?
>>>>
>>>> Thank you very much!
>>>>
>>>> Miguel E. Coimbra
>>>> Email: miguel.e.coimbra@gmail.com <miguel.e.coimbra@ist.utl.pt>
>>>> Skype: miguel.e.coimbra
>>>>
>>>
>>>
>>
>

Mime
View raw message