# Edge DataFrame
e = spark.createDataFrame([
("a", "b", "friend"),
("a", "d", "friend"),
("k", "a", "family"),
("k", "h", "friend"),
("d", "g", "family"),
("d", "h", "family"),
("d", "g", "family"),
("e", "d", "family"),
("e", "h", "friend"),
("g", "h", "friend"),
("h", "e", "friend")], ["src", "dst", "relationship"])
Social Network Analysis using GraphFrames#
Graph analytics offers a wide range of applications such as optimization of network flow and information propagation and fraud and anomaly detection. Because to the advent of social networks and the Internet of Things, we now have massive web-scale graphs with millions to billions of nodes and edges. We need tools to efficiently analyze such large graphs.
Databricks launched GraphFrames which implements graph queries and pattern matching on top of Spark SQL to ease graph analytics. GraphFrames is a graph library built based on DataFrames. It benefits from the scalability and high performance of DataFrames, and provides high-level APIs for graph processing available from Scala, Java, and Python.
Creating GraphFrames#
We can create GraphFrames from vertex and edge DataFrames. A vertex DataFrame should contain a special column named “id” which enumerates unique IDs for each node in the graph. An edge DataFrame should contain two special columns: “src” (source node ID of edge) and “dst” (destination node ID of edge). Both the vertex and edge DataFrames can have arbitrary other columns which may represent node and edge attributes. These can be the name and age for the node attributes and relationship of the nodes as edge attribute.
Example 1#
To illustrate, let us consider the sample social network in Fig. 24.
You may also import data from a csv-file or a Parquet-file into a DataFrame.
First, we create the nodes and edges via dataframes.
We show here the nodes of our graph and its attributes.
The edges signify the relationship between the nodes.
We can also determine different network metrics, such as degree, in- and out-degree.
Page Rank#
PageRank is a metric for determining the centrality of nodes in a network. It ranks nodes according to their network placements. The strategy presupposes a recursive definition of significance or centrality: Numerous significant nodes point to nodes that are themselves significant. PageRank was first established for directed networks since it was used to rank websites based on their hyperlinks; however, it naturally generalizes to undirected and even weighted networks via a random-walk formulation.
PageRank was discussed in the Link Analysis Chapter. Here we just show how to implement it in GraphFrames.
Example 2#
PageRank was discussed in the Link Analysis Chapter. Here we just show how to implement it in GraphFrames.
Triangle Counting#
Triangle counting is a critical aspect of graph mining. It is needed in calculating two frequently used metrics in complex network analysis, the graph’s transitivity ratio and clustering coefficient. Triangles have been effectively used in a variety of real-world applications, like community detection in social networks, detecting spamming behavior, revealing the web’s hidden theme organization, and recommending links in online social networks. Additionally, the triangle count is a frequently used network statistic in models of exponential random graphs. In this section, we will demonstrate how to count triangles in GraphFrames.
Exercise#
Implement the codes discussed here to analyze real-world data.