{ "cells": [ { "cell_type": "markdown", "id": "57c2064a", "metadata": {}, "source": [ "# Social Network Analysis using GraphFrames " ] }, { "cell_type": "markdown", "id": "c50c32c7", "metadata": {}, "source": [ "Graph analytics offers a wide range of applications such as optimization of network flow and information propagation and fraud and anomaly detection. Because to the advent of social networks and the Internet of Things, we now have massive web-scale graphs with millions to billions of nodes and edges. We need tools to efficiently analyze such large graphs.\n", "\n", "Databricks launched **GraphFrames** which implements graph queries and pattern matching on top of Spark SQL to ease graph analytics. GraphFrames is a graph library built based on DataFrames. It benefits from the scalability and high performance of DataFrames, and provides high-level APIs for graph processing available from Scala, Java, and Python." ] }, { "cell_type": "markdown", "id": "0a405720", "metadata": {}, "source": [ "## Creating GraphFrames\n", "\n", "We can create GraphFrames from vertex and edge DataFrames. A **vertex DataFrame** should contain a special column named \"id\" which enumerates unique IDs for each node in the graph. An **edge DataFrame** should contain two special columns: \"src\" (source node ID of edge) and \"dst\" (destination node ID of edge). Both the vertex and edge DataFrames can have arbitrary other columns which may represent node and edge attributes. These can be the name and age for the node attributes and relationship of the nodes as edge attribute. " ] }, { "cell_type": "code", "execution_count": 1, "id": "1f0ff188", "metadata": {}, "outputs": [], "source": [ "from pyspark import SparkConf, SparkContext\n", "from pyspark.sql import SparkSession" ] }, { "cell_type": "code", "execution_count": 2, "id": "a83ab2b1", "metadata": {}, "outputs": [], "source": [ "conf = (SparkConf()\n", " .setMaster('local[*]')\n", " .set('spark.jars.packages',\n", " 'graphframes:graphframes:0.8.2-spark3.1-s_2.12'))" ] }, { "cell_type": "code", "execution_count": 3, "id": "c2137888", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ ":: loading settings :: url = jar:file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Ivy Default Cache set to: /home/phd/rroxasvillanueva/.ivy2/cache\n", "The jars for the packages stored in: /home/phd/rroxasvillanueva/.ivy2/jars\n", "graphframes#graphframes added as a dependency\n", ":: resolving dependencies :: org.apache.spark#spark-submit-parent-7c5d398e-cdef-40e9-b390-2e134244e177;1.0\n", "\tconfs: [default]\n", "\tfound graphframes#graphframes;0.8.2-spark3.1-s_2.12 in spark-packages\n", "\tfound org.slf4j#slf4j-api;1.7.16 in central\n", ":: resolution report :: resolve 179ms :: artifacts dl 7ms\n", "\t:: modules in use:\n", "\tgraphframes#graphframes;0.8.2-spark3.1-s_2.12 from spark-packages in [default]\n", "\torg.slf4j#slf4j-api;1.7.16 from central in [default]\n", "\t---------------------------------------------------------------------\n", "\t| | modules || artifacts |\n", "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", "\t---------------------------------------------------------------------\n", "\t| default | 2 | 0 | 0 | 0 || 2 | 0 |\n", "\t---------------------------------------------------------------------\n", ":: retrieving :: org.apache.spark#spark-submit-parent-7c5d398e-cdef-40e9-b390-2e134244e177\n", "\tconfs: [default]\n", "\t0 artifacts copied, 2 already retrieved (0kB/7ms)\n", "22/03/30 21:41:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "sc = SparkContext(conf=conf)" ] }, { "cell_type": "code", "execution_count": 4, "id": "155a19a3", "metadata": {}, "outputs": [], "source": [ "spark = SparkSession(sc)" ] }, { "cell_type": "markdown", "id": "08874cc5", "metadata": {}, "source": [ "## Example 1\n", "\n", "To illustrate, let us consider the sample social network in {numref}`socialnet`.\n", "\n", "\n", "```{figure} ./images/socialnet.png\n", ":name: socialnet\n", ":width: 500px\n", "\n", "Sample directed social network\n", "```\n", "You may also import data from a csv-file or a Parquet-file into a DataFrame." ] }, { "cell_type": "markdown", "id": "d9b0ce59", "metadata": {}, "source": [ "First, we create the nodes and edges via dataframes." ] }, { "cell_type": "code", "execution_count": 5, "id": "20ef3b48", "metadata": {}, "outputs": [], "source": [ "# Vertex DataFrame\n", "v = spark.createDataFrame([\n", " (\"a\", \"Amihan\", 19),\n", " (\"b\", \"Bagwis\", 25),\n", " (\"k\", \"Kidlat\", 39),\n", " (\"d\", \"Danaya\", 18),\n", " (\"e\", \"Elias\", 50),\n", " (\"g\", \"Gloria\", 38),\n", " (\"h\", \"Hiraya\", 35)], [\"id\", \"name\", \"age\"])\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "b13ce239", "metadata": {}, "outputs": [], "source": [ "# Edge DataFrame\n", "e = spark.createDataFrame([\n", " (\"a\", \"b\", \"friend\"),\n", " (\"a\", \"d\", \"friend\"),\n", " (\"k\", \"a\", \"family\"),\n", " (\"k\", \"h\", \"friend\"),\n", " (\"d\", \"g\", \"family\"),\n", " (\"d\", \"h\", \"family\"),\n", " (\"d\", \"g\", \"family\"),\n", " (\"e\", \"d\", \"family\"),\n", " (\"e\", \"h\", \"friend\"),\n", " (\"g\", \"h\", \"friend\"),\n", " (\"h\", \"e\", \"friend\")], [\"src\", \"dst\", \"relationship\"])" ] }, { "cell_type": "code", "execution_count": 7, "id": "146263e3", "metadata": {}, "outputs": [], "source": [ "from graphframes import *\n", "\n", "# Create a GraphFrame\n", "g = GraphFrame(v, e)" ] }, { "cell_type": "markdown", "id": "36554e89", "metadata": {}, "source": [ "We show here the nodes of our graph and its attributes." ] }, { "cell_type": "code", "execution_count": 8, "id": "8474de34", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 0:> (0 + 1) / 1]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+---+------+---+\n", "| id| name|age|\n", "+---+------+---+\n", "| a|Amihan| 19|\n", "| b|Bagwis| 25|\n", "| k|Kidlat| 39|\n", "| d|Danaya| 18|\n", "| e| Elias| 50|\n", "| g|Gloria| 38|\n", "| h|Hiraya| 35|\n", "+---+------+---+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", " \r" ] } ], "source": [ "g.vertices.show()" ] }, { "cell_type": "markdown", "id": "de820001", "metadata": {}, "source": [ "The edges signify the relationship between the nodes." ] }, { "cell_type": "code", "execution_count": 9, "id": "39c15614", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+---+------------+\n", "|src|dst|relationship|\n", "+---+---+------------+\n", "| a| b| friend|\n", "| a| d| friend|\n", "| k| a| family|\n", "| k| h| friend|\n", "| d| g| family|\n", "| d| h| family|\n", "| d| g| family|\n", "| e| d| family|\n", "| e| h| friend|\n", "| g| h| friend|\n", "| h| e| friend|\n", "+---+---+------------+\n", "\n" ] } ], "source": [ "g.edges.show()" ] }, { "cell_type": "markdown", "id": "f0e84bcc", "metadata": {}, "source": [ "We can also determine different network metrics, such as degree, in- and out-degree. " ] }, { "cell_type": "code", "execution_count": 10, "id": "305cc8a6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+---+------+\n", "| id|degree|\n", "+---+------+\n", "| g| 3|\n", "| k| 2|\n", "| e| 3|\n", "| h| 5|\n", "| d| 5|\n", "| b| 1|\n", "| a| 3|\n", "+---+------+\n", "\n" ] } ], "source": [ "g.degrees.show()" ] }, { "cell_type": "code", "execution_count": 11, "id": "016fec13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+--------+\n", "| id|inDegree|\n", "+---+--------+\n", "| g| 2|\n", "| e| 1|\n", "| h| 4|\n", "| d| 2|\n", "| b| 1|\n", "| a| 1|\n", "+---+--------+\n", "\n" ] } ], "source": [ "g.inDegrees.show()" ] }, { "cell_type": "code", "execution_count": 12, "id": "0c610bf7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+---------+\n", "| id|outDegree|\n", "+---+---------+\n", "| g| 1|\n", "| k| 2|\n", "| e| 2|\n", "| h| 1|\n", "| d| 3|\n", "| a| 2|\n", "+---+---------+\n", "\n" ] } ], "source": [ "g.outDegrees.show()" ] }, { "cell_type": "markdown", "id": "92010d0b", "metadata": {}, "source": [ "## Page Rank\n", "\n", "PageRank is a metric for determining the centrality of nodes in a network. It ranks nodes according to their network placements. The strategy presupposes a recursive definition of significance or centrality: Numerous significant nodes point to nodes that are themselves significant. PageRank was first established for directed networks since it was used to rank websites based on their hyperlinks; however, it naturally generalizes to undirected and even weighted networks via a random-walk formulation. \n", "\n", "PageRank was discussed in the Link Analysis Chapter. Here we just show how to implement it in GraphFrames. " ] }, { "cell_type": "markdown", "id": "4f9ae37b", "metadata": {}, "source": [ "## Example 2\n", "\n", "PageRank was discussed in the Link Analysis Chapter. Here we just show how to implement it in GraphFrames. " ] }, { "cell_type": "code", "execution_count": 13, "id": "1b335f76", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+---+------+---+------------------+\n", "| id| name|age| pagerank|\n", "+---+------+---+------------------+\n", "| g|Gloria| 38| 0.854458432231672|\n", "| h|Hiraya| 35|2.1591976714911048|\n", "| b|Bagwis| 25|0.3167021972103069|\n", "| e| Elias| 50| 2.017276721802774|\n", "| a|Amihan| 19|0.2810747410040871|\n", "| k|Kidlat| 39|0.1972454322835699|\n", "| d|Danaya| 18|1.1740448039764855|\n", "+---+------+---+------------------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+---+---+------------+------------------+\n", "|src|dst|relationship| weight|\n", "+---+---+------------+------------------+\n", "| a| b| friend| 0.5|\n", "| g| h| friend| 1.0|\n", "| d| g| family|0.3333333333333333|\n", "| d| g| family|0.3333333333333333|\n", "| d| g| family|0.3333333333333333|\n", "| d| g| family|0.3333333333333333|\n", "| k| a| family| 0.5|\n", "| e| h| friend| 0.5|\n", "| e| d| family| 0.5|\n", "| a| d| friend| 0.5|\n", "| d| h| family|0.3333333333333333|\n", "| k| h| friend| 0.5|\n", "| h| e| friend| 1.0|\n", "+---+---+------------+------------------+\n", "\n" ] } ], "source": [ "pr = g.pageRank(resetProbability=0.15, tol=0.01)\n", "## look at the pagerank score for every vertex\n", "pr.vertices.show()\n", "## look at the weight of every edge\n", "pr.edges.show()" ] }, { "cell_type": "markdown", "id": "7f0a8f66", "metadata": {}, "source": [ "## Triangle Counting\n", "\n", "Triangle counting is a critical aspect of graph mining. It is needed in calculating two frequently used metrics in complex network analysis, the graph's transitivity ratio and clustering coefficient. Triangles have been effectively used in a variety of real-world applications, like community detection in social networks, detecting spamming behavior, revealing the web's hidden theme organization, and recommending links in online social networks. Additionally, the triangle count is a frequently used network statistic in models of exponential random graphs. In this section, we will demonstrate how to count triangles in GraphFrames." ] }, { "cell_type": "code", "execution_count": 14, "id": "e297131c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+---+------+---+\n", "|count| id| name|age|\n", "+-----+---+------+---+\n", "| 1| g|Gloria| 38|\n", "| 0| k|Kidlat| 39|\n", "| 1| e| Elias| 50|\n", "| 2| h|Hiraya| 35|\n", "| 2| d|Danaya| 18|\n", "| 0| b|Bagwis| 25|\n", "| 0| a|Amihan| 19|\n", "+-----+---+------+---+\n", "\n" ] } ], "source": [ "g.triangleCount().show()" ] }, { "cell_type": "markdown", "id": "3b9e9cc7", "metadata": {}, "source": [ "## Exercise\n", "\n", "Implement the codes discussed here to analyze real-world data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }