Community Detection in Graphs, by Santo Fortunato Presented by: Yiye Ruan Monadhika Sharma Yu-Keng Shih Outline Sec. 1~5, 9: Yiye Sec. 6~8: Monadhika Sec 11~13,15: Yu-Keng Sec 17: All (17.1: Yu-Keng 17.2: Yiye and Monadhika)

Graphs from the Real World Knigsberg's Bridges Ref: http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg Graphs from the Real World Zacharys Karate Club Lusseaus network of bottlenose dolphins Graphs from the Real Word Webpage Hyperlink Graph Directed Communities Network of Word Associations

Overlapping Communities Real Networks Are Not Random Degree distribution is broad, and often has a tail following power-law distribution Ref: Plot of power-law degree distribution on log-log scale. From Math Insight. http://mathinsight.org/image/power_law_degree_distribution_scatter Real Networks Are Not Random Edge distribution is locally inhomogeneous Community Structure!

Applications of Community Detection Website mirror server assignment Recommendation system Social network role detection Functional module in biological networks Graph coarsening and summarization Network hierarchy inference General Challenges

Structural clusters can only be identified if graphs are sparse (i.e. ) Many clustering problems are NP-hard. Even polynomial time approaches may be too expensive Motivation for graph sampling/sparsification Call for scalable solutions Concepts of cluster, community are not

quantitatively well defined Discussed in more details below Defining Communities (Sec. 3) Intuition: There are more edges inside a community than edges connected with the rest of the graph Terminology Graph , subgraph have and vertices : Internal and external degrees of : Internal and external degrees of : Intra-cluster density

: Inter-cluster density Defining Communities (Sec. 3) Local definitions: focus on the subgraph only Clique: Vertices are all adjacent to each other Strict definition, NP-complete problem n-clique, n-clan, n-club, k-plex k-core: Maximal subgraph that each vertex is adjacent to at least k other vertices in the subgraph

LS-set (strong community): Weak community: Fitness measure: Intra-cluster density, cut size, Image ref: Lszl, Zahornszky, et al. "Breaking the hierarchy-a new cluster selection mechanism for hierarchical clustering methods." Algorithms for Molecular Biology 4. Zhao, Jing, et al. "Insights into the pathogenesis of axial spondyloarthropathy from network and pathway analysis." BMC Systems Biology 6.Suppl 1 (2012): S4. Defining Communities (Sec. 3) Global definition: with respect to the whole graph Null

model: A random graph where some structure properties are matched with the original graph Intuition: A subgraph is a community if the number of internal edges exceeds the expectation over all realizations of the null model Modularity Defining Communities (Sec. 3) Vertex similarity-based Embedding vertices into dimensional space Euclidean distance:

Cosine similarity: Similarity from adjacency relationships Distance between neighbor list: Neighborhood overlap: Correlation coefficient of adjacency list: Evaluating Community Quality (Sec. 3)

So we can compare the goodness of extracted communities, whether extracted by different algorithms or the same. Performance, coverage Define Normalized cut (n-cut): Conductance: Evaluating Community Quality (Sec. 3) Modularity Intuition: A subgraph is a community if the number of internal edges exceeds the expectation over all realizations of the null model. Definition:

: expected number of edges between i and j in the null model Bernoulli random graph: Evaluating Community Quality (Sec. 3) Modularity Distribution that matches original degrees:

Evaluating Community Quality (Sec. 3) Modularity Range: if we treat the whole graph as one community if each vertex is one community Traditional Methods (Sec. 4) Graph Partitioning Dividing

vertices into groups of predefined size Kernighan-Lin Create algorithm initial bisection Iteratively swap subsets containing equal number of vertices Select the partition that maximize (number of edges insider modules cut size) Traditional Methods (Sec. 4) Graph Partitioning METIS

(Karypis and Kumar) Multi-level approach Coarsen the graph into skeleton Perform K-L and other heuristics on the skeleton Project back with local refinement Traditional Methods (Sec. 4)

Hierarchical Clustering Graphs may have hierarchical structure Traditional Methods (Sec. 4) Hierarchical Clustering Find clusters using a similarity matrix Agglomerative: clusters are iteratively merged if their similarity is sufficiently high Divisive: clusters are iteratively split by removing

edges with low similarity Define similarity between clusters Single linkage (minimum element) Complete linkage (maximum element) Average linkage Drawback: dependent on similarity threshold Traditional Methods (Sec. 4) Partitional Clustering

Embed vertices in a metric space, and find clustering that optimizes the cost function Minimum k-clustering k-clustering sum k-center k-median k-means Fuzzy k-means Traditional Methods (Sec. 4) Spectral Clustering Un-normalized #

Laplacian: of connected components = # of 0 eigenvalues Normalized variants: Traditional Methods (Sec. 4) Spectral Clustering Compute the Laplacian matrix Transform graph vertices into points where coordinates are elements of eigenvectors

Cluster properties become more evident Cluster vertices in the new metric space Complexity Approximate algorithms for a small number of eigenvectors. Dependent on the size of eigengap Traditional Methods (Sec. 4) Graph Partitioning Spectral

bisection: Minimize the cut size where is the graph Laplacian matrix, and is the indicator vector Approximate solution using (Fiedler vector): Drawback: Have to specify the number of groups or group size. Ref: http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html Divisive Algorithms (Sec. 5)

Girvan and Newmans edge centrality algorithm: Iteratively remove edges with high centrality and recompute the values Define edge centrality: Edge betweenness: number of all-pair shortest paths that run along an edge Random-walk betweenness: probability of random walker passing the edge Current-flow betweenness: current passing the edge in a unit resistance network

Drawback: at least complexity Statistical Inference (Sec. 9) Generative Models Observation: graph structure () Parameters: assumption of model () Hidden information: community assignment () Maximize the likelihood Statistical Inference (Sec. 9) Generative Models Hastings:

Given planted partition model (intra-group link probability), (inter-group link probability), Statistical Inference (Sec. 9) Generative Models Newman Directed and Leicht: mixed membership model graph, given Infer

(fraction of vertices belonging to group ) (probability of a directed edge from group to vertex ) (probability of vertices being assigned to group ) Iterative update ( is the out degree of vertex ) Can find overlapping communities

Statistical Inference (Sec. 9) Generative Models Hofman and Wiggins: Bayesian planted partition model Assume and have Beta priors, has Dirichlet prior, and is a smooth function Maximize conditional probability No

need to specify number of clusters Signed Networks Edges represent both positive and negative relations/interactions between vertices Example: like/dislike function, member voting, Theories Structural balance: three positive edges and one positive edge are more likely configurations Social

status: creator of positive link considers the Signed Networks Leskovec, Huttenlocher, Kleinberg: Compare the actual count of triangles with different configuration with expectation Findings: When networks are viewed as undirected, there is strong support for a weaker version of balance theory Fewer-than-expected

triangles with two positive edges Over-represented triangles with three positive edges When networks are viewed as directed, results follow the status theory better Modularity based Methods - BY M O N A D H I KA S H A R M A What is Modularity Quality function to assess the usefulness of a certain partition Based on the paper by Newman and Girvan It is based on the idea that a random graph is not expected to have a cluster structure to measure the strength of division of a

network into modules Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random Modularity Modularity based Methods Try to Maximize Modularity Finding the best value for Q is NP hard Hence we use heuristics 1. Greedy Technique Agglomerative hierarchical clustering method Groups of vertices are successively joined to form larger communities

such that modularity increases after the merging. 2. Simulated Annealing probabilistic procedure for global optimization an exploration of the space of possible states, looking for the global optimum of a function F (say maximum) Transition with 1 if increases, else with 3. Extremal Optimization evolves a single solution and makes local modifications to the worst components Uses fitness value like in genetic algorithm

At each iteration, the vertex with the lowest fitness is shifted to the other cluster Changes partition, fitness recalculated Till we reach an optimum Q value SPECTRAL ALGORITHMS Spectral properties of graph matrices are frequently used to find partitions properties of a graph in relationship to the characteristic polynomial, eigenvalues, and eigenvectors of matrices associated to the graph, such as its adjacency matrix or Laplacian Matrix SPECTRAL ALGORITHMS

. SPECTRAL ALGORITHMS 1. Spin models A system of spins that can be in q different states The interaction is ferromagnetic, i.e. it favors spin alignment Interactions are between neighboring spins Potts spin variables are assigned to the vertices of a graph with community structure 1. Spin models The Hamiltonian of the model, i. e. its

energy: 2. Random walk A random walker spends a long time inside a community due to the high density of internal edges E.g. 1 : Zhou used random walks to dene a distance between pairs of vertices the distance between i and j is the average number of edges that a random walker has to cross to reach j starting from i. 3. Synchronization In a synchronized state, the units of the

system are in the same or similar state(s) at every time Oscillators in the same community synchronize first, whereas a full synchronization requires a longer time First used Kuramoto oscillators which are coupled two-dimensional vectors with a proper frequency of oscillations 3. Synchronization Phase of i Natural frequency Coupling coefficient Runs over all oscillators

Overlapping community detection Most of previous methods can only generate non-overlapped clusters. A node only belongs to one community. Not real in many scenarios. A person usually belongs to multiple communities.

Most of current overlapping community detection algorithms can be categorized into three groups. Mainly based on non-overlapping communities algorithms. 48 Overlapping community detection 1. Identifying bridge nodes First, identifying bridge nodes and remove or duplicate these nodes.

Duplicate nodes have connection b/t them. Then, apply hard clustering algorithm. If bridge nodes was removed, add them back. E.g. DECAFF [Li2007], Peacock [Gregory2009] Cons: Only a small part of nodes can be identified as bridge nodes. 2 1

5 4 3 6 49 Overlapping community detection 2. Line graph transformation Edges become nodes. New

nodes have connection if they originally share a node. Then, apply hard clustering algorithm on the line graph. E.g. LinkCommunity Cons: An edge can only belong to one cluster [Ahn2010]

1 1 2 2 4 4 3 3 6 5

5 8 7 6 50 Overlapping community detection 3. Local clustering (optional) Select seed nodes. Expand seed node according to some criterion.

E.g. ClusterOne [Nepusz2012], MCODE [Adamcsek2006], RRW [Macropol2009] Cons: Not globally consider the topology [Bader2003], 2 1 CPM 5 4

3 6 51 Multi-resolution methods Many graphs have a hierarchical cluster structure. 52 Multi-resolution / Hierarchical methods

Most of previous methods can only generate a clustering with fixed resolution (avg. cluster size) Multi-resolution methods Clusters might be hierarchical or users might be interesting in different resolutions. Produce clusterings with different average cluster size. Hierarchical Clustering

Produce a dendrogram, showing the hierarchical clusters. 53 Multi-resolution methods Has a parameter to change the average cluster size. Pons (2006) and Arenas et al. (2008) introduce a new parameter in the modularity objective function.

Lancichinetti et al. (2009) designed a fitness function. To detect overlapping clusters in multi-resolutions. Pros: Good for clusters w/o hierarchy. 54 Cons: Need to rerun the algorithms to generate Hierarchical Methods

Sales-Pardo et al. (2007) propose a top-down approach. Can iteratively determine a graph has 0/1/2+ communities. some nodes can belong to no cluster, corresponding to the real situation. Pros: Help understand the hierarchy among clusters.

Cons: Hard to evaluate the dendrogram. 55 Dynamic community Cluster each snapshot independently Then mapping clusters in each clustering. If two clusters in continuous snapshots share most of

nodes, then the next one evolves from the previous one. Detect the evolution of communities in a dynamic graph. Birth, Death, Growth, Contraction, Merge, Split. 56 Dynamic community 57 Dynamic community

Asur et al. (2007) further detect a event involving nodes. E.g. join and leave Measure the node behavior. Sociability: How frequently a node join and leave a community.

Influence: How a node can influence other nodes activities. Usage Understand the community behavior. E.g. age is positively correlated with the size. Predict the evolution of a community 58 Dynamic community detection

Hypothesis: Communities in dynamic graphs are smooth. Detect communities by also considering the previous snapshots. Chakrabarti et al (2006) introduce history cost. Measures the dissimilarity between two clusterings in continuous timestamps.

A smooth clustering has lower history cost. Add this cost to the objective function. 59 Testing algorithms 1. Real data w/o gold standards: 2. Read data w/ gold standard

3. Synthetic data Hard to say which algorithm is the best. In different scenarios, different algorithms might be best choices. 1 and 2 are practical, but hard to determine which kinds of graphs / clusters an algorithm is suitable.

Sparse/Dense, power-law, overlapping communities. 60 Real data w/o gold standards Almeida et al. (2011) discuss many metrics. Modularity, normalized cut, Silhouette Index, conductance, etc. Each metric has its own bias.

Modularity, conductance are biased toward small number of clusters. Should not choose the algorithms which is designed for that metric, e.g. modularity-based method. 61 Real data w/ gold standard Examples of gold standard clusters Networktags in Facebook.

Article tags in Wiki Protein annotations. Evaluate how closely the clusters are matched to the gold standard. Cons: Overfitting biased towards the clustering with similar cluster size.

Cons: Gold standard might be noisy, incomplete. 62 Metrics F-measure Harmonic mean of precision and recall

Need a parameter (usually 0.25) Accuracy Square root of PPV * Sn Tij: common nodes in community I and cluster j 63 Metrics

Normalized Mutual Information H(X): Entropy of X I(X, Y): H(X) H(X|Y), H(X|Y) is the conditional entropy Some metrics need to be adjusted for overlapping clustering. 64 Synthetic data

Girvan and Newman (2002) Benchmark Fixed 128 nodes and 4 communities Can tune noisy level Cons: All nodes have the same expected degree; All communities have the same size, etc 65 Synthetic data

LFR (Lancichinetti 2009) Generate power-law, weighted/unweighted, directed/undirected graph with gold standard Pros: can generate variaous graphs. # nodes, average degree, power-law exponent.

Average/Min/Max community size, # bridge nodes. Noisy level, etc. Cons: The number of communities each bridge nodes belonging to is fixed. Use the above metrics to evaluate the result. 66 Biological Application

Protein-protein interaction (PPI) network Node: Protein; Edge: Interaction Edge weight: Confidence level of an interaction Interacting proteins are likely to have the same function. Community: Protein complex or functional module

Gene Ontology terms, etc. Usage: Predict functions of each protein Biologically examining each protein is expensive Improve drug design, etc. 67 PPI networks

Usually thousands of nodes. Each dataset, organism has a different network. average degree 5~10. power-law distribution. 68 PPI sub-network example 69

Biological Application Must overlapping clustering Protein function is hierarchical But a large function might not form a community. Gold standard is far from complete

A protein has many functions. Yeast is the most annotated organism. PPIs are very noisy False positive and false negative Better to integrate more evidence, e.g. sequence, gene expression profile. 70

Applications (Sec. 17) Social Networks Belgian phone call network distinguishes French- and Dutchspeaking population Applications (Sec. 17) Social Networks University students Facebook network (left) and corresponding dorm affiliation (right)

72 Applications (Sec. 17) Other Networks Map of science derived from citation network 73