Loading [MathJax]/jax/output/CommonHTML/jax.js
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Community Detection Methods Based on Exploiting Attributes and Interactions on Social Networks: A Survey and Future Directions

    https://doi.org/10.1142/S2196888824300011Cited by:0 (Source: Crossref)

    Abstract

    In recent years, community detection in social network analysis has been a hot research topic that has attracted much more attention in the scientific community and a large number of articles have already been published in the literature. Community detection problems focus on determining the subgraphs of the network that have dense edges and whose nodes are expected to have similar features. Community detection has significant applications in different domains of social network analysis such as sociology, business, criminal detection and recommendation systems. The computational complexity is hampered by the huge size and the dynamic nature of social networks. This paper introduces a comprehensive survey of community detection methods in social networks based on mining attributes and interactions of users. We divide the community detection approaches into three basic categories, namely (1) node content-based approaches, (2) edge content-based approaches, and (3) node- and edge content-based approaches. We also discuss in detail the main idea of each method that is suitable for social network data. Furthermore, to promote the study of community detection, we introduce the benchmark datasets from available sources. Last, we present further research directions as well as some open challenges for the community detection problem in social networks.

    1. Introduction

    A social network is a structure made up of actors, which are mainly human individuals, and interactions among them. Social networks facilitate different communication methods and represent a diversity of social interaction types. They create a favorable environment for users to interact and share news. In general, a social network is a set of actors interconnected via relationships created during online interactions.1 It can be presented as a graph in which the nodes represent the social actor and the edges represent their social relationship.1,2 In the networks, the community structure forms the groups of nodes having dense intra-connections, and sparse inter-connections. Community detection in social networks intensively has been studied in recent years.3,4,5

    In general, communities are categorized into two categories: disjoint and overlapping communities.1,6 In disjoint communities, the actor is assigned to a unique community. However, a user in social networks may belong to more than one group such as family, best friends and colleagues. The actors associate across different communities which are called overlapping communities.7,8 The community detection approach can be classified into two categories: structure-based community indicating the nodes in a cluster of a graph have more relationships with each other than other nodes and semantic-based community indicating a cluster of nodes having similar semantic context or a set of nodes in a cluster having the same interests. Semantic-based community detection mines both the context and relationship of the nodes therefore the discovered community could represent the coherence more efficiently.9,10,11,12 The heterogeneity of data, diversity structure, and social network size pose growing challenges for community detection problems. Dynamic data are also a problem that makes community detection problems even more complicated.13,14,15 These limitations have motivated growing interest in algorithms for large-scale networks.

    Community detection is a valuable problem for different domains of social network analysis such as sociology, business, criminal detection and recommendation systems.2 Identifying communities from social networks helps to understand the user’s behaviors and relationships between members of the community. A community is a group of nodes that have links with each other and limited associations with the remaining network. Detecting communities of the networks provides important knowledge about the structural properties of the networks. The structural properties have significant applications in different domains such as viral marketing, and modeling the epidemic spread.2 There exists a wide variety of community detection algorithms in the literature. Table 1 presents a summary of the existing surveys on community detection. The articles mainly focused on summarizing the methods and categorizing the algorithms. Some articles focused on comparing models and giving the pros and cons in various use cases. In the rest of this section, we will summarize the main contributions of the existing surveys on community detection in the literature.

    Table 1. Recently existing surveys on community detection problems.

    AuthorsYearFocus onAlgorithm categories
    Kim et al.162015Community detection algorithms in multi-layer graphs of interactionsCluster expansion, matrix factorization, unified distance, model-based, pattern mining, and graph merging
    Jonnalagadda et al.172016The game theory-based community detection algorithmsNon-cooperative game theory and cooperative game theory
    Fani et al.182017Community detection methods in social networksLink-based and content-based
    Zhang et al.192018Algorithms of community detection in complex networksModularity-based, clique percolation-based, label propagation-based, and hierarchical partitioning-based
    Javed et al.202018Popular community detection algorithmsDisjoint communities and overlapping communities
    Chen et al.212019Overlapping community detection of complex networksNode-based, network-based, and hierarchical clustering-based
    Azaouzi et al.22019Centralized community detection and distributed community detection in static and dynamic social networks
    Chunaev222020The community detection methods in node-attributed social networks
    Souravlas et al.232021The popular community detection methodsThe bottom-up approach, the top-down approach, and the data structure-based approach
    Jin et al.242021The popular community detection methodsProbabilistic graphical and deep learning model
    Moscato et al.252021The different community detection techniques applicable to social networksTopological analysis-based, game theory-based, AI models-based, fuzzy systems-based, and greedy methods-based
    Xing et al.262022Deep learning-based models upon deep neural networks, deep nonnegative matrix factorization, and deep sparse filtering.Convolutional networks, graph attention networks, generative adversarial networks, autoencoder, deep nonnegative matrix factorization, and deep sparse filtering
    Rezvani et al.52022The hierarchical community detection problemFitness-metric-based, hierarchical graph clustering, random walk models, and structured approaches
    Bui et al.42023Analyze and evaluate the computational complexity of hierarchical clustering algorithms for community detectionAgglomerative and divisive algorithms
    Our paper2024Community detection methods in social networks based on mining attributes and interactionsNode content-based, edge content-based, node and edge content combination-based

    The aspects of interactions can be modeled as a multi-layer graph composed of multiple interdependent graphs in which each graph represents an aspect. Kim et al. classified community detection algorithms in multi-layer graphs into six categories: cluster expansion, matrix factorization, unified distance, model-based, pattern mining, and graph merging.16 In Ref. 17, the authors presented a comprehensive survey of game theory-based community detection algorithms. The game theory-based community detection algorithms were discussed in two categories: non-cooperative game theory and cooperative game theory. They also discussed the interesting applications of game theory for social networks and provided further research directions as well as some open challenges. Fani et al. provided an overview of the definitions of an online community and reviewed different community detection methods in social networks.18 They classified the existing community detection approaches into two categories: topology-based and topic-based approaches. Meanwhile, Javed et al. divided the algorithms into disjoint communities and overlapping communities.20 Azaouzi et al. introduced a taxonomy of existing models based on the computational nature that were centralized community detection and distributed community detection in static and dynamic social networks.2 In Ref. 21, the authors reviewed the state-of-the-art of overlapping community detection in complex networks and briefly summarized the advantages and applications of each algorithm. The existing community detection methods were divided into three categories: node-based, network-based, and hierarchical clustering-based. In Ref. 25, the authors presented a comprehensive and comparative study of the different community detection techniques applicable to social networks. The community detection approaches were classified into five categories: topological analysis-based, game theory-based, AI models-based, fuzzy systems-based, and greedy methods-based in which, the most popular approaches based on game theory, AI models, and fuzzy strategies were introduced in detail and highlighted the pros and cons. Besides, their applicability to the different social network types was discussed, focusing on complex networks, starting from multi-layer to heterogeneous information networks. Jin et al. presented a unified architecture of community detection methods to characterize the state-of-the-art of community detection.24 They divided the existing community detection methods into two categories: the probabilistic graphical model and the deep learning model. Besides, several benchmark datasets from several problem domains were provided on the publicly accessible web. These datasets were separated into two groups: synthetic networks and real-world networks. Meanwhile, Rezvani et al. introduced a set of real-world benchmark datasets and experimented hierarchical community detection methods to evaluate the accuracy of each method.5 Souravlas et al. divided the community detection methods into three basic approaches: the bottom-up approach, the top-down approach, and the data structure-based approach.23 For each discussed algorithm, they described in detail the specific method to detect communities and the overall complexity. In Ref. 4, the authors analyzed and evaluated the computational complexity of hierarchical clustering algorithms for community detection. The survey mainly focused on analyzing the computational complexity of agglomerative and divisive methods for community detection in networks. They also discussed the fundamental tools commonly used for detecting and evaluating the quality of communities in different types of networks. Chunaev et al. reviewed the community detection methods in node-attributed networks.22 They proposed a classification of community detection based on when and how they use and fuse network structure and attributes. They provided general technical ideas behind each method in each class.

    Goals and contributions of the survey

    The existing surveys have shown the diversity of community detection methods. The rapid growth of network data and the abundance of user information have attracted the attention of researchers in exploiting the attribute information to detect communities. The object of this review is to give a comprehensive description of proposed methods based on mining attributes and interactions in the networks. In particular, the review will focus on community detection methods exploiting the user’s data and social interactions among users. The community detection approaches are classified into three categories: user’s data-based (also known as node content), social interaction-based (also known as edge content), and the combination of user’s data and social interactions. The main idea of each method is also presented in detail to reveal the suitability for social network data. The focus of the paper is

    (1)

    Describing the main contributions of existing surveys on community detection in recent years.

    (2)

    Providing the definitions of community detection problems and related concepts.

    (3)

    Categorizing the community detection algorithms based on mining attributes and interactions on social networks.

    (4)

    Describing the available datasets for community detection problems.

    (5)

    Further, we discuss the potential research issues and challenges for community detection on social networks.

    Structure of the survey

    The rest of the paper is organized as follows. Section 2 presents the definitions of community detection problems and related concepts. Section 3 presents the classification of community detection methods based on exploiting attributes and interactions on social networks. Section 4 describes the available datasets for community detection problems. Section 5 is about future directions and open challenges. Section 6 is the conclusion.

    2. Preliminaries

    Definition 2.1. A social network can be modeled as a graph G=(V,E), where V is the set of nodes or vertices representing users, and EV×V is the set of edges presenting the relationships or interactions between users.

    If two users are friends or have interactions then an edge is added between them.2,5 A social network can be either a directed or undirected graph according to the type of established relationships between nodes. Directed graphs represent relationships that one node has with another one. Meanwhile, undirected graphs present mutual relationships between two entities.25 Figures 1(a)–1(c) are, respectively, examples of directed, undirected, and mixed graphs presenting a social network. Figure 1(a) is an example of the Twitter network and Figs. 1(b) and 1(c) are examples of the Facebook network.

    Fig. 1.

    Fig. 1. The examples of graphs illustrate the relationships of users on social networks, (a) An example of a directed graph illustrating the relationships of five users on the Twitter network, (b) An example of an undirected graph illustrates the relationships of four users on the Facebook network in the case of considering the friend relation and (c) An example of a mixed graph illustrates the relationships of four users on the Facebook network in the case of considering the friend relation and following.

    There is no exact definition of what constitutes community and the definition of the community often depends on context. A community Ci describes a group of nodes, called a subgraph, that have common attributes. Concretely, a community is a group of nodes having dense intra-connections and sparse inter-connections.2

    Definition 2.2. A community structure can be defined as a partition of the network G into k subgraphs C=C1,,Ck in which Ci denotes the community ith and V=ki=1Ci.

    Thus, each node is a member of one and only one community. However, in real networks such as social networks, a user can join multiple social groups that have different topics such as colleagues, friends and interesting topics. Therefore, the community can be divided into two categories that are disjoint and overlapping community. In the disjoint community, each node only belongs to a community, i.e. CiCj=,ij. Meanwhile, the nodes of the overlapping community can belong to more than one community, i.e. ij,CiCj.2,24,25 Figure 2 is an illustrative graph of disjoint and overlapping communities.

    Fig. 2.

    Fig. 2. An illustrative of (a) disjoint communities and (b) overlapping communities.

    3. The Classification of Community Detection Methods on Attributed Networks

    Community detection is an emerging research topic in the area of social network analysis. The study of community detection methods has attracted much more attention in recent years. There are a significant amount of community detection approaches applied to social networks in the literature. A majority of these approaches tend to detect communities by mining the attributes of the networks. The rapid growth of network data and the abundance of user information have attracted the attention of researchers in exploiting the attribute information to detect communities. In this section, we describe the existing community detection methods on social networks in recent years. This review only focuses on presenting the methods that exploit the node and edge attributes of the network.

    3.1. Node content-based approaches

    A community is detected relying on a topology structure that might contain several topics. To better understand the fundamental community semantics, Wang et al. investigated the correlations of different topics in the communities.27 They modeled the formation of each edge based on assuming that users are more likely to communicate with each other when they are in the same community and their topics are closely correlated. A topic correlation model was designed to learn community structure and semantic interpretation of each community. A generative model for community detection was proposed consisting of three components: (1) generating all user content based on the community memberships and their topics, (2) generating all link contents based on two endpoint users’ community memberships and their topics, (3) generating each directed link with users’ community memberships and topic correlations together. Jiang et al. investigated the impact of individual-level topics on the formation of the network topology and network content.28 They pointed out that individual topics in social networks play a core role in community generation and if two individuals are in the same community and interested in similar topics, it is more likely that they interact with each other. Besides, the existence of relationships among individuals depends on the relationships between their communities and interesting topics. A generative community detection model was proposed to simulate the generation of the network topology and network content based on individual topics. In Ref. 29, the authors designed a method based on a generative model for networks with node attributes. This is an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes. The method statistically models the interaction between the network structure and the node attributes that can improve the accuracy of community detection and increase reliability in the case of noise in the network structure. The method was designed relying on the following assumes:

    • Nodes in the same communities are likely to be connected.

    • Individual nodes may belong to multiple communities (i.e. overlapping communities).

    • If two nodes belong to multiple common communities, they are more likely to be connected than only sharing a single common community.

    • Nodes belonging to the same community are likely to share common attributes.

    The network topology potential analysis is an efficient method for overlapping community detection. However, most methods ignore the mass difference between nodes leading to inaccurate topological potential values of nodes. In fact, the node mass expresses its inherent properties such as importance and influence. Zhixiao et al. designed an overlapping community detection method based on node location analysis in the topology potential field.30 The method associates node mass with the node’s importance and influence to evaluate the node mass. The PageRank algorithm was utilized to calculate the node mass. The community affiliation of nodes was determined by relying on their positions in the inherent peak-valley structure of the topology potential field. Three types of node positions were defined in the topology potential field such as peak, valley, and slope in which peak nodes and slope nodes are the representative nodes and the internal nodes, respectively, and valley nodes are overlapping nodes among communities. In Ref. 31, they investigated the influence of nodes, information linking communities from nodes, and the similarity of community topologies. They proposed an effective node representation strategy and designed a community detection method based on combining local node embedding and global community embedding.

    3.2. Edge content-based approaches

    Most of the community detection methods mainly focus on either the network topology or the node attributes separately. However, both of these provide valuable information to characterize the communities. In Ref. 32, Liu et al. proposed a combination of the topological structure of a network and node attributes from the perspective of content propagation to detect the community in social networks. A network is considered as a dynamic graph and the interactions among nodes form the community structure. They considered a network as a dynamic system and a community as the stable status shared by the nodes in that community. The interactions among nodes were described in two ways, including a linear model to approximate influence propagation, and modeling the interactions directly with a random walk. In a group of social networks, a user can join the group without knowing its other members. The interactions inside a group are usually based on the membership concept. Guidi et al. proposed a dynamic community detection method of the user structure in Facebook Groups.33 A group is represented as a graph where an edge between two nodes means that these nodes have interacted in the group. To deal with the dynamic of the community, they consider that edges may be added as new pairs of nodes interacting and edges may be removed if corresponding nodes do not interact again within a fixed amount of time. The interactions among the members of a group are extracted to detect the dynamic community. The goal is to find dense communication groups of users and the evolution of the communication groups over time.

    The users’ interactions are more meaningful for reflecting the peer relationship of users. Luo et al. exploited users’ interactions based on cascading analysis to identify communities.34 Both direct and indirect users’ interactions such as reply comments, comments on retweets and retweets of retweets are extracted from a large set of social activities such as posting and sharing objects. They proposed a user interaction-oriented community detection method based on cascading analysis. Both the direct and indirect user interactions associated with each social object sharing are analyzed. A graph representation, called an event graph, captures cascading relations among all direct and indirect interactions. Small user groups, called sub-events, are extracted from each event graph which include the users from the same community who have actively interacted with each other in a given social object sharing. A super graph is then constructed with each node representing an extracted user group. The super graph was used to discover the community in which each community represents a group of users who have actively interacted with each other over multiple sharing objects. The method aims to detect active communities based on user interaction behaviors rather than one-time friendship establishments. The framework of the method is shown in Fig. 3. The inputs are the series of cascading events extracted from the users’ interaction data. The process is conducted in four main steps such as event graph generation, sub-events clustering, super graph construction, and community detection.

    Fig. 3.

    Fig. 3. The overall framework of the user interaction-oriented community detection based on cascading analysis.34

    In social networks, people who have common interests and habits, and are closely related usually communicate with each other and build a comfort zone for one’s life. The comfort zone is the group of people in which each person usually communicates with people that he/she knows and his/her daily activities are limited to this group. Each user usually interacts frequently with several users in the comfort zone and seldom interacts with other users outside it. In Ref. 35, they proposed a community detection method based on human social behavior. The model automatically identifies communities in the network by simulating human social behavior. The basic idea of the method is to identify the social comfort zone of human beings in the social process and simulate the evolution of social network structure in the information-spreading process. The model includes three phases: (1) finding the comfort zone for each node, (2) selecting the associated nodes from the comfort zone to reconstruct a simplified network, (3) adjusting and optimizing the structure of the simplified network by simulating the process of information spreading in human society.

    3.3. Edge and node content combination-based approaches

    Classical approaches for community detection usually deal only with the structure of the network and ignore features of the nodes. The majority of real-world social networks provide additional actors’ information such as age, gender and interests. The attributes of the network can enrich the knowledge about the actors and relationships and give sense to the detected communities. This is the motivation for developing community detection methods that use both the structure and the attributes of the network. Wang et al. raised two research questions,36 which are as follows: (1) For each link between two nodes, what is the process of its formation with respect to community structure? (2) How does each user make contributions to the formation of network semantics? For the second issue, two following aspects need to be considered that are two types of content in networks, i.e. edge and node content, and interest in multiple topics of a user. By analyzing a large number of networks, they found that the social behaviors of users are underlying factors to generate the community structure. Social behaviors include users’ interactions, users’ posting preferences, multiple interesting topics of a user, and temporal variation of interesting topics. These social behaviors accurately describe the internal relationships of community structure, the semantics of the community, network topology, and network content. Network topology and network content are formed by the social behaviors of users. In social networks, the users communicate with each other by sharing news and commenting on news generating the relationships between users. The news content from users constitutes the network content that reflects the users’ interesting topics and the semantic information of the network. The users’ social behaviors affect the community structure and community semantics. They investigated four types of user social behaviors to answer the aforementioned questions and proposed a unified community detection model by integrating network topology, node content, link content, and four types of user social behaviors.

    Most of the community detection algorithms only use a linkage structure. However, rich information is encoded in the content of social networks such as node content and edge content, which is essential to discover topically meaningful communities. Therefore, Wang et al. integrated linkage structure, node content, and edge content to detect both structurally and topically meaningful communities for dynamic networks.15 They proposed a transformation of the content-based network into a node-edge interaction network, in which linkage structure, node content, and edge content are embedded seamlessly. The content-based network is first transformed into the node–edge interaction network, which is a multi-mode network comprising two types of nodes and three types of edges. The two types of nodes correspond to the nodes and the edges of the original content-based network, which are called n-node and e-node, respectively. The three types of edges characterize structural similarity, node content similarity, and edge content similarity. Then, a random walk-based approach was used to discover structurally and topically meaningful communities on the multi-mode network.

    Social networks contain helpful information sources for analyzing network structure such as users’ interesting topics and users’ interactions. In Ref. 37, Akachar et al. posed two research questions: (1) How can we discover communities that reflect both the strengths of users’ interactions and their interests? (2) How can we update the community structure to detect new communities without re-computing them at each snapshot in the whole network? They designed a 2-phase framework to discover and update the community structure. In the first phase, a static community detection algorithm based on content and structure information is utilized to identify the initial community structures. The algorithm uses a hybrid approach based on statistical and semantic measures to extract the interesting topics of users. The network is partitioned into several groups in which each group represents a distinct topic. The relation in each group then is analyzed to verify the groups. In the second phase, an adaptive method for detecting and updating community structure is used to exploit the interesting topics of each changed node to identify the group to which it belongs. The algorithm is composed of four main steps as shown in Fig. 4. Liu et al. considered the community detection for the weighted networks such as user relationships on social networks.38 They proposed a graph clustering algorithm based on the concept of density and attractiveness for weighted networks. The user’s degree and users’ attractiveness were defined as node weight and edge weight, respectively.

    Fig. 4.

    Fig. 4. The four main steps of a community detection algorithm.37

    The community detection methods by analyzing the linkage of the network do not reflect the semantics such as the interesting topics shared by people. Zhao et al. proposed a topic-oriented community detection approach that combines both social object clustering and link analysis.39 First, a subspace clustering algorithm is used to cluster all the social objects into topics. Then, the members related to these social objects are divided into topical clusters in which each cluster corresponds to a distinct topic. Finally, in order to differentiate the strength of connections, a link analysis is performed on each topical cluster to detect the topical communities. In real networks such as social networks, the node attributes are correlated with the community structure. However, the majority of existing community detection methods do not consider additional information on the nodes (i.e. node attributes). Mining the node attributes can improve the performance of community detection. Zhang et al. proposed a joint community detection method that uses both the edge information and the node features.40 Since links of nodes in social networks generally contain semantic descriptions therefore communities of links can better characterize community behaviors than communities of nodes. The community detection based on network topologies and descriptive contents can be restricted to one topic per community, which may not be consistent with the community in social networks. Jin et al. exploited the relationships between communities and topics to discover communities of links and extract semantically meaningful community summaries at the same time.41 This method overcomes the limitation of restricting one topic per community.

    The entities in the real-world networks tend to have attributes that are useful for discovering communities. The community detection methods directly use the original network topology leading to low accuracy due to ignoring inherent community structures. Li et al. proposed a community structure embedding-based model for community detection by incorporating community structures and node attributes via underlying community memberships.42 The attributed community detection was formulated as a nonnegative matrix factorization optimization problem. The nodes in a social network typically contain a variety of attributes, such as age, gender, place of residence, preferences and interactions. Each of these attributes can be regarded as forming an attribute layer. In Ref. 43, Li et al. introduced a multi-layer local community detection model based on attributes and structure information. The method mines the effectiveness of the node attribute information and the similarity of social exchanges.

    4. Datasets

    For community detection problems, the available experimental datasets are popularly described in the literature. There are typically two types of network data that are real-world and synthetic datasets. Real-world datasets are collected from real-world networks. Meanwhile, synthetic datasets are generated by specific models on manually designed rules. Real-world datasets are more commonly used than synthetic datasets. Some articles experimented on both of them30,35,44,46 while most articles only experiment on real data. The common datasets for community detection are listed in Table 2. In this section, we introduce commonly used datasets in the experiments and describe the methods of representing social networks as graphs.

    Table 2. The statistics of the real-world network datasets in the literature.

    Sources#Communities#Nodes#EdgesReferences
    Aarhus employee6162043
    AGBlogs61,22233,42835
    Airlines4173,58843
    BlogCatalog3910,312333,98331
    Citeseer63,3124,73232, 35, 44, 42
    Cora72,7085,42939, 32, 44, 42, 31, 37
    Cornell519530444, 42, 45
    DBLP32,5549,96315, 27, 28, 36, 37
    Dolphins26215930, 35, 45, 46
    Enron139741,55737, 39, 44, 41
    Facebook104,08988,23429, 32, 42, 44
    Flickr543616,710716,06329, 42
    Football1211561335, 38, 46
    Google+437250,46930,230,90529
    Karate2347830, 35, 45, 46
    Last.fm381,89212,71244
    LiveJ3,98726,26834
    MoscowAthletics2013103,319144,59143
    Philosophers12201,2185,97229, 42
    Polbooks310588235, 46
    Political Blogs21,49019,09039
    PubMed Diabetes319,71744,33832
    QQ Zone Sales632,05610,239,86543
    Reddit323,37138,93915, 27, 28, 36, 41
    Texas518718744, 42
    Tieba8,44425,19034
    Twitter817179644
    Twitter3140125,1202,248,40629
    UmbcBlog24044,76435
    Washington523044642, 44
    Weibo6,21613,92434, 38
    Wiki192,40517,98131
    Wisconsin526553042, 44, 45

    4.1. Real-world datasets

    The real-world datasets are popularly published. The data are crawled from social networks and preprocessed to convert to a graph. In Ref. 15, Wang et al. used two real-world dynamic social networks which were Reddit and DBLP datasets. These datasets were also used for experiments in many research. The Reddit dataset contains the threads of three sub-forums (i.e. Movies, Politics, and Science) on https://www.reddit.com. The nodes represent the users and their contents are the corresponding post contents. The undirected edges are the interactions between the users submitting the post and the users replying to the post. The edge contents are the comment contents. The DBLP dataset includes three research fields treated as three communities (i.e. machine learning, image processing, and data mining).47 The authors were considered as nodes and the undirected edges were formed among nodes when they were co-authors. If the paper has an author then the paper title is the node content of the corresponding node, otherwise, the title of the paper is the edge content of collaborated authors. In Ref. 37, Akachar et al. used three real-world network datasets (i.e. Enron email, Cora, and DBLP) for community detection experiments. In the Enron email corpus, each node represents an email address and if two addresses send at least one email to each other then an edge between them is created. In the CORA dataset, each node represents a scientific paper and the edge represents a citation relation (i.e. if two papers cite each other, an edge between them is formed). The DBLP dataset includes three publication types: article, in collection, and in proceedings. The nodes represent authors and the edges represent co-authorship. In the Karate network, nodes represent the members of the karate club, and edges represent the interaction between the club’s members. In the Dolphins network, each node represents Dolphins and each edge represents frequent contact between the Dolphins. In the Football network, each node represents a team, while each edge represents the relationship between two teams.30,46

    4.2. Synthetic datasets

    LFR benchmark networks48 are popularly used in community detection experiments. LFR benchmark networks have a priori known communities and are usually used to experiment with different community detection methods. The advantage of the benchmark networks over other methods is that they account for the heterogeneity in the distributions of node degrees and community sizes. The synthetic networks generated by the LFR algorithm are closer to the features observed in real-world networks. LFR networks have power-law distributions for both the degree of node and size of communities. LFR can also control the topology of networks by adjusting the following parameters: the number of nodes, the average degree, the average fraction of neighboring nodes of a node, the minimum and maximum community sizes, and the number of overlapping nodes.49 Therefore, LFR networks are appropriate to evaluate the performance of community detection algorithms. In experiments, a common method is to use both real-world networks and synthetic networks.50,51,52,53,54

    5. Future Directions and Open Challenges

    Social networks have become popular and attracted large numbers of interactive users every day. The scale of networks increases rapidly leading to create a huge amount of network data. These networks typically have tens of thousands or billions of nodes and edges as well as complex network structures.55,56,57 Developing algorithms that can be applied on large-scale networks is one of the main challenges in community detection problems on social networks. Real-world networks are often large-scale, dynamic, heterogeneous, or incomplete networks. Most existing community detection algorithms have to deal with high computational complexity on such complex networks.4,21 A commonly used solution is to reduce the data scale by network reduction or network approximation, which may lose some important information and affect the performance of the model.24 In this section, we discuss the challenging problems and future research directions potentially worth pursuing to solve the above-mentioned problems.

    5.1. Dynamic networks

    The user’s relationships in social networks often change over time therefore the networks have a dynamic structure13 and the communities in such networks also change accordingly. Since the topology and attributes of dynamic networks change over time, most existing learning-based community detection methods have to retrain the models according to the evolved network structure.20 It is very time-consuming and not suitable for real-time processing. How to efficiently perform community detection on large networks is also a challenging task. Besides, exploiting temporal characteristics is crucial to have insights into community structures. We are in the era of big data and social networks. A large volume of data is created every day. Although multiple methods have been proposed to address the mentioned issues, in reality, there still remain many challenges that need to be addressed due to the dynamic evolution of social networks.

    5.2. Heterogeneous networks

    In this survey, we reviewed many proposed methods based on topological analysis but they mostly consider the users’ relationships without considering network heterogeneity. Heterogeneous networks contain different types of nodes and edges or different types of descriptions of nodes and edges.24 In heterogeneous networks, we have to have appropriate strategies for clustering different types of nodes or for merging the different layers to infer the relationships between different nodes and identify communities accurately. Besides, the majority of proposed approaches can be applied to undirected graphs, only a few of which can be applied to directed graphs and weighted graphs. As described in Fig. 1, directed graphs and weighted graphs accurately reflect the nature of social networks. Therefore, community detection of directed graphs is a potential research direction to discover communities that truly reflect the real communities of real-world networks.

    5.3. Multi-layered framework

    A social network can be presented as a multi-layer graph where each layer contains a unique set of edges over the same underlying nodes.16,43 Multi-layer networks provide a multi-layer framework to describe multiple types of interactions among entities. Multi-layer networks allow modeling the relationships among users from different perspectives. Therefore, edges in different layers have distinct semantics. Modeling the evolution and detecting the communities in multi-layer networks are potential research directions that should be considered. A possible solution is to incorporate information from multi-layer networks into single-layer networks and then use the existing community detection algorithms.

    6. Conclusions

    This survey provided an overview of community detection methods on social networks in recent years. We analyzed the characteristics of community detection models according to different approaches. In this work, we only focus on the methods of mining attributes and interactions of users. We proposed a novel classification of community detection approaches based on mining attributes and interactions of users: node content-based approaches, edge content-based approaches, and node and edge content-based approaches. Furthermore, we provided an overview of popularly used datasets for community detection tasks. Finally, we discuss some possible research directions and open challenges in community detection to stimulate further research in this area.

    Acknowledgments

    This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number DS2021-26-03.

    ORCID

    Hai Bang Truong  https://orcid.org/0000-0001-5094-3128

    Mirjana Ivanovic  https://orcid.org/0000-0003-1946-0384

    Van Cuong Tran  https://orcid.org/0000-0002-6291-8541