In recent years, community detection in social network analysis has been a hot research topic that has attracted much more attention in the scientific community and a large number of articles have already been published in the literature. Community detection problems focus on determining the subgraphs of the network that have dense edges and whose nodes are expected to have similar features. Community detection has significant applications in different domains of social network analysis such as sociology, business, criminal detection and recommendation systems. The computational complexity is hampered by the huge size and the dynamic nature of social networks. This paper introduces a comprehensive survey of community detection methods in social networks based on mining attributes and interactions of users. We divide the community detection approaches into three basic categories, namely (1) node content-based approaches, (2) edge content-based approaches, and (3) node- and edge content-based approaches. We also discuss in detail the main idea of each method that is suitable for social network data. Furthermore, to promote the study of community detection, we introduce the benchmark datasets from available sources. Last, we present further research directions as well as some open challenges for the community detection problem in social networks.

Keywords:

1. Introduction

A social network is a structure made up of actors, which are mainly human individuals, and interactions among them. Social networks facilitate different communication methods and represent a diversity of social interaction types. They create a favorable environment for users to interact and share news. In general, a social network is a set of actors interconnected via relationships created during online interactions.¹ It can be presented as a graph in which the nodes represent the social actor and the edges represent their social relationship.^1,2 In the networks, the community structure forms the groups of nodes having dense intra-connections, and sparse inter-connections. Community detection in social networks intensively has been studied in recent years.^3,4,5

In general, communities are categorized into two categories: disjoint and overlapping communities.^1,6 In disjoint communities, the actor is assigned to a unique community. However, a user in social networks may belong to more than one group such as family, best friends and colleagues. The actors associate across different communities which are called overlapping communities.^7,8 The community detection approach can be classified into two categories: structure-based community indicating the nodes in a cluster of a graph have more relationships with each other than other nodes and semantic-based community indicating a cluster of nodes having similar semantic context or a set of nodes in a cluster having the same interests. Semantic-based community detection mines both the context and relationship of the nodes therefore the discovered community could represent the coherence more efficiently.^9,10,11,12 The heterogeneity of data, diversity structure, and social network size pose growing challenges for community detection problems. Dynamic data are also a problem that makes community detection problems even more complicated.^13,14,15 These limitations have motivated growing interest in algorithms for large-scale networks.

Community detection is a valuable problem for different domains of social network analysis such as sociology, business, criminal detection and recommendation systems.² Identifying communities from social networks helps to understand the user’s behaviors and relationships between members of the community. A community is a group of nodes that have links with each other and limited associations with the remaining network. Detecting communities of the networks provides important knowledge about the structural properties of the networks. The structural properties have significant applications in different domains such as viral marketing, and modeling the epidemic spread.² There exists a wide variety of community detection algorithms in the literature. Table 1 presents a summary of the existing surveys on community detection. The articles mainly focused on summarizing the methods and categorizing the algorithms. Some articles focused on comparing models and giving the pros and cons in various use cases. In the rest of this section, we will summarize the main contributions of the existing surveys on community detection in the literature.

**Table 1. Recently existing surveys on community detection problems.**
Authors	Year	Focus on	Algorithm categories
Kim et al.¹⁶	2015	Community detection algorithms in multi-layer graphs of interactions	Cluster expansion, matrix factorization, unified distance, model-based, pattern mining, and graph merging
Jonnalagadda et al.¹⁷	2016	The game theory-based community detection algorithms	Non-cooperative game theory and cooperative game theory
Fani et al.¹⁸	2017	Community detection methods in social networks	Link-based and content-based
Zhang et al.¹⁹	2018	Algorithms of community detection in complex networks	Modularity-based, clique percolation-based, label propagation-based, and hierarchical partitioning-based
Javed et al.²⁰	2018	Popular community detection algorithms	Disjoint communities and overlapping communities
Chen et al.²¹	2019	Overlapping community detection of complex networks	Node-based, network-based, and hierarchical clustering-based
Azaouzi et al.²	2019	Centralized community detection and distributed community detection in static and dynamic social networks
Chunaev²²	2020	The community detection methods in node-attributed social networks
Souravlas et al.²³	2021	The popular community detection methods	The bottom-up approach, the top-down approach, and the data structure-based approach
Jin et al.²⁴	2021	The popular community detection methods	Probabilistic graphical and deep learning model
Moscato et al.²⁵	2021	The different community detection techniques applicable to social networks	Topological analysis-based, game theory-based, AI models-based, fuzzy systems-based, and greedy methods-based
Xing et al.²⁶	2022	Deep learning-based models upon deep neural networks, deep nonnegative matrix factorization, and deep sparse filtering.	Convolutional networks, graph attention networks, generative adversarial networks, autoencoder, deep nonnegative matrix factorization, and deep sparse filtering
Rezvani et al.⁵	2022	The hierarchical community detection problem	Fitness-metric-based, hierarchical graph clustering, random walk models, and structured approaches
Bui et al.⁴	2023	Analyze and evaluate the computational complexity of hierarchical clustering algorithms for community detection	Agglomerative and divisive algorithms
Our paper	2024	Community detection methods in social networks based on mining attributes and interactions	Node content-based, edge content-based, node and edge content combination-based

The aspects of interactions can be modeled as a multi-layer graph composed of multiple interdependent graphs in which each graph represents an aspect. Kim et al. classified community detection algorithms in multi-layer graphs into six categories: cluster expansion, matrix factorization, unified distance, model-based, pattern mining, and graph merging.¹⁶ In Ref. 17, the authors presented a comprehensive survey of game theory-based community detection algorithms. The game theory-based community detection algorithms were discussed in two categories: non-cooperative game theory and cooperative game theory. They also discussed the interesting applications of game theory for social networks and provided further research directions as well as some open challenges. Fani et al. provided an overview of the definitions of an online community and reviewed different community detection methods in social networks.¹⁸ They classified the existing community detection approaches into two categories: topology-based and topic-based approaches. Meanwhile, Javed et al. divided the algorithms into disjoint communities and overlapping communities.²⁰ Azaouzi et al. introduced a taxonomy of existing models based on the computational nature that were centralized community detection and distributed community detection in static and dynamic social networks.² In Ref. 21, the authors reviewed the state-of-the-art of overlapping community detection in complex networks and briefly summarized the advantages and applications of each algorithm. The existing community detection methods were divided into three categories: node-based, network-based, and hierarchical clustering-based. In Ref. 25, the authors presented a comprehensive and comparative study of the different community detection techniques applicable to social networks. The community detection approaches were classified into five categories: topological analysis-based, game theory-based, AI models-based, fuzzy systems-based, and greedy methods-based in which, the most popular approaches based on game theory, AI models, and fuzzy strategies were introduced in detail and highlighted the pros and cons. Besides, their applicability to the different social network types was discussed, focusing on complex networks, starting from multi-layer to heterogeneous information networks. Jin et al. presented a unified architecture of community detection methods to characterize the state-of-the-art of community detection.²⁴ They divided the existing community detection methods into two categories: the probabilistic graphical model and the deep learning model. Besides, several benchmark datasets from several problem domains were provided on the publicly accessible web. These datasets were separated into two groups: synthetic networks and real-world networks. Meanwhile, Rezvani et al. introduced a set of real-world benchmark datasets and experimented hierarchical community detection methods to evaluate the accuracy of each method.⁵ Souravlas et al. divided the community detection methods into three basic approaches: the bottom-up approach, the top-down approach, and the data structure-based approach.²³ For each discussed algorithm, they described in detail the specific method to detect communities and the overall complexity. In Ref. 4, the authors analyzed and evaluated the computational complexity of hierarchical clustering algorithms for community detection. The survey mainly focused on analyzing the computational complexity of agglomerative and divisive methods for community detection in networks. They also discussed the fundamental tools commonly used for detecting and evaluating the quality of communities in different types of networks. Chunaev et al. reviewed the community detection methods in node-attributed networks.²² They proposed a classification of community detection based on when and how they use and fuse network structure and attributes. They provided general technical ideas behind each method in each class.

Goals and contributions of the survey

The existing surveys have shown the diversity of community detection methods. The rapid growth of network data and the abundance of user information have attracted the attention of researchers in exploiting the attribute information to detect communities. The object of this review is to give a comprehensive description of proposed methods based on mining attributes and interactions in the networks. In particular, the review will focus on community detection methods exploiting the user’s data and social interactions among users. The community detection approaches are classified into three categories: user’s data-based (also known as node content), social interaction-based (also known as edge content), and the combination of user’s data and social interactions. The main idea of each method is also presented in detail to reveal the suitability for social network data. The focus of the paper is

(1)	Describing the main contributions of existing surveys on community detection in recent years.
(2)	Providing the definitions of community detection problems and related concepts.
(3)	Categorizing the community detection algorithms based on mining attributes and interactions on social networks.
(4)	Describing the available datasets for community detection problems.
(5)	Further, we discuss the potential research issues and challenges for community detection on social networks.

Structure of the survey

The rest of the paper is organized as follows. Section 2 presents the definitions of community detection problems and related concepts. Section 3 presents the classification of community detection methods based on exploiting attributes and interactions on social networks. Section 4 describes the available datasets for community detection problems. Section 5 is about future directions and open challenges. Section 6 is the conclusion.

2. Preliminaries

Definition 2.1. A social network can be modeled as a graph $G = (V, E)$ , where V is the set of nodes or vertices representing users, and $E \subseteq V \times V$ is the set of edges presenting the relationships or interactions between users.

If two users are friends or have interactions then an edge is added between them.^2,5 A social network can be either a directed or undirected graph according to the type of established relationships between nodes. Directed graphs represent relationships that one node has with another one. Meanwhile, undirected graphs present mutual relationships between two entities.²⁵ Figures 1(a)–1(c) are, respectively, examples of directed, undirected, and mixed graphs presenting a social network. Figure 1(a) is an example of the Twitter network and Figs. 1(b) and 1(c) are examples of the Facebook network.

Fig. 1. The examples of graphs illustrate the relationships of users on social networks, (a) An example of a directed graph illustrating the relationships of five users on the Twitter network, (b) An example of an undirected graph illustrates the relationships of four users on the Facebook network in the case of considering the friend relation and (c) An example of a mixed graph illustrates the relationships of four users on the Facebook network in the case of considering the friend relation and following.

There is no exact definition of what constitutes community and the definition of the community often depends on context. A community $C_{i}$ describes a group of nodes, called a subgraph, that have common attributes. Concretely, a community is a group of nodes having dense intra-connections and sparse inter-connections.²

Definition 2.2. A community structure can be defined as a partition of the network G into k subgraphs $C = C_{1}, \dots, C_{k}$ in which $C_{i}$ denotes the community ith and $V = ⋃_{i = 1}^{k} C_{i}$ .

Thus, each node is a member of one and only one community. However, in real networks such as social networks, a user can join multiple social groups that have different topics such as colleagues, friends and interesting topics. Therefore, the community can be divided into two categories that are disjoint and overlapping community. In the disjoint community, each node only belongs to a community, i.e. $C_{i} \cap C_{j} = \emptyset, \forall i \neq j$ . Meanwhile, the nodes of the overlapping community can belong to more than one community, i.e. $\exists i \neq j, C_{i} \cap C_{j} \neq \emptyset$ .^2,24,25 Figure 2 is an illustrative graph of disjoint and overlapping communities.

Fig. 2. An illustrative of (a) disjoint communities and (b) overlapping communities.

3. The Classification of Community Detection Methods on Attributed Networks

Community detection is an emerging research topic in the area of social network analysis. The study of community detection methods has attracted much more attention in recent years. There are a significant amount of community detection approaches applied to social networks in the literature. A majority of these approaches tend to detect communities by mining the attributes of the networks. The rapid growth of network data and the abundance of user information have attracted the attention of researchers in exploiting the attribute information to detect communities. In this section, we describe the existing community detection methods on social networks in recent years. This review only focuses on presenting the methods that exploit the node and edge attributes of the network.

3.1. Node content-based approaches

A community is detected relying on a topology structure that might contain several topics. To better understand the fundamental community semantics, Wang et al. investigated the correlations of different topics in the communities.²⁷ They modeled the formation of each edge based on assuming that users are more likely to communicate with each other when they are in the same community and their topics are closely correlated. A topic correlation model was designed to learn community structure and semantic interpretation of each community. A generative model for community detection was proposed consisting of three components: (1) generating all user content based on the community memberships and their topics, (2) generating all link contents based on two endpoint users’ community memberships and their topics, (3) generating each directed link with users’ community memberships and topic correlations together. Jiang et al. investigated the impact of individual-level topics on the formation of the network topology and network content.²⁸ They pointed out that individual topics in social networks play a core role in community generation and if two individuals are in the same community and interested in similar topics, it is more likely that they interact with each other. Besides, the existence of relationships among individuals depends on the relationships between their communities and interesting topics. A generative community detection model was proposed to simulate the generation of the network topology and network content based on individual topics. In Ref. 29, the authors designed a method based on a generative model for networks with node attributes. This is an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes. The method statistically models the interaction between the network structure and the node attributes that can improve the accuracy of community detection and increase reliability in the case of noise in the network structure. The method was designed relying on the following assumes:

Nodes in the same communities are likely to be connected.
Individual nodes may belong to multiple communities (i.e. overlapping communities).
If two nodes belong to multiple common communities, they are more likely to be connected than only sharing a single common community.
Nodes belonging to the same community are likely to share common attributes.

The network topology potential analysis is an efficient method for overlapping community detection. However, most methods ignore the mass difference between nodes leading to inaccurate topological potential values of nodes. In fact, the node mass expresses its inherent properties such as importance and influence. Zhixiao et al. designed an overlapping community detection method based on node location analysis in the topology potential field.³⁰ The method associates node mass with the node’s importance and influence to evaluate the node mass. The PageRank algorithm was utilized to calculate the node mass. The community affiliation of nodes was determined by relying on their positions in the inherent peak-valley structure of the topology potential field. Three types of node positions were defined in the topology potential field such as peak, valley, and slope in which peak nodes and slope nodes are the representative nodes and the internal nodes, respectively, and valley nodes are overlapping nodes among communities. In Ref. 31, they investigated the influence of nodes, information linking communities from nodes, and the similarity of community topologies. They proposed an effective node representation strategy and designed a community detection method based on combining local node embedding and global community embedding.

3.2. Edge content-based approaches

Most of the community detection methods mainly focus on either the network topology or the node attributes separately. However, both of these provide valuable information to characterize the communities. In Ref. 32, Liu et al. proposed a combination of the topological structure of a network and node attributes from the perspective of content propagation to detect the community in social networks. A network is considered as a dynamic graph and the interactions among nodes form the community structure. They considered a network as a dynamic system and a community as the stable status shared by the nodes in that community. The interactions among nodes were described in two ways, including a linear model to approximate influence propagation, and modeling the interactions directly with a random walk. In a group of social networks, a user can join the group without knowing its other members. The interactions inside a group are usually based on the membership concept. Guidi et al. proposed a dynamic community detection method of the user structure in Facebook Groups.³³ A group is represented as a graph where an edge between two nodes means that these nodes have interacted in the group. To deal with the dynamic of the community, they consider that edges may be added as new pairs of nodes interacting and edges may be removed if corresponding nodes do not interact again within a fixed amount of time. The interactions among the members of a group are extracted to detect the dynamic community. The goal is to find dense communication groups of users and the evolution of the communication groups over time.

The users’ interactions are more meaningful for reflecting the peer relationship of users. Luo et al. exploited users’ interactions based on cascading analysis to identify communities.³⁴ Both direct and indirect users’ interactions such as reply comments, comments on retweets and retweets of retweets are extracted from a large set of social activities such as posting and sharing objects. They proposed a user interaction-oriented community detection method based on cascading analysis. Both the direct and indirect user interactions associated with each social object sharing are analyzed. A graph representation, called an event graph, captures cascading relations among all direct and indirect interactions. Small user groups, called sub-events, are extracted from each event graph which include the users from the same community who have actively interacted with each other in a given social object sharing. A super graph is then constructed with each node representing an extracted user group. The super graph was used to discover the community in which each community represents a group of users who have actively interacted with each other over multiple sharing objects. The method aims to detect active communities based on user interaction behaviors rather than one-time friendship establishments. The framework of the method is shown in Fig. 3. The inputs are the series of cascading events extracted from the users’ interaction data. The process is conducted in four main steps such as event graph generation, sub-events clustering, super graph construction, and community detection.

Fig. 3. The overall framework of the user interaction-oriented community detection based on cascading analysis.³⁴

In social networks, people who have common interests and habits, and are closely related usually communicate with each other and build a comfort zone for one’s life. The comfort zone is the group of people in which each person usually communicates with people that he/she knows and his/her daily activities are limited to this group. Each user usually interacts frequently with several users in the comfort zone and seldom interacts with other users outside it. In Ref. 35, they proposed a community detection method based on human social behavior. The model automatically identifies communities in the network by simulating human social behavior. The basic idea of the method is to identify the social comfort zone of human beings in the social process and simulate the evolution of social network structure in the information-spreading process. The model includes three phases: (1) finding the comfort zone for each node, (2) selecting the associated nodes from the comfort zone to reconstruct a simplified network, (3) adjusting and optimizing the structure of the simplified network by simulating the process of information spreading in human society.

3.3. Edge and node content combination-based approaches

Classical approaches for community detection usually deal only with the structure of the network and ignore features of the nodes. The majority of real-world social networks provide additional actors’ information such as age, gender and interests. The attributes of the network can enrich the knowledge about the actors and relationships and give sense to the detected communities. This is the motivation for developing community detection methods that use both the structure and the attributes of the network. Wang et al. raised two research questions,³⁶ which are as follows: (1) For each link between two nodes, what is the process of its formation with respect to community structure? (2) How does each user make contributions to the formation of network semantics? For the second issue, two following aspects need to be considered that are two types of content in networks, i.e. edge and node content, and interest in multiple topics of a user. By analyzing a large number of networks, they found that the social behaviors of users are underlying factors to generate the community structure. Social behaviors include users’ interactions, users’ posting preferences, multiple interesting topics of a user, and temporal variation of interesting topics. These social behaviors accurately describe the internal relationships of community structure, the semantics of the community, network topology, and network content. Network topology and network content are formed by the social behaviors of users. In social networks, the users communicate with each other by sharing news and commenting on news generating the relationships between users. The news content from users constitutes the network content that reflects the users’ interesting topics and the semantic information of the network. The users’ social behaviors affect the community structure and community semantics. They investigated four types of user social behaviors to answer the aforementioned questions and proposed a unified community detection model by integrating network topology, node content, link content, and four types of user social behaviors.

Most of the community detection algorithms only use a linkage structure. However, rich information is encoded in the content of social networks such as node content and edge content, which is essential to discover topically meaningful communities. Therefore, Wang et al. integrated linkage structure, node content, and edge content to detect both structurally and topically meaningful communities for dynamic networks.¹⁵ They proposed a transformation of the content-based network into a node-edge interaction network, in which linkage structure, node content, and edge content are embedded seamlessly. The content-based network is first transformed into the node–edge interaction network, which is a multi-mode network comprising two types of nodes and three types of edges. The two types of nodes correspond to the nodes and the edges of the original content-based network, which are called n-node and e-node, respectively. The three types of edges characterize structural similarity, node content similarity, and edge content similarity. Then, a random walk-based approach was used to discover structurally and topically meaningful communities on the multi-mode network.

Social networks contain helpful information sources for analyzing network structure such as users’ interesting topics and users’ interactions. In Ref. 37, Akachar et al. posed two research questions: (1) How can we discover communities that reflect both the strengths of users’ interactions and their interests? (2) How can we update the community structure to detect new communities without re-computing them at each snapshot in the whole network? They designed a 2-phase framework to discover and update the community structure. In the first phase, a static community detection algorithm based on content and structure information is utilized to identify the initial community structures. The algorithm uses a hybrid approach based on statistical and semantic measures to extract the interesting topics of users. The network is partitioned into several groups in which each group represents a distinct topic. The relation in each group then is analyzed to verify the groups. In the second phase, an adaptive method for detecting and updating community structure is used to exploit the interesting topics of each changed node to identify the group to which it belongs. The algorithm is composed of four main steps as shown in Fig. 4. Liu et al. considered the community detection for the weighted networks such as user relationships on social networks.³⁸ They proposed a graph clustering algorithm based on the concept of density and attractiveness for weighted networks. The user’s degree and users’ attractiveness were defined as node weight and edge weight, respectively.

Fig. 4. The four main steps of a community detection algorithm.³⁷

The community detection methods by analyzing the linkage of the network do not reflect the semantics such as the interesting topics shared by people. Zhao et al. proposed a topic-oriented community detection approach that combines both social object clustering and link analysis.³⁹ First, a subspace clustering algorithm is used to cluster all the social objects into topics. Then, the members related to these social objects are divided into topical clusters in which each cluster corresponds to a distinct topic. Finally, in order to differentiate the strength of connections, a link analysis is performed on each topical cluster to detect the topical communities. In real networks such as social networks, the node attributes are correlated with the community structure. However, the majority of existing community detection methods do not consider additional information on the nodes (i.e. node attributes). Mining the node attributes can improve the performance of community detection. Zhang et al. proposed a joint community detection method that uses both the edge information and the node features.⁴⁰ Since links of nodes in social networks generally contain semantic descriptions therefore communities of links can better characterize community behaviors than communities of nodes. The community detection based on network topologies and descriptive contents can be restricted to one topic per community, which may not be consistent with the community in social networks. Jin et al. exploited the relationships between communities and topics to discover communities of links and extract semantically meaningful community summaries at the same time.⁴¹ This method overcomes the limitation of restricting one topic per community.

The entities in the real-world networks tend to have attributes that are useful for discovering communities. The community detection methods directly use the original network topology leading to low accuracy due to ignoring inherent community structures. Li et al. proposed a community structure embedding-based model for community detection by incorporating community structures and node attributes via underlying community memberships.⁴² The attributed community detection was formulated as a nonnegative matrix factorization optimization problem. The nodes in a social network typically contain a variety of attributes, such as age, gender, place of residence, preferences and interactions. Each of these attributes can be regarded as forming an attribute layer. In Ref. 43, Li et al. introduced a multi-layer local community detection model based on attributes and structure information. The method mines the effectiveness of the node attribute information and the similarity of social exchanges.

4. Datasets

For community detection problems, the available experimental datasets are popularly described in the literature. There are typically two types of network data that are real-world and synthetic datasets. Real-world datasets are collected from real-world networks. Meanwhile, synthetic datasets are generated by specific models on manually designed rules. Real-world datasets are more commonly used than synthetic datasets. Some articles experimented on both of them^30,35,44,46 while most articles only experiment on real data. The common datasets for community detection are listed in Table 2. In this section, we introduce commonly used datasets in the experiments and describe the methods of representing social networks as graphs.

**Table 2. The statistics of the real-world network datasets in the literature.**
Sources	#Communities	#Nodes	#Edges	References
Aarhus employee		61	620	43
AGBlogs	6	1,222	33,428	35
Airlines		417	3,588	43
BlogCatalog	39	10,312	333,983	31
Citeseer	6	3,312	4,732	32, 35, 44, 42
Cora	7	2,708	5,429	39, 32, 44, 42, 31, 37
Cornell	5	195	304	44, 42, 45
DBLP	3	2,554	9,963	15, 27, 28, 36, 37
Dolphins	2	62	159	30, 35, 45, 46
Enron	13	974	1,557	37, 39, 44, 41
Facebook	10	4,089	88,234	29, 32, 42, 44
Flickr	5436	16,710	716,063	29, 42
Football	12	115	613	35, 38, 46
Google+	437	250,469	30,230,905	29
Karate	2	34	78	30, 35, 45, 46
Last.fm	38	1,892	12,712	44
LiveJ		3,987	26,268	34
MoscowAthletics2013		103,319	144,591	43
Philosophers	1220	1,218	5,972	29, 42
Polbooks	3	105	882	35, 46
Political Blogs	2	1,490	19,090	39
PubMed Diabetes	3	19,717	44,338	32
QQ Zone Sales		632,056	10,239,865	43
Reddit	3	23,371	38,939	15, 27, 28, 36, 41
Texas	5	187	187	44, 42
Tieba		8,444	25,190	34
Twitter	8	171	796	44
Twitter	3140	125,120	2,248,406	29
UmbcBlog	2	404	4,764	35
Washington	5	230	446	42, 44
Weibo		6,216	13,924	34, 38
Wiki	19	2,405	17,981	31
Wisconsin	5	265	530	42, 44, 45

4.1. Real-world datasets

The real-world datasets are popularly published. The data are crawled from social networks and preprocessed to convert to a graph. In Ref. 15, Wang et al. used two real-world dynamic social networks which were Reddit and DBLP datasets. These datasets were also used for experiments in many research. The Reddit dataset contains the threads of three sub-forums (i.e. Movies, Politics, and Science) on https://www.reddit.com. The nodes represent the users and their contents are the corresponding post contents. The undirected edges are the interactions between the users submitting the post and the users replying to the post. The edge contents are the comment contents. The DBLP dataset includes three research fields treated as three communities (i.e. machine learning, image processing, and data mining).⁴⁷ The authors were considered as nodes and the undirected edges were formed among nodes when they were co-authors. If the paper has an author then the paper title is the node content of the corresponding node, otherwise, the title of the paper is the edge content of collaborated authors. In Ref. 37, Akachar et al. used three real-world network datasets (i.e. Enron email, Cora, and DBLP) for community detection experiments. In the Enron email corpus, each node represents an email address and if two addresses send at least one email to each other then an edge between them is created. In the CORA dataset, each node represents a scientific paper and the edge represents a citation relation (i.e. if two papers cite each other, an edge between them is formed). The DBLP dataset includes three publication types: article, in collection, and in proceedings. The nodes represent authors and the edges represent co-authorship. In the Karate network, nodes represent the members of the karate club, and edges represent the interaction between the club’s members. In the Dolphins network, each node represents Dolphins and each edge represents frequent contact between the Dolphins. In the Football network, each node represents a team, while each edge represents the relationship between two teams.^30,46

4.2. Synthetic datasets

LFR benchmark networks⁴⁸ are popularly used in community detection experiments. LFR benchmark networks have a priori known communities and are usually used to experiment with different community detection methods. The advantage of the benchmark networks over other methods is that they account for the heterogeneity in the distributions of node degrees and community sizes. The synthetic networks generated by the LFR algorithm are closer to the features observed in real-world networks. LFR networks have power-law distributions for both the degree of node and size of communities. LFR can also control the topology of networks by adjusting the following parameters: the number of nodes, the average degree, the average fraction of neighboring nodes of a node, the minimum and maximum community sizes, and the number of overlapping nodes.⁴⁹ Therefore, LFR networks are appropriate to evaluate the performance of community detection algorithms. In experiments, a common method is to use both real-world networks and synthetic networks.^{50,51,52,53,54}

5. Future Directions and Open Challenges

Social networks have become popular and attracted large numbers of interactive users every day. The scale of networks increases rapidly leading to create a huge amount of network data. These networks typically have tens of thousands or billions of nodes and edges as well as complex network structures.^55,56,57 Developing algorithms that can be applied on large-scale networks is one of the main challenges in community detection problems on social networks. Real-world networks are often large-scale, dynamic, heterogeneous, or incomplete networks. Most existing community detection algorithms have to deal with high computational complexity on such complex networks.^4,21 A commonly used solution is to reduce the data scale by network reduction or network approximation, which may lose some important information and affect the performance of the model.²⁴ In this section, we discuss the challenging problems and future research directions potentially worth pursuing to solve the above-mentioned problems.

5.1. Dynamic networks

The user’s relationships in social networks often change over time therefore the networks have a dynamic structure¹³ and the communities in such networks also change accordingly. Since the topology and attributes of dynamic networks change over time, most existing learning-based community detection methods have to retrain the models according to the evolved network structure.²⁰ It is very time-consuming and not suitable for real-time processing. How to efficiently perform community detection on large networks is also a challenging task. Besides, exploiting temporal characteristics is crucial to have insights into community structures. We are in the era of big data and social networks. A large volume of data is created every day. Although multiple methods have been proposed to address the mentioned issues, in reality, there still remain many challenges that need to be addressed due to the dynamic evolution of social networks.

5.2. Heterogeneous networks

In this survey, we reviewed many proposed methods based on topological analysis but they mostly consider the users’ relationships without considering network heterogeneity. Heterogeneous networks contain different types of nodes and edges or different types of descriptions of nodes and edges.²⁴ In heterogeneous networks, we have to have appropriate strategies for clustering different types of nodes or for merging the different layers to infer the relationships between different nodes and identify communities accurately. Besides, the majority of proposed approaches can be applied to undirected graphs, only a few of which can be applied to directed graphs and weighted graphs. As described in Fig. 1, directed graphs and weighted graphs accurately reflect the nature of social networks. Therefore, community detection of directed graphs is a potential research direction to discover communities that truly reflect the real communities of real-world networks.

5.3. Multi-layered framework

A social network can be presented as a multi-layer graph where each layer contains a unique set of edges over the same underlying nodes.^16,43 Multi-layer networks provide a multi-layer framework to describe multiple types of interactions among entities. Multi-layer networks allow modeling the relationships among users from different perspectives. Therefore, edges in different layers have distinct semantics. Modeling the evolution and detecting the communities in multi-layer networks are potential research directions that should be considered. A possible solution is to incorporate information from multi-layer networks into single-layer networks and then use the existing community detection algorithms.

6. Conclusions

This survey provided an overview of community detection methods on social networks in recent years. We analyzed the characteristics of community detection models according to different approaches. In this work, we only focus on the methods of mining attributes and interactions of users. We proposed a novel classification of community detection approaches based on mining attributes and interactions of users: node content-based approaches, edge content-based approaches, and node and edge content-based approaches. Furthermore, we provided an overview of popularly used datasets for community detection tasks. Finally, we discuss some possible research directions and open challenges in community detection to stimulate further research in this area.

Acknowledgments

This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number DS2021-26-03.

ORCID

Hai Bang Truong https://orcid.org/0000-0001-5094-3128

Mirjana Ivanovic https://orcid.org/0000-0003-1946-0384

Van Cuong Tran https://orcid.org/0000-0002-6291-8541

References

1. S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications (Cambridge University Press, 1994). Crossref, Google Scholar
2. M. Azaouzi, D. Rhouma and L. Ben Romdhane, Community detection in large-scale social networks: State-of-the-art and future directions, Soc. Netw. Anal. Min. 9 (2019) 1–32. Crossref, Google Scholar
3. S. Souravlas, S. D. Anastasiadou, T. Economides and S. Katsavounis, Probabilistic community detection in social networks, IEEE Access 11 (2023) 25629–25641. Crossref, Google Scholar
4. V. H. Bui and H. T. Phan, The computational complexity of hierarchical clustering algorithms for community detection: A review, Vietnam J. Comput. Sci. 10(4) (2023) 409–431. Link, Google Scholar
5. M. Rezvani and F. S. Kazemian, A survey on hierarchical community detection in large-scale complex networks, AUT J. Math. Comput. 3(2) (2022) 173–184. Google Scholar
6. S. Fortunato, Community detection in graphs, Phys. Rep. 486(3–5) (2010) 75–174. Crossref, Google Scholar
7. Y. Lei, Y. Zhou and J. Shi, Overlapping communities detection of social network based on hybrid c-means clustering algorithm, Sustain. Cities Soc. 47 (2019) 101436. Crossref, Google Scholar
8. M. Xu, Y. Li, R. Li, F. Zou and X. Gu, EADP: An extended adaptive density peaks clustering for overlapping community detection in social networks, Neurocomputing 337 (2019) 287–302. Crossref, Google Scholar
9. X. Yu, J. Yang and Z.-Q. Xie, A semantic overlapping community detection algorithm based on field sampling, Exp. Syst. Appl. 42(1) (2015) 366–375. Crossref, Google Scholar
10. H. Sun, H. Du, J. Huang, Z. Sun, L. He, X. Jia and Z. Zhao, Detecting semantic-based communities in node-attributed graphs, Comput. Intell. 34(4) (2018) 1199–1222. Crossref, Google Scholar
11. H. Yang, J. Zhang, X. Ding, C. Chen and L. Wang, GTIP: A gaming-based topic influence percolation model for semantic overlapping community detection, Entropy 24(9) (2022) 1274. Crossref, Google Scholar
12. X. Wang, D. Jin, X. Cao, L. Yang and W. Zhang, Semantic community identification in large attribute networks, Proc. AAAI Conf. Artif. Intell. 30(1) (2016) 265–271. Google Scholar
13. G. Rossetti and R. Cazabet, Community discovery in dynamic networks: A survey, ACM Comput. Surveys 51(2) (2018) 1–37. Crossref, Google Scholar
14. Z. Sun, B. Wang, J. Sheng, Z. Yu, R. Zhou and J. Shao, Community detection based on information dynamics, Neurocomputing 359 (2019) 341–352. Crossref, Google Scholar
15. C.-D. Wang, J.-H. Lai and S. Y. Philip, NEIWalk: Community discovery in dynamic content-based networks, IEEE Trans. Knowl. Data Eng. 26(7) (2013) 1734–1748. Crossref, Google Scholar
16. J. Kim and J.-G. Lee, Community detection in multi-layer graphs: A survey, ACM SIGMOD Rec. 44(3) (2015) 37–48. Crossref, Google Scholar
17. A. Jonnalagadda and L. Kuppusamy, A survey on game theoretic models for community detection in social networks, Social Netw. Anal. Min. 6 (2016) 1–24. Crossref, Google Scholar
18. H. Fani and E. Bagheri, Community detection in social networks, Encycl. Semantic Comput. Robotic Intell. 1(1) (2017) 1630001. Link, Google Scholar
19. J. Zhang, Z. Ma, Q. Sun and J. Yan, Research review on algorithms of community detection in complex networks, J. Phys.: Conf. Ser. 1069(1) (2018) 012124. Crossref, Google Scholar
20. M. A. Javed, M. S. Younis, S. Latif, J. Qadir and A. Baig, Community detection in networks: A multidisciplinary review, J. Netw. Comput. Appl. 108 (2018) 87–111. Crossref, Google Scholar
21. Q. Chen and L. Wei, Overlapping community detection of complex network: A survey, in 2019 20th Int. Conf. Parallel and Distributed Computing, Applications and Technologies (PDCAT) (IEEE, 2019), pp. 513–516. Crossref, Google Scholar
22. P. Chunaev, Community detection in node-attributed social networks: A survey, Comput. Sci. Rev. 37 (2020) 100286. Crossref, Google Scholar
23. S. Souravlas, A. Sifaleras, M. Tsintogianni and S. Katsavounis, A classification of community detection methods in social networks: A survey, Int. J. Gen. Syst. 50(1) (2021) 63–91. Crossref, Google Scholar
24. D. Jin, Z. Yu, P. Jiao, S. Pan, D. He, J. Wu, S. Y. Philip and W. Zhang, A survey of community detection approaches: From statistical modeling to deep learning, IEEE Trans. Knowl. Data Eng. 35(2) (2021) 1149–1170. Google Scholar
25. V. Moscato and G. Sperli, A survey about community detection over on-line social and heterogeneous information networks, Knowl.-Based Syst. 224 (2021) 107112. Crossref, Google Scholar
26. S. Xing et al., A comprehensive survey on community detection with deep learning, IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2021.3137396 Google Scholar
27. Y. Wang, D. Jin, K. Musial and J. Dang, Community detection in social networks considering topic correlations, Proc. AAAI Conf. Artif. Intell. 33(1) (2019) 321–328. Google Scholar
28. H. Jiang, L. Sun, J. Ran, J. Bai and X. Yang, Community detection based on individual topics and network topology in social networks, IEEE Access 8 (2020) 124414–124423. Crossref, Google Scholar
29. J. Yang, J. McAuley and J. Leskovec, Community detection in networks with node attributes, in 2013 IEEE 13th Int. Conf. Data Mining (IEEE, 2013), pp. 1151–1156. Crossref, Google Scholar
30. W. Zhi-Xiao, L. Ze-Chao, D. Xiao-Fang and T. Jin-Hui, Overlapping community detection based on node location analysis, Knowl.-Based Syst. 105 (2016) 225–235. Crossref, Google Scholar
31. M. Li, S. Lu, L. Zhang, Y. Zhang and B. Zhang, A community detection method for social network based on community embedding, IEEE Trans. Comput. Soc. Syst. 8(2) (2021) 308–318. Crossref, Google Scholar
32. L. Liu, L. Xu, Z. Wangy and E. Chen, Community detection based on structure and content: A content propagation perspective, in 2015 IEEE Int. Conf. Data Mining (IEEE, 2015), pp. 271–280. Crossref, Google Scholar
33. B. Guidi and A. Michienzi, Dynamic community structure in online social groups, Information 12(3) (2021) 113. Crossref, Google Scholar
34. L. Luo, K. Liu, B. Guo and J. Ma, User interaction-oriented community detection based on cascading analysis, Inf. Sci. 510 (2020) 70–88. Crossref, Google Scholar
35. J. Sheng, J. Hu, Z. Sun, B. Wang, A. Ullah, K. Wang and J. Zhang, Community detection based on human social behavior, Physica A 531 (2019) 121765. Crossref, Google Scholar
36. Y. Wang, D. Jin, D. He, K. Musial and J. Dang, Community detection in social networks considering social behaviors, IEEE Access 10 (2022) 109969–109982. Crossref, Google Scholar
37. E. Akachar, B. Ouhbi and B. Frikh, ACSIMCD: A 2-phase framework for detecting meaningful communities in dynamic social networks, Future Gener. Comput. Syst. 125 (2021) 399–420. Crossref, Google Scholar
38. R. Liu, S. Feng, R. Shi and W. Guo, Weighted graph clustering for community detection of large social networks, Procedia Comput. Sci. 31 (2014) 85–94. Crossref, Google Scholar
39. Z. Zhao, S. Feng, Q. Wang, J. Z. Huang, G. J. Williams and J. Fan, Topic oriented community detection through social objects and link analysis in social networks, Knowl.-Based Syst. 26 (2012) 164–173. Crossref, Google Scholar
40. Y. Zhang, E. Levina and J. Zhu, Community detection in networks with node features, preprint (2015), arXiv:1509.01173. Google Scholar
41. D. Jin, X. Wang, D. He, J. Dang and W. Zhang, Robust detection of link communities with summary description in social networks, IEEE Trans. Knowl. Data Eng. 33(6) (2019) 2737–2749. Crossref, Google Scholar
42. Y. Li, C. Sha, X. Huang and Y. Zhang, Community detection in attributed graphs: An embedding approach, Proc. AAAI Conf. Artif. Intell. 32(1) (2018) 338–345. Google Scholar
43. X. Li, G. Xu, L. Jiao, Y. Zhou and W. Yu, Multi-layer network community detection model based on attributes and social interaction intensity, Comput. Electric. Eng. 77 (2019) 300–313. Crossref, Google Scholar
44. M. Qin, D. Jin, D. He, B. Gabrys and K. Musial, Adaptive community detection incorporating topology and content in social networks, in Proc. 2017 IEEE/ACM Int. Conf. Advances in Social Networks Analysis and Mining 2017 (IEEE, 2017), pp. 675–682. Crossref, Google Scholar
45. E. Alinezhad, B. Teimourpour, M. M. Sepehri and M. Kargari, Community detection in attributed networks considering both structural and attribute similarities: Two mathematical programming approaches, Neural Comput. Appl. 32 (2020) 3203–3220. Crossref, Google Scholar
46. X. You, Y. Ma and Z. Liu, A three-stage algorithm on community detection in social networks, Knowl.-Based Syst. 187 (2020) 104822. Crossref, Google Scholar
47. J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang and Z. Su, ArnetMiner: Extraction and mining of academic social networks, in Proc. 14th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (ACM, 2008), pp. 990–998. Crossref, Google Scholar
48. A. Lancichinetti, S. Fortunato and F. Radicchi, Benchmark graphs for testing community detection algorithms, Phys. Rev. E 78(4) (2008) 046110. Crossref, Google Scholar
49. Y. Zhang, Y. Liu, R. Jin, J. Tao, L. Chen and X. Wu, GLLPA: A graph layout based label propagation algorithm for community detection, Knowl.-Based Syst. 206 (2020) 106363. Crossref, Google Scholar
50. J. Zhu, B. Chen and Y. Zeng, Community detection based on modularity and k-plexes, Inf. Sci. 513 (2020) 127–142. Crossref, Google Scholar
51. Y. Zhang, Y. Liu, Q. Li, R. Jin and C. Wen, LILPA: A label importance based label propagation algorithm for community detection with application to core drug discovery, Neurocomputing 413 (2020) 107–133. Crossref, Google Scholar
52. H. Ma, Z. Liu, X. Zhang, L. Zhang and H. Jiang, Balancing topology structure and node attribute in evolutionary multi-objective community detection for attributed networks, Knowl.-Based Syst. 227 (2021) 107169. Crossref, Google Scholar
53. Z. Wu, X. Wang, W. Fang, L. Liu, S. Tang, H. Zheng and Z. Zheng, Community detection based on first passage probabilities, Phys. Lett. A 390 (2021) 127099. Crossref, Google Scholar
54. M. Magnani, O. Hanteer, R. Interdonato, L. Rossi and A. Tagarelli, Community detection in multiplex networks, ACM Comput. Surveys (CSUR) 54(3) (2021) 1–35. Crossref, Google Scholar
55. E. Pourabbasi, V. Majidnezhad, S. T. Afshord and Y. Jafari, A new single-chromosome evolutionary algorithm for community detection in complex networks by combining content and structural information, Exp. Syst. Appl. 186 (2021) 115854. Crossref, Google Scholar
56. K. Berahmand, S. Haghani, M. Rostami and Y. Li, A new attributed graph clustering by using label propagation in complex networks, J. King Saud Univ.-Comput. Inf. Sci. 34(5) (2022) 1869–1883. Google Scholar
57. F. Hu, J. Liu, L. Li and J. Liang, Community detection in complex networks using Node2vec with spectral clustering, Physica A 545 (2020) 123633. Crossref, Google Scholar

Vol. 12, No. 01

Metrics

Downloaded 444 times

History

Received 16 January 2024

Revised 5 February 2024

Accepted 9 February 2024

Published: 22 March 2024

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC BY) License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Keywords

PDF download