Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Microservice architecture is a new architecture pattern, which aims to provide users with more reliable, maintainable, and extensible software design services. However, with the continuous expansion of the scale of microservice application system, the proliferation of services and service interactions in the system make the system fault detection difficult. Detecting faults accurately and effectively is the key technology to ensure the system reliability and stability. From the perspective of microservice operation status and dependencies between services, this paper proposes a space-aware bidirectional gated recurrent unit (BGRU) microservice fault detection algorithm, which uses deep learning technology to mine hidden information that causes failures and combines space-aware attention to establish long-distance spatial dependency to improve the accuracy of model detection. The paper also conducts many experiments to demonstrate the effectiveness of the algorithm in microservice fault detection.
In recent years, more and more developers have been building applications based on the cloud-native architecture. Container and microservice are two essential components in the cloud-native architecture. Container technologies like Docker and Kubernetes can help developers achieve a consistent and scalable delivery for complex software applications. On the other hand, microservice technologies can facilitate the division of complex applications into multiple functionality-independent and composable components, which further increases the flexibility of applications. With the support of cloud computing platforms, cloud-native applications will be easier to manage and maintain, together with higher scalability. However, it is challenging to identify performance issues on microservices due to the complex runtime environments and the numerous monitoring metrics. Towards this issue, this paper proposes a novel root cause analysis approach. Our approach firstly constructs a service dependency graph based on the metrics collected in real time. Next, the anomaly weight of each microservice is automatically updated by extending the mRank algorithm. Finally, a PageRank-based random walk is adopted to rank root causes further, i.e. to rank potential problematic services. Experiments conducted on Kubernetes clusters show that the proposed approach achieves a good analysis result, which outperforms several baseline methods.
Together with the spread of DevOps practices and container technologies, Microservice Architecture has become a mainstream architecture style in recent years. Resilience is a key characteristic in Microservice Architecture (MSA) Systems, and it shows the ability to cope with various kinds of system disturbances which cause degradations of services. However, due to lack of consensus definition of resilience in the software field, although a lot of work has been done on resilience for MSA Systems, developers still do not have a clear idea on how resilient an MSA System should be, and what resilience mechanisms are needed.
In this paper, by referring to existing systematic studies on resilience in other scientific areas, the definition of microservice resilience is provided and a Microservice Resilience Measurement Model is proposed to measure service resilience. And a requirement model to represent resilience requirements of MSA Systems is given. The requirement model uses elements in KAOS to represent notions in the measurement model, and decompose service resilience goals into system behaviors that can be executed by system components. As a proof of concept, a case study is conducted on an MSA System to illustrate how the proposed models are applied.
Service anomalies are difficult to locate accurately due to their propagation through service dependencies in microservice systems. Besides, the protection mechanisms are introduced into the microservice systems to ensure the stable operation of services. However, the existing approaches ignore the impact of protection mechanisms on the root cause localization of abnormal services. Specifically, the circuit breaking and rate limiting mechanisms can refuse service requests and thus change the way of anomaly propagation. Moreover, the different service request frequencies and latency make service dependencies change dynamically, resulting in the different probabilities of anomaly propagation among services. In this paper, we propose a novel framework named MicroGBPM to locate the root cause of abnormal services. We model the anomaly propagation among services as a dynamically constructed service attributed graph with metrics and traces when a failure occurs. To eliminate the impact of the protection mechanisms, we design a two-stage dynamic calibration strategy to adjust the probability of anomaly propagation among services. Then, we propose a random walking approach to calculate the root cause results by using the PageRank algorithm. The experimental results show that MicroGBPM improves the accuracy of root cause localization compared to other approaches in the microservice systems with protection mechanisms.