This paper proposed a pedestrian evacuation model combined with reinforcement learning in order to study how to better guide pedestrians to complete evacuation in specific indoor scenes. This model introduced the way of establishing a scene in cellular automata and formulated reward rules according to the characteristics of the scene. It fitted the psychological activities of pedestrians in the actual evacuation process and trained the strategy of pedestrians at the overall level through the Q-learning algorithm from the reinforcement learning area. A speed control mechanism combined with real statistical data was introduced to simulate the speed attenuation. A simulation platform was built to compare the evacuation conditions under different scenarios and the different total numbers of pedestrians. The research showed that the model could automatically realize the exit selection function of pedestrians and part of conformity behavior. In the same evacuation scenario, this model could show adaptability for the different total numbers of pedestrians.
Evolutionary game theory provides a platform to investigate the emergence of cooperation in population consisting of selfish agents. In this work, we study evolutionary games on networks in which agents cooperate or defect according to Q-learning algorithms with extended state space. Extended state space provides agents two types of information, local environment information based on the cooperation level in agents’ neighborhood and personal information based on the last action of agents. Through numerical simulations, we find that rich information on local environment tends to improve cooperation in the population no matter whether personal information is present or not. Moreover, we show that, for the same local environment information, the introduction of personal information may improve cooperation except for the situations with low amount of local environment information where personal information deteriorates cooperation in bad-condition environment. For the same total information, the absence of personal information promotes cooperation in bad-condition environment while the presence of personal information promotes cooperation in good-condition environment. By investigating the distributions and temporal behaviors of Q-values, we present explanations for the above statements. This work suggests an effective way of extending the state space in evolutionary games incorporated with Q-learning algorithm to enhance cooperation.
We propose a new method to categorize continuous numeric percepts for Q-learning, where percept vectors are classified into categories on the basis of fuzzy ART and Q-learning uses categories as states to acquire rules for agent behavior. For efficient learning, we modify fuzzy ART to reduce the number of categories without deteriorating the efficiency of reinforcement learning. In our modification, a vigilance parameter is defined for each category in order to control the size of a category and it is updated during learning. The method to update a vigilance parameter is based on category integration, which contributes to reducing the number of categories. Here, we define the similarity for any category pair to judge whether category integration should be performed or not. When two categories are integrated into a new category, a vigilance parameter for the category is calculated and categories used for integration are discarded, so that the number of categories is reduced without restricting the number of categories. Experimental results show that Q-learning with modified fuzzy ART acquires good rules for agent behavior more efficiently than Q-learning with ordinary fuzzy ART, although the number of categories generated by modified fuzzy ART is much less than that generated by ordinary one.
A weighted densely connected convolution network (W-DenseNet) is proposed for reinforcement learning in this work. The W-DenseNet can maximize the information flow between all layers in the network by cross layer connection, which can reduce the phenomenon of gradient vanishing and degradation, and greatly improves the speed of training convergence. The weight coefficient introduced in W-DenseNet, the current layer received all the previous layers’ feature maps with different initial weights, which can extract feature information of different layers more effectively according to tasks. According to the weight adjusted by learning, the cross-layer connection is pruned to remove the cross-layer connection with smaller weight, so as to reduce the number of cross-layer. In this work, GridWorld and FlappyBird games are used for simulation. The simulation results of deep reinforcement learning based on W-DenseNet are compared with the traditional deep reinforcement learning algorithm and reinforcement learning algorithm based on DenseNet. The simulation results show that the proposed W-DenseNet method can make the results more convergent, reduce the training time, and obtain more stable results.
In this paper, we study the path-planning problem of emergency fire control robots in the nuclear environment. Given the high risk of the atomic environment, the irregularity of spatial shape, and the complex distribution of obstacles, a robot path planning method is proposed based on the combination of QQ-learning and BCD raster map decomposition method. It realizes the automatic elimination control of the nuclear-contaminated environment and reduces the exposure risk of manual intervention operation. First, QQ-learning, a reinforcement learning model, is used to establish the optimal path between the start and end points of the operation area. Second, the BCD raster map decomposition method is used to realize the global division of the operation area. Then, an improved partition merging method based on the QQ-learning optimal path is proposed to complete the job sub-region merging and cover path planning. Finally, the simulation experiment proves that the technique can quickly and stably achieve the global path coverage of the unique operating environment of the nuclear domain.
Reinforcement learning is usually required in the process of trial and error called exploration, and the uniform pseudorandom number generator is considered effective in that process. As a generator for the exploration, chaotic sources are also useful in creating a random-like sequence such as in the case of stochastic sources. In this research, we investigate the efficiency of the deterministic chaotic generator for the exploration in learning a nonstationary shortcut maze problem. As a result, it is found that the deterministic chaotic generator based on the logistic map is better in the performance of the exploration than in the stochastic random generator. This has been made clear by analyzing the difference of the performances between the two generators in terms of the patterns of exploration occurrence. We also examine the tent map, which is homeomorphic to the logistic map, compared with other generators.
This paper addresses the problem of efficiency in reinforcement learning of Single Robot Hose Transport (SRHT) by training an Extreme Learning Machine (ELM) from the state-action value Q-table, obtaining large reduction in data space requirements because the number of ELM parameters is much less than the Q-table's size. Moreover, ELM implements a continuous map which can produce compact representations of the Q-table, and generalizations to increased space resolution and unknown situations. In this paper we evaluate empirically three strategies to formulate ELM learning to provide approximations to the Q-table, namely as classification, multi-variate regression and several independent regression problems.
Traditional image enhancement algorithms do not account for the subjective evaluation of human operators. Every observer has a different opinion of an ideally enhanced image. Automated Techniques for obtaining a subjectively ideal image enhancement are desirable, but currently do not exist. In this paper, we demonstrate that Reinforcement Learning is a potential method for solving this problem. We have developed an agent that uses the Q-learning algorithm. The agent modifies the contrast of an image with a simple linear point transformation based on the histogram of the image and feedback it receives from human observers. The results of several testing sessions have indicated that the agent performs well within a limited number of iterations.
This paper presents an unified overview of a new family of distributed algortithms for routing and load balancing in dynamic communication networks. These new algorithms are described as an extension to the classical routing algorithms: they combine the ideas of online asynchronous distance vector routing with adaptive link state routing. Estimates of the current traffic condition and link costs are measured by sending routing agents in the network that mix with the regular information packets and keep track of the costs (e.g. delay) encountered during their journey. The routing tables are then regularly updated based on that information without any central control nor complete knowledge of the network topology. Two new algorithms are proposed here. The first one is based on round trip routing agents that update the routing tables by backtracking their way after having reached the destination. The second one relies on forward agents that update the routing tables directly as they move toward their destination. An efficient co-operative scheme is proposed to deal with asymmetric connections. All these methods are compared on a simulated network with various traffic loads; the robustness of the new algorithms to network changes is proved on various dynamic scenarii.
This study examines the complex and evolutionary nature of supply network configuration. Taking a bottom-up approach, we examine how supply network configuration at the macro-level evolves as a result of individual retailers' dynamic choice of procurement strategies at the micro-level. Employing agent-based modeling, we focus on the effects of switch cost and distributors' ordering policy on the evolution of supply network configuration. Our results show that (1) supply networks tend to evolve into a set of separate supply chains when switch cost is high and into an integrated network when switch cost is low, (2) a responsive ordering policy adopted by distributors is more conducive to the integrated network configuration than a non-responsive policy, and (3) lack of coordination among retailers in their dynamic choice of procurement strategies hurts not only the overall system performance, but also retailers themselves. More importantly, our study demonstrates the capabilities of agent-based modeling as a methodology for researching complex supply network issues.
In-stream big data processing is an important part of big data processing. Proactive decision support systems can predict future system states and execute some actions to avoid unwanted states. In this paper, we propose a proactive decision support system for online event streams. Based on Complex Event Processing (CEP) technology, this method uses structure varying dynamic Bayesian network to predict future events and system states. Different Bayesian network structures are learned and used according to different event context. A networked distributed Markov decision processes model with predicting states is proposed as sequential decision making model. A Q-learning method is investigated for this model to find optimal joint policy. The experimental evaluations show that this method works well for congestion control in transportation system.
The aim of this paper is to reduce the energy consumption of a humanoid by analyzing electrical power as input to the robot and mechanical power as output. The analysis considers motor dynamics during standing up and sitting down tasks. The motion tasks of the humanoid are described in terms of joint position, joint velocity, joint acceleration, joint torque, center of mass (CoM) and center of pressure (CoP). To reduce the complexity of the robot analysis, the humanoid is modeled as a planar robot with four links and three joints. The humanoid robot learns to reduce the overall motion torque by applying Q-Learning in a simulated model. The resulting motions are evaluated on a physical NAO humanoid robot during standing up and sitting down tasks and then contrasted to a pre-programmed task in the NAO. The stand up and sit down motions are analyzed for individual joint current usage, power demand, torque, angular velocity, acceleration, CoM and CoP locations. The overall result is improved energy efficiency between 25–30% when compared to the pre-programmed NAO stand up and sit down motion task.
Value function approximation plays an important role in reinforcement learning (RL) with continuous state space, which is widely used to build decision models in practice. Many traditional approaches require experienced designers to manually specify the formulization of the approximating function, leading to the rigid, non-adaptive representation of the value function. To address this problem, a novel Q-value function approximation method named ‘Hierarchical fuzzy Adaptive Resonance Theory’ (HiART) is proposed in this paper. HiART is based on the Fuzzy ART method and is an adaptive classification network that learns to segment the state space by classifying the training input automatically. HiART begins with a highly generalized structure where the number of the category nodes is limited, which is beneficial to speed up the learning process at the early stage. Then, the network is refined gradually by creating the attached sub-networks, and a layered network structure is formed during this process. Based on this adaptive structure, HiART alleviates the dependence on expert experience to design the network parameter. The effectiveness and adaptivity of HiART are demonstrated in the Mountain Car benchmark problem with both fast learning speed and low computation time. Finally, a simulation application example of the one versus one air combat decision problem illustrates the applicability of HiART.
American options are important financial products traded in enormous volumes across the world. Therefore, accurate and efficient valuation is of paramount importance for global financial markets. Due to the early exercise feature, the pricing of American options is significantly more complicated than European options, and an analytical closed-form solution is unavailable even for simple dynamic models. Practitioners employ various valuation methods to strike the balance: accurate valuation usually suffers inefficiency, while fast valuation likely leads to inaccuracy. In this paper, we provide an innovative solution to address both the accuracy and efficiency issues of pricing American options by applying quantum reinforcement learning. Meanwhile, the quantum part of the new approach would potentially speed up the calculation dramatically.
This chapter introduces a new reinforcement learning method to solve a train marshaling problem for assembling an outgoing train. In the problem, the arrangement of incoming freight cars are assumed to be random. Then, the cars are rearranged to the desired layout in order to assemble an outgoing train. In the proposed method, each set of freight cars that have the same destination make a group, and the desirable group layout constitutes the best outgoing train. Then, a rearrange operation is conducted by using several subtracks and an outgoing train is assembled in the main track. When a rearrangement operation is conducted in the proposed method, several cars located on different subtracks are collected by a locomotive. In order to rearrange cars by the desirable order, cars are removed from a sub-track to another sub-track. Each marshaling plan that consists of series of removal and rearrangement operations are generated by a reinforcement learning system based on the transfer distance of a locomotive. The total transfer distance of the locomotive required to assemble an outgoing train can be minimized by the proposed method.
QWalking uses Q-Lcarning to find optimal walking patterns for biped robots using two novel features, firstly, QWalking does not require a precise dynamics model, thus a robot, can learn to dynamically switch walking patterns without knowing exactly the position of the centre of mass nor the angular momentum of each link. Secondly, inspired by the psychological reward system in a human's brain, we introduce two (i.e., positive and negative) rewards and a punishment rate to adjust the weight of these two rewards. In this paper, our contributions include the exploration of learning robust and energy-efficient walking patterns using QWalking. We evaluate QWalking in the context of RoboCup 3D simulator SPARK based on a NAO robot, and the robot successfully yields fast walking patterns adapting to different mass of the torso. We also explore how to yield energy-efficient walking patterns using QWalking. Experimental result shows explicit decrease of the mechanical cost of transport (MCT).
QWalking is a model-free biped locomotion method that uses Q-Learning to find optimal walking patterns for a humanoid robot. Previously tested in a simulated environment, here we investigate the implementation of QWalking on a real NAO robot to yield fast locomotion patterns. We compare related work with respect to locomotion on the NAO robot and discuss the software infrastructure in order to use QWalking with it. Experimental results show that it develops fast walking patterns that are competitive to the current fastest approach published for this robot, with the advantage of not requiring precise dynamic models or extensive state sensing.
In this chapter, an intelligent method for generating marshaling plan of freight cars in a train is introduced. Initially, freight cars are located in a freight yard by the random layout, and they are moved and lined into a main track in a certain desired order in order to assemble an out bound train. Based on the processing time, Marshaling plans are obtained by a reinforcement learning system. In order to evaluate the processing time, the total transfer distance of a locomotive and the total movement counts of freight cars are simultaneously considered. Moreover, by grouping freight cars that have the same destination, candidates of the desired arrangement of the outbound train is extended. This feature is considered in the learning algorithm, so that the total processing time is reduced. Then, the order of movements of freight cars, the position for each removed car, the layout of groups in a train, the arrangement of cars in a group and the number of cars to be moved are simultaneously optimized to achieve minimization of the total processing time for obtaining the desired arrangement of freight cars for an outbound train.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.