Nowadays, there is a significant need for maintenance free modern Internet of things (IoT) devices which can monitor an environment. IoT devices such as these are mobile embedded devices which provide data to the internet via Low Power Wide Area Network (LPWAN). LPWAN is a promising communications technology which allows machine to machine (M2M) communication and is suitable for small mobile embedded devices. The paper presents a novel data-driven self-learning (DDSL) controller algorithm which is dedicated to controlling small mobile maintenance-free embedded IoT devices. The DDSL algorithm is based on a modified Q-learning algorithm which allows energy efficient data-driven behavior of mobile embedded IoT devices. The aim of the DDSL algorithm is to dynamically set operation duty cycles according to the estimation of future collected data values, leading to effective operation of power-aware systems. The presented novel solution was tested on a historical data set and compared with a fixed duty cycle reference algorithm. The root mean square error (RMSE) and measurements parameters considered for the DDSL algorithm were compared to a reference algorithm and two independent criteria (the performance score parameter and normalized geometric distance) were used for overall evaluation and comparison. The experiments showed that the novel DDSL method reaches significantly lower RMSE while the number of transmitted data count is less than or equal to the fixed duty cycle algorithm. The overall criteria performance score is 40% higher than the reference algorithm base on static confirmation settings.
The article deals with the design and application of the control algorithm for a prototype of an efficient Low-Cost, Low-Power, Low Complexity—hereinafter (L-CPC) bidirectional communication system for the reading and configuration of embedded devices. Low Power Wide Area Networks (LPWANs) and the fifth-generation technology standard for broadband cellular networks (5G) are promising technologies for the connection of compact monitoring mobile embedded devices to the internet using machine to machine (M2M) communications [
Generally, reinforcement learning (RL) methods are suitable as easy-to-implement and low computational power demanding machine learning approach for mobile IoT devices [
Several research articles have used various implementations of RL principles, especially QL in monitoring IoT devices at a network level (see
In this article, the application of a novel DDSL control approach for mobile monitoring IoT devices based on wake-up scheduling (
Author, source | Algorithm | Description | Advantages |
---|---|---|---|
Savaglio et al. [ |
QL-MAC | Self-adjusting node duty-cycle | + Low energy states |
Wei et al. [ |
QS-TDMA | Task scheduling algorithm | + The reliability and real-time performance of WSNs |
Wang et al. [ |
TTDD-QL | Two-tier data dissemination scheme | + Reduced energy consumption |
Redhu et al. [ |
QL | Joint mobile sink scheduling and dynamic buffer management | + Improved network lifetime |
Kosunalp [ |
QL-SEP | Prediction algorithm | + Useful for solar-powered devices |
Mirhoseini et al. [ |
QL | Hybrid energy supply system | + Improved system lifetime |
Al Islam et al. [ |
QRTT | Prediction algorithm | + Useful for wireless embedded devices |
Zhang et al. [ |
DQL-EES | Energy-efficient scheduling | + Energy efficient |
The remainder of the article is organized as follows: the background section describes power-aware challenges, the general Q-learning algorithm principle and future value estimation by polynomial approximation. The experimental section describes a designed controller, reference algorithm and the evaluation criteria. The experiment summary is elaborated in the results section, followed by a technical discussion. The final section concludes the article and discusses several research challenges as future work.
This section introduces the theoretical background for a general description of the Q-learning algorithm and mathematical formalization of the applied polynomial approximation.
QL belongs to a family of reinforcement learning methods which explore an optimal strategy for a given problem. This semi-supervised model free algorithm was introduced by Watkins [
The QL defines an agent which is responsible for the selection of action
The QL approach also uses a memory-stored array which is called Q-table, and its size is defined by the number of states
The learning strategy is also influenced by a constant ɛ (epsilon-greedy policy), which causes the selection of a random action instead of the maximal reward action. From the 0 to 1 interval, ɛ is selected (e.g., 0.95 means 5% of random actions) [
The polynomial approximation interpolates values with a polynomial. The polynomial is a function which is written in the form:
An approximation is an inaccurate expression of some function. In this paper, the polynomial coefficients are calculated using the least-squares approximation method by summing squared values of the deviations; this sum should be minimal (see
The experiment uses the dataset from an environmental data collection station. The data include values of incoming solar energy as simulated input from a sensor. The solar energy values were collected continuously for five years at the Fairview Agricultural Drought Monitoring station (AGDM) located in Alberta, Canada [
The aim of the performed experiment is evaluation whether the DDSL controller is capable of finding an optimal strategy for dynamic configuration of the data collection period. A conventional QL algorithm was modified to be useful to the proposed experiment for its application in wake-up embedded devices. The experiment was performed in MATLAB, and a complete solution is simple to implement to mobile monitoring devices.
The proposed DDSL controller dynamically sets an operation period according to correct estimation of the collected data to adjust the operation duty cycle. The DDSL controller follows the RL model shown on the
Action
Based on the current state and performed action, partial rewards (the state reward (
The index_of() function returns an one-based order of elements in the state vector (higher index represents higher estimation accuracy). The
In this case the index of() function provides higher value for longer duty cycle. The total reward
The QL process is affected by a total reward
The DDSL approach is equipped by discounting the learning factor to achieve stability in the learning process. The discounting progress of the parameter
The conventional QL algorithm presented in the literature [
The conventional Q-learning algorithm described in [
1: Initialize
2:
3: Choose
4: Take action
5:
6:
7:
The modified Q-learning algorithm is composed of the following commands:
1: Initialize
2:
3: Wake up
4: Observe
5: Calculate reward
6:
7: Choose
8: Start action
9: Sleep
10: (time)
11:
12:
In the original QL algorithm, the performed action step is inside the QL algorithm loop, but from the monitoring device point of view, the performed action itself is a duration of standby or sleep mode. In the modified scenario, the algorithm performs an action at a different stage than the original approach. The learning process part is completed based on the past state and current state because the future action is unknown.
In the conventional QL algorithm, an action is first selected according to the QL policy and the environment state. The action is performed, and a reward based on the previous state and actual action is calculated. In the next step, the Q-table is updated by the learning process and a new state
The modified QL algorithm is controlled by the following equation:
The polynomial approximation method is used to evaluate the next value
In the next step, MATLAB's polyval function is used to protect negative values in prediction. The polyval evaluated the polynomial
To evaluate the DDSL controller approach, a reference algorithm with a linear interpolation method is used. The original collected data has a 5-min data collection interval. Therefore, the reference solution is based on an original data set where only 10-, 15-, 20-, 25- and 30-min intervals are extracted. To fill in the missing data between the extracted samples, the linear interpolation method was used.
To compare the accuracy of prediction between individual settings of expert constants and the reference solution, the root mean squared error (RMSE) was calculated by following equation,
The Number of Measurement (NoM) is the second evaluation parameter which follows the number of the operation period. The algorithm policy is principally designed to minimize NoM (
The performance score (PS) is then the overall evaluation parameter, which considers both above-mentioned parameters (RMSE and the NoM) and is calculated according to the following equation:
Generally, an overall evaluation considers two parameters (RMSE and NoM). Technically, these parameters oppose each other, and a trade-off between RMSE and NoM should be considered. To evaluate the DDSL approach, a cartesian distance to zero is used.
The RMSE and NoM parameters are normalized according to the worst case, meaning a 30-min reference algorithm RMSE parameter and a 5-min reference algorithm NoM parameter:
An overall cartesian evaluation parameter
This section provides the results of a comprehensive set of experiments which verify the designed controller with various QL parameters settings and the degree of the polynomial. Each experiment configuration was repeated ten times to eliminate the effect of the epsilon-greedy policy. Experiments were performed with the following settings for γ = {0, 0.1, 0.2, …, 1},
Reference algorithm | |||||
|
γ | RMSE | NoM | PS | |
5 min | - | - | 0.00 | 507,745 | 96 |
10 min | - | - | 26.72 | 253,873 | 87 |
15 min | - | - | 35.92 | 169,249 | 77 |
20 min | - | - | 41.54 | 126,937 | 58 |
25 min | - | - | 45.99 | 101,549 | 29 |
30 min | - | - | 48.90 | 84,625 | 0 |
Best PS settings of the DDSL algorithm | |||||
0.2 | 1.0 | 23.15 | 184,962 | 139 | |
0.3 | 1.0 | 24.48 | 173,773 | 141 | |
0.1 | 0.3 | 24.74 | 180,487 | 134 | |
0.2 | 0.9 | 21.89 | 197,751 | 137 | |
0.1 | 0.7 | 25.13 | 201,041 | 118 |
|
|||||||||||
|
γ | RMSE | NoM | Perf. | |
γ | RMSE | NoM | PS | |
|
0.2 | 1.0 | 23.15 | 184962 | 139 | 0.597 | 0.3 | 1.0 | 24.48 | 173773 | 141 | 0.606 |
0.1 | 0.6 | 25.76 | 171190 | 135 | 0.625 | 0.2 | 1.0 | 20.24 | 207092 | 138 | 0.581 |
0.2 | 0.9 | 24.90 | 178212 | 135 | 0.618 | 0.2 | 0.3 | 29.81 | 140577 | 136 | 0.670 |
0.3 | 1.0 | 28.26 | 154979 | 133 | 0.654 | 0.1 | 0.1 | 27.69 | 157967 | 134 | 0.646 |
0.1 | 0.3 | 27.99 | 158891 | 132 | 0.652 | 0.1 | 0.4 | 26.36 | 170567 | 132 | 0.635 |
0.3 | 0.9 | 29.91 | 148714 | 128 | 0.678 | 0.2 | 0.0 | 31.72 | 132334 | 130 | 0.699 |
0.3 | 0.8 | 31.74 | 135589 | 127 | 0.702 | 0.1 | 0.0 | 29.07 | 153865 | 129 | 0.667 |
0.1 | 0.8 | 25.70 | 184075 | 126 | 0.638 | 0.1 | 0.6 | 25.97 | 178273 | 129 | 0.637 |
0.2 | 0.8 | 29.87 | 151770 | 125 | 0.680 | 0.1 | 0.3 | 28.08 | 162190 | 128 | 0.657 |
0.1 | 0.7 | 26.96 | 175725 | 125 | 0.651 | 0.2 | 0.4 | 31.24 | 137799 | 128 | 0.694 |
The results return several interesting areas to discuss. The first idea concerns the correct selection of the degree of polynomial. The presented experiment used a degree of polynomial from 1 to 5. Based on the input solar irradiance data, the DDSL approach provided the best performing result for the degree of the polynomial 2. The degree of the polynomial 1 also provided better performance than 3, 4 and 5 in this case. It must be highlighted that selection of the appropriate degree of the polynomial is directly linked to the type of data collected from the sensors. In our case, the best performing result was achieved by linear or quadratic approximation represented by the degree of the polynomial 1 and 2. In the case of a different dataset, correct selection of the coefficient could lead to higher degree of the polynomial. Regarding the key feature of the DDSL approach, exploratory studies for suitable degrees of the polynomial should be performed before mobile monitoring IoT devices are deployed in target application areas. The capability of the self-learning approach is limited without custom adjustment of the degree of the polynomial according to the character of the collected data.
The configuration of Q-learning parameters is second area to discuss. Deployment of the mobile monitoring devices should consider proper selection of the learning rate, discount factor, and the epsilon-greedy policy. The article's results showed that the initial learning rate should be set conservatively from 0.1 to 0.3. Therefore, the DDSL controller accepts new information slowly and keeps its already obtained knowledge stored in a Q-table. However, in terms of the discount factor, there is no conclusive result. With a degree of polynomial 1, the experiment showed that the best results are achieved from high cumulative discount factor approaches (0.8–1). However, the result which included a degree of polynomial 2 showed that an instant reward policy with a low discount factor (<0.4) could also lead to the best performance solutions. Therefore, the discount factor setting is not simply a subject of the input dataset but has a strong connection to the degree of the polynomial. The epsilon-greedy policy is set to 5% of random actions as standard in such applications, but the question is whether this leads to the best performance in long-term deployments where the learning rate is significantly reduced by the learning discount coefficient. This idea should be evaluated with long-term field testing or extensive simulations on an extended dataset. In this case however, the study does not provide a general answer for setting up the initial epsilon-greedy and discount policies.
The final discussion topic concerns the evaluation policy of the presented solution. There were designed two basic approaches, one which uses a linear ratio between the RMSE and NoM, and the second which is calculated by the geometrical distance in normalized cartesian space. Both evaluation methodologies followed the same aim, which was to determine an evaluation coefficient which targets the tradeoff between low RMSE and low NoM. Both methodologies provide similar results in an opposing manner, one maximizing the linear ratio and the other minimizing the normalized distance. In another implementation scenario, the evaluation strategy varied according to the specific optimization target.
Author, source | Algorithm | Description | Advantages and limitations |
Lork et al. [ |
QL | Data-driven energy consumption control | – Large data pool required |
Radac et al. [ |
QL | Data-driven position control | + Superior control performance |
Duan et al. [ |
Deep QL | Data-driven voltage control | + Promising performance |
Proposed DDSL controller | Modified QL | Data-driven operation duty cycle control | + Energy efficient |
The article proposed a modified QL-based algorithm which controls an operational cycle according to the acquired data. The general principle lies in observation of the parameters of interest when data from sensors contains high information value. This solution leads to the minimization of operational cycles when data changes according to a predictable trend. This solution offers a unique paradigm in contrast to the classic scenario of an embedded device obtaining data and then deciding whether the data contains information which should be stored and transmitted to a cloud. The presented DDSL method principally avoids redundant data acquisition, which leads to a more energy-efficient operation.
The proposed DDSL algorithm provides better results than the reference algorithm which operates with a continuous measurement period. The novel approach described in this paper achieved an approximately 40% higher PS than the reference algorithm. It means that our novel algorithm reached a lower RMSE at the same NoM as the reference algorithm, or a lower NoM at the same RMSE.
The presented solution opens several research opportunities. The first challenge includes application of the proposed method in another data domain. The next research challenge might be modification of the learning model. It is also possible to use statistical parameters as a reward policy to replace the polynomial function. In this article, the authors examined the general principle of the DDSL approach, which performs well on the presented mobile monitoring embedded devices, however future modification of the DDSL approach could lead to more effective domain-customized solutions.