Yifei Xiao, Shijie Zhou*
School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, China
* Corresponding Author: Shijie Zhou. Email:
(This article belongs to this Special Issue: Internet of Things in Healthcare and Health: Security and Privacy)
Computer Modeling in Engineering & Sciences 2023, 135(1), 169-185. https://doi.org/10.32604/cmes.2022.021795
Received 05 February 2022; Accepted 19 May 2022; Issue published 29 September 2022
With growing of the ageing population and the related rise in chronic illness (e.g., diabetes  or Parkinson’s disease ), the Internet of Things (IoT) has been widely identified as a potential solution to alleviate the pressures on healthcare systems . For instance, Health Care Assistants (HCA)  (e.g., Remote Patients Monitoring) are generating a huge amount of data (called “health data” for brevity) in real time using IoT medical sensors and ambient sensors (Fig. 1). These massive amounts of health data are usually stored in Cloud Storage Systems (CSS) to enable applying different analytical techniques to extract the medical knowledge, such as detecting patients’ health status, innovating methods for the diagnosis of different diseases, and how to treat them . For example, medical images are usually used to assist the healthcare provider to predict diseases and make clinical decisions accurately . However, with the explosively increasing of health data in CSS, the conventional storage techniques which poses many needs and challenges , such as
1. The need to develop infrastructures that are capable of processing data in parallel.
3. The need to provide a fault-tolerant mechanism with high availability.
In this paper, we focus on the third one: provide a fault-tolerant mechanism with high availability.
Erasure codes (EC) are a leading technology to achieve strong fault-tolerance in CSS . Roughly speaking, as all the files in CSS are usually split into fixed-size data blocks, EC encode these data blocks to generate a small number of redundant blocks (also called parity blocks), such that a subset of data and parity blocks still suffices to recover the original data blocks. Compared to conventional replication (e.g., 3-replication), EC can assuredly maintain the same degree of fault tolerance with much less storage overhead and hence is preferable in practical storage systems. For example, the erasure-coded Quantcast File System saves 50% of storage space over the original HDFS, which uses 3-replication. Besides, EC have been widely used in CSS, such as Microsoft Azure , Google Cloud , Facebook Cluster  and Alibaba Cloud .
However, EC bring two new problems, namely data repair (DR) and data update (DU). In DU, since each parity block is a linear combination of multiple data blocks, once the data block is updated, the relevant parity blocks must also be updated to achieve data consistency. Otherwise, it may cause permanent data loss (especially for the precious health data) in the face of node failures. Obviously, the health data in CSS is “hot data”, which means it will be frequently generated or updated by various IoT devices. Thus, it will cause considerable network traffic for DU, especially for the cross-rack traffic, which is often oversubscribed and much more scarce than the inner-rack bandwidth . To provide a fault-tolerant mechanism with high availability, it is necessary to provide an efficient and reliable DU scheme to solve the problem of data transmission in DU, especially for the cross-rack data transmission.
In order to alleviate the impact of network traffic, many works concentrate on network tier, as shown in Fig. 2. We re-examine and group them into two classes: ① improve bandwidth utilization (e.g., PUM-P, PDN-P , and T-Update ) and ② reduce network traffic (e.g., XORInc  and CAU ). Specifically, to improve bandwidth utilization, PUM-P  used a dedicated node called Update Manager (UM) to collect the update info and the old parity value of the relevant parity nodes for DU. T-Update  found that the traditional data transmission path is a star structure, which is detrimental to fully use the network bandwidth. What is worse, it is easy to cause a single-point bottleneck. Hence, T-Update modified the transmission path to a tree structure, which is great to leverage network traffic to other unused links and increase the network parallelism. To reduce network traffic, XORInc  offloads computation operations onto the programming network devices (i.e., modern switches with XOR computation capability and sufficient buffers). Thus, it can help data nodes forward the delta info to the relevant parity nodes. In order to mitigate the cross-rack traffic, Shen et al.  proposed CAU, which grouped the storage nodes into racks, and offered two optional update methods (data-delta-based update and parity-delta-based update) based on batch update and relay. However, despite the fruitful achievements of these great works, we found there is still massive room for network optimization, especially for the XOR-based DU.
By carefully summarizing the previous works, we found four valuable techniques for network optimization: delta transmission, XOR, relay and batch update. The delta transmission means that we only transmit the delta info, since the DU size is generally smaller than the whole block size . XOR means our scheme is based on XOR, as XOR-based DU can lead better throughput than RS-based DU. Relay means we exploit the relay nodes to forward data, which can fully use the unused links to mitigate the update traffic. In a word, we propose a simple and efficient mechanism Delta-XOR-Relay DU (DXR-DU) by using them jointly. To summarize, our work mainly makes the following contributions:
• We summarized the previous works on network optimization and found four valuable techniques: delta transmission, relay, XOR and batch update.
• Based on the four techniques, we proposed a novel data update scheme called DXR-DU, which can significantly improve throughput for DU. In other words, it can help CSS to build a fault-tolerant mechanism with high data availability.
• We implemented the DXR-DU prototype in Go programming language and analyzed that it can achieve the optimal cross-rack data update.
• We conducted numerous local testbed experiments based on our prototype. 1Experiments on a local testbed show that DXR-DU can significantly reduce the cross-rack traffic and improve the update throughput.
It is well-known that modern DC deploy thousands of storage nodes in one or multiple geographic regions to provide large-scale storage services. These storage nodes are grouped into racks and further interconnected via the network core-an abstraction of aggregation switches and core routers . Fig. 3 shows a typical CSS with three racks and each rack comprises four nodes.
A leading technique to achieve strong fault-tolerance in CSS is to utilize EC. As stated above, EC use the original data to generate more encoded data, thus they allow a fixed number of component failures in the overall system. EC are usually configured by two parameters: the number of data symbols k to be encoded, and the number of coded symbols n to be produced . The data symbols and the coded symbols are usually assumed to be in finite field in computer systems.
RS codes  are a well-known erasure code construction and have been widely deployed in production [24–26]. RS codes are usually referred to as RS (n, k). For instance, Fig. 3 depicts a typical CSS with RS (5, 3), which encodes data blocks into parity blocks. These blocks group into a stripe, scattering in different nodes.
It is known that EC can be divided into two classes: RS-based codes and XOR-based codes . Accordingly, we can classify DU into two types: RS-based DU and XOR-based DU.
Fig. 4 shows the typical encoding process of RS (5, 3), where the leftmost matrix (called generator matrix) encodes the data blocks () into a codeword (). After encoding, the data blocks () will be sent to the corresponding data nodes and the parity blocks () will be sent to the corresponding parity nodes. From Fig. 4 we can infer that, in a RS-based CSS, each parity block could be represented by a linear combination of the k data blocks with the following equation:
where and all elements are numbers in GF() for some value of w. Suppose that is updated to (), Eq. (1) can be called for DU.
RS-delta-based: On the other hand, we can simply utilize the delta info () to renew the parity block with the following equation:
where denotes the old value. In this way, we can simply transfer the delta of (also called ) to the parity node i.
No matter Eq. (1) or Eq. (2) is selected for DU, a considerable number of multiplications are generated, which will significantly impede the performance of DU. To end this, as shown in Fig. 5, XOR-based encoding is proposed via Binary Distribution Matrix (BDM), where each element e in GF() can be denoted by a matrix M(e) of or a vector V(e) of , thus, the generator matrix of size can be converted to a new generator matrix of size in GF(2) . In this light, we can use the smaller element ( bits) to encode. According to Fig. 5, the parity blocks can be computed by the following equations:
where the matrix multiplications are now converted to XORs of data bits corresponding to the ones in BDM. Zhou et al.  proved it is more efficient to use XOR operation to encode instead of directly using RS-based encoding. In other words, XOR-based DU can significantly reduce the computation overhead than RS-based DU.
As mentioned earlier, nodes are grouped into racks. We assume data racks are dedicated to data nodes and parity racks are dedicated to parity nodes. Without loss of generality, suppose that there are blocks denoted by that are updated to in the data rack , based on Eq. (2), we can calculate the parity block () with the following equation:
where is the delta of parity block i. Suppose that parity block i () are located in the parity rack , and there are parity blocks to update. As illustrated in Fig. 6, there are two options to renew the parity block: ① data-delta-based update, and ② parity-delta-based update .
Data-delta-based update: which updates the parity blocks of a rack in batch via transmitting data delta blocks directly . As shown in Fig. 6a, the number of data updates in is less than the number of parity updates in (i.e., ). Thus, we separately send the delta info () to the relay node , when receives all the deltas, it calculates and forwards the new values for and via Eq. (9).
Parity-delta-based update: as shown in Fig. 6b, , to mitigate the cross-rack traffic, parity-delta-based update is selected, where we select a data node as the relay node to collect the deltas in the same rack. Similarly, the relay node is responsible for regenerating the parity blocks via Eq. (9) and transferring the deltas to the relevant parity nodes.
In this section, we elaborate the design overview of Delta-XOR-Relay Data Update (DXR-DU).
Our study of previous works on network optimization found four valuable techniques for network optimization: delta transmission, XOR, relay, and batch update.
Recall that the existing two classes of network optimization: ① improve bandwidth utilization, and ② reduce network traffic. We found that the key technique to improve bandwidth utilization is using relay. For example, PUM-P  used a dedicated node called update manager (UM) as a relay node to compute the deltas of the relevant parity blocks, while PDN-P discarded it. CAU  selected a data node or a parity node as a relay node, and RackCU  selected a data rack or a parity rack as a relay rack. It sounds like the triangle principle: If the sum (network overhead) of the two sides (using relay node) is greater than the third side (directly sending data), it is unnecessary to use the relay; Otherwise, we should use the relay to fully exploit the unused links. Besides, the relay can be used for updating one block (e.g., T-Update ) or a group of blocks (e.g., CAU  and RackCU ), and the latter should consider node grouping. For example, CAU groups nodes into racks and selects a relay node for each rack.
To reduce network traffic, we found two key factors: delta transmission and batch update. The block size in CSS normally ranges from 1 MB to 64 MB [16,24]. But it is unnecessary to update the whole block, since DU is small (60% of them are less than 4 KB ). Thus, the better way is to transfer the delta of the updated block. Another key point is batch update. For instance, CAU proved the batch update is powerful for saving network traffic via setting the threshold at 100 (i.e., when 100 data requests arrive). However, the batch update has the disadvantage that will slightly sacrifice the system reliability. Fortunately, we can utilize the interim replication to maintain the system reliability and data availability at the same level as the baseline EC approach .
As mentioned above, the fourth valuable technique is XOR. The experimental results in Section 4 reinforce our determination to use XOR. In the next section, we will discuss how to use delta, relay and batch update jointly based on XOR-based DU.
As far as we known, the transmission path is either a star structure (e.g., the baseline method) or a tree structure (e.g., T-Update, CAU, XORInc and RackCU). As mentioned earlier, the conventional star-structured path can easily cause single point bottleneck or even single point failure. Obviously, the tree-structured path is better. To build a tree-structured path, T-Update relies on the network distance (i.e., the hops) between nodes. While CAU groups nodes based on racks, and selects a relay node for each rack. Comparely, we believe CAU is more simple and easy for implementation. Besides, T-Update builds a tree only for one block, while CAU builds a tree for a group of related blocks. For example, in parity-delta-based update (Fig. 6b), CAU collects the deltas of a rack and directly transfers the merging result of parity block i () to related parity nodes. We argue that transferring the parity deltas is better than transferring the data deltas one by one. Therefore, we build the transmission path based on CAU.
It has long been recognized that transferring the data block in delta style will substantially save network load than transferring the whole data block. However, few works indeed transfer the delta in implementation. Although it is just a implementation issue, we proved that it is significant performance differentiator in evaluation. Thus, in this section, we elaborate our way to transfer the delta.
Block merging for batch update: It is well-known that a dirty (updated) data block may be modified in different places within a batch time (as shown in Fig. 7), which means the delta info (the gray parts) is scattered. To end this, we employ a very straightforward way: we label the leftmost offset as rangeL and the rightmost offset as rangeR. Thus, we only transfer the [rangeL, rangeR] of the whole block. As mentioned earlier, DU is small (most updates are less than 4 KB), even though we pack these small and scattered blocks into a large piece, the optimization space is still huge.
Delta Alignment: As mentioned above, we label the delta of an updated block as [rangeL, rangeR], but the ranges of distinct data blocks within a stripe are probably different, which prevents us from calculating the parity blocks. Therefore, before renewing the parity node in a relay node, we have to perform the delta alignment.
A typical example is depicted in Fig. 8, where there are four updated blocks within a stripe, and the ranges are , respectively. To compute the new value of the parity node, we use a dedicated node called central controller to align these four deltas based on the maximum range. Thus, can receive three deltas with identical range, and easily renew the parity node.
As mentioned above, we design DXR-DU based on the four valuable techniques: delta transmission, XOR, batch update and relay. As shown in Fig. 9, we first build the transmission path based on CAU, which offers two selective methods (data-delta-based update and parity-delta-based update). When the number of the updated data blocks in data rack is smaller than the number of updated parity nodes in parity rack (i.e., ), as shown in Fig. 9a, data-delta-based update is selected, which means we choose a parity node as the relay node to collect deltas and compute the new values for parity nodes. On the contrary, parity-delta-based update is selected, where we choose a data node as the relay node to collect the deltas and compute the parity deltas for the parity nodes.
Based on CAU, we have two extra techniques: ① We choose XOR-based DU to improve the update throughput in coding tier, unlike CAU, which relies on RS-based DU. ② We employ the delta transmission to mitigate the network traffic, especially for the cross-rack traffic.
Algorithm details: Algorithm 1 elaborates the main procedure to schedule the update requests in a batch. We first collect the data blocks (getting from user requests) in a batch, and perform block merging (Line 1). Then, we group these blocks into stripes (Line 2). For each stripe, according to the number of the updated data blocks in a data rack, we handle the data blocks of a stripe in this rack: 1) as mentioned above, if , we use parity-xor-based update (Line 6). 2) Otherwise, we use data-xor-based update (Line 7).
Similar to RS-based DU, in this section, we discuss the optimal cross-rack parity update in XOR-based DU. For ease of presentation, we take an example of a CSS, where there are data nodes and parity nodes. We set and only show the blocks in one stripe (as shown in Fig. 10). Besides, we suppose the parity update equations are as follows:
As we focus on network optimization, it is unnecessary to know the exact equations of , we just need to make sure that all parity could receive what they want, thus we label as the all the data of node needed. The key question is: how to update to minimize the cross-rack network load? For example, as shown in Fig. 10a, if are changed in batch, according to Eq. (10) to Eq. (14), we need to update all parity blocks ().
In this case, where the number of updated data blocks in the data rack is smaller than the number of updated parity nodes in the parity rack (). Unlike the data-delta-based update in CAU, where there is only one updated block belonging to a node in a stripe. Here we should consider multiple updated blocks () in a data node. To save cross-rack traffic, we should consider whether to transfer the xor result of multiple updated blocks or not. But we found we can not do that, because 1) every delta info may have different update range, 2) if we transmit the xor result () of and to , while only needs (Eq. (1)), we can not extract from the xor result without transferring . Therefore, we can only transfer them one by one (i.e., ).
In data-delta-based update, where the number of updated data blocks (denoted by ) is smaller than the number of updated nodes in the parity rack (denoted by ), we should select a relay node for the parity rack. A small question is: how to select the relay node in the parity rack? We tested 2 options: 1) Random select, and 2) consider load balance, which means every round we select a different relay node for forwarding data. However, we test it and found it is unnecessary to do that. Thus, we choose the frist option.
On the other hand, if , in parity-delta-based update, where we will select a relay node for the data rack, similar to parity-delta-based update, we will utilize the relay node to compute and transfer the deltas of the corresponding parity node. As illustrated in Fig. 10b, there are three updated data blocks () and 2 parity node to be updated (), namely , thus we randomly select as the relay node to compute and forward the deltas of and .
In a nutshell, data-delta-based update and parity-delta-based update are two methods to minimize the cross-rack network traffic. Compare to CAU, we proved it has advantages on update time, throughput and cross-rack traffic in experiments.
In this section, we conduct extensive performance evaluation via local testbed experiments between the proposed approach DXR-DU and two well-known Counterparts: PDN-P and CAU. We summarize our major findings below: compared to the state-of-the-art schemes, ① DXR-DU saves more than 44.9% of cross-rack traffic in most cases (Section 4.3); ② DXR-DU increases 53.6% of update throughput (Section 4.4).
Traces: We assess the update performance via trace-driven evaluation. We utilize MSR Cambridge Traces (MSR) , which record the I/O patterns from 13 core servers of a data center. Every trace consists of successive read/write requests, each of which records the request type (read or write), the start position of the requested data, and the request size, etc. According to the ranking results of the average update size of MSR Cambridge Traces in , we select 4 traces with dramatically distinct update sizes (sorted from small size to big size): rsrch_2, hm_0, hm_1, proj_0.
PDN-P: When a data block is updated, PDN-P directly send the delta to the relevant parity nodes, which means it builds a star-structured transmission path for each update.
CAU: As shown in Fig. 6, CAU updates parity blocks simply through the selective parity update: 1) if the updated data blocks of a data rack are more than the parity blocks of a parity rack, CAU updates the parity blocks via transferring parity delta blocks; Otherwise, it updates them through transferring the data delta blocks.
Since open source implementations for PDN-P and CAU are not available, we design and implement the prototype of DXR-DU and its two counterparts (PDN-P and CAU) with Go programming language on Ubuntu 18.04. These schemes rely on Cauchy RS code implementations. Hence, we utilize the reedsolomon library which is the Go version of Jerasure library 2.0.
The system architecture of our prototype is illustrated in Fig. 11, where we choose RS(12, 4) (deployed in Windows Azure Storage ), where there exists data nodes and parity nodes. We utilize Linux tool tc  to group them into three racks via vitual Top-of-Rack switches (ToR) and set the cross-rack/inner-rack at 40/200 Mbps. Such a configuration can tolerate any four nodes failure as well as any single rack failure. Besides, we have another node called metadata server, which is used for metadata management. The metadata server also includes two components: the client, which is to generate user requests, and the central controller, which is responsible for sending commands to the storage nodes and receiving ACKs from them. In addition, the agent in storage nodes is responsible for performing the tasks (e.g., computing and forwarding data) according to the received commands. When the task is finished, it returns an ACK to the command sender. All the nodes are virtual machines (VM) which are generated from 3 Huawei H12M-03 servers via Proxmox VE . Each VM is equipped with a dual core CPU, 2 GB memory and 32 GB disk.
As mentioned above, we believe that our proposed scheme has advantages on cross-rack traffic, thus we first keep eyes on the amount of induced cross-rack traffic.
Experiment A.1 (Impact of update size): We first study the impact of the update size by selected four traces with distinct update sizes: rsrch_2, hm_0, hm_1, proj_0. We configure the block size as 1 MB. Table 1 shows the cross-rack traffic for each update. Compared to PDN-P and CAU, DXR-DU reduces the cross-rack traffic by up to 98.0% and 71.6%, respectively. The result is actually out of our expectations but still consistent with the fact that DU is small.
Experiment A.2 (Impact of block size): To assess the impact of block size, we set the block size as 0.25/1/4 MB, respectively. Table 3 exhibits that DXR-DU keeps efficiency on saving the cross-rack traffic with different block sizes. DXR-DU can reduce 98.4%, 68.2% of the cross-rack update traffic on average compared to PDN-P and CAU, respectively. The rationale is that DXR-DU utilizes delta transmission.
In a nutshell, with the help of delta transmission, DXR-DU can significantly mitigate the cross-rack update traffic by up to 44.9%--99.1%.
As the health data is “hot data”, it is significant to achieve an excellent DU throughput to maintain the data availability at a high level. In this paper, to compare various schemes fair, we assess the DU throughput of the schemes by changing the update size and block size.
Experiment B.1 (Impact of update size): We first evaluate the update time of a single block on average by changing th update size: rsrch_2, hm_0, hm_1, proj_0. Similarly, the default block size is 1MB. Table 2 shows the result that DXR-DU keeps efficiency on DU throughput. For DXR-DU, the update time of single block on average only needs 0.021 s. Compared to PDN-P and CAU, DXR-DU can save update time of single block on average by up to 89.9% and 53.6%, respectively.
Experiment B.2 (Impact of block size): We further assess the DU throughput under different block sizes (0.25/1/4 MB), from Fig. 12 we observe that DXR-DU improves the update throughput by up to 13.8% and 88.8% when compared to PDP-P and CAU, respectively. Unsurprisingly, as DXR-DU wins the game in the comparison on cross-rack traffic, it also has significant advantages on throughput.
According to our experiments, DXR-DU can reduce 44.9%--99.1% of the cross-rack update traffic on average compared to PDN-P and CAU in most cases. Meanwhile, it can dramatically improve the DU throughput by up to 53.6% when compared to CAU.
To achieve the high availability of health and medical big data in erasure-coded cloud storage systems, the data update performance in erasure coding should be continuously optimized. We perform DU performance optimization via mitigating the update traffic, especially the cross-rack traffic. Thus, we propose a rack-aware update scheme called Delta-XOR-Relay Data Update (DXR-DU) based on four valuable techniques: delta transmission, XOR, relay, and batch update. Our proposed scheme offers two selective update options: (i) data-delta-based update, when the number of updated data blocks in data rack is less than the number of parity blocks to update in parity rack, we select a parity node as a relay node for collecting the data deltas and renewing the parity blocks, and (ii) parity-delta-based update for the opposite case, where we select a relay node for each data rack to collect the local data deltas and send the parity deltas to the relevant parity nodes. Experiments on a local testbed show that DXR-DU can significantly reduce the cross-rack update traffic and improve the update throughput.
Funding Statement: We thank the anonymous reviewers for their insightful feedback. We also appreciate Wenhan Zhan for his sincere help. This work is supported by Major Special Project of Sichuan Science and Technology Department (2020YFG0460), Central University Project of China (ZYGX2020ZB020, ZYGX2020ZB019).
Conflicts of Interest: We declare that we have no conflicts of interest to report regarding the present study.
1The source code of DXR-DU is available for download at: http://email@example.com:xyf1989/cau.git.