As video compression is one of the core technologies required to enable seamless medical data streaming in mobile healthcare applications, there is a need to develop powerful media codecs that can achieve minimum bitrates while maintaining high perceptual quality. Versatile Video Coding (VVC) is the latest video coding standard that can provide powerful coding performance with a similar visual quality compared to the previously developed method that is High Efficiency Video Coding (HEVC). In order to achieve this improved coding performance, VVC adopted various advanced coding tools, such as flexible Multi-type Tree (MTT) block structure which uses Binary Tree (BT) split and Ternary Tree (TT) split. However, VVC encoder requires heavy computational complexity due to the excessive Rate-distortion Optimization (RDO) processes used to determine the optimal MTT block mode. In this paper, we propose a fast MTT decision method with two Lightweight Neural Networks (LNNs) using Multi-layer Perceptron (MLP), which are applied to determine the early termination of the TT split within the encoding process. Experimental results show that the proposed method significantly reduced the encoding complexity up to 26% with unnoticeable coding loss compared to the VVC Test Model (VTM).
Image or video compression is widely used to facilitate real-time medical data communication in mobile healthcare applications. This technology has several applications, including remote diagnostics and emergency incidence responses, as shown in
The state of the art video coding standard, Versatile Video Coding (VVC) [
One of the main differences between HEVC and VVC is the block structure scheme. Both HEVC and VVC commonly specify Coding Tree Unit (CTU) as a basic coding unit with an interchangeable size based on the encoder configuration. In addition, to adapt to the various block properties, CTU could be split into four Coding Units (CUs) using a Quad-Tree (QT) structure. In HEVC, a leaf CU could be further partitioned into one, two, or seven Prediction Units (PUs) according to the PU partitioning types. After obtaining the residual block derived from the PU level using either intra or inter prediction, a leaf CU can be further partitioned into multiple Transform Units (TUs) according to a Residual Quad-Tree (RQT) that is structurally similar to that of the CU split. Therefore, the block structure of HEVC has multiple partition concepts, including CU, PU, and TU, as shown in
On the other hand, VVC replaces the concepts of multiple partition unit types (CU, PU, and TU) of HEVC with another block structure, named QT-based Multi-type Tree (QTMTT). In VVC, a MTT can be partitioned into either Binary Tree (BT) or Ternary Tree (TT) to support more flexible CU shapes. As shown in
Although the block structure of VVC is superior to that of HEVC in terms of the flexibility of CU shapes, it causes heavy computational complexity on VVC encoder due to the excessive Rate-distortion Optimization (RDO) calculations required to search for the optimal QT, BT, and TT block mode decision. Furthermore, VVC adopted a dual-tree structure in the intra-frame so that a CTU can have different block structures according to the color of each component. It means that a luma CTU and a chroma CTU can have their own QTMTT structures. This dual-tree concept significantly improves the coding performance in the chroma components but comes at the cost of increasing the computational complexity [
In this study, we investigated four input features to use as an input vector on the proposed two LNNs with MLP structure. With the proposed four input features, two LNNs were designed to provide high accuracy with the lowest complexity. In addition, various ablation works were performed to evaluate their effectiveness how these features would affect the accuracy of the proposed LNNs. We then proposed a fast MTT decision method using the two LNNs to determine the early termination of the horizontal and the vertical TT splits required in the CU encoding process. Finally, we implemented the proposed method on top of the VTM and evaluated the trade-off between the complexity reduction and the coding loss on medical sequences as well as JVET test sequences.
The remainder of this paper is organized as follows. In Section 2, we review the related fast encoding schemes used to reduce the complexity of the video encoder. Then, the proposed method is described in Section 3. Finally, experimental results and conclusions are given in Sections 4 and Section 5, respectively.
Several methods have been proposed to reduce the computational complexity of HEVC and VVC encoders. Shen et al. proposed a fast CU size decision algorithm for HEVC intra prediction, which exploited RD costs and intra mode correlations among previously encoded CUs [
While the aforementioned methods evaluated statistical coding correlations between a current CU and neighboring CUs, recent researches have studied new fast decision schemes using Convolutional Neural Network (CNN) or Multi-layer Perceptron (MLP) to avoid the redundant block partitioning within video encoding processes. Xu et al. [
The RD-based encoding procedures for the current CU are illustrated in
As shown in
In this paper, we defined the relationship between the parent CU and current CUs used to extract the network input features. The parent CU can be either a QT node with a square shape or a MTT node with either a square or a rectangular shape covering the area of the current CU. For example, the divided QT, BT, and TT CUs can have the same parent CU, which is the QT node with 2 N
Because CNN generally requires heavy convolution operations between input features and filter weights, the proposed LNNs were designed to use fewer input features that can be easily extracted during the current CU encoding process. In this paper, we proposed four input features which are named as Ratio of Block Size (RBS), Optimal BT Direction (OBD), Ratio of the Number of Direction (RND), and TT Indication (TTI), respectively, as described in
Input features | Description |
---|---|
Ratio of block size ( |
Depending on the TT direction, RBS is computed as follows: |
where W and H are the width and the height of parent CU, respectively. | |
Optimal BT direction ( |
The boolean value indicates whether the optimal direction among the two BTs is the same as that of the current TT. |
Ratio of the number of direction ( |
The ratio of the horizontal ( |
TT split indication ( |
TTI is an accumulated value according to the comparisons of RD costs in the block mode decision process. For reference, TTI is initialized at 0. |
We implemented two LNNs with MLP architectures before the TT split, defined as HTS-LNN and VTS-LNN, according to the early termination of the horizontal and vertical TT splits, respectively. Through feature aggregation, a 1-dimensional (1D) column vector with four features as an input vector was identified for both HTS-LNN and VTS-LNN. As shown in
We implemented both HTS-LNN and VTS-LNN on PyCharm 2020.3.2 using the Keras library. In the process of network training, the output of the
Category | HTS-LNN | VTS-LNN | |||||||
---|---|---|---|---|---|---|---|---|---|
Layer | Total parameter | Memory (KB) | Layer | Total parameter | Memory (KB) | ||||
2 | 3 | 2 | 3 | 4 | |||||
Weight | 4 × 15 | 15 × 1 | 75 | 0.30 | 4 × 30 | 30 × 15 | 15 × 1 | 585 | 2.34 |
Bias | 15 | 1 | 16 | 0.06 | 30 | 15 | 1 | 46 | 0.18 |
Under the AI configuration of JVET CTC [
The selected hyperparameters are presented in
Optimizer | Stochastic Gradient Descent (SGD) |
Activation function | Sigmoid |
Loss function | Mean Squared Error (MSE) |
Learning rate | 0.01 |
Momentum | 0.9 |
Weight decay | 1e-6 |
Number of epochs | 5,000 |
Batch size | 128 |
Initial weight | Xavier |
All experiments were run on an Intel Xeon Gold 6138 40-cores 2.00 GHz processors having 256GB RAM operated by the 64-bit Windows server 2016. In the class A (3,840 × 2,160) and B (1,920 × 1,080) sequences of JVET CTC, the proposed method was evaluated under AI configuration and we compared our method with the previous method [
In order to evaluate the coding loss, we measured the Bjontegaard Delta Bit Rate (BDBR) [
Class | Sequence | C-TTD [ |
Proposed method | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
ΔT | ΔT | ||||||||||
A | Tango2 | 0.17% | 0.09% | 0.30% | 0.17% | 14% | 0.28% | 0.03% | 0.11% | 0.23% | 30% |
FoodMarket4 | 0.16% | 0.14% | 0.14% | 0.15% | 13% | 0.27% | 0.15% | 0.02% | 0.22% | 27% | |
Campfire | 0.27% | 0.35% | 0.58% | 0.32% | 17% | 0.38% | 0.10% | 0.45% | 0.36% | 26% | |
CatRobot | 0.36% | 0.56% | 0.49% | 0.40% | 14% | 0.50% | 0.21% | 0.20% | 0.43% | 26% | |
DatlighRoad2 | 0.44% | 0.99% | 0.77% | 0.55% | 18% | 0.59% | 0.50% | 0.37% | 0.55% | 26% | |
ParkRunning3 | 0.15% | 0.35% | 0.32% | 0.19% | 16% | 0.26% | 0.20% | 0.18% | 0.24% | 22% | |
B | MarketPlace | 0.18% | 0.44% | 0.28% | 0.23% | 16% | 0.33% | 0.28% | -0.03% | 0.28% | 29% |
RitualDance | 0.33% | 0.56% | 0.41% | 0.37% | 18% | 0.53% | 0.41% | 0.23% | 0.48% | 27% | |
Cactus | 0.38% | 0.49% | 0.71% | 0.44% | 20% | 0.51% | 0.10% | 0.39% | 0.44% | 26% | |
BasketballDrive | 0.39% | 0.66% | 0.60% | 0.45% | 15% | 0.50% | 0.14% | 0.18% | 0.42% | 24% | |
BQTerrace | 0.34% | 0.89% | 0.94% | 0.48% | 18% | 0.44% | 0.58% | 0.45% | 0.46% | 21% | |
CVC-ClinicDB [ |
C-TTD [ |
Proposed method | ||
---|---|---|---|---|
Δ |
Δ |
|||
2nd sequence | 0.42% | 25% | 0.44% | 33% |
3rd sequence | 0.31% | 26% | 0.40% | 34% |
10th sequence | 0.44% | 31% | 0.62% | 36% |
12th sequence | 0.29% | 30% | 0.31% | 39% |
25th sequence | 0.28% | 28% | 0.44% | 34% |
In order to optimize the proposed network architecture, it is essential to determine the valid input features, the number of hidden layers, and the number of nodes per hidden layer. Tool-off tests on both validation and training datasets were performed to measure the effectiveness of the four input features. This test involved measuring the efficacy of all input features combined (“all features”) and then by subsequently omitting one of the four input features. The results of the tool-off tests are illustrated in
Category | HTS-LNN | |||
---|---|---|---|---|
Training datasets | Validation datasets | |||
Accuracy | Loss | Accuracy | Loss | |
All features | 0.752 | 0.176 | 0.752 | 0.176 |
RBS tool-off | 0.734 | 0.185 | 0.733 | 0.185 |
OBD tool-off | 0.716 | 0.191 | 0.716 | 0.191 |
RND tool-off | 0.747 | 0.180 | 0.746 | 0.181 |
TTI tool-off | 0.682 | 0.204 | 0.681 | 0.204 |
Category | VTS-LNN | |||
---|---|---|---|---|
Training datasets | Validation datasets | |||
Accuracy | Loss | Accuracy | Loss | |
All features | 0.713 | 0.193 | 0.716 | 0.192 |
RBS tool-off | 0.670 | 0.211 | 0.669 | 0.211 |
OBD tool-off | 0.682 | 0.205 | 0.684 | 0.205 |
RND tool-off | 0.645 | 0.221 | 0.644 | 0.221 |
TTI tool-off | 0.660 | 0.218 | 0.663 | 0.217 |
In terms of the number of hidden layers and the number of nodes per hidden layer, we investigated a few neural network models using different numbers of nodes and hidden layers, as shown in
Category | HTS-LNN (4 × 15 × 1) | |||
---|---|---|---|---|
Training datasets | Validation datasets | |||
Accuracy | Loss | Accuracy | Loss | |
4 × 15 × 1 | ||||
4 × 30 × 1 | 0.752 | 0.177 | 0.753 | 0.177 |
4 × 45 × 1 | 0.753 | 0.176 | 0.748 | 0.178 |
4 × 30 × 15 × 1 | 0.752 | 0.176 | 0.751 | 0.176 |
4 × 45 × 30 × 1 | 0.752 | 0.176 | 0.752 | 0.176 |
Category | VTS-LNN (4 × 30 × 15 × 1) | |||
---|---|---|---|---|
Training datasets | Validation datasets | |||
Accuracy | Loss | Accuracy | Loss | |
4 × 15 × 1 | 0.710 | 0.195 | 0.714 | 0.193 |
4 × 30 × 1 | 0.710 | 0.197 | 0.709 | 0.197 |
4 × 45 × 1 | 0.711 | 0.195 | 0.710 | 0.194 |
4 × 30 × 15 × 1 | ||||
4 × 45 × 30 × 1 | 0.713 | 0.193 | 0.714 | 0.193 |
In this paper, we proposed two LNNs with MLP architectures to determine the early termination of the TT split in the encoding process, namely HTS-LNN and VTS-LNN for the early termination of the horizontal and vertical TT splits, respectively. HTS-LNN consisted of four input nodes, one hidden layer with 15 hidden nodes, and one output node. On the other hand, VTS-LNN consisted of four input nodes, two hidden layers with 30 and 15 hidden nodes, and one output node. The various verification tests of those networks were conducted to determine the optimal network structure. In order to identify the effectiveness of this method for the transfer of medical images, we evaluated the performance of our approach using colonoscopy medical sequences obtained from the CVC-ClinicDB and JVET CTC sequences. Our experimental results indicate that the proposed method can significantly reduce the average encoding complexity by 26% and 10% with unnoticeable coding loss compared to the anchor and the previous method, respectively. Through visual comparison, we demonstrated that the proposed method could provide almost the same visual quality compared to the anchor.
This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-00072, Development of Audio/Video Coding and Light Field Media Fundamental Technologies for Ultra Realistic Tera-media)