Computers, Materials & Continua DOI:10.32604/cmc.2021.016635 | |
Article |
General Steganalysis Method of Compressed Speech Under Different Standards
1Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
2The University of Melbourne, Melbourne, VIC3010, Australia
*Corresponding Author: Songbin Li. Email: lisongbin@mail.ioa.ac.cn
Received: 07 January 2021; Accepted: 17 February 2021
Abstract: Analysis-by-synthesis linear predictive coding (AbS-LPC) is widely used in a variety of low-bit-rate speech codecs. Most of the current steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods, and thus lack generalization capability. In this paper, a general steganalysis method for detecting steganographies in low-bit-rate compressed speech under different standards is proposed. First, the code-element matrices corresponding to different coding standards are concatenated to obtain a synthetic code-element matrix, which will be mapped into an intermediate feature representation by utilizing the pre-trained dictionaries. Then, bidirectional long short-term memory is employed to capture long-term contextual correlations. Finally, a code-element affinity attention mechanism is used to capture the global inter-frame context, and a full connection structure is used to generate the prediction result. Experimental results show that the proposed method is effective and better than the comparison methods for detecting steganographies in cross-standard low-bit-rate compressed speech.
Keywords: Cross-standard; compressed speech; steganalysis; attention
Data hiding is a technique of embedding secrets into digital media imperceptibly, and different types of media data are considered for steganography, including image [1,2], text [3,4], and video [5,6]. In recent years, with the continuous growth of network bandwidth and the enhancement of network convergence, network streaming media services for communication have undergone unprecedented development. Since Voice over Internet Protocol (VoIP) technology [7,8] has been widely used for real-time communication, it has become an excellent carrier for transmitting secret information over the Internet. VoIP steganography is a means of imperceptibly embedding secret information into VoIP-based cover speech. There are many VoIP speech codecs, including G.711, G.723.1, G.726, G.728, G.729, internet Low Bitrate Codec (iLBC), and the Adaptive Multi-Rate (AMR) codec. Most of them, including G.723.1, G.729, AMR, and iLBC, are low-bit-rate speech codecs that use analysis-by-synthesis linear predictive coding (AbS-LPC) [9]. At present, most methods of speech steganography utilize AbS-LPC low-bit-rate speech codecs to embed secret information for covert communication. Therefore, it is essential to develop a powerful steganalysis method to analyze low-bit-rate speech streams.
Information-hiding methods based on low-bit-rate speech streams can be divided into three categories according to the embedding position: The first category uses a pitch synthesis filter for information hiding [10–16], the second uses a LPC synthesis filter to hide information [17–22], and the third embeds information by directly modifying the value of some code elements in the compressed speech stream [23–30].
The existing steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods. Thus, they lack generalization capacity. When general steganalysis is required, it is complex and time-consuming to enumerate all the steganalysis methods that correspond to the steganographic methods, which makes it difficult to meet the requirements of practical applications. In this paper, the generality of steganalysis algorithms is divided into two levels: one is general for different steganography algorithms under the same compression standard, and the other is general for steganography algorithms under different standards. For interpreting the idea of the proposed method, the generality of the first one is referred to as C1 and that of the second one as C2. The general steganalysis algorithm of the C1 level can effectively detect different information-hiding algorithms (e.g., quantization index modulation [31]) under the same standard, such as G.729. The general steganalysis method of the C2 level can detect different information-hiding algorithms under an arbitrary standard. For example, to achieve general steganalysis of different coding standards, if non-general steganography detection methods are used, it is necessary to jointly use multiple steganalysis methods for different coding standards and different steganography methods, as shown in Fig. 1a. As demonstrated in Fig. 1b, different methods must be combined under different coding standards when using the steganalysis methods of the C1 level. As shown in Fig. 1c, only one detection method of the C2 level is needed. Obviously, the ideal steganalysis method is to achieve C2-level generality, which is also the research focus of this paper.
Since speech signals are encoded by different encoding standards, the number of code elements (CEs) and their connotations are quite different. Therefore, it is unrealistic to perform C2-level general steganalysis directly based on the original compressed speech stream. In this paper, the compressed speech stream of different coding standards is first converted into an intermediate feature representation. Then, a classification network based on a CE affinity attention mechanism is built to accomplish steganalysis.
The architecture of the proposed steganalysis method is illustrated in Fig. 2. It can be divided into two parts: Intermediate feature representation and the steganalysis network. The intermediate feature representation is mainly used to convert compressed speech data under different coding standards into a general intermediate feature representation, and the steganalysis network performs steganalysis based on the intermediate feature. The details are described below.
2.1 Intermediate Feature Representation
Assuming that one must detect m types of coding standards at the same time, the CE matrix
where Ni is the number of CEs in a frame corresponding to the ith coding standard, and
where
To convert the values of CEs into a form that is easy to use by the neural network, one-hot coding is utilized to map each CE into a feature vector. For a CE that occupies n bits, its coded value range is 0 −2n −1. In one-hot encoding, a vector with a length of 2n is used to represent this CE. If the coded value of this CE is u, the one-hot representation can be denoted as
where
After one-hot coding, a group of independent CE one-hot representations are obtained, which are then aggregated according to the order of the original CEs to form a long feature vector, called a multi-hot vector. This process is called multi-hot coding. Taking the example in which there are M code elements in a frame, the length of the corresponding multi-hot vector can be calculated as
where di denotes the number of bits that the ith CE occupies.
However, some CEs may occupy too many bits, which will greatly increase the amount of calculation of the model. When the one-hot coding operation is conducted on these code elements, the length of the one-hot vector will become very large. This explosive dimension increase is more than can be afforded, so dimensionality reduction is needed. Therefore, a frequency count method is employed on these CEs. Experiments prove that this is a simple but very effective coding method. Specifically, for each CE that occupies more than 8 bits, the occurrence frequency of its every coded value is counted. Then, the coded values are arranged in order of frequency, and the first 255 values selected. These values are encoded as 0–254. The other coded values are encoded as 255. In this way, the coded value of all CEs can be mapped into 0–255.
A sparse representation
Since the front and back frames of a speech sample can influence each other, a two-layer Bi-LSTM is first employed to capture long-term contextual correlations of
The architecture of the CE affinity attention module is illustrated in Fig. 2. It consists of two branches: the first branch is used to calculate the local affinity attention vector, and the second deals with the feature representation
In this paper, the output features calculated by Bi-LSTM are defined as
where
After
where
To endow the features used for classification with both long-range dependencies and global inter-frame context, the features output from Bi-LSTM and the codeword affinity module are integrated to form a more powerful feature representation. Then, the features are inputted into a classification that consists of two layers of full connection and a sigmoid activation function. A prediction probability value p, which determines whether the hidden message exists in an input speech sequence, is then obtained:
3 Equations and Mathematical Expressions
Seven thousand speech segments were collected from the Internet, including samples from seven human voice categories, to form the speech database. Each category contains 1,000 speech segments. The seven categories are Chinese man, Chinese woman, English man, English woman, French, German, and Japanese. Each human voice category contains samples from more than five individuals. The duration of each speech segment is 10 s, and each segment is formatted as a mono PCM file with an 8,000-Hz sampling rate and 16-bit quantization. The speech segments in each category are divided into a training dataset and a testing dataset at a 4:1 ratio. The training dataset is used to conduct parameters adjustment of the model, and the test dataset is used to evaluate the model performance. The G.723.1 (6.3 kbit/s) and G.729 codecs are used to evaluate the performance of the proposed method.
Both the training and testing stages were executed on a GeForce GTX 2080 graphical processing unit with 11 Gb of graphics memory. PyTorch was used to help implement the model and algorithm. In addition, in the process of training the neural network, Adam was used as the optimizer with a learning rate of
As mentioned above, three main categories of steganography methods exist for AbS-LPC low-bit-rate compressed speech. To comprehensively test the performance of the proposed model, a representative method [15,17,24] was chosen for each steganography category. For simplicity, the chosen methods are denoted “ACL” [15], “CNV” [17], and “HYF” [24]. It should be noted that the ACL and HYF methods are designed for the G.723.1 standard, and the CNV method was used for steganography under the G.729 standard; All three methods were used for steganography under the G.723.1 standard.
To the best of our knowledge, no general method has been designed for the detection of steganographies in cross-standard AbS-LPC low-bit-rate compressed speech. The MFCC-based steganalysis method [33] can, in theory, detect any type of steganography based on the decoded audio/speech data. In this sense, this method is believed to be general as well. Besides, Hu et al. [34] proposed a SFFN-based general steganalysis method for specialized coding standards. In the present paper, these methods are used as comparison algorithms with which to evaluate the proposed method.
The embedding rate is defined as the ratio of the number of embedded bits to the total embedding capacity. Experiments on the three steganography methods for the G.723.1 standard were conducted under five different embedding rates (20%–100%). The experimental results are shown in Tab. 1. For ACL, the detection accuracy of the MFCC method is only 51.58% when the embedding rate is 20%, slightly better than a random guess. As a comparison, the detection accuracy of the proposed method is 98.96%, far exceeding that of the MFCC method. However, the detection accuracy of SFFN achieves 99.54%, 0.58% higher than the proposed method. When the embedding rate is 40% or above, both SFFN and the proposed method have a detection accuracy of 100%. For HYF and CNV, when the embedding rate is 20%, the detection accuracies of the proposed method are 35.73% and 37.26% higher, respectively, than that of MFCC. By contrast, the detection accuracies of SFFN are 8.48% and 12% higher than that of MFCC, respectively. When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 95%, while the proposed method can achieve the same accuracy when the embedding rate is only 20%.
Since the ACL and HYF methods are designed for the G.723.1 standard, the CNV method is used for steganography under the G.729 standard. Experiments on the CNV method were conducted under five different embedding rates (20%–100%). The experimental results are shown in Tab. 2, from which it can be seen that the proposed method performs better than MFCC and SFFN at all embedding rates. When the embedding rate is 20%, the detection accuracies of the proposed method are 32.73% higher than that of MFCC and 6.74% higher than that of SFFN. When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 99%, while the proposed method can achieve the same accuracy when the embedding rate is only 40%.
In summary, the proposed method achieves the best results at all embedding rates under the G.723.1 and G.729 standards, except for a 20% embedding rate and ACL steganography under the G.723.1 standard, which is 0.58% lower than that of SFFN. The experimental results indicate that the proposed steganalysis method can be effective for detecting steganographies in cross-standard low-bit-rate compressed speech.
In this paper, a common method for detecting steganographies in cross-standard low-bit-rate compressed speech based on intermediate feature representation is proposed. To detect multiple coding standards at the same time, the code element (CE) matrices corresponding to m coding standards are first concatenated to obtain a synthetic CE matrix. Then, one-hot coding is utilized to convert this matrix into a form that is easy to use by a neural network. Inspired by the ideas in natural language processing, dictionaries are built for each CE by transforming them into intermediate features to achieve more compact representations. These features are inputted into the resulting steganalysis network to obtain the final classification result. Experimental results indicate the superiority in accuracy and performance of the proposed method.
Funding Statement: This work is supported partly by Hainan Provincial Natural Science Foundation of China under Grant No. 618QN309, partly by the Important Science & Technology Project of Hainan Province under Grant Nos. ZDKJ201807 and ZDKJ2020010, partly by the Scientific Research Foundation Project of Haikou Laboratory, Institute of Acoustics, Chinese Academy of Sciences, and partly by the IACAS Young Elite Researcher Project (QNYC201829 and QNYC201747).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
1. R. Das, M. Baykara and G. Tuna, “A novel approach to steganography: Enhanced least significant bit substitution algorithm integrated with self-determining encryption feature,” Computer Systems Science and Engineering, vol. 34, no. 1, pp. 23–32, 2019. [Google Scholar]
2. Y. Wang, Z. Fu and X. Sun, “High visual quality image steganography based on encoder-decoder model,” Journal of Cyber Security, vol. 2, no. 3, pp. 115–121, 2020. [Google Scholar]
3. L. Y. Xiang, S. H. Yang, Y. H. Liu, Q. Li and C. Z. Zhu, “Novel linguistic steganography based on character-level text generation,” Mathematics, vol. 8, no. 9, pp. 1558, 2020. [Google Scholar]
4. Y. J. Tong, Y. L. Liu, J. Wang and G. J. Xin, “Text steganography on RNN-generated lyrics,” Mathematical Biosciences and Engineering, vol. 16, no. 5, pp. 5451–5463, 2019. [Google Scholar]
5. Z. Li, L. Meng, S. Xu, Z. Li, Y. Shi et al., “A hevc video steganalysis algorithm based on pu partition modes,” Computers, Materials & Continua, vol. 59, no. 2, pp. 563–574, 2019. [Google Scholar]
6. H. Tang, X. Yang, Y. Zhang and K. Niu, “A mv-based steganographic algorithm for H.264/avc without distortion,” Computers, Materials & Continua, vol. 63, no. 3, pp. 1205–1219, 2020. [Google Scholar]
7. B. Goode, “Voice over internet protocol (VoIP),” Proceedings of the IEEE, vol. 90, no. 9, pp. 1495–1517, 2002. [Google Scholar]
8. R. J. B. Roslin, O. O. Khalifa and S. S. N. Bhuiyan, “Improved voice over internet protocol for wireless devices,” in 7th Int. Conf. on Computer and Communication Engineering, Kuala Lumpur, Malaysia, pp. 498–503, 2018. [Google Scholar]
9. D. O. Shaughnessy, “Linear predictive coding,” IEEE Potentials, vol. 7, no. 1, pp. 29–32, 1988. [Google Scholar]
10. B. Geiser and P. Vary, “High rate data hiding in ACELP speech codecs,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, pp. 4005–4008, 2008. [Google Scholar]
11. A. Nishimura, “Data hiding in pitch delay data of the adaptive multi-rate narrow-band speech codec,” in Fifth Int. Conf. on Intelligent Information Hiding and Multimedia Signal Processing, Kyoto, Japan, pp. 483–486, 2009. [Google Scholar]
12. H. Miao, L. Huang, Z. Chen, W. Yang, AI-hawbani et al., “A new scheme for covert communication via 3G encoded speech,” Computers & Electrical Engineering, vol. 38, no. 6, pp. 1490–1501, 2012. [Google Scholar]
13. S. Yan, G. Tang and Y. Chen, “Incorporating data hiding into G.729 speech codec,” Multimedia Tools and Applications, vol. 75, no. 18, pp. 11493–11512, 2016. [Google Scholar]
14. Y. Ren, H. Wu and L. Wang, “An AMR adaptive steganography algorithm based on minimizing distortion,” Multimedia Tools and Applications, vol. 77, no. 10, pp. 12095–12110, 2018. [Google Scholar]
15. Y. F. Huang, C. Liu, S. Y. Tang and S. Bai, “Steganography integration into a low-bit rate speech codec,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 6, pp. 1865–1875, 2012. [Google Scholar]
16. Y. Z. Ren, H. Y. Yang, H. X. Wu, W. P. Tu and L. N. Wang, “A secure AMR fixed codebook steganographic scheme based on pulse distribution model,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 10, pp. 2649–2661, 2019. [Google Scholar]
17. B. Xiao, Y. F. Huang and S. Y. Tang, “An approach to information hiding in low bit-rate speech stream,” in IEEE Global Telecommunications Conf., New Orleans, LO, USA, pp. 1–5, 2008. [Google Scholar]
18. Y. F. Huang, H. Z. Tao, B. Xiao and C. C. Chang, “Steganography in low bit-rate speech streams based on quantization index modulation controlled by keys,” Science China Technological Sciences, vol. 60, no. 10, pp. 1585–1596, 2017. [Google Scholar]
19. H. Tian, J. Liu and S. B. Li, “Improving security of quantization-index-modulation steganography in low bit-rate speech streams,” Multimedia systems, vol. 20, no. 2, pp. 143–154, 2014. [Google Scholar]
20. P. Liu, S. B. Li and H. Q. Wang, “Steganography in vector quantization process of linear predictive coding for low-bit-rate speech codec,” Multimedia Systems, vol. 23, no. 4, pp. 485–497, 2017. [Google Scholar]
21. P. Liu, S. B. Li and H. Q. Wang, “Steganography integrated into linear predictive coding for low bit-rate speech codec,” Multimedia Tools and Applications, vol. 76, no. 2, pp. 2837–2859, 2017. [Google Scholar]
22. Y. Z. Ren, W. M. Zheng and L. N. Wang, “SILK steganography scheme based on the distribution of LSF parameter,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conf., Honolulu, HI, USA, pp. 539–548, 2018. [Google Scholar]
23. T. Z. Xu and Z. Yang, “Simple and effective speech steganography in G. 723.1 low-rate codes,” in Int. Conf. on Wireless Communications & Signal Processing, Nanjing, China, pp. 1–4, 2009. [Google Scholar]
24. Y. F. Huang, S. Y. Tang and J. Yuan, “Steganography in inactive frames of VoIP streams encoded by source codec,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 2, pp. 296–306, 2011. [Google Scholar]
25. J. Liu, K. Zhou and H. Tian, “Least-significant-digit steganography in low bitrate speech,” in IEEE Int. Conf. on Communications, Ottawa, Canada, pp. 1133–1137, 2012. [Google Scholar]
26. R. S. Lin, “An imperceptible information hiding in encoded bits of speech signal,” in Int. Conf. on Intelligent Information Hiding and Multimedia Signal Processing, Adelaide, SA, Australia, pp. 37–40, 2015. [Google Scholar]
27. Z. J. Wu, H. J. Cao and D. Z. Li, “An approach of steganography in G.729 bitstream based on matrix coding and interleaving,” Chinese Journal of Electronics, vol. 24, no. 1, pp. 157–165, 2015. [Google Scholar]
28. S. F. Yan, G. M. Tang, Y. F. Sun, Z. Z. Gao and L. Q. Shen, “A triple-layer steganography scheme for low bit-rate speech streams,” Multimedia Tools and Applications, vol. 74, no. 24, pp. 11763–11782, 2015. [Google Scholar]
29. X. S. Peng, Y. F. Huang and F. F. Li, “A steganography scheme in a low-bit rate speech codec based on 3D-sudoku matrix,” in 8th IEEE Int. Conf. on Communication Software and Networks, Beijing, China, pp. 13–18, 2016. [Google Scholar]
30. Y. J. Jiang and S. Y. Tang, “An efficient and secure VoIP communication system with chaotic mapping and message digest,” Multimedia Systems, vol. 24, no. 3, pp. 355–363, 2018. [Google Scholar]
31. S. B. Li, Y. Z. Jia and C. C. J. Kuo, “Steganalysis of qim steganography in low-bit-rate speech signals,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1011–1022, 2017. [Google Scholar]
32. G. Liu and J. B. Guo, “Bidirectional LSTM with attention mechanism and convolutional layer for text classification,” Neurocomputing, vol. 337, no. 14, pp. 325–338, 2019. [Google Scholar]
33. Q. Z. Liu, A. H. Sung and M. Y. Qiao, “Temporal derivative-based spectrum and mel-cepstrum audio steganalysis,” IEEE Transactions on Information Forensics and Security, vol. 4, no. 3, pp. 359–368, 2009. [Google Scholar]
34. Y. T. Hu, Y. H. Huang, Z. L. Yang and Y. F. Huang, “Detection of heterogeneous parallel steganography for low bit-rate voip speech streams,” Neurocomputing, vol. 419, no. 1, pp. 70–79, 2020. [Google Scholar]
This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |