In this paper, a robust approach INLPETWA (an Intelligent Natural Language Processing and English Text Watermarking Approach) is proposed to tampering detection of English text by integrating zero text watermarking and hidden Markov model as a soft computing and natural language processing techniques. In the INLPETWA approach, embedding and detecting the watermark key logically conducted without altering the plain text. Second-gram and word mechanism of hidden Markov model is used as a natural text analysis technique to extracts English text features and use them as a watermark key and embed them logically and validates them during detection process to detect any tampering. INLPETWA approach has been implemented by self-developed program using PHP with VS code IDE. INLPETWA approach has been proved with various experiments and simulation scenarios. Comparison results with baseline approaches also show that the proposed approach is appropriate to detect all types of tampering attacks. The paper includes implications for integrating natural language processing and text-watermarking to propose an intelligent solution. This paper fulfils an identified need to study how we can use a robust text information via various Internet applications.
For the research community, the security and reliability of text information exchanged through the Internet is the greatest promising and challenging field. In communication technologies, content authentication and honesty of automated text verification in different Languages and formats are of great significance. Numerous applications such as electronic banking, electronic commerce etc. impose most challenges during contents transfer via internet. In terms of content, structure, grammar, and semantics, much of the multimedia exchanged via Internet is in textual form and is very susceptible to online transmission. During the transfer process, malicious attackers can temper such digital content and thus the changed count [
For information security, many algorithms and techniques are available, such as content authentication, verification of integrity, detection of tampering, identification of owners, access control and copyright protection.
To overcome these issues, digital watermarking (DWM) is a technique can be used to hide the various data, such as text, binary images, video, and audio and embed them in digital content as a watermark information [
A fine-grain text watermarking method is suggested based on the substitution of homoglyph characters for Latin symbols and white spaces [
Several conventional methods and solutions for text watermarking were proposed [
Restricted research has centred on the appropriate solutions to verify the credibility of critical digital media online [
Proposing the most appropriate techniques and solutions for various formats and content, especially in English and Arabic languages, is the most common challenge in this area [
Some instances of such sensitive digital text content are digital Holy Qur’an in Arabic, eChecks, online marks and exams. Different Arabic alphabet characteristics such as diacritics, extended letters, and other Arabic symbols make it easy to alter the key meaning of text material by making basic changes such as modifying diacritic arrangements [
In this paper, authors present a robust approach INLPETWA (an Intelligent NLP and English Text Watermarking Approach) which makes use of English text zero watermarking and second gram of word method of Markov model. Soft computing tool and zero watermarking technique have been integrated in INLPETWA approach in order to analyzing the given English text and extract the watermark information. Embedding process will be conducted logically in the plain English text without effecting on contents and size of the plain text. After the transmission of the text, aim of the hidden DWM is used in next phase to detect and obtain tampered text on received English text and ensures the authenticity of the transmitted text.
The core objective of the INLPETWA approach is to achieve better performance with high detection level of any illegal tampering occurred in English text exchanged electronically via Internet.
This paper is organized in addition to the Section 1 as follows. Section 2 presents the previous related works. Section 3 presents the proposed INLPETWA. Section 4 explain the implementation, simulation, and experimental details. Section 5 describes the comparison and results discussion, and Section 6 offers conclusions.
According to the processing domain of NLP and text watermarking, these existing methods and solutions of text watermarking reviewed in this paper classified into linguistic, structural, and zero-watermark techniques [
The approaches to linguistic text watermarking are based upon natural language to hide watermark key by making some altering on semantic and the syntactic nature of original text [
To enhance the capability and imperceptibility of Arabic text, a text watermarking algorithm based on open-word spaces [
A technique of text steganography [
A Kashida-watermark based method has been presented in [
The method of text steganography [
A text steganographic approach [
Structural text-watermark-based methods are based on a framework dependent on material in which altering on structure of the original text are performed to hide a watermark data [
Text watermarking method based on Unicode extended characters has been proposed in [
The replacement attack method [
Zero watermark-based methods rely on text characteristics. Several zero-text-watermark-based algorithms and techniques have been suggested, as in studies [
To measure the reliability of the electronic texts posted on social application ns, the ANiTH method [
Zero-watermarking algorithm has been presented in [
A zero-watermarking approach [
An intelligent approach is proposed in this paper by integrating text-watermark and hidden Markov model as NLP technique in which do not need additional details to be embedded as a watermark data and do not need to make any changes to the plain text to insert a watermark inside it. Second gram of word method of Markov model is used as NLP to analyze English content and extract the features of these text contents. Several assumptions of INLPETWA are addressed as follows:
Watermark key will be extracted as a result of English text analysis without altering the original text. High watermark robustness in all cases whenever the tackers get watermark key in any way. All types of tampering attacks will be addressed to detect randomly such as insertion, deletion, and reorder attacks. All volumes of tampering will be addressed to detect whenever attack volume is very low. There are no limitations in size of English text.
The following subsections explain in detail two main processes that should perform in INLPETWA. The first process called watermark generation and embedding process, however, the second one called watermark extraction and detection process.
Three algorithms should be performed in this process are pre-processing, English text analysis and WM generation, and watermark embedding algorithm as illustrated in
Preprocessing of the plain English text is a core activity in both the WM generation and extraction phases to set all English letter in small case, delete blank spaces and extra new lines, and it will affect the accuracy of tampering detection and watermark robustness. The original English text (OET) is a necessary provided as input for this process.
This algorithm contains two sub procedures—building Markov chain matrix and text analysis, and WM generation processes.
where,
–OET: is plain English text, PET: is a pre-processed English text.
This algorithm is performed as second step of this process in which English text should be analyzed to extract the features of the given text and utilize them to generate watermark information. In this algorithm, occurrence time of all transitions for each present state of pair words will computed by
The following example of the provided English text illustrates the work mechanism of this algorithm.
“The quick brown fox jumps over the brown fox who is slow jumps over the brown fox is dead”
In second gram of word method of Markov model, each pair unique of English words represent a unique state.
Authors assume “brown fox” is a current state, and its transition(s) are “jumps”, “who”, and “who”. We observe that “who” transition appears twice in the given English text sample.
Based on second gram of word method of hidden Markov, algorithm of text analysis and WM generation performed as presented in
Feature extraction of English text and WM generation algorithm is proceeds formally as presented in
where, pw: previous pair of words, cpw: current pair of words.
In this approach, watermark embedding process will be done logically without necessity to make altering on the original plain text. As a result of feature extraction of the given English text, WM data is embedded logically by obtaining non-zeros values in Markov chain matrix. Those values will be concatenated and used extract the WM key pattern EW2_WMPO, as given in in
Algorithm of watermark embedding process using INLPETWA approach is executed as showed below in
where, EW2_WMPO: WM patterns, EW2_DWMPO: digested WM.
Pre-processing process is required for attacked English text (PETA). Then, attacked watermark key (EW2_EWMA) should be produced, and detection process should be calculated by INLPETWA approach to detect any illegal tampering occurred in the given English text.
This process includes two core algorithms are watermark extraction and detection. Though, EW2_EWMA will be produced from (PETA) and compared with EW2_WMPO by detection process.
PETA should be provided as input to initial setup of this algorithm. Though, EW2_WMPA is a core output of this algorithm as illustrated formally in
where, PETA: pre-processed attacked English text document, EW2_WMPA: attacked DWM.
EW2_WMPA and EW2_WMPO should be provided as inputs to run this algorithm. However, the status of the given English text is a core output of this algorithm which can be reliable or not. This process can perform in two steps as follows:
Main matching for EW2_WMPO and EW2_WMPA is achieved. If those two WM patterns are similar in appearance, then there will be a warning “Given English text is a reliable”. Otherwise, the note will be rendered “Given English text is not reliable”, and then it going through next phase. Secondary matching is performed by matching each state’s transition status in the entire produced pattern of watermarks. This means EW2_WMPA of each state is contrasted with an analogous transition of EW2_WMPO as given by
where,
EW2_
where,
n: is a summation value of non zeros transitions. i: is the cumulative pattern matching rate of the word state.
The following step is obtaining the weight of each state stored in Markov chain matrix as illustrated in
where,
The final EW2_PMR of PETA and OETP are computed by
where, N: is summation of non-zeros in EW2_MM.
The distortion rate of the watermark reflects the volume of tampering attacks that take place on the attacked contents of Arabic background, denoted by EW2_WDR and computed by
Algorithm of WM detection process is implemented as showed in
where,
EW2_SW: is value of properly weight of matched states.
To validate the accuracy of INLPETWA, Self-developed program has been implemented, several scenarios of experiments and simulation are performed as explained in detail in the following subsections.
INLPETWA approach, is executed by self-developed program in object oriented and PHP using VS Code IDE on the environment having modern features.
The following an experimental, simulation metrics and their related values that used to perform the experiments are given in
Metric | Value |
---|---|
English dataset size | [ESST, 179], [EMST, 421], [EHMST, 559] and [ELST, 2018] |
Attack type | Insertion, deletion and reorder |
Attack volumes | 5%, 10%, 20% and 50% |
Robustness and tampering detection accuracy | H when close to 100 |
L when close to 0 | |
EW2_PMR | (H if EW2_PMR > 70, |
M if 40 < EW2_PMR < 70, and | |
L if EW2_PMR < 40) | |
EW2_WDR | (H if EW2_WDR > 70, |
M if 40 < EW2_WDR < 70, and | |
L if EW2_WDR < 40) |
The performance of INLPETWA refers to accuracy of robustness and tampering detection which is evaluated by using the following parameters.
Accuracy of tampering detection (EW2_PMR and EW2_WDR) is evaluated under main four attack volumes which are: very low (5%), low (10%), mid (20%) and high (50%). Desired accuracy of tampering detection values near to 100%. Comparison of text size, attack types, and attack volumes effects against detection accuracy using the proposed INLPETWA approach, ZWAFWMMM and HNLPZWA baseline approaches.
The performance and accuracy of INLPETWA is compared with HNLPZWA (an intelligent hybrid of natural language processing and zero-watermarking approach) and ZWAFWMMM (Zero-Watermarking Approach based on Fourth level order of Word Mechanism of Markov Model) [
Approach | Attacks types | Attacks volumes | Dataset size |
---|---|---|---|
ZWAFWMMM | Insertion, deletion and reorder | 5%, 10%, 20% and 50% | Small, medium, and large |
HNLPZWA |
In this sub section, performance evaluation of INLPETWA have been performed. The character set covers all English characters, spaces, special symbols, and numbers. Simulations are performed on various datasets sizes and various kind of attacks and volumes as showed above in
Various simulation scenarios have been conducted to text and evaluate the tampering detection accuracy of INLPETWA using all types of attacks and their rates as show in
Attack volume (%) | Attacks | ||
---|---|---|---|
Insertion | Deletion | Reorder | |
5 | 94.47 | 92.14 | 84.76 |
10 | 90.13 | 85.37 | 74.21 |
20 | 82.07 | 74.06 | 58.85 |
50 | 65.30 | 42.56 | 34.34 |
The results in
The performance and tampering detection accuracy results are critically analyzed, effect study and compared between INLPETWA and baseline approaches ZWAFWMMM and HNLPZWA and shows discussion of their effect under the major factors, i.e., attack volumes and types, and dataset size.
Attack type | Approach | ||
---|---|---|---|
ZWAFWMMM | HNLPZWA | INLPETWA | |
Insertion | 80.02 | 71.28 | 82.99 |
Deletion | 66.25 | 59.99 | 73.53 |
Reorder | 44.88 | 37.23 | 63.04 |
Attack volume (%) | Approach | ||
---|---|---|---|
ZWAFWMMM | HNLPZWA | INLPETWA | |
5 | 83.60 | 82.09 | 90.45 |
10 | 74.33 | 72.74 | 83.24 |
20 | 59.39 | 57.71 | 71.66 |
50 | 37.56 | 13.66 | 47.40 |
In this subsection, authors present an evaluation of the different dataset size effects on performance of INLPETWA, ZWAFWMMM and HNLPZWA approaches against all attack types under their different volumes as shown in
Dataset size | Approach | ||
---|---|---|---|
ZWAFWMMM | HNLPZWA | INLPETWA | |
[ESST] | 69.53 | 67.27 | 68.83 |
[EMST] | 68.13 | 63.80 | 78.72 |
[EHMST] | 65.11 | 59.23 | 73.15 |
[ELST] | 62.07 | 54.47 | 72.30 |
The comparative results as shown in
Centered on the hidden Markov model mechanism of second gram and word method, a novel hybrid approach of NLP and English text zero-watermarking has been developed which is abbreviated as INLPETWA has been proposed in this paper by integrating soft computing and digital watermarking techniques. soft computing and NLP used in INLPETWA to perform text analysis process to found interrelationships between the content of the English-text provided and the main watermark created. Without modification or impact on plain text size, the created watermark should logically be embedded in the original English background. Hidden watermark will be used in the next phase to detect illegal tampering on received English-text after transmission of text through the Internet. INLPETWA approach has been developed and implemented in PHP using VS code IDE. The experiments are performed on different standard English datasets using various rates of insertion, reorder, and deletion attacks. The experiments results show that INLPETWA is applicable to detect tampering on English text. For future work, authors will intend to improve the performance using other mechanism of Markov model.