Text recognition is a crucial and challenging task, which aims at translating a cropped text instance image into a target string sequence. Recently, Convolutional neural networks (CNN) have been widely used in text recognition tasks as it can effectively capture semantic and structural information in text. However, most existing methods are usually based on contextual clues. If only recognize a single character, the accuracy of these approaches can be reduced. For example, it is difficult to distinguish 0 and O in the traditional CNN network because they are very similar in composition and structure. To solve this problem, we propose a novel neural network model called Morphological Feature Aware Multi-CNN Model for Multilingual Text Recognition (MFAM-CNN) in this article. We introduce a contour extraction model to enrich the representation ability, which can distinguish characters with similar shapes. Self-adaptive text density classification module is designed to recognize characters of different densities, improving the accuracy of character recognition. In general, the model is more sensitive to the overall size of the text, which improves the recognition rate of similar text. To evaluate the effectiveness of our approach, we make a dataset containing Chinese, numbers, and letters called SC dataset. Extensive experiments are conducted on the above SC dataset and ARTI-TEXT dataset, the results demonstrate that our model significantly improves the performance, achieving 98.03% and 97.77% respectively.
Recent years have witnessed tremendous success in text recognition [
Most traditional methods solve this kind of problem without feature extraction. After deep learning was introduced, many methods have been proposed [
To solve these problems, sophisticated models are required to combine morphological features, semantic features, and cognitive computing into a complete system. In order to reduce the complexity of the system, Théodore Bluche et al. [
The proposed method was tested on two different datasets. The experimental results shows that the proposed method significantly improved the performance of the text recognition. The main contributions of this study are summarized as follows: A density-aware multi-CNN module has been introduced to recognize the text with different density, where the receptive field of model can be enhanced. In addition, more recognition of the complicated text with high-density could achieve a certain improvement in performance. A text contour extraction module has been proposed, in which we use the contour extraction algorithm to extract the maximum bounding box of the text. Then, through the diverse aspect ratio to distinguish the text with similar appearance.
In Section 2, we describe the development status and main issues of text recognition. In addition, 4 different methods will be mentioned as recent research. Our new method MFAM-CNN will be proposed in Section 3. After that, an experimental control group and experimental results and analysis will be presented in Section 4. And in the last section, we will draw some conclusions.
In this section, we review approaches that make use of recognition tasks in computer vision, with a focus on text detection and text recognition. Various methods have been proposed by researchers over the years.
At first, people's understanding of text recognition stays in the judgment of words by comparing the pixel difference between the target and the template. In 1966, Casey et al. [
However, this method only records the pixel information of the text and does not affect characters of different sizes, fonts, and colors. In addition, it relies on robust image pre-processing. After 2000, Support vector machine becomes a common classification tool for text recognition. The Support Vector Machine (SVM) aims to use a hyperplane to classify samples. The SVM is based on the VC dimension theory and the structural risk minimum principle in statistical learning theory. Usually, before using SVM, some feature extraction techniques are used to extract the feature from the target, and then the result is put into the SVM to classify the object. Because SVM fit nonlinear functions adaptively, it has become a very popular technology in the field of machine learning, and once achieve the art-of-state in various image classification competitions.
Gao et al. [
Although the support vector machine has greatly improved the recognition rate of text, its feature extraction for images still needs manual extraction to complete. To extract high-dimensional features automatically, Neural networks should be introduced which obtain image classification effects far beyond SVM.
Wu et al. [
In 2018, recognizing chinese characters is proposed based on time series is proposed [
However, for scenes where multiple characters appear at the same time, there is still no good model that can uniformly extract features and efficiently recognize them. The main reason is that it is difficult for a single model to perform uniform feature extraction for different types of language. So, improving the feature extraction part of the model is important.
Bissacco et al. [
All these recognition methods mentioned above did not solve the problem that how to increase recognition accuracy when similar characters of different languages appear at the same time. Lyu et al. [
Although many methods have been proposed for text image recognition. Chinese character recognition is still a challenging problem. An important reason is that Chinese character has so large categories that general Chinese character datasets cannot be found [
In contrast to the efforts demonstrated above, our method shows that it is possible to fuse the morphological features with the density-aware algorithm to recognize the characters.
As illustrated in
When we recognize text, we often regard the overall shape ratio of the text as an important reference standard, such as “0” in the number and “O” and “o” in the letter. These three words are similar, but compared with the letter “O”, the number “0” is much slenderer. When we observe the difference in this ratio, we can easily distinguish these. At the same time, when there is noise interference in the text, the proportion of the text will not change much. We can use this ratio to make reasonable guesses on some unclear words.
Based on this, we use the contour extraction algorithm of the text to extract the maximum bounding box of the text, and a preliminary statistical contour extraction algorithm for the average aspect ratio of different types of text is shown in the
//Variable Definition |
//Variable Definition |
In this paper, we take the black body as an example and calculate the aspect ratio of 3000 Chinese characters, 10 numbers, and 52 English uppercase and lowercase letters. The statistical results are shown in the
Character type | Average aspect ratio |
---|---|
English | 1.903735 |
Number | 1.860185 |
Chinese | 1.004891 |
The complexity of different texts is not the same, especially for Chinese characters, there is either a single stroke such as “一” Chinese characters or a relative complex Chinese character like “饕”. For texts with relatively simple structures, using a deeper network may affect the speed of recognition. Therefore, for texts with different levels of complexity, we hope to pre-classify them with an adaptive classification method. The text is trained using neural networks of different structures to speed up recognition and increase recognition accuracy.
We assume that the maximum bounding box of the text is w × h, where M is the maximum number of pixels of the text in the maximum bounding box, and the density P of the text relative to the maximum bounding box is:
We scale up the maximum bounding box until the larger of w and h reaches 128, and then we get M’ under w’ × h’. For example, if w is assumed to be the larger one, then:
Finally, we use M’ and the normalized uniform size of 128 × 128 for relative density calculations to get the final result P':
We counted the text density of 3,000 commonly used Chinese characters, 10 numbers, and 52 English letters. The results are shown in the
It can be seen from the statistical results that most of the text aspect ratio is concentrated in the interval of 0.3 to 0.5. Therefore, we set two thresholds of th1 to 0.35 and th2 to 0.4, which can be obtained. The total numbers of words in three intervals are shown in the
Interval | Quantity |
---|---|
[0, th1] | 621 |
[th1, th2] | 997 |
[th2, 1] | 1444 |
And
In addition, to prevent some words at the edge of the threshold from being affected by noise and being misclassified into another class, when the recognition probability of the text in the recognition model of the current threshold is lower than a certain threshold T, we will run another model that is closer to this ratio to identify it. If the recognition probability is higher than T, we will take this result, Otherwise, multiply the recognition probability of the first selected recognition model and the recognition probability of the second recognition model by a penalty coefficient PC and then make a decision. The specific process steps are shown in the
//Variant Definition |
//Algorithm |
9 P = CALCULATE_P(I) |
In order to make the network better distinguish similar words in different languages, we input the contour of the text as an additional feature into the network, fuse with the image features of the text itself and train the recognition model. Among them, we set the character outline feature to F, the image of the text itself to I, and the jth convolution kernel of the ith convolutional layer to be kij, then the feature map FM1 after the convolution operation on the image I can be expressed as:
Finally, we will flatten the j feature maps in the
What else, the feature of text contour F should be flattened too, which can be expressed as below:
The three convolutional neural network structures used in this paper use the above method to fuse the contour features of the text with the features of the text image.
As shown in the
Input source | Input shape |
---|---|
Input | 128 × 128 × 3 |
Contour | 128 × 128 × 1 |
At the input of the text image, we use 4 convolutional layers and 4 pooling layers to extract the image features. The input and output parameters of the specific convolutional layer and the pooling layer are shown in the
Layer | Input | Output shape |
---|---|---|
Convolutional layer 1 | 128 × 128 × 3 | 128 × 128 × 16 |
Pooling layer 1 | 128 × 128 × 16 | 64 × 64 × 16 |
Convolutional layer 2 | 64 × 64 × 16 | 64 × 64 × 32 |
Pooling layer 2 | 64 × 64 × 32 | 32 × 32 × 32 |
Convolutional layer 3 | 32 × 32 × 32 | 32 × 32 × 64 |
Pooling layer 3 | 32 × 32 × 64 | 16 × 16 × 64 |
Convolutional layer 4 | 16 × 16 × 64 | 16 × 16 × 128 |
Pooling layer 4 | 16 × 16 × 128 | 8 × 8×128 |
At the contour input, the feature is extracted using an expansion layer and a fully connected layer. The specific parameters are as shown in the
Layer | Input shape | Output shape |
---|---|---|
Flatten layer | 128 × 128 × 1 | 16384 |
Fully connect layer | 16384 | 1024 |
After that, we combine two features into combine layer. Two fully connected layers will be followed. The parameters are shown in the
Layer | Input shape | Output shape |
---|---|---|
Combine layer | 1024/8192 | 9216 |
Fully connect layer 1 | 9216 | 4096 |
Fully connect layer 2 | 4096 | 621 |
Layer | Input | Output shape |
---|---|---|
Convolutional layer 1 | 128 × 128 × 3 | 128 × 128 × 32 |
Convolutional layer 2 | 128 × 128 × 32 | 128 × 128 × 32 |
Pooling layer 1 | 128 × 128 × 32 | 64 × 64 × 32 |
Convolutional layer 3 | 64 × 64 × 32 | 64 × 64 × 64 |
Convolutional layer 4 | 64 × 64 × 64 | 32 × 32 × 64 |
Pooling layer 2 | 64 × 64 × 64 | 32 × 32 × 64 |
Convolutional layer 5 | 32 × 32 × 64 | 32 × 32 × 128 |
Convolutional layer 6 | 32 × 32 × 128 | 64 × 64 × 128 |
Pooling layer 3 | 16 × 16 × 128 | 16 × 16 × 256 |
Convolutional layer 7 | 16 × 16 × 128 | 16 × 16 × 256 |
Convolutional layer 8 | 16 × 16 × 256 | 16 × 16 × 256 |
Pooling layer 4 | 16 × 16 × 256 | 8 × 8×256 |
At the same time, we modified the parameter of the extracted contour feature and classifier. The specific parameters are described in
Layer | Input shape | Output shape |
---|---|---|
Flatten layer | 128 × 128 × 1 | 16384 |
Fully connect layer | 16384 | 4096 |
Flatten layer | 128 × 128 × 1 | 16384 |
Layer | Input shape | Output shape |
---|---|---|
Combine layer | 4096/16384 | 20480 |
Fully connect layer 1 | 20480 | 4096 |
Fully connect layer 2 | 4096 | 997 |
Layer | Input shape | Output shape |
---|---|---|
Combine layer | 4096/16384 | 20480 |
Fully connect layer 1 | 20480 | 4096 |
Fully connect layer 2 | 4096 | 1444 |
In this experiment, we took 3000 black-faced Chinese prints, 52 English capitalizations, and 10 numbers as the training set for this experiment, and did a lot of morphological-based random processing on the training set, including rotation, translation, corrosion, expansion, Add random noise points, etc. Some training sets are shown in the
In our previous work [
Among them, all images are 128 × 128 × 3. For each class, we randomly generated 100 training sets, of which 80 are used as training sets and 20 are used as test sets.
The training of the three networks all used Adam as the optimization function, setting the learning rate to 0.01, setting the momentum of 0.9, and the learning rate attenuation value of 10–6.
All three networks were trained using the K80 GPU. When training the text type recognition network, we spent 5 h training the low-density network 200 rounds, spent 8 h training the medium-density network 200 rounds, and spent 12 h training 200 rounds of the high-density network, the loss function of training and the accuracy curve are as shown in
As shown in
After training the branch models, we merged the models into a single system and tested them using three sets of training sets.
To compare and analyze the proposed model, we used the same training set to train the network structures commonly used in the CNN recognition model, such as LeNet, AlexNet, VGG, and record the accuracy of the model. All the models were trained for 200 rounds. We divided the test set into three sets to test all the trained models, and the recognition accuracy obtained is shown in the
Network | Test dataset 1 | Test dataset 2 | Test dataset 3 |
---|---|---|---|
LeNet | 0.9729 | 0.9672 | 0.9714 |
AlexNet | 0.9596 | 0.9693 | 0.9653 |
VGG | 0.9782 | 0.9802 | 0.9786 |
MFAM-CNN | 0.9775 | 0.9837 | 0.9803 |
To test the practicability of the method proposed in this chapter, we used the Chinese characters in the Chinese language format and the English uppercase and lowercase letters and numbers in the Bradley Hand ITC format to train the above four networks. All the data sets use the above self-made. A series of image processing methods used in the training set to increase the diversity of the sample. The training set of the mixed data set and the number of test sets is shown in the
Dataset | English | Number | Chinese |
---|---|---|---|
Train set | 4160 | 800 | 240000 |
Test set 1 | 520 | 100 | 30000 |
Test set 2 | 520 | 100 | 30000 |
Test set 3 | 520 | 100 | 30000 |
We use the above training set to train 4 networks and test them with 3 sets of test sets. The recognition accuracy is as shown in the
Network | Test dataset 1 | Test dataset 2 | Test dataset 3 |
---|---|---|---|
LeNet | 0.9422 | 0.9393 | 0.9446 |
AlexNet | 0.9433 | 0.9375 | 0.9431 |
VGG | 0.9483 | 0.9441 | 0.9482 |
MFAM-CNN |
We use the ARTI-TEXT training set to train between LeNet, AlexNet, VGG, and MFAM-CNN. The recognition accuracy is shown in the
Network | Test dataset |
---|---|
LeNet | 0.9733 |
AlexNet | 0.9595 |
VGG | 0.9768 |
MFAM-CNN |
In this paper, we proposed MFAM-CNN, an adaptive multi-convolutional neural network text recognition model that combines morphological features. This model fully integrates morphological features with emerging deep learning techniques. Generally, the proposed model can be used as a normal recognition model like other CNN models. According to the different lengths and widths of different types of characters, we calculated character contour for the density of characters. Then we input the character contour information into the model to increase the recognition rate of similar words. At the same time, we classify all the characters according to the density of the text and the default threshold. Different texts are identified using different network structures, which increases the efficiency of recognition and the accuracy of recognition. By comparing with other recognition models, it is proved that the proposed model has a better recognition effect on our well-tested benchmark. The model also achieves high-quality text-feature extraction from image with text with little latency.
In the future, we plan to investigate the advantages/disadvantages of the self-adaptive classification algorithm which is a central part of our approach to extend the model by adjusting the parameters or structures of our model to achieve performance comparable to state-of-the-art.