Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 23 March 2020

Deep-learning-based image segmentation integrated with optical microscopy for automatically searching for two-dimensional materials

  • Satoru Masubuchi   ORCID: orcid.org/0000-0001-7039-6694 1 ,
  • Eisuke Watanabe 1 ,
  • Yuta Seo 1 ,
  • Shota Okazaki 2 ,
  • Takao Sasagawa 2 ,
  • Kenji Watanabe   ORCID: orcid.org/0000-0003-3701-8119 3 ,
  • Takashi Taniguchi 1 , 3 &
  • Tomoki Machida 1  

npj 2D Materials and Applications volume  4 , Article number:  3 ( 2020 ) Cite this article

25k Accesses

82 Citations

48 Altmetric

Metrics details

  • Materials science
  • Nanoscience and technology

Deep-learning algorithms enable precise image recognition based on high-dimensional hierarchical image features. Here, we report the development and implementation of a deep-learning-based image segmentation algorithm in an autonomous robotic system to search for two-dimensional (2D) materials. We trained the neural network based on Mask-RCNN on annotated optical microscope images of 2D materials (graphene, hBN, MoS 2 , and WTe 2 ). The inference algorithm is run on a 1024 × 1024 px 2 optical microscope images for 200 ms, enabling the real-time detection of 2D materials. The detection process is robust against changes in the microscopy conditions, such as illumination and color balance, which obviates the parameter-tuning process required for conventional rule-based detection algorithms. Integrating the algorithm with a motorized optical microscope enables the automated searching and cataloging of 2D materials. This development will allow researchers to utilize a large number of 2D materials simply by exfoliating and running the automated searching process. To facilitate research, we make the training codes, dataset, and model weights publicly available.

Similar content being viewed by others

image processing research papers google scholar

Automatic detection of multilayer hexagonal boron nitride in optical images using deep learning-based computer vision

Fereshteh Ramezani, Sheikh Parvez, … Bradley M. Whitaker

image processing research papers google scholar

Physically informed machine-learning algorithms for the identification of two-dimensional atomic crystals

Laura Zichi, Tianci Liu, … Gongjun Xu

image processing research papers google scholar

Deep-learning-based quality filtering of mechanically exfoliated 2D crystals

Yu Saito, Kento Shin, … Koji Tsuda

Introduction

The recent advances in deep-learning technologies based on neural networks have led to the emergence of high-performance algorithms for interpreting images, such as object detection 1 , 2 , 3 , 4 , 5 , semantic segmentation 4 , 6 , 7 , 8 , 9 , 10 , instance segmentation 11 , and image generation 12 . As neural networks can learn the high-dimensional hierarchical features of objects from large sets of training data 13 , deep-learning algorithms can acquire a high generalization ability to recognize images, i.e., they can interpret images that they have not been shown before, which is one of the traits of artificial intelligence 14 . Soon after the success of deep-learning algorithms in general scene recognition challenges 15 , attempts at automation began for imaging tasks that are conducted by human experts, such as medical diagnosis 16 and biological image analysis 17 , 18 . However, despite significant advances in image recognition algorithms, the implementation of these tools for practical applications remains challenging 18 because of the unique requirements for developing deep-learning algorithms that necessitate the joint development of hardware, datasets, and software 18 , 19 .

In the field of two-dimensional (2D) materials 20 , 21 , 22 , the recent advent of autonomous robotic assembly systems has enabled high-throughput searching for exfoliated 2D materials and their subsequent assembly into van der Waals heterostructures 23 . These developments were bolstered by an image recognition algorithm for detecting 2D materials on SiO 2 /Si substrates 23 , 24 ; however, current implementations have been developed on the framework of conventional rule-based image processing 25 , 26 , which uses traditional handcrafted image features, such as color contrast, edges, and entropy 23 , 24 . Although these algorithms are computationally inexpensive, the detection parameters need to be adjusted by experts, with retuning required when the microscopy conditions change. To perform the parameter tuning in conventional rule-based algorithms, one has to manually find at least one sample flake on SiO 2 /Si substrate, every time one exfoliates 2D flakes. Since the exfoliated flakes are sparsely distributed on SiO 2 /Si substrate, e.g., 3–10 thin flakes in 1 × 1 cm 2 SiO 2 /Si substrate for MoS 2 23 , manually finding a flake and tuning parameters requires at least 30 min. The time spent for parameter-tuning process causes degradation of some two-dimensional materials, such as Bi 2 Sr 2 CaCu 2 O 8+δ 27 , even in a glovebox enclosure.

In contrast, deep-learning algorithms for detecting 2D materials are expected to be robust against changes in optical microscopy conditions, and the development of such an algorithm would provide a generalized 2D material detector that does not require fine-tuning of the parameters. In general, deep-learning algorithms for interpreting images are grouped into two categories 28 . Fully convolutional approaches employ an encoder–decoder architecture, such as SegNet 7 , U-Net 8 , and SharpMask 29 . In contrast, region-based approaches employ feature extraction by a stack of convolutional neural networks (CNNs), such as Mask-RCNN 11 , PSP Net 30 , and DeepLab 10 . In general, the region-based approaches outperform the fully convolutional approaches for most image segmentation tasks when the networks are trained on a sufficiently large number of annotated datasets 11 .

In this work, we implemented and integrated deep-learning algorithms with an automated optical microscope to search for 2D materials on SiO 2 /Si substrates. The neural network architecture based on Mask-RCNN enabled the detection of exfoliated 2D materials while generating a segmentation mask for each object. Transfer learning from the network trained on the Microsoft common objects in context (COCO) dataset 31 enabled the development of a neural network from a relatively small (~2000 optical microscope images) dataset of 2D materials. Owing to the generalization ability of the neural network, the detection process is robust against changes in the microscopy conditions. These properties could not be realized using conventional rule-based image recognition algorithms. To facilitate further research, we make the source codes for network training, the model weights, the training dataset, and the optical microscope drivers publicly available. Our implementation can be deployed on optical microscopes other than the instrument utilized in this study.

System architectures and functionalities

A schematic diagram of our deep-learning-assisted optical microscopy system is shown in Fig. 1a , with photographs shown in Fig. 1b, c . The system comprises three components: (i) an autofocus microscope with a motorized XY scanning stage (Chuo Precision); (ii) a customized software pipeline to capture the optical microscope image, run deep-learning algorithms, display results, and record the results in a database; (iii) a set of trained deep-learning algorithms for detecting 2D materials (graphene, hBN, MoS 2 , and WTe 2 ). By combining these components, the system can automatically search for 2D materials exfoliated on SiO 2 /Si substrates (Supplementary Movie 1 and 2 ). When 2D flakes are detected, their positions and shapes are stored in a database (sample record is presented in supplementary information), which can be browsed and utilized to assemble van der Waals heterostructures with a robotic system 23 . The key component developed in this study was the set of trained deep-learning algorithms for detecting 2D materials in the optical microscope images. Algorithm development required three steps, namely, preparation of a large dataset of annotated optical microscope images, training of the deep-learning algorithm on the dataset, and deploying the algorithm to run inference on optical microscope images.

figure 1

a Schematic of the deep-learning-assisted optical microscope system. The optical microscope acquires an image of exfoliated 2D crystals on a SiO 2 /Si substrate. The images are input into the deep-learning inference algorithm. The Mask-RCNN architecture generates a segmentation mask, bounding boxes, and class labels. The inference data and images are stored in a cloud database, which forms a searchable database. The customized computer-assisted-design (CAD) software enables browsing of 2D crystals, and designing of van der Waals heterostructures. b , c Photographs of ( b ) the optical microscope and ( c ) the computer screen for deep-learning-assisted automated searching. d – k Segmentation of 2D crystals. Optical microscope images of ( d ) graphene, ( e ) hBN, ( f ) WTe 2 , and ( g ) MoS 2 on SiO 2 (290 nm)/Si. h – k Inference results for the optical microscope images in d – g , respectively. The segmentation masks and bounding boxes are indicated by polygons and dashed squares, respectively. In addition, the class labels and confidences are displayed. The contaminating objects, such as scotch tape residue, particles, and corrugated 2D flakes, are indicated by the white arrows in e , f , i , and j . The scale bars correspond to 10 µm.

The deep-learning model we employed was Mask-RCNN 11 (Fig. 1a ), which predicts objects, bounding boxes, and segmentation masks in images. When an image is input into the network, the deep convolutional network ResNet101 32 extracts the position-aware high-dimensional features. These features are passed to the region proposal network (RPN) and the region of interest alignment network (ROI Align), which propose candidate regions where the targeted objects are located. The full connection network performs classification (Class) and regression for the bounding box (BBox) of the detected objects. Finally, the convolutional network generates segmentation masks for the objects using the output of the ROI Align layer. This model was developed on the Keras/TensorFlow framework 33 , 34 , 35 .

To train the Mask-RCNN model, we prepared annotated images and trained networks as follows. In general, the performance of a deep-learning network is known to scale with the size of the dataset 36 . To collect a large set of optical microscope images containing 2D materials, we exfoliated graphene (covalent material), MoS 2 (2D semiconductors), WTe 2 , and hBN crystals onto SiO 2 /Si substrates. Using the automated optical microscope, we collected ~2100 optical microscope images containing graphene, MoS 2 , WTe 2 , and hBN flakes. The images were annotated manually using a web-based labeling tool 37 . The training was performed by the stochastic gradient decent method described later in this paper.

We show the inference results for optical microscope images containing 2D materials. Figure 1c–f shows optical microscope images of graphene, WTe 2 , MoS 2 , and hBN flakes, which were input into the neural network. The inference results shown in Fig. 1g–j consist of bounding boxes (colored squares), class labels (text), confidences (numbers), and masks (colored polygons). For the layer thickness classification, we defined three categories: “mono” (1 layer), “few” (2–10 layers), and “thick” (10–40 layers). Note that this categorization was sufficient for practical use in the first screening process because final verification of the layer thickness can be conducted either by manual inspection or by using the computational post process, such as color contrast analysis 24 , 38 , 39 , 40 , 41 , 42 , which would be interfaced with the deep-learning algorithms in the future works. As indicated in Fig. 1g–j , the 2D flakes are detected by the Mask-RCNN, and the segmentation mask exhibits good overlap with the 2D flakes. The layer thickness was also correctly classified, with monolayer graphene classified as “mono”. The detection process is robust against contaminating objects, such as scotch tape residue, particles, and corrugated 2D flakes (white arrows, Fig. 1e, f, I, j ).

As the neural network locates 2D crystals using the high-dimensional hierarchical features of the image, the detection results were unchanged when the illumination conditions were varied (Supplementary Movie 3 ). Figure 2a–c shows the deep-learning detection of graphene flakes under differing illumination intensities ( I ). For comparison, the results obtained using conventional rule-based detection are presented in Fig. 2d–f . For the deep-learning case, the results were not affected by changing the illumination intensity from I  = 220 (a) to 180 (b) or 90 (c) (red, blue, and green curves, Fig. 2 ). In contrast, with rule-based detection, a slight decrease in the light intensity from I  = 220 (d) to 200 (e) affected the results significantly, and the graphene flakes became undetectable. Further decreasing the illumination intensity to I  = 180 (f) resulted in no objects being detected. These results demonstrate the robustness of the deep-learning algorithms over conventional rule-based image processing for detecting 2D flakes.

figure 2

Input image and inference results under illumination intensities of I  =  a 220, b 180, and c 90 (arb. unit) for deep-learning detection, and I  =  d 220, e 210, and f 180 (arb. unit) for rule-based detection. The scale bars correspond to 10 µm.

The deep-learning model was integrated with a motorized optical microscope by developing a customized software pipeline using C++ and Python. We employed a server/client architecture to integrate the deep-learning inference algorithms with the conventional optical microscope (Supplementary Fig. 1 ). The image captured by the optical microscope is sent to the inference server, and the inference results are sent back to the client computer. The deep-learning model can run on a graphics-processing unit (NVIDIA Tesla V100) at 200 ms. Including the overheads for capturing images, transferring image data, and displaying inference results, frame rates of ~1 fps were achieved. To investigate the applicability of the deep-learning inference to searching for 2D crystals, we selected WTe 2 crystals as a testbed because the exfoliation yields of transition metal dichalcogenides are significantly smaller than graphene flakes. We exfoliated WTe 2 crystals onto 1 × 1 cm 2 SiO 2 /Si substrates, and then conducted searching, which was completed in 1 h using a ×50 objective lens. Searching identified ~25 WTe 2 flakes on 1 × 1 cm 2 SiO 2 /Si with various thicknesses (1–10 layers; Supplementary Fig. 2 ).

To quantify the performance of the Mask-RCNN detection process, we manually checked over 2300 optical microscope images, and the detection metrics are summarized in Supplementary Table 1 . Here, we defined true- and false-positive detections (TP and FP) as whether the optical microscope image contained at least one correctly detected 2D crystal or not (examples are presented in Supplementary Figs 2 – 7 ). An image in which the 2D crystal was not correctly detected was considered a false negative (FN). Based on these definitions, the value of precision was TP/(TP + FP) ~0.53, which implies that over half of the optical microscope images with positive detection contained WTe 2 crystals. Notably, the recall (TP/(TP + FN) ~0.93) was significantly high. In addition, the examples of false-negative detection contain only small fractured WTe 2 crystals, which cannot be utilized for assembling van der Waals heterostructures. These results imply that the deep-learning-based detection process does not miss usable 2D crystals. This property is favorable for the practical application of deep-learning algorithms to searching for 2D crystals, as exfoliated 2D crystals are usually sparsely distributed over SiO 2 /Si substrates. In this case, false-positive detection is less problematic than missing 2D crystals (false negative). The screening of the results can be performed by a human operator without significant intervention 43 . In the case of graphene (Supplementary Table 1 ), both the precision and recall were high (~0.95 and ~0.97, respectively), which implies excellent performance of the deep-learning algorithm for detecting 2D crystals. We speculate that there is a difference between the exfoliation yields of graphene and WTe 2 because the mean average precision (mAP) at the intersection of union (IOU) over 50% mAP@IoU 50% with respect to the annotated dataset (see preparation methods below) for each material does not differ significantly (0.49 for graphene and 0.52 for WTe 2 ). As demonstrated above, these values are sufficiently high and can be successfully applied to searches for 2D crystals. These results indicate that the deep-learning inference can be practically utilized to search for 2D crystals.

Model training

The Mask-RCNN model was trained on a dataset, where Fig. 3a shows representative annotated images, and Fig. 3b shows the annotation metrics. The dataset comprises 353 (hBN), 862 (graphene), 569 (MoS 2 ), and 318 (WTe 2 ) images. The numbers of annotated objects were 456 (hBN), 4805 (graphene), 839 (MoS 2 ), and 1053 (WTe 2 ). The annotations were converted to the JSON format compatible with the Microsoft COCO dataset using our customized scripts written in Python. Finally, the annotated dataset was randomly divided into training and test datasets in a 8:2 ratio. To train the model on the annotated dataset, we utilized the multitask loss function defined in refs 11 , 33

where L cls , L box , and L mask are the classification, localization, and segmentation mask losses, respectively; α  –  γ is the control parameter for tuning the balance between the loss sets as ( α , β , γ ) = (0.6, 1.0, 1.0). The class loss was

where p  = ( p 0 , …, p k ) is the probability distribution for each region of interest in which the result of classification is u . The bounding box loss L box is defined as

where \({\mathrm{smooth}}_{L_1}\left( x \right) = \left\{ {\begin{array}{l} {0.5x^2,\,\left| x \right| < 1} \\ {\left| x \right| - 0.5,{\mathrm{otherwise}}} \end{array}} \right.\) is an L 1 loss. The mask loss L mask was defined as the average binary cross-entropy loss:

where y ij is the binary mask at ( i , j ) from an ROI of ( m  ×  m ) size on the ground truth mask of class k , and \(\hat y_{ij}^k\) is the predicted class label of the same cell.

figure 3

a Examples of annotated datasets for graphene (G), hBN, WTe 2 , and MoS 2 . b Training data metrics. c Schematic of the training procedure. d Learning curves for training on the dataset. The network weights were initialized by the model weights pretrained on the MS-COCO dataset. Solid (dotted) curves are test (train) losses. Training was performed either with (red curve) or without (blue curve) augmentation.

Instead of training the model from scratch, the model weights, except for the network heads, were initialized using those obtained by pretraining on a large-scale object segmentation dataset in general scenes, i.e., the MS-COCO dataset 31 . The remaining parts of the network weights were initialized using random values. The optimization was conducted using a stochastic gradient decent with a momentum of 0.9 and a weight decay of 0.0001. Each training epoch consisted of 500 iterations. The training comprised four stages, each lasting for 30 epochs (Fig. 3c ). For the first two training stages, the learning rate was set to 10–3. The learning rate was decreased to 10–4 and 10–5 for the last two stages. In the first stage, only the network heads were trained (top row, Fig. 3c ). Next, the parts of the backbone starting at layer 4 were optimized (second row, Fig. 3c ). In the third and fourth stages, the entire model (backbone and heads) was trained (third and fourth rows, Fig. 3c ). The training took 12 h using four GPUs (NVIDIA Tesla V100 with 32-GB memory). To increase the number of training datasets, we used data augmentation techniques, including color channel multiplication, rotation, horizontal/vertical flips, and horizontal/vertical shifts. These operations were applied to the training data with a random probability online to reduce disk usage (examples of the augmented data are presented in Supplementary Figs 8 and 9 ). Before being fed to the Mask-RCNN model, each image was resized to 1024 × 1024 px 2 while preserving the aspect ratio, with any remaining space zero padded.

To improve the generalization ability of the network, we organized the training of the Mask-RCNN model into two steps. First, the model was trained on mixed datasets consisting of multiple 2D materials (graphene, hBN, MoS 2 , and WTe 2 ). At this stage, the model was trained to perform segmentation and classification, both on material identity and layer thickness. Then, we use the trained weights as a source, and performed transfer learning on each material subset to achieve layer thickness classification. By employing this strategy, the feature values that are common to 2D materials behind the network heads were optimized and shared between the different materials. As shown below, the sharing of the backbone network contributed to faster convergence of the network weights and a smaller test loss.

Training curve

Figure 3d shows the value of the loss function as a function of the epoch count. The solid (dotted) curves represent the test (training) loss. The training was conducted either with (red curves) or without (blue curves) data augmentation. Without augmentation, the training loss decreased to zero, while the test loss was increased. The difference between the test and training losses was significantly increased with training, which indicates that the generalization error increased, and the model overfits the training data 13 . When data augmentation was applied, both the training and validation losses decreased monotonically with training, and the difference between the training and validation losses was small. These results indicate that when 2000 optical microscope images are prepared, the Mask-RCNN model can be trained on 2D materials without overfitting.

Transfer learning

After training on multiple material categories, we applied transfer learning to the model using each sub-dataset. Figure 4a–d shows the learning curves for training the networks on the graphene, hBN, MoS 2 , and WTe 2 subsets of the annotated data, respectively. The solid (dotted) curves represent the test (training) loss. The network weights were initialized using those at epoch 120 obtained by training on multiple material classes (Fig. 3d ) (red curves, Fig. 4a–d ). For reference, we also trained the dataset by initializing the network weights using those obtained by pretraining only on the MS-COCO dataset (blue curves, Fig. 4a–d ). Notably, in all cases, the test loss decreased faster for those pretrained on the 2D crystals and MS-COCO than for those pretrained on MS-COCO only. The loss value after 30 epochs of training on 2D crystals and MS-COCO was of almost the same order as that obtained after 80 epochs of training on MS-COCO only. In addition, the minimum loss value achieved in the case of pretraining on 2D crystals and MS-COCO was smaller than that achieved with MS-COCO only. These results indicate that the feature values that are common to 2D materials are learnt in the backbone network. In particular, the trained backbone network weights contribute to improving the model performance on each material.

figure 4

a – d Test (solid curves) and training (dotted curves) losses as a function of epoch count for training on a graphene, b WTe 2 , c MoS 2 , and d hBN. Each epoch consists of 500 training steps. The model weights were initialized using those pretrained on (blue) MS-COCO and (red) MS-COCO and 2D material datasets. The optical microscope image of graphene (WTe 2 ) and the inference results for these images are shown in e – g ( h – j ). The scale bars correspond to 10 µm.

To investigate the improvement of the model accuracy, we compared the inference results for the optical microscope images using the network weights from each training set. Figure 4e–h shows the optical microscope images of graphene and WTe 2 , respectively, input into the network. We employed the model weights where the loss value was minimum (indicated by the red/blue arrows). The inference results in the cases of transferring only from MS-COCO, and from both MS-COCO and 2D materials, are shown in Fig. 4f, g for graphene, and Fig. 4I, j for WTe 2 . For graphene, the model transferred from MS-COCO only failed in detecting some thick graphite flakes, as indicated by the white arrows in Fig. 4f , whereas the model trained on MS-COCO and 2D crystals detected the graphene flakes, as indicated by the white arrows in Fig. 4g . Similarly, for WTe 2 , when the inference process was performed using the model transferred from MS-COCO only, the surface of the SiO 2 /Si substrate surrounded by thick WTe 2 crystals was misclassified as WTe 2 , as indicated by the white arrow in Fig. 4d . In contrast, when learning was transferred from the model pretrained on MS-COCO and 2D materials (red arrow, Fig. 4b ), this region was not recognized as WTe 2 . These results indicate that pretraining on multiple material classes contributes to improving model accuracy because the common properties of 2D crystals are learnt in the backbone network. The inference results presented in Fig. 1 were obtained by utilizing the model weights at epoch 120 for each material.

Generalization ability

Finally, we investigated the generalization ability of the neural network for detecting graphene flakes in images obtained using different optical microscope setups (Asahikogaku AZ10-T/E, Keyence VHX-900, and Keyence VHX-5000 as shown in Fig. 5a–c , respectively). Figure 5d–f shows the optical microscope images of exfoliated graphene captured by each instrument. Across these instruments, there are significant variations in the white balance, magnification, resolution, illumination intensity, and illumination inhomogeneity (Fig. 5d–f ). The model weights from training epoch 120 on the graphene dataset were employed (red arrow, Fig. 4d ). Even though no optical microscope images recorded by these instruments were utilized for training, as shown by the inference results in Fig. 5g–i , the deep-learning model successfully detected the regions of exfoliated graphene. These results indicate that our trained neural network captured the latent general features of graphene flakes, and thus constitutes a general-purpose graphene detector that works irrespective of the optical microscope setup. These properties cannot be realized by utilizing the conventional rule-based detection algorithms for 2D crystals, where the detection parameters must be retuned when the optical conditions were altered.

figure 5

a–c Optical microscope setups used for capturing images of exfoliated graphene (Asahikogaku AZ10-T/E, Keyence VHX-900, and Keyence VHX-5000, respectively). d–f Optical microscope images recorded using instruments ( a–c ), respectively. g–i Inference results for the optical microscope images in d–f , respectively. The segmentation masks are shown in color, and the category and confidences are also indicated. The scale bars correspond to 10 µm.

In order to train the neural network for the 2D crystals that have different appearance, such as ZrSe 3 , the model weights trained on both MS-COCO and 2D crystals obtained in this study can be used as source weights to start training. In our experience, the Mask-RCNN trained on a small dataset of ~80 images from the MS-COCO pretrained model can produce rough segmentation masks on graphene. Therefore, providing <80 annotated images would be sufficient for developing a classification algorithm that works for detecting other 2D materials when we use our trained weights as a source. Our work can be utilized as a starting point for developing neural network models that work for various 2D materials.

Moreover, the trained neural networks can be utilized for searching the materials other than those used for training. For demonstration, we exfoliated WSe 2 and MoSe 2 flakes on SiO 2 /Si substrate, and conducted searching with the model trained on WTe 2 . As shown in Supplementary Figs 10 and 11 in supplementary information, thin WSe 2 and MoSe 2 flakes are correctly detected even without training on these materials. This result indicates that the difference of the appearances of WSe 2 and MoSe 2 from WTe 2 are covered by the generalization ability of neural networks.

Finally, our deep-learning inference process can run on the remote server/client architecture. This architecture is suitable for researchers with an occasional need for deep learning, as it provides a cloud-based setup that does not require a local GPU. The conventional optical microscope instruments that were not covered in this study can also be modified to support deep-learning inference by implementing the client software to capture an image, send an image to the server, receive, and display inference results. The distribution of the deep-learning inference system will benefit the research community by saving the time needed for optical microscopy-based searching of 2D materials.

In this work, we developed a deep-learning-assisted automated optical microscope to search for 2D crystals on SiO 2 /Si substrates. A neural network with Mask-RCNN architecture trained on 2D materials enabled the efficient detection of various exfoliated 2D crystals, including graphene, hBN, and transition metal dichalcogenides (WTe 2 and MoS 2 ), while simultaneously generating a segmentation mask for each object. This work, along with the recent other attempts for utilizing the deep-learning algorithms 44 , 45 , 46 , should free researchers from the repetitive tasks of optical microscopy, and comprises a fundamental step toward realizing fully automated fabrication systems for van der Waals heterostructures. To facilitate such research, we make the source codes for training, the model weights, the training dataset, and the optical microscope drivers publicly available.

Optical microscope drivers

The automated optical microscope drivers were written in C++ and Python. The software stack was developed on the stacks of a robotic operating system 47 and the HALCON image-processing library (MVTec Software GmbH).

Preparation of the training dataset

To obtain the Mask-RCNN model to segment 2D crystals, we employed a semiautomatic annotation workflow. First, we trained the Mask-RCNN with a small dataset consisting of ~80 images of graphene. Then, we conducted predictions on optical microscope images of graphene. The prediction labels generated using the Mask-RCNN were stored in LabelBox using API. These labels were manually corrected by a human annotator. This procedure greatly enhanced the annotation efficiency, allowing each image to be labeled in 20–30 s.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Code availability

The source code, the trained network weights, and the training data are available at https://github.com/tdmms/ .

Zhao, Z.-Q., Zheng, P., Xu, S.-t. & Wu, X. Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems 30 , 3212–3232 (2019).

Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems , 91–99 (Neural Information Processing Systems Foundation, 2015).

Girshick, R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision , 1440–1448 (IEEE, 2015).

Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 580–587 (IEEE, 2014).

Liu, W. et al. SSD: Single shot multibox detector. European Conference on Computer Vision , 21–37 (Springer, 2016).

Garcia-Garcia, A., Orts-Escolano, S., Oprea, S. O., Villena-Martinez, V. & Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. Preprint at https://arxiv.org/abs/1704.06857 (2017).

Badrinarayanan, V., Kendall, A. & Cipolla, R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 , 2481–2495 (2017).

Article   Google Scholar  

Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-assisted Intervention , 234–241 (Springer, 2015).

Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 3431–3440 (IEEE, 2015).

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 , 834–848 (2017).

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision , 2961–2969 (IEEE, 2017).

Goodfellow, I. et al. Generative adversarial nets. Advances in Neural Information Processing Systems , 2672–2680 (Neural Information Processing Systems Foundation, 2014).

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning . (MIT Press, 2016).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article   CAS   Google Scholar  

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems , 1097–1105 (Neural Information Processing Systems Foundation, 2012).

Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42 , 60–88 (2017).

Falk, T. et al. U-Net: deep learning for cell counting, detection, and morphometry. Nat. Methods 16 , 67–70 (2019).

Moen, E. et al. Deep learning for cellular image analysis. Nat. Methods https://doi.org/10.1038/s41592-019-0403-1 (2019).

Karpathy, A. Software 2.0 . https://medium.com/@karpathy/software-2-0-a64152b37c35 (2017).

Novoselov, K. S., Mishchenko, A., Carvalho, A. & Castro Neto, A. H. 2D materials and van der Waals heterostructures. Science 353 , aac9439 (2016).

Novoselov, K. S. et al. Two-dimensional atomic crystals. Proc. Natl Acad. Sci. USA 102 , 10451–10453 (2005).

Novoselov, K. S. et al. Electric field effect in atomically thin carbon films. Science 306 , 666–669 (2004).

Masubuchi, S. et al. Autonomous robotic searching and assembly of two-dimensional crystals to build van der Waals superlattices. Nat. Commun. 9 , 1413 (2018).

Masubuchi, S. & Machida, T. Classifying optical microscope images of exfoliated graphene flakes by data-driven machine learning. npj 2D Mater. Appl. 3 , 4 (2019).

Nixon, M. S. & Aguado, A. S. Feature Extraction & Image Processing for Computer Vision (Academic Press, 2012).

Szeliski, R. Computer Vision: Algorithms and Applications . (Springer Science & Business Media, 2010).

Yu, Y. et al. High-temperature superconductivity in monolayer Bi 2 Sr 2 CaCu 2 O 8+δ . Nature 575 , 156–163 (2019).

Ghosh, S., Das, N., Das, I. & Maulik, U. Understanding deep learning techniques for image segmentation. Preprint at https://arxiv.org/abs/1907.06119 (2019).

Pinheiro, P. O., Lin, T.-Y., Collobert, R. & Dollár, P. Learning to refine object segments. European Conference on Computer Vision , 75–91 (Springer, 2016).

Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2881–2890 (IEEE, 2017).

Lin, T.-Y. et al. Microsoft COCO: common objects in context. European Conference on Computer Vision , 740–755 (Springer, 2014).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 770–778 (IEEE, 2016).

Abdulla, W. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow https://github.com/matterport/Mask_RCNN (2017).

Chollet, F. Keras: Deep learning for humans https://github.com/keras-team/keras (2015).

Abadi, M. et al. Tensorflow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation , 265–283 (USENIX Association, 2016).

Hestness, J. et al. Deep learning scaling is predictable, empirically. Preprint at https://arxiv.org/abs/1712.00409 (2017).

Labelbox, “Labelbox,” Online, [Online]. https://labelbox.com (2019).

Lin, X. et al. Intelligent identification of two-dimensional nanostructures by machine-learning optical microscopy. Nano Res. 11 , 6316–6324 (2018).

Li, H. et al. Rapid and reliable thickness identification of two-dimensional nanosheets using optical microscopy. ACS Nano 7 , 10344–10353 (2013).

Ni, Z. H. et al. Graphene thickness determination using reflection and contrast spectroscopy. Nano Lett. 7 , 2758–2763 (2007).

Nolen, C. M., Denina, G., Teweldebrhan, D., Bhanu, B. & Balandin, A. A. High-throughput large-area automated identification and quality control of graphene and few-layer graphene films. ACS Nano 5 , 914–922 (2011).

Taghavi, N. S. et al. Thickness determination of MoS 2 , MoSe 2 , WS 2 and WSe 2 on transparent stamps used for deterministic transfer of 2D materials. Nano Res. 12 , 1691–1695 (2019).

Zhang, P., Zhong, Y., Deng, Y., Tang, X. & Li, X. A survey on deep learning of small sample in biomedical image analysis. Preprint at https://arxiv.org/abs/1908.00473 (2019).

Saito, Y. et al. Deep-learning-based quality filtering of mechanically exfoliated 2D crystals. npj Computational Materials 5 , 1–6 (2019).

Han, B. et al. Deep learning enabled fast optical characterization of two-dimensional materials. Preprint at https://arxiv.org/abs/1906.11220 (2019).

Greplova, E. et al. Fully automated identification of 2D material samples. Preprint at https://arxiv.org/abs/1911.00066 (2019).

Quigley, M. et al. ROS: an open-source Robot Operating System. ICRA Workshop on Open Source Software (Open Robotics, 2009).

Download references

Acknowledgements

This work was supported by CREST, Japan Science and Technology Agency Grant Numbers JPMJCR15F3 and JPMJCR16F2, and by JSPS KAKENHI under Grant No. JP19H01820.

Author information

Authors and affiliations.

Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo, 153-8505, Japan

Satoru Masubuchi, Eisuke Watanabe, Yuta Seo, Takashi Taniguchi & Tomoki Machida

Laboratory for Materials and Structures, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503, Japan

Shota Okazaki & Takao Sasagawa

National Institute for Materials Science, 1-1 Namiki, Tsukuba, Ibaraki, 305-0044, Japan

Kenji Watanabe & Takashi Taniguchi

You can also search for this author in PubMed   Google Scholar

Contributions

S.M. conceived the scheme, implemented the software, trained the neural network, and wrote the paper. E.W. and Y.S. exfoliated the 2D materials and tested the system. S.O. and T.S. synthesized the WTe 2 and WSe 2 crystals. K.W. and T.T. synthesized the hBN crystals. T.M. supervised the research program.

Corresponding authors

Correspondence to Satoru Masubuchi or Tomoki Machida .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information pdf, supplementary movie1_compressed.mp4, supplementary movie2_compressed.mp4, supplementary movie3_compressed.mp4, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Masubuchi, S., Watanabe, E., Seo, Y. et al. Deep-learning-based image segmentation integrated with optical microscopy for automatically searching for two-dimensional materials. npj 2D Mater Appl 4 , 3 (2020). https://doi.org/10.1038/s41699-020-0137-z

Download citation

Received : 20 October 2019

Accepted : 24 February 2020

Published : 23 March 2020

DOI : https://doi.org/10.1038/s41699-020-0137-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Van der waals enabled formation and integration of ultrathin high-κ dielectrics on 2d semiconductors.

  • Matej Sebek
  • Jinghua Teng

npj 2D Materials and Applications (2024)

  • Laura Zichi

Scientific Reports (2023)

  • Fereshteh Ramezani
  • Sheikh Parvez
  • Bradley M. Whitaker

Review: 2D material property characterizations by machine-learning-assisted microscopies

  • Zhizhong Si
  • Daming Zhou
  • Xiaoyang Lin

Applied Physics A (2023)

Clean assembly of van der Waals heterostructures using silicon nitride membranes

  • Wendong Wang
  • Nicholas Clark
  • Roman Gorbachev

Nature Electronics (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

image processing research papers google scholar

  • Reference Manager
  • Simple TEXT file

People also looked at

Editorial article, editorial: current trends in image processing and pattern recognition.

www.frontiersin.org

  • PAMI Research Lab, Computer Science, University of South Dakota, Vermillion, SD, United States

Editorial on the Research Topic Current Trends in Image Processing and Pattern Recognition

Technological advancements in computing multiple opportunities in a wide variety of fields that range from document analysis ( Santosh, 2018 ), biomedical and healthcare informatics ( Santosh et al., 2019 ; Santosh et al., 2021 ; Santosh and Gaur, 2021 ; Santosh and Joshi, 2021 ), and biometrics to intelligent language processing. These applications primarily leverage AI tools and/or techniques, where topics such as image processing, signal and pattern recognition, machine learning and computer vision are considered.

With this theme, we opened a call for papers on Current Trends in Image Processing & Pattern Recognition that exactly followed third International Conference on Recent Trends in Image Processing & Pattern Recognition (RTIP2R), 2020 (URL: http://rtip2r-conference.org ). Our call was not limited to RTIP2R 2020, it was open to all. Altogether, 12 papers were submitted and seven of them were accepted for publication.

In Deshpande et al. , authors addressed the use of global fingerprint features (e.g., ridge flow, frequency, and other interest/key points) for matching. With Convolution Neural Network (CNN) matching model, which they called “Combination of Nearest-Neighbor Arrangement Indexing (CNNAI),” on datasets: FVC2004 and NIST SD27, their highest rank-I identification rate of 84.5% was achieved. Authors claimed that their results can be compared with the state-of-the-art algorithms and their approach was robust to rotation and scale. Similarly, in Deshpande et al. , using the exact same datasets, exact same set of authors addressed the importance of minutiae extraction and matching by taking into low quality latent fingerprint images. Their minutiae extraction technique showed remarkable improvement in their results. As claimed by the authors, their results were comparable to state-of-the-art systems.

In Gornale et al. , authors extracted distinguishing features that were geometrically distorted or transformed by taking Hu’s Invariant Moments into account. With this, authors focused on early detection and gradation of Knee Osteoarthritis, and they claimed that their results were validated by ortho surgeons and rheumatologists.

In Tamilmathi and Chithra , authors introduced a new deep learned quantization-based coding for 3D airborne LiDAR point cloud image. In their experimental results, authors showed that their model compressed an image into constant 16-bits of data and decompressed with approximately 160 dB of PSNR value, 174.46 s execution time with 0.6 s execution speed per instruction. Authors claimed that their method can be compared with previous algorithms/techniques in case we consider the following factors: space and time.

In Tamilmathi and Chithra , authors carefully inspected possible signs of plant leaf diseases. They employed the concept of feature learning and observed the correlation and/or similarity between symptoms that are related to diseases, so their disease identification is possible.

In Das Chagas Silva Araujo et al. , authors proposed a benchmark environment to compare multiple algorithms when one needs to deal with depth reconstruction from two-event based sensors. In their evaluation, a stereo matching algorithm was implemented, and multiple experiments were done with multiple camera settings as well as parameters. Authors claimed that this work could be considered as a benchmark when we consider robust evaluation of the multitude of new techniques under the scope of event-based stereo vision.

In Steffen et al. ; Gornale et al. , authors employed handwritten signature to better understand the behavioral biometric trait for document authentication/verification, such letters, contracts, and wills. They used handcrafter features such as LBP and HOG to extract features from 4,790 signatures so shallow learning can efficiently be applied. Using k-NN, decision tree and support vector machine classifiers, they reported promising performance.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Santosh, KC, Antani, S., Guru, D. S., and Dey, N. (2019). Medical Imaging Artificial Intelligence, Image Recognition, and Machine Learning Techniques . United States: CRC Press . ISBN: 9780429029417. doi:10.1201/9780429029417

CrossRef Full Text | Google Scholar

Santosh, KC, Das, N., and Ghosh, S. (2021). Deep Learning Models for Medical Imaging, Primers in Biomedical Imaging Devices and Systems . United States: Elsevier . eBook ISBN: 9780128236505.

Google Scholar

Santosh, KC (2018). Document Image Analysis - Current Trends and Challenges in Graphics Recognition . United States: Springer . ISBN 978-981-13-2338-6. doi:10.1007/978-981-13-2339-3

Santosh, KC, and Gaur, L. (2021). Artificial Intelligence and Machine Learning in Public Healthcare: Opportunities and Societal Impact . Spain: SpringerBriefs in Computational Intelligence Series . ISBN: 978-981-16-6768-8. doi:10.1007/978-981-16-6768-8

Santosh, KC, and Joshi, A. (2021). COVID-19: Prediction, Decision-Making, and its Impacts, Book Series in Lecture Notes on Data Engineering and Communications Technologies . United States: Springer Nature . ISBN: 978-981-15-9682-7. doi:10.1007/978-981-15-9682-7

Keywords: artificial intelligence, computer vision, machine learning, image processing, signal processing, pattern recocgnition

Citation: Santosh KC (2021) Editorial: Current Trends in Image Processing and Pattern Recognition. Front. Robot. AI 8:785075. doi: 10.3389/frobt.2021.785075

Received: 28 September 2021; Accepted: 06 October 2021; Published: 09 December 2021.

Edited and reviewed by:

Copyright © 2021 Santosh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: KC Santosh, [email protected]

This article is part of the Research Topic

Current Trends in Image Processing and Pattern Recognition

  • Open access
  • Published: 11 February 2019

Research on image classification model based on deep convolution neural network

  • Mingyuan Xin 1 &
  • Yong Wang 2  

EURASIP Journal on Image and Video Processing volume  2019 , Article number:  40 ( 2019 ) Cite this article

46k Accesses

130 Citations

Metrics details

Based on the analysis of the error backpropagation algorithm, we propose an innovative training criterion of depth neural network for maximum interval minimum classification error. At the same time, the cross entropy and M 3 CE are analyzed and combined to obtain better results. Finally, we tested our proposed M3 CE-CEc on two deep learning standard databases, MNIST and CIFAR-10. The experimental results show that M 3  CE can enhance the cross-entropy, and it is an effective supplement to the cross-entropy criterion. M3 CE-CEc has obtained good results in both databases.

1 Introduction

Traditional machine learning methods (such as multilayer perception machines, support vector machines, etc.) mostly use shallow structures to deal with a limited number of samples and computing units. When the target objects have rich meanings, the performance and generalization ability of complex classification problems are obviously insufficient. The convolution neural network (CNN) developed in recent years has been widely used in the field of image processing because it is good at dealing with image classification and recognition problems and has brought great improvement in the accuracy of many machine learning tasks. It has become a powerful and universal deep learning model.

Convolutional neural network (CNN) is a multilayer neural network, and it is also the most classical and common deep learning framework. A new reconstruction algorithm based on convolutional neural networks is proposed by Newman et al. [ 1 ] and its advantages in speed and performance are demonstrated. Wang et al. [ 2 ] discussed three methods, that is, the CNN model with pretraining or fine-tuning and the hybrid method. The first two executive images are passed to the network one time, while the last category uses a patch-based feature extraction scheme. The survey provides a milestone in modern case retrieval, reviews a wide selection of different categories of previous work, and provides insights into the link between SIFT and the CNN based approach. After analyzing and comparing the retrieval performance of different categories on several data sets, we discuss a new direction of general and special case retrieval. Convolution neural network (CNN) is very interested in machine learning and has excellent performance in hyperspectral image classification. Al-Saffar et al. [ 3 ] proposed a classification framework called region-based pluralistic CNN, which can encode semantic context-aware representations to obtain promising features. By combining a set of different discriminant appearance factors, the representation based on CNN presents the spatial spectral contextual sensitivity that is essential for accurate pixel classification. The proposed method for learning contextual interaction features using various region-based inputs is expected to have more discriminant power. Then, the combined representation containing rich spectrum and spatial information is fed to the fully connected network and the label of each pixel vector is predicted by the Softmax layer. The experimental results of the widely used hyperspectral image datasets show that the proposed method can outperform any other traditional deep-learning-based classifiers and other advanced classifiers. Context-based convolution neural network (CNN) with deep structure and pixel-based multilayer perceptron (MLP) with shallow structure are recognized neural network algorithms which represent the most advanced depth learning methods and classical non-neural network algorithms. The two algorithms with very different behaviors are integrated in a concise and efficient manner, and a rule-based decision fusion method is used to classify very fine spatial resolution (VFSR) remote sensing images. The decision fusion rules, which are mainly based on the CNN classification confidence design, reflect the usual complementary patterns of each classifier. Therefore, the ensemble classifier MLP-CNN proposed by Said et al. [ 4 ] acquires supplementary results obtained from CNN based on deep spatial feature representation and MLP based on spectral discrimination. At the same time, the CNN constraints resulting from the use of convolution filters, such as the uncertainty of object boundary segmentation and the loss of useful fine spatial resolution details, are compensated. The validity of the ensemble MLP-CNN classifier was tested in urban and rural areas using aerial photography and additional satellite sensor data sets. MLP-CNN classifier achieves promising performance and is always superior to pixel based MLP, spectral and texture based MLP, and context-based CNN in classification accuracy. The research paves the way for solving the complex problem of VFSR image classification effectively. Periodic inspection of nuclear power plant components is important to ensure safe operation. However, current practice is time-consuming, tedious, and subjective, involving human technicians examining videos and identifying reactor cracks. Some vision-based crack detection methods have been developed for metal surfaces, and they generally perform poorly when used to analyze nuclear inspection videos. Detecting these cracks is a challenging task because of their small size and the presence of noise patterns on the surface of the components. Huang et al. [ 5 ] proposed a depth learning framework based on convolutional neural network (CNN) and Naive Bayes data fusion scheme (called NB-CNN), which can be used to analyze a single video frame for crack detection. At the same time, a new data fusion scheme is proposed to aggregate the information extracted from each video frame to enhance the overall performance and robustness of the system. In this paper, a CNN is proposed to detect the fissures in each video frame, the proposed data fusion scheme maintains the temporal and spatial coherence of the cracks in the video, and the Naive Bayes decision effectively discards the false positives. The proposed framework achieves a hit rate of 98.3% 0.1 false positives per frame which is significantly higher than the most advanced method proposed in this paper. The prediction of visual attention data from any type of media is valuable to content creators and is used to drive coding algorithms effectively. With the current trend in the field of virtual reality (VR), the adaptation of known technologies to this new media is beginning to gain momentum R. Gupta and Bhavsar [ 6 ] proposed an extension to the architecture of any convolutional neural network (CNN) to fine-tune traditional 2D significant prediction to omnidirectional image (ODI). In an end-to-end manner, it is shown that each step in the pipeline presented by them is aimed at making the generated salient map more accurate than the ground live data. Convolutional neural network (Ann) is a kind of depth machine learning method derived from artificial neural network (Ann), which has achieved great success in the field of image recognition in recent years. The training algorithm of neural network is based on the error backpropagation algorithm (BP), which is based on the decrease of precision. However, with the increase of the number of neural network layers, the number of weight parameters will increase sharply, which leads to the slow convergence speed of the BP algorithm. The training time is too long. However, CNN training algorithm is a variant of BP algorithm. By means of local connection and weight sharing, the network structure is more similar to the biological neural network, which not only keeps the deep structure of the network, but also greatly reduces the network parameters, so that the model has good generalization energy and is easier to train. This advantage is more obvious when the network input is a multi-dimensional image, so that the image can be directly used as the network input, avoiding the complex feature extraction and data reconstruction process in traditional recognition algorithm. Therefore, convolutional neural networks can also be interpreted as a multilayer perceptron designed to recognize two-dimensional shapes, which are highly invariant to translation, scaling, tilting, or other forms of deformation [ 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 ].

With the rapid development of mobile Internet technology, more and more image information is stored on the Internet. Image has become another important network information carrier after text. Under this background, it is very important to make use of a computer to classify and recognize these images intelligently and make them serve human beings better. In the initial stage of image classification and recognition, people mainly use this technology to meet some auxiliary needs, such as Baidu’s star face function can help users find the most similar star. Using OCR technology to extract text and information from images, it is very important for graph-based semi-supervised learning method to construct good graphics that can capture intrinsic data structures. This method is widely used in hyperspectral image classification with a small number of labeled samples. Among the existing graphic construction methods, sparse representation (based on SR) shows an impressive performance in semi-supervised HSI classification tasks. However, most algorithms based on SR fail to consider the rich spatial information of HSI, which has been proved to be beneficial to classification tasks. Yan et al. [ 16 ] proposed a space and class structure regularized sparse representation (SCSSR) graph for semi-supervised HSI classification. Specifically, spatial information has been incorporated into the SR model through graph Laplace regularization, which assumes that spatial neighbors should have similar representation coefficients, so the obtained coefficient matrix can more accurately reflect the similarity between samples. In addition, they also combine the probabilistic class structure (which means the probabilistic relationship between each sample and each class) into the SR model to further improve the differentiability of graphs. Hyion and AVIRIS hyperspectral data show that our method is superior to the most advanced method. The invariance extracted by Zhang et al. [ 17 ], such as the specificity of uniform samples and the invariance of rotation invariance, is very important for object detection and classification applications. Current research focuses on the specific invariance of features, such as rotation invariance. In this paper, a new multichannel convolution neural network (mCNN) is proposed to extract the invariant features of object classification. Multi-channel convolution sharing the same weight is used to reduce the characteristic variance of sample pairs with different rotation in the same class. As a result, the invariance of the uniform object and the rotation invariance are encountered simultaneously to improve the invariance of the feature. More importantly, the proposed mCNN is particularly effective for small training samples. The experimental results of two datum datasets for handwriting recognition show that the proposed mCNN is very effective for extracting invariant features with a small number of training samples. With the development of big data era, convolutional neural network (CNN) with more hidden layers has more complex network structure and stronger feature learning and feature expression ability than traditional machine learning methods. Since the introduction of the convolutional neural network model trained by the deep learning algorithm, significant achievements have been made in many large-scale recognition tasks in the field of computer vision. Chaib et al. [ 18 ] first introduced the rise and development of deep learning and convolution neural network and summarized the basic model structure, convolution feature extraction, and pool operation of convolution neural network. Then, the research status and development trend of convolution neural network model based on deep learning in image classification are reviewed, and the typical network structure, training method, and performance are introduced. Finally, some problems in the current research are briefly summarized and discussed, and new directions of future development are predicted. Computer diagnostic technology has played an important role in medical diagnosis from the beginning to now. Especially, image classification technology, from the initial theoretical research to clinical diagnosis, has provided effective assistance for the diagnosis of various diseases. In addition, the image is the concrete image formed in the human brain by the objective things existing in the natural environment, and it is an important source of information for a human to obtain the knowledge of the external things. With the continuous development of computer technology, the general object image recognition technology in natural scene is applied more and more in daily life. From image processing technology in simple bar code recognition to text recognition (such as handwritten character recognition and optical character recognition OCR etc.) to biometric recognition (such as fingerprint, sound, iris, face, gestures, emotion recognition, etc.), there are many successful applications. Image recognition (Image Recognition), especially (Object Category Recognition) in natural scenes, is a unique skill of human beings. In a complex natural environment, people can identify concrete objects (such as teacups) at a glance (swallow, etc.) or a specific category of objects (household goods, birds, etc.). However, there are still many questions about how human beings do this and how to apply these related technologies to computers so that they have humanoid intelligence. Therefore, the research of image recognition algorithms is still in the fields of machine vision, machine learning, depth learning, and artificial intelligence [ 19 , 20 , 21 , 22 , 23 , 24 ].

Therefore, this paper applies the advantage of depth mining convolution neural network to image classification, tests the loss function constructed by M 3  CE on two depth learning standard databases MNIST and CIFAR-10, and pushes forward the new direction of image classification research.

2 Proposed method

Image classification is one of the hot research directions in computer vision field, and it is also the basic image classification system in other image application fields, which is usually divided into three important parts: image preprocessing, image feature extraction and classifier.

2.1 The ZCA process is shown as below

In this process, we first use PCA to zero the mean value. In this paper, we assume that X represents the image vector [ 25 ]: \( \mu =\frac{1}{m}\sum \limits_{j=1}^m{x}_j \)

Next, the covariance matrix for the entire data is calculated, with the following formulas:

where I represents the covariance matrix, I is decomposed by SVD [ 26 ], and its eigenvalues and corresponding eigenvectors are obtained.

Of which, U is the eigenvector matrix of ∑, and S is the eigenvalue matrix of ∑. Based on this, x can be whitened by PCA, and the formula is:

So X ZCAwhiten can be expressed as

For the data set in this paper, because the training sample and the test sample are not well distinguished [ 27 ], the random generation method is used to avoid the subjective color of the artificial classification.

2.2 Image feature extraction based on time-frequency composite weighting

Feature extraction is a concept in computer vision and image processing. It refers to the use of a computer to extract image information and determine whether the points of each image belong to an image feature extraction. The purpose of feature extraction is to divide the points on the image into different subsets, which are often isolated points, a continuous curve, or region. There are usually many kinds of features to describe the image. These features can be classified according to different criteria, such as point features, line features, and regional characteristics according to the representation of the features on the image data. According to the region size of feature extraction, it can be divided into two categories: global feature and local feature [ 24 ]. The image features used in some feature extraction methods in this paper include color feature and texture feature, analysis of the current situation of corner feature, and edge feature.

The time-frequency composite weighting algorithm for multi-frame blurred images is a frequency-domain and time-domain weighting simultaneous processing algorithm based on blurred image data. Based on the weighted characteristic of the algorithm and the feature extraction of target image in time domain and frequency domain, the depth extraction technique is based on the time-frequency composite weighting of night image to extract the target information from depth image. The main steps of the time-frequency composite weighted feature extraction method are as follows:

Step 1: Construct a time-frequency composite weighted signal model for multiple blurred images, as the following expression shows:

Of which, f ( t ) is original signal, S  = ( c  −  v )/( c  +  v ), called the image scale factor. Referred to as scale, it represents the signal scaling change of the original image time-frequency composite weighting algorithm. \( \sqrt{S} \) is the normalized factor of image time-frequency composite weighting algorithm.

Step 2: Map the one-dimensional function to the two-dimensional function y ( t ) of the time scale a and the time shift b , and perform a time-frequency composite weighted transform on the continuous nighttime image of the image time-frequency composite weighted 0 using the square integrable function as shown below:

Of which, divisor \( 1/\sqrt{\mid a\mid } \) . The energy normalization of the unitary transformation is ensured. ψ a , b is ψ ( t ) obtained by transforming U ( a ,  b ) through the affine group, as shown by the following expression:

Step 3: Substituting the variable of the original image f ( t )by a  = 1/ s and b  =  τ and rewriting the expression to obtain an expression:

Step 4: Build a multi-frame fuzzy image time-frequency composite weighted signal form.

Of which, rect( t ) = 1 and ∣ t   ∣   ≤ 1/2.

Step 5: The frequency modulation law of the time-frequency composite weighted signal of multi-thread fuzzy image is a hyperbolic function;

among them, K  =  Tf max f min / B , t 0  =  f 0 T / B , f 0 is arithmetic center frequency, and f max , f min are the minimum and maximum frequencies, respectively.

Step 6: Use the image transformation formula of the multi-detector fuzzy image time-frequency composite weighted signal to carry on the time-frequency composite weighting to the image, the definition of the image transformation is like the formula.

Of which, \( {b}_a=\left(1-a\right)\left(\frac{1}{afm_{ax}}-\frac{T}{2}\right) \) , and Ei (•) represents an exponential integral.

Final output image time-frequency composite weighted image signal W u u ( a ,  b ). Therefore, compared with the traditional time-domain, c extraction technique of image features can be better realized by the time-frequency composite weighting algorithm.

2.3 Application of deep convolution neural network in image classification

After obtaining the feature vectors from the image, the image can be described as a vector of fixed length, and then a classifier is needed to classify the feature vectors.

In general, a common convolution neural network consists of input layer, convolution layer, activation layer, pool layer, full connection layer, and final output layer from input to output. The convolutional neural network layer establishes the relationship between different computational neural nodes and transfers input information layer by layer, and the continuous convolution-pool structure decodes, deduces, converges, and maps the feature signals of the original data to the hidden layer feature space [ 28 ]. The next full connection layer classifies and outputs according to the extracted features.

2.3.1 Convolution neural network

Convolution is an important analytical operation in mathematics. It is a mathematical operator that generates a third function from two functions f and g , representing the area of overlap between function f and function g that has been flipped or translated. Its calculation is usually defined by a following formula:

Its integral form is the following:

In image processing, a digital image can be regarded as a discrete function of a two-dimensional space, denoted as f ( x , y ). Assuming the existence of a two-dimensional convolution function g ( x , y ), the output image z ( x , y ) can be represented by the following formula:

In this way, the convolution operation can be used to extract the image features. Similarly, in depth learning applications, when the input is a color image containing RGB three channels, and the image is composed of each pixel, the input is a high-dimensional array of 3 × image width × image length; accordingly, the kernel (called “convolution kernel” in the convolution neural network) is defined in the learning algorithm as the accounting. Computational parameter is also a high-dimensional array. Then, when two-dimensional images are input, the corresponding convolution operation can be expressed by the following formula:

The integral form is the following:

If a convolution kernel of m  ×  n is given, there is

where f represents the input image G to denote the size of the convolution kernel m and n . In a computer, the realization of convolution is usually represented by the product of a matrix. Suppose the size of an image is M × M and the size of the convolution kernel is n  ×  n . In computation, the convolution kernel multiplies with each image region of n  ×  n size of the image, which is equivalent to extracting the image region of n  ×  n and expressing it as a column vector of n  ×  n length. In a zero-zero padding operation with a step of 1, a total of ( M  −  n  + 1)  ∗  ( M  −  n  + 1) calculation results can be obtained; when these small image regions are each represented as a column vector of n  ×  n , the original image can be represented by the matrix [ n ∗ n ∗ ( M  −  n  + 1)]. Assuming that the number of convolution kernels is K , the output of the original image obtained by the above convolution operation is k ∗ ( M  −  n  + 1)  ∗  ( M  −  n  + 1). The output is the number of convolution kernels × the image width after convolution × the image length after convolution.

2.3.2 M 3 CE constructed loss function

In the process of neural network training, the loss function is the evaluation standard of the whole network model. It not only represents the current state of the network parameters, but also provides the gradient of the parameters in the gradient descent method, so the loss function is an important part of the deep learning training. In this paper, we introduce the loss function proposed by M 3  CE. Finally, the loss function of M 3  CE and cross-entropy is obtained by gradient analysis.

According to the definition of MCE, we use the output of Softmax function as the discriminant function. Then, the error classification metric formula is redefined as.

Where k is the label of the sample, q  = arg max l  ≠  k , P l represents the most confusing class of output of the Softmax function. If we use the logistic loss function, we can find the gradient of the loss function to Z .

This gradient is used in the backpropagation algorithm to get the gradient of the entire network, and it is worth noting that if z is misdivided,ℓ k will be infinitely close to 1, and a ℓ k (1 − ℓ k ) will be close to 0. Then, the gradient will be close to 0, which will cause almost no gradient to be reversed to the previous layer, which will not be good for the completion of the training process [ 29 ].

The sigmoid function is used in the traditional neural network activation function. But this is also the case during training. The observation formula shows that when the activation value is high the backpropagation gradient is very small which is called saturation. In the past, the influence of shallow neural networks was not very large, but with the increase of the number of network layers, this situation would affect the learning of the whole network. In particular, if the saturated sigmoid function is at a higher level, it will affect all the previous low-level gradients. Therefore, in the present depth neural networks, an unsaturated activation function linear rectifier unit (Rectified Linear Unit, Re LU) is used to replace the sigmoid function. It can be seen from the formula that when the input value is positive, the gradient of the linear rectifying unit is 1, so the gradient of the upper layer can be reversely transmitted to the lower layer without attenuation. The literature shows that linear rectification units can accelerate the training process and prevent gradient dispersion.

According to the fact that the saturation activation function in the middle of the network is not conducive to the training of the depth network, but the saturation function in the top loss function, has a great influence on the depth neural network.

We call it the max-margin loss, where the interval is defined as ∈ k  =  −  d k ( z ) =  P k  −  P q .

Since P k is a probability, that is, P k   ∈  [0, 1], then d k   ∈  [−1, 1], when a sample gradually becomes misclassified from the correct classification, d k increases from − 1 to 1, compared to the original logistic loss function, and even if the sample is seriously misclassified, the loss function still get the biggest loss value. Because of1 +  d k  ≥ 0, it can be simplified.

When we need to give a larger loss value to the wrong classification sample, the upper formula can be extended to

where γ is a positive integer. If γ  = 2 is set, we get the squared maximum interval loss function. If the function is to be applied to training deep neural networks, the gradient needs to be calculated and obtained according to the chain rule.

Here, we need to discuss (1) when the dimension is the dimension corresponding to the sample label, (2) when the dimension is the dimension corresponding to the confused category label, and (3) when the dimension is neither the sample label nor the dimension corresponding to the confused category label. The following conclusions have been drawn:

3 Experimental results

3.1 experimental platform and data preprocessing.

MNIST (Mixed National Institute of Standards and Technology) database is a standard database in machine learning. It consists of ten types of handwritten digital grayscale images, of which 60,000 training pictures are tested with a resolution of 28 × 28.

In this paper, we mainly use ZCA whitening to process the image data, such as reading the data into the array and reforming the size we need (Figs.  1 , 2 , 3 , 4 , and 5 ). The image of the data set is normalized and whitened respectively. It makes all pixels have the same mean value and variance, eliminates the white noise problem in the image, and eliminates the correlation between pixels and pixels.

figure 1

ZCA whitening flow chart

figure 2

Sample selection of different fonts and different colors

figure 3

Comparison of image feature extraction

figure 4

Image classification and modeling based on deep convolution neural network

figure 5

Comparison of recognition rates among different species

At the same time, a common way to change the results of image training is a random form of distortion, cropping, or sharpening the training input, which has the advantage of extending the effective size of the training data, thanks to all possible changes in the same image. And it tends to help network learning to deal with all distortion problems that will occur in the real use of classifiers. Therefore, when the training results are abnormal, the images will be deformed randomly to avoid the large interference caused by individual abnormal images to the whole model.

3.2 Build a training network

Classification algorithm is a relatively large class of algorithms, and image classification algorithm is one of them. Common classification algorithms are support vector machine, k -nearest algorithm, random forest, and so on. In image classification, support vector machine (SVM) based on the maximum boundary is the most widely used classification algorithm, especially the support vector machine (SVM) which uses kernel techniques. Support vector machine (SVM) is based on VC dimension theory and structural risk minimization theory. Its main purpose is to find the optimal classification hyperplane in high dimensional space so that the classification spacing is in maximum and the classification error rate is minimized. But it is more suitable for the case where the feature dimension of the image is small and the amount of data is large after extracting the special vector.

Another commonly used target recognition method is the depth learning model, which describes the image by hierarchical feature representation. The mainstream depth learning networks include constrained Boltzmann machine, depth belief network foot, automatic encoder, convolution neural network, biological model, and so on. We tested the proposed M3 CE-CEc. We design different convolution neural networks for different datasets. The experimental settings are as follows: the weight parameters are initialized randomly, the bias parameters are set as constants, the basic learning rate is set to 0.01, and the impulse term is set to 0.9. In the course of training, when the error rate is no longer decreasing, the learning rate is multiplied by 0.1.

3.3 Image classification and modeling based on deep convolution neural network

The following is a model for image classification based on deep convolution neural networks.

Input: Input is a collection of N images; each image label is one of the K classification tags. This set is called the training set.

Learning: The task of this step is to use the training set to learn exactly what each class looks like. This step is generally called a training classifier or learning a model.

Evaluation: The classifier is used to predict the classification labels of images it has not seen and to evaluate the quality of the classifiers. We compare the labels predicted by the classifier with the real labels of the image. There is no doubt that the classification labels predicted by the classifier are consistent with the true classification labels of the image, which is a good thing, and the more such cases, the better.

3.4 Evaluation index

In this paper, the image recognition effect is mainly divided into three parts: the overall classification accuracy, classification accuracy of different categories, and classification time consumption. The classification accuracy of an image includes the accuracy of the overall image classification and the accuracy of each classification. Assuming that nij represents the number of images in category I divided into category j , the accuracy of the overall classification is as follows:

The accuracy of each classification is as follows:

Run time is the average time to read a picture to get a classification result.

4 Discussion

4.1 comparison of classification effects of different loss functions.

We compare the images of the traditional logistic loss function with our proposed maximum interval loss function. It can be clearly seen that the value of the loss function increases with the increase of the severity of the misclassification, which indicates that the loss function can effectively express the error degree of the classification.

4.2 Comparison of recognition rates between the same species

4.3 comparison of recognition rates among different species.

As can be seen from the following table, the recognition rate of this method is generally the same among different species, reaching more than 80% level, among which the accuracy of this method is relatively high in classifying clearly defined images such as cars. This may be due to the fact that clearly defined images have greater advantages in feature extraction.

4.4 Time-consuming comparison of SVM, KNN, BP, and CNN methods

On the premise of feature extraction using the same loss function method constructed by M 3  CE, the selection of classifier is the key factor to affect the automatic detection accuracy of human physiological function. Therefore, this paper discusses the influence of different classifiers on classification accuracy in this part (Table  1 ). The following table summarizes the influence of some common classifiers on classification accuracy. These classifiers include linear kernel support vector machine (SVM-Linear), Gao Si kernel support vector machine (SVM-RBF), and Naive Bayes (NB) (NB) k -nearest neighbor (KNN), random forest (RF), and decision. Strategy tree (DT) and gradient elevation decision tree (GBDT).

The experimental results show that the accuracy of CNN classifier is higher than that of other classifiers in training set and test set. Although the speed of DT is the fastest when it is used for automatic detection of human physiological function in the classifier contrast experiment, its accuracy on the test set is only 69.47% unacceptable.由In this paper, the following conclusions can be drawn in the comparison experiment of classifier: compared with other six common classifiers, CNN has the highest accuracy, and the spending of 6 s is acceptable in the seven classifiers of comparison.

First, because each test image needs to be compared with all the stored training images, it takes up a lot of storage space, consumes a lot of computing resources, and takes a lot of time to calculate. Because in practice, we focus on testing efficiency far higher than training efficiency. In fact, the convolution neural network that we want to learn later reaches the other extreme in this trade-off: although the training takes a lot of time, once the training is completed, the classification of new test data is very fast. Such a model is in line with the actual use of the requirements.

5 Conclusions

Deep convolution neural networks are used to identify scaling, translation, and other forms of distortion-invariant images. In order to avoid explicit feature extraction, the convolutional network uses feature detection layer to learn from training data implicitly, and because of the weight sharing mechanism, neurons on the same feature mapping surface have the same weight. The ya training network can extract features by W parallel computation, and its parameters and computational complexity are obviously smaller than those of the traditional neural network. Its layout is closer to the actual biological neural network. Weight sharing can greatly reduce the complexity of the network structure. Especially, the multi-dimensional input vector image WDIN can effectively avoid the complexity of data reconstruction in the process of feature extraction and image classification. Deep convolution neural network has incomparable advantages in image feature representation and classification. However, many researchers still regard the deep convolutional neural network as a black box feature extraction model. To explore the connection between each layer of the deep convolutional neural network and the visual nervous system of the human brain, and how to make the deep neural network incremental, as human beings do, to compensate for learning, and to increase understanding of the details of the target object, further research is needed.

Abbreviations

Artificial neural network

Backpropagation

Convolutional neural network and Naive Bayes

Convolutional neural network

Multilayer perceptron

Omnidirectional image

Very fine spatial resolution

Virtual reality

E. Newman, M. Kilmer, L. Horesh, Image classification using local tensor singular value decompositions (IEEE, international workshop on computational advances in multi-sensor adaptive processing. IEEE, Willemstad, 2018), pp. 1–5.

Google Scholar  

X. Wang, C. Chen, Y. Cheng, et al, Zero-shot image classification based on deep feature extraction. United Kingdom: IEEE Transactions on Cognitive & Developmental Systems, 10 (2), 1–1 (2018).

A.A.M. Al-Saffar, H. Tao, M.A. Talab, Review of deep convolution neural network in image classification (International conference on radar, antenna, microwave, electronics, and telecommunications. IEEE, Jakarta, 2018), pp. 26–31.

A.B. Said, I. Jemel, R. Ejbali, et al., A hybrid approach for image classification based on sparse coding and wavelet decomposition (Ieee/acs, international conference on computer systems and applications. IEEE, Hammamet, 2018), pp. 63–68.

Huang G, Chen D, Li T, et al. Multi-Scale Dense Networks for Resource Efficient Image Classification. 2018.

V. Gupta, A. Bhavsar, Feature importance for human epithelial (HEp-2) cell image classification. J Imaging. 4 (3), 46 (2018).

Article   Google Scholar  

L. Yang, A.M. Maceachren, P. Mitra, et al., Visually-enabled active deep learning for (geo) text and image classification: a review. ISPRS Int. J. Geo-Inf. 7 (2), 65 (2018).

Chanti D A, Caplier A. Improving bag-of-visual-words towards effective facial expressive image classification Visigrapp, the, International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. 2018.

X. Long, H. Lu, Y. Peng, X. Wang, S. Feng, Image classification based on improved VLAD. Multimedia Tools Appl. 75 (10), 5533–5555 (2016).

B. Kieffer, M. Babaie, S. Kalra, et al., Convolutional neural networks for histopathology image classification: training vs. using pre-trained networks (International conference on image processing theory. IEEE, Montreal, 2018), pp. 1–6.

J. Zhao, T. Fan, L. Lü, H. Sun, J. Wang, Adaptive intelligent single particle optimizer based image de-noising in shearlet domain. Intelligent Automation & Soft Computing 23 (4), 661–666 (2017).

Mou L, Ghamisi P, Zhu X X. Unsupervised spectral-spatial feature learning via deep residual conv-Deconv network for hyperspectral image classification IEEE transactions on geoscience & Remote Sensing. 2018,(99):1–16.

Newman E, Kilmer M, Horesh L. Image classification using local tensor singular value decompositions IEEE, international workshop on computational advances in multi-sensor adaptive processing. IEEE, 2018:1–5.

S.A. Quadri, O. Sidek, Quantification of biofilm on flooring surface using image classification technique. Neural Comput. & Applic. 24 (7–8), 1815–1821 (2014).

X.-C. Yin, Q. Liu, H.-W. Hao, Z.-B. Wang, K. Huang, FMI image based rock structure classification using classifier combination. Neural Comput. & Applic. 20 (7), 955–963 (2011).

Z. Yan, V. Jagadeesh, D. Decoste, et al., HD-CNN: hierarchical deep convolutional neural network for image classification. Eprint Arxiv 4321-4329 (2014).

C. Zhang, X. Pan, H. Li, et al., A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification. Isprs Journal of Photogrammetry & Remote Sensing 140 , 133–144 (2018).

Chaib S, Yao H, Gu Y, et al. Deep feature extraction and combination for remote sensing image classification based on pre-trained CNN models. International Conference on Digital Image Processing. 2017:104203D.

S. Roychowdhury, J. Ren, Non-deep CNN for multi-modal image classification and feature learning: an azure-based model (IEEE international conference on big data. IEEE, Washington, D.C., 2017), pp. 2893–2812.

M.Z. Afzal, A. Kölsch, S. Ahmed, et al., Cutting the error by half: investigation of very deep CNN and advanced training strategies for document image classification (Iapr international conference on document analysis and recognition. IEEE computer Society, Kyoto, 2017), pp. 883–888.

X. Fu, L. Li, K. Mao, et al., in Chinese High Technology Letters . Remote sensing image classification based on CNN model (2017).

Sachin R, Sowmya V, Govind D, et al. Dependency of various color and intensity planes on CNN based image classification. International Symposium on Signal Processing and Intelligent Recognition Systems. Springer, Cham, Manipal, 2017:167–177.

Shima Y. Image augmentation for object image classification based on combination of pre-trained CNN and SVM. International Conference on Informatics, Electronics and Vision & 2017, International sSymposium in Computational Medical and Health Technology. 2018:1–6.

J.Y. Lee, J.W. Lim, E.J. Koh, A study of image classification using HMC method applying CNN ensemble in the infrared image. Journal of Electrical Engineering & Technology 13 (3), 1377–1382 (2018).

Zhang C, Pan X, Zhang S Q, et al. A rough set decision tree based Mlp-Cnn for very high resolution remotely sensed image classification. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017:1451–1454.

M. Kumar, Y.H. Mao, Y.H. Wang, T.R. Qiu, C. Yang, W.P. Zhang, Fuzzy theoretic approach to signals and systems: Static systems. Inf. Sci. 418 , 668–702 (2017).

W. Zhang, J. Yang, Y. Fang, H. Chen, Y. Mao, M. Kumar, Analytical fuzzy approach to biological data analysis. Saudi J Biol Sci. 24 (3), 563, 2017–573.

Z. Sun, F. Li, H. Huang, Large scale image classification based on CNN and parallel SVM. International conference on neural information processing (Springer, Cham, Manipal, 2017), pp. 545–555.

Sachin R, Sowmya V, Govind D, et al. Dependency of various color and intensity planes on CNN based image classification. 2017.

Download references

Acknowledgements

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

About the author

Xin Mingyuan was born in Heihe, Heilongjiang, P.R. China, in 1983. She received the Master degree from harbin university of science and technology, P.R. China. Now, she works in School of computer and information engineering, Heihe University, His research interests include Artificial intelligence, data mining and information security.

Wang yong was born in Suihua, Heilongjiang, P.R. China, in 1979. She received the Master degree from qiqihaer university, P.R. China. Now, she works in School of Heihe University, His research interests include Artificial intelligence, Education information management.

This work was supported by University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (No.UNPYSCT-2017104). Scientific research items of basic research business of provincial higher education institutions of Heilongjiang Provincial Department of Education (No.2017-KYYWF-0353).

Availability of data and materials

Please contact author for data requests.

Author information

Authors and affiliations.

School of Computer and Information Engineering, Heihe University, No. 1 Xueyuan Road education science and technology zone, Heihe, Heilongjiang, China

Mingyuan Xin

Heihe University, No. 1 Xueyuan Road education science and technology zone, Heihe, Heilongjiang, China

You can also search for this author in PubMed   Google Scholar

Contributions

All authors take part in the discussion of the work described in this paper. XM wrote the first version of the paper. XM and WY did part experiments of the paper. XM revised the paper in a different version of the paper, respectively. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yong Wang .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Xin, M., Wang, Y. Research on image classification model based on deep convolution neural network. J Image Video Proc. 2019 , 40 (2019). https://doi.org/10.1186/s13640-019-0417-8

Download citation

Received : 17 October 2018

Accepted : 07 January 2019

Published : 11 February 2019

DOI : https://doi.org/10.1186/s13640-019-0417-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Convolution neural network
  • Image classification

image processing research papers google scholar

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

A Review of Image Processing Techniques for Deepfakes

Hina fatima shahzad.

1 Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan; moc.liamg@dazhahsamitafanih

Furqan Rustam

2 Department of Software Engineering, University of Management and Technology, Lahore 544700, Pakistan; [email protected]

Emmanuel Soriano Flores

3 Higher Polytechnic School, Universidad Europea del Atlántico (UNEATLANTICO), Isabel Torres 21, 39011 Santander, Spain; [email protected] (E.S.F.); [email protected] (J.L.V.M.)

4 Department of Project Management, Universidad Internacional Iberoamericana (UNINI-MX), Campeche 24560, Mexico

Juan Luís Vidal Mazón

5 Project Department, Universidade Internacional do Cuanza, Municipio do Kuito, Bairro Sede, EN250, Bié, Angola

Isabel de la Torre Diez

6 Department of Signal Theory and Communications and Telematic Engineering, University of Valladolid, Paseo de Belén 15, 47011 Valladolid, Spain

Imran Ashraf

7 Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, Korea

Associated Data

Not applicable.

Deep learning is used to address a wide range of challenging issues including large data analysis, image processing, object detection, and autonomous control. In the same way, deep learning techniques are also used to develop software and techniques that pose a danger to privacy, democracy, and national security. Fake content in the form of images and videos using digital manipulation with artificial intelligence (AI) approaches has become widespread during the past few years. Deepfakes, in the form of audio, images, and videos, have become a major concern during the past few years. Complemented by artificial intelligence, deepfakes swap the face of one person with the other and generate hyper-realistic videos. Accompanying the speed of social media, deepfakes can immediately reach millions of people and can be very dangerous to make fake news, hoaxes, and fraud. Besides the well-known movie stars, politicians have been victims of deepfakes in the past, especially US presidents Barak Obama and Donald Trump, however, the public at large can be the target of deepfakes. To overcome the challenge of deepfake identification and mitigate its impact, large efforts have been carried out to devise novel methods to detect face manipulation. This study also discusses how to counter the threats from deepfake technology and alleviate its impact. The outcomes recommend that despite a serious threat to society, business, and political institutions, they can be combated through appropriate policies, regulation, individual actions, training, and education. In addition, the evolution of technology is desired for deepfake identification, content authentication, and deepfake prevention. Different studies have performed deepfake detection using machine learning and deep learning techniques such as support vector machine, random forest, multilayer perceptron, k-nearest neighbors, convolutional neural networks with and without long short-term memory, and other similar models. This study aims to highlight the recent research in deepfake images and video detection, such as deepfake creation, various detection algorithms on self-made datasets, and existing benchmark datasets.

1. Introduction

Fake content in the form of images and videos using digital manipulation with artificial intelligence (AI) approaches has become widespread during the past few years. Deepfakes, a hybrid form of deep learning and fake material, include swapping the face of one human with that of a targeted person in a picture or video and making content to mislead people into believing the targeted person has said words that were said by another person. Facial expression modification or face-swapping on images and videos is known as deepfake [ 1 ]. In particular, deepfake videos where the face of one person is swapped with the face of another person have been regarded as a public concern and threat. Rapidly growing advanced technology has made it simple to make very realistic videos and images by replacing faces that make it very hard to find the manipulation traces [ 2 ]. Deepfakes use AI to combine, merge, replace, or impose images or videos for making fraudulent videos that look real and authentic [ 3 ]. Face-swap apps such as ’FakeApp’ and ’DeepFaceLab’ make it easy for people to use them for malicious purposes by creating deepfakes for a variety of unethical purposes. Both privacy and national security are put at risk by these technologies as they have the potential to be exploited in cyberattacks. In the past two years, the face-swap issue has attracted a lot of attention, especially with the development of deepfake technology, which uses deep learning techniques to edit pictures and videos. Using autoencoders [ 4 ] and generative adversarial networks (GANs) [ 5 ], the deepfake algorithms may swap original faces with fake faces and generate a new video with fake faces. With the first deepfake video appearing in 2017 when a Reddit user transposed celebrities’ faces in porn videos, multiple techniques for generating and detecting deepfake videos have been developed.

Although deepfake technology can be utilized for constructive purposes such as filming and virtual reality applications, it has the potential to be used for destructive purposes [ 6 , 7 , 8 , 9 ]. Manipulation of faces in pictures or films is a serious problem that threatens global security. Faces are essential in human interactions as well as biometric-based human authentication and identity services. As a result, convincing changes in faces have the potential to undermine security applications and digital communications [ 10 ]. Deepfake technology is used to make several kinds of videos such as funny or pornographic videos of a person involving the voice and video of a person without any authorized use [ 11 , 12 ]. The dangerous aspect of deepfakes is their scale, scope, and access which allows anyone with a single computer to create bogus videos that look genuine [ 12 ]. Deepfakes may be used for a wide variety of purposes, including creating fake porn videos of famous people, disseminating fake news, impersonating politicians, and committing financial fraud [ 13 , 14 , 15 ]. Although the initial deepfakes focused on politicians, actresses, leaders, entertainers, and comedians for making porn videos [ 16 ], deepfakes pose a real threat concerning their use for bullying, revenge porn, terrorist propaganda, blackmail, misleading information, and market manipulation [ 3 ].

Thanks to the growing use of social media platforms such as Instagram and Twitter, plus the availability of high-tech camera mobile phones, it has become easier to create and share videos and photos. As digital video recording and uploading have become increasingly easy, digitally changed media can have significant consequences depending on the information being changed. With deepfakes, one may create hyper-realistic videos and fake pictures by using the advanced techniques from deep learning technology. Accompanying the wide use of social media, deepfakes can immediately reach millions of people and can be very dangerous to make fake news, hoaxes, and fraud [ 17 , 18 ]. Fake news contains bogus material that is presented in a news style to deceive people [ 19 , 20 ]. Bogus and misleading information spreads quickly and widely through social media channels, with the potential to affect millions of people [ 21 ]. Research shows that one in every five internet users receives news via Facebook, second only to YouTube [ 22 ]. This rise in popularity and reputation of video necessitates the initiation of proper tools to authenticate media channels and news. Considering the easy access and availability of tools to disseminate false and misleading information using social media platforms, determining the authenticity of the content is becoming difficult day by day [ 22 ]. Current issues are attributed to digital misleading information, also called disinformation, and represent information warfare where fake content is presented deliberately to alter people’s opinions [ 17 , 22 , 23 , 24 ].

1.1. Existing Surveys

Several survey and review papers have been presented on deepfake detection ( Table 1 ). For example, ref. [ 25 ] presents a survey of deepfake creation and detection techniques. Potential trends of deepfake techniques, challenges, and future directions are discussed. The survey does not include a systematic review and contains papers for the period of 2017 to 2019 only and lacks recent papers in deepfake technology. Similarly, ref. [ 26 ] provides a survey of the deepfake approaches with respect to the type of swap used to generate deepfakes. The study discusses the papers utilizing face synthesis, identity swap, attribute manipulation, and expression swap. Another study [ 27 ] presents a systematic literature review that covers the research works related to the evolution of deepfakes in terms of face synthesis, attributes manipulation, identity swap, and expression swap. The approaches are discussed with respect to the mathematical models and the types of signals used for creating and detecting deepfakes. In addition, several current datasets are discussed that are used for testing deepfake creation and detection approaches.

A comparative analysis of review/survey papers on deepfakes.

Similarly, ref. [ 28 ] provides a survey of deepfake creation and detection techniques with a focus on the architecture of various networks used for this purpose. For example, detailed discussions on the capability of various deep learning networks are provided, along with their architectures used in different studies. The authors provide a survey of tools and algorithms used for deepfake creation and detection in [ 29 ]. A brief discussion on deepfake challenges, advances, and strategies is provided, however, the number of discussed studies is very small. The current study presents a systematic literature review (SLR) of deepfake creation and detection techniques and covers images and video similar to other surveys and, additionally, includes the studies related to deepfake tweets.

1.2. Contributions of Study

While spreading fake and misleading information is easier, its identification and correction are becoming harder [ 30 ]. For amending its impact and fighting against deepfakes, it is necessary to understand deepfakes, the technology behind them, and the potential tools and methods to identify and prevent the wide spread of deepfake videos. In this regard, this study makes the following contributions:

  • A brief overview of the process involved in creating deepfake videos is provided.
  • Deepfake content is discussed with respect to different categories such as video, images, and audio, as well as fake content provided in tweets. The process involved in generating these deepfakes is discussed meticulously.
  • A comprehensive review of the methods presented to detect deepfakes is discussed with respect to each kind of deepfake.
  • Challenges associated with deepfake detection and future research directions are outlined.

The rest of the paper is structured in three sections. Section 2 presents the survey methodology, while the process of deepfake creation is described in Section 3 . Section 4 discusses the types of deepfakes and the potential methods to detect such deepfakes. Discussions and future directions are provided in Section 5 and Section 6 while the conclusions are given at the end.

2. Survey Methodology

The first and foremost step in conducting a review is searching and selecting the most appropriate research papers. For this paper, both relevant and recent research papers need to be selected. In this regard, this study searches the deepfake literature from Web of Science (WoS) and Google Scholar which are prominent scientific research databases. Figure 1 shows the methodology used for research article search and selection.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g001.jpg

Research article search and selection methodology.

The goal of this systematic review and meta-analysis is to analyze the advancement of deepfake detection techniques. The study also examines the deepfake creation methods. The study is carried out following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) recommendations. A systematic review helps scholars to gain a thorough understanding of a certain research area and provides future insights [ 31 ]. It is also known for its structured method for research synthesis due to its methodological process and identification metrics in identifying relevant studies when compared to conventional approaches [ 32 ]. This makes it a valuable asset not only for researchers but also for post-graduate students in developing an integrated platform for their research studies by identifying existing research gaps and the recent status of the literature [ 33 ].

2.1. PRISMA

PRISMA is a minimal set of elements for systematic reviews and meta-analyses that is based on evidence [ 34 ]. PRISMA consists of 4 phases and a checklist of 27 items [ 35 , 36 ]. PRISMA is generally designed for reporting reviews that evaluate the impact of an intervention, although it may also be utilized for a literature review with goals other than assessing approaches (e.g., evaluating etiology, prevalence, diagnosis, or prognosis). PRISMA is a tool that writers may use to enhance the way they report systematic reviews and meta-analyses [ 34 ]. It is a valuable tool for critically evaluating published systematic reviews, but it is not a quality evaluation tool for determining a systematic review’s quality.

2.2. Information Source

The literature search for this study includes research papers from peer-reviewed journals that are indexed in the Science Citation Index Expanded (SCIE), Social Science Citation Index (SSCI), and Arts & Humanities Citation Index (AHCI). The search query is developed after a review of the literature and refined to obtain the most relevant papers. The databases are chosen for their reliability and scientific rigor, and they are judged to be adequate and most appropriate for our evaluation.

2.3. Search Strategy

The search was conducted in two stages: the first in February 2021, following the conclusion of the manuscript’s primary results; and the second in May 2021, to guarantee that more updated and recent material is included. In all of the utilized databases, Boolean operators are employed for the search, and three groups of queries are used in the procedure, as follows:

Keywords to search papers related to Deepfake in different databases: ‘Deepfake’, ‘Deepfake Detection’, ‘Deepfake creation’, ‘fake videos’, ‘fake tweets’ .

2.4. Inclusion Criteria

The following inclusion criteria are used in the selection of the articles:

  • Studies that applied machine learning algorithms.
  • Studies that applied deep learning algorithms.
  • Studies that evaluated fake image detection, fake video detection, fake audio detection, and fake tweet detection.
  • Studies that used algorithms to analyze deepfakes using physiological and biological signals.

2.5. Exclusion Criteria

The following studies are excluded:

  • Studies that used any machine learning or deep learning approaches for problems that are not directly related to deepfake detection.
  • Studies that used other techniques or classic computer vision approaches and do not focus on deepfake detection.
  • Studies that did not provide a clear explanation of the machine learning or deep learning model that was used to solve their problem.
  • Review studies.

The use of the word ’fake’ resulted in many irrelevant papers such as those on fake news, fake tweets, fake articles, and fake images. Therefore, these studies are also excluded.

2.6. Study Selection

The initial number ( n = 158) is obtained by an article search. Initially, 8 articles are excluded including 3 each in Russian and Spanish languages, and 1 each in Chinese and Portuguese languages. This is followed by the elimination of duplicated articles ( n = 50). Afterward, from the resulting data, articles’ abstracts are studied to refine the selection. For this purpose, the focus is on the studies that are most relevant to deepfake detection and creation. In particular, the research articles are checked for the techniques used to detect and create deepfake data, and a total of ( n = 30) are rejected based on the selection criteria. Seventy articles are assessed in their entirety, from which 10 are discarded due to the unavailability of the full text. Sixty papers met the inclusion criteria and are subjected to data extraction.

2.7. Data Extraction

Seven researchers collected and extracted data from each article in this study to examine and highlight the major points. Every included paper is properly examined for a variety of facts that had been predefined and agreed upon by all researchers. The extracted data consist of general study features such as the study title and publication year.

2.8. Quality Assessment

To guarantee that the findings of the selected publications can contribute to this review, a quality assessment method is developed. These criteria are created using the compiled list published by [ 37 ], which includes aspects of study design, data collection, data analysis, and conclusion. To the best of our knowledge, there is no agreement on the standard criteria for assessing research quality. As a result, we used the aforementioned recommendations because they have been used in several systematic reviews and encompass all aspects required to evaluate the quality of a research publication.

2.9. Quality Assessment Results

Figure 2 shows the flowchart of the study design, indicating that this review has made a meaningful contribution. A breakdown of the selected research articles concerning the studied topic is provided in Table 2 .

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g002.jpg

Flow chart of paper selection methodology.

Details of the selected articles with respect to sub-topics.

3. Deepfake Creation

Figure 3 shows the original and fake faces generated by deepfake algorithms. The top row shows the original faces whereas the bottom row shows the corresponding fake faces generated by deepfake algorithms. It shows a good example of the potential of deepfake technology to generate genuine-looking pictures.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g003.jpg

Examples of original and deepfake videos.

Deepfakes have grown famous due to the ease of using different applications and algorithms, especially those which are available for mobile devices. Predominantly, deep learning techniques are used in these applications, however, there are some variations as well. Data representation using deep learning is well known and has been widely used in recent times. This study discusses the use of deep learning techniques with respect to their use to generate deepfake images and videos.

3.1. FakeApp

FakeApp is an app built by a Reddit user to generate deepfakes by utilizing an autoencoder–decoder pairing structure [ 38 , 39 ]. By using this technique, face pictures are decoded using an autoencoder and a decoder. Encoding and decoding pairs are required to change faces between input and output pictures. Each pair is trained on a different image collection and the encoder parameters are shared between the two networks. As a result, two pairs of encoders share the same network. Faces are often similar in terms of eye, nose, and mouth locations and, thus, this technique makes it very easy for a common encoder to learn the similarities between two sets of face pictures. Figure 4 demonstrates a deep creation process, in which the characteristics of face A are linked to decoder B to recreate face B from the original face A. It depicts the use of two pairs of encoder–decoders in a deep creational model. For the training process, two networks utilize the same encoder and various decoders ( Figure 4 a). The standard encoder encodes an image of face A and decoder B to generate a deepfake image ( Figure 4 b).

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g004.jpg

Deepfake generation process using encoder–decoder pair [ 40 ].

3.2. DeepFaceLab

DeepFaceLab [ 41 ] was introduced to overcome the limitations of obscure workflow and poor performance which are commonly found in deepfake generating models. It is an enhanced framework used for face-swapping [ 41 ] as shown in Figure 5 . It uses a conversion phase that consists of an ‘encoder’ and ‘destination decoder’ with an ’inter’ layer between them, and an ‘alignment’ layer at the end. The proposed approach LIAE is shown in Figure 6 . For feature extraction, heat map-based facial landmark algorithm 2DFAN [ 42 ] is used while face segmentation is achieved by a fine-grained face segmentation network, TernausNet [ 43 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g006.jpg

Training and testing phases of FC-GAN [ 46 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g005.jpg

Architecture of DeepFaceLab from [ 41 ].

Other deepfake apps such as D-Faker [ 44 ] and Deepfake [ 45 ] use a very similar method for generating deepfakes.

3.3. Face Swap-GAN

Face Swap-GAN is an enhanced version of deepfakes that utilizes a GAN [ 5 , 46 ]. It makes use of the two kinds of losses to deep learning: adversarial loss and perceptual loss [ 47 ]. Eye movements are more natural and constant when using perceptual loss. It also helps to smooth artifacts in the segmentation masks, resulting in higher quality output videos. As a result, it is possible to create outputs with resolutions of 64 × 64, 128 × 128, and 256 × 256.

Additionally, the FaceNet implementation [ 48 ] introduces a multi-task convolutional neural network (CNN) to make face detection more reliable and face recognition more accurate. Generative networks are implemented with CycleGAN [ 49 ]. Table 3 describes some major deepfake techniques and their usual features.

Brief overview of deepfake face apps.

3.4. Generative Adversarial Network

Goodfellow et al. [ 5 ] proposed GAN to analyze the underlying unknown data distribution by pushing the produced samples to be unrecognizable from actual photos. GAN is the best deepfake algorithm because it creates real images by combining two neural networks (NNs). Machine learning techniques are smart enough to learn from a collection of images and then combine these images to create an image that feels realistic to human eyes. For example, we may use GAN to produce realistic pictures of animals, clothing designs, and anything else that GAN has been trained for [ 50 ]. GAN combines two NNs, one of which is referred to as the generator and the other as the discriminator. The discriminator evaluates the images produced by the generator for authenticity, and it helps in generating more realistic images that look real to human eyes [ 50 ]. The generator tries to generate fake images from the image dataset that is provided to it, and it tries to generate realistic images [ 5 ].

3.5. Encoder/Decoder

The encoder is involved in the process of converting data into the desired format while the decoder refers to the process of converting coded data into an understandable format. Encoder and decoder networks have been extensively researched in machine learning [ 51 ] and deep learning [ 52 ], especially for voice recognition [ 53 ] and computer vision tasks [ 54 , 55 ]. A study [ 54 ] used an encoder–decoder structure for semantic segmentation. The proposed technique overcomes the constraints of prior methods based on fully convolutional networks by including deconvolutional networks and pixel-wise prediction, allowing it to identify intricate structures and handle objects of numerous sizes. An encoder is a network (FC, CNN, RNN, etc.) that receives input and produces a feature map/vector/tensor as output. These feature vectors contain the data, or features, that represent the input. The decoder is a network (typically the same network structure as the encoder, but in the opposite direction) that receives the feature vector from the encoder and returns the best close match to the actual input or intended output [ 56 ]. The decoders are used to train the encoders. There are no tags (hence unsupervised). The loss function computes the difference between the real and reconstructed input. The optimizer attempts to train both the encoder and the decoder to reduce the reconstruction loss [ 56 ].

4. Deepfake Detection

Deepfakes can compromise the privacy and societal security of both individuals and governments. In addition, deepfakes are a threat to national security, and democracies are progressively being harmed. To overcome/mitigate the impact of deepfakes, different methods and approaches have been introduced that can identify deepfakes and appropriate corrective actions can be taken. This section gives a survey of deepfake detection approaches with respect to deepfake classification into two main categories ( Figure 7 ): deepfake images and deepfake videos, with the addition of deepfake audio and deepfake tweets as well.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g007.jpg

Types of deepfake videos and detection process.

4.1. Deepfake Video Detection

A study [ 62 ] proposed LipForensics, a detection method for detecting forged face videos. LipForensics focuses on high-level semantic anomalies in mouth motions, which are prevalent in many created films. For training, the study used the FaceForensics++ (FF++) dataset, which includes 1.8 million modified frames and 4000 false videos created by two face-swapping algorithms, DeepFakes (DF) and FaceSwap (FS), as well as two face reenactment methods, Face2Face and NeuralTextures (NT). DeeperForensics (DFo) and FaceShifter (FSh), face-swapping, Celeb-DF-v2 (CDF), and DeepFake Detection Challenge (DFDC) datasets are used as a testing dataset. The study obtained 82.4%, 73.5%, 97.1%, and 97.6% accuracy from CDF, DFDC, FSh, and DFo, respectively.

A study [ 63 ] proposed deepfake detection as a fine-grained classification issue and presented a novel multi-attentional deepfake detection network. The study is composed of three major components. First, numerous spatial attention heads to direct the network’s attention to distinct local portions, second, textural feature enhancement block to zoom in on subtle artifacts in shallow features, and, third, combining low-level textural and high-level semantic characteristics driven by attention maps. The study used FF++, DFDC, and CDF datasets, where FF++ achieved 97.60% accuracy, CDF achieved 67.44% accuracy, and DFDC achieved 0.1679 Logloss.

Along the same lines, [ 64 ] proposes Multi-Feature Fusion Network (MFF-Net), a deepfake detection system that combines RGB features, and textural information extracted by an NN and signal processing techniques. The proposed system is composed of four major modules: (1) a feature extraction module to extract textural and frequency information, (2) a texture augmentation module to zoom the subtle textural features in shallow layers, (3) an attention module, and (4) two instances of feature fusion. Feature fusion includes fusing textural features from the shallow RGB branch and feature extraction module and fusing the textural features and semantic information. For the experimental process, the study used DFD, CDF, and FF++ datasets where FF++ achieved 99.73% accuracy, DFD achieved 92.53% accuracy, and CDF achieved 75.07% accuracy.

The authors propose Fake-Buster in [ 65 ], to address the issue of detecting face modification in video sequences using recent facial manipulation techniques. The study used NN compression techniques such as pruning and knowledge distillation to construct a lightweight system capable of swiftly processing video streams. The proposed technique employs two networks: a face recognition network and a manipulation recognition network. The study used the DFDC dataset which contains 119,154 training videos, 4000 validation videos, and 5000 testing videos. The study achieved 93.9% accuracy. Another study [ 66 ] proposed media forensics for deepfake detection using hand-crafted features. To build deepfake detection, the study explores three sets of hand-crafted features and three distinct fusion algorithms. These features examine the blinking behavior, the texture of the mouth region, and the degree of texture in the picture foreground. The study used TIMIT-DF, DFD, and CDF datasets. The evaluation results are obtained by using five fusion operators: concatenation of all features (feature-level fusion), simple majority voting (decision-level fusion), decision-level fusion weighted based on accuracy using TIMIT-DF for training, decision-level fusion weighted based on accuracy using DFD for training, and decision-level fusion weighted based on accuracy using CDF for training. The study concludes that hand-crafted features achieved 96% accuracy.

A lightweight 3D CNN is proposed in [ 67 ]. This framework involves the use of 3D CNNs for their outstanding learning capacity in integrating spatial features in the time dimension and employs a channel transformation (CT) module to reduce the number of parameters while learning deeper levels of the extracted features. To boost the speed of the spatial–temporal module, spatial rich model (SRM) features are also used to disclose the textures of the frames. For experiments, the study used FF++, DeepFake-TIMIT, DeepFake Detection Challenge Preview (DFDC-pre), and CDF datasets. The study achieved 99.83%, 99.28%, 99.60%, 93.98%, and 98.07% accuracy scores using FF++, TIMIT HQ, TIMIT LQ, DFDC-pre, and CDF datasets, respectively.

4.1.1. Deepfake Video Detection Using Image Processing Techniques

Several approaches have been presented to detect deepfake videos based on facial landmarks, e.g., Dlib [ 68 ] detector and multi-task convolutional neural network (CNN) [ 69 , 70 ], etc. For example, a deep learning model has been proposed by [ 71 ] to detect deepfake videos. The study uses the FaceForensics dataset to train a CNN. The FaceForensics++ dataset comprises real and fake videos, of which 363 videos are real videos and 3068 are fake videos. Additionally, the model is tested with a variety of AI approaches including layer-wise relevance propagation (LRP) and local interpretable model-agnostic explanations (LIMEs) to provide clear representations of the prominent parts of the picture as pointed out by the model. Firstly, the face is extracted from the images using Dlib. After that, CNN XceptionNet is used to extract the features. To analyze the results of the models, the study utilizes the Xception network which is a conventional CNN with depth-wise separable convolutions (DWSCs) and LIME. Using 1.3× and 2× background scales, the study trains the algorithm with 90.17% accuracy on the image dataset. With the experimental process at 1.3×, the model achieves 94.33% accuracy, and at 2× the model achieves 90.17% accuracy.

A study [ 72 ] proposes a system for detecting deepfake videos using a support vector machine (SVM). For detecting fake videos, the model utilizes feature points (FPs) taken from the video to build an AI model. Different FP extraction methods have been used for experiments including histogram of oriented gradient (HOG), oriented features from accelerated segment test (FAST), rotated binary robust independent elementary features (BRIEF), oriented robust binary (ORB), binary robust invariant scalable key-points (BRISK), KAZE, speeded-up robust features (SURF). The study uses the dataset from [ 73 ] which comprises 90 MP4 videos with a length of about 30 s. Half of the videos in the collection are fake while the other half are real. The HOG FP extraction method obtains a 95% accuracy whereas ORB achieves 91% accuracy, SURF achieves 90.5% accuracy, BRISK 87%, FAST 86.5%, and KAZE 76.5% accuracy. The results show that the HOG extracted FP shows the highest accuracy.

Xin et al. [ 74 ] propose a deepfake detection system based on inconsistencies found in head poses. The study focuses on how deepfakes are created by splicing a synthetic face area into the original picture, as well as how it can employ 3D posture estimation for identifying manufactured movies. The study points out that the algorithms are employed to generate various people’s faces but without altering their original expressions. This leads to mismatched facial landmarks and facial features. As a result of this deepfake method, the landmark positions of a few fake faces may sometimes differ from those of the actual ones. They may be distinguished from one other based on the difference in the distribution of the cosine distance between their two head orientation vectors. The study uses Dlib to recognize faces and retrieve 68 facial landmarks. OpenFace2 is used to build a standard face 3D model, from which a difference calculation is made based on that model. The suggested system makes use of the UADFV data. In this case, an SVM classifier using radian basis function (RBF) kernels is utilized. When using an SVM to classify the UADFV dataset, the SVM classifier achieved an area under the receiver operating characteristic curve (AUROC) of 0.89.

An automated deepfake video detection pipeline based on temporal awareness is proposed in [ 75 ], as shown in Figure 8 . For this purpose, the study proposes a two-stage analysis where the first stage involves using a CNN to extract characteristics at the frame level, followed by a recurrent neural network (RNN) that can detect temporal irregularities produced during the face-swapping procedure. The study created a dataset of 600 videos, half of which are gathered from a variety of video-hosting websites, while the other 300 are random selections from the HOHA dataset. With the use of sub-sequences of n = 20 , 40 , 80 frames, the performance of the proposed model is evaluated in terms of detection accuracy. The results show that from the selected frames, CNN-LSTM achieves the highest accuracy of 97.1% from 40 and 80 frames.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g008.jpg

Deepfake detection using CNN and LSTM [ 75 ].

Several patterns and clues may be used to investigate spatial and temporal information in deepfake videos. For example, a study [ 76 ] created FSSPOTTER to detect swapped faces in videos. The spatial feature extractor (SFE) is used to distribute the videos into multiple segments, each of which comprises a specified number of frames. Input clips are sent into the SFE, which creates frame-level features based on the clips’ frames. Visual geometry group (VGG16) convolution layers with batch normalization are used as the backbone network to extract spatial information in the intra-frames of the image. The superpixel-wise binary classification unit (SPBCU) is also used with the backbone network to retrieve additional features. An LSTM is used by the temporal feature aggregator (TFG) to detect temporal anomalies inside the frame. In the study, the probabilities are computed by using a fully connected layer, as well as a softmax layer to determine if the clip is real or false. For the evaluation process, the study uses FaceForensics++, Deepfake TIMIT, UADVF, and Celeb-DF datasets. FSSPOTTER achieves a 91.1% accuracy from UADFV, Celeb-DF 77.6%, Deepfake TIMITHQ 98.5%, and LQ 99.5%, whereas using FaceForensics++, FSSPOTTER obtains 100% accuracy. In [ 77 ], the CNN-LSTM combo is used to identify and classify the videos into fake and real. Deepfake detection (DFD), Celeb-DF, and deepfake detection challenge (DFDC) are the datasets utilized in the analysis. The experimental process is performed with and without the usage of transfer learning. The XceptionNet CNN is used for detection. The study combined all three datasets to make predictions. Using the proposed model on the combined dataset, an accuracy of 79.62% is achieved without transfer learning, and with transfer learning the accuracy is 86.49%.

A YOLO-CNN-XGBoost model is presented in [ 10 ] for deepfake detection. It incorporates a CNN, extreme gradient boosting (XGB), and the face detector you only look once (YOLO). As the YOLO face detector extracts faces from video frames, the study uses the InceptionResNetV2 CNN to extract facial features from the extracted faces. The study uses the CelebDF-FaceForencics++ (c23) dataset which is a combination of two popular datasets: Celeb-DF and FaceForencics++ (c23). Accuracy, specificity, precision, recall, sensitivity, and F1 score are used as evaluation parameters. Results indicate that the CelebDF-FaceForencics++ (c23) combined dataset achieves a 90.62% area under the curve (AUC), 90.73% accuracy, 93.53% specificity, 85.39% sensitivity, 85.39% recall, and 87.36% precision. The model obtains an average F1 score of 86.36% for the combined dataset.

4.1.2. Deepfake Video Detection Using Physiological Signals

Besides using the image processing approaches for detecting deepfake videos, physiological signals can also be used for the same purpose. For example, a study [ 78 ] proposed a system to identify deepfake videos using mouth movements. For this purpose, the study used a CNN and designed a deepfake detection model with mouth features (DFT-MF). To detect deepfakes, two different datasets are used, the deepfake forensics dataset and the VID-TIMIT dataset, that contain real and fake videos. The deepfake forensic dataset comprises a total of 1203 videos with 408 real videos and 795 fake videos, whereas the VID-TIMID dataset comprises 320 low-quality (LQ) and 320 high-quality (HQ) videos. In the preprocessing step, a Dlib classifier is utilized to recognize face landmarks. As an example, the face according to Dlib has (49, 68) coordinates. A person’s eyebrows, nose, and other facial features can be determined using the Dlib library. Afterward, all frames with a closed mouth are excluded by measuring the space between lips. According to the suggested model, sentence length is determined by the number of words that are spoken. The study also focuses on the rate of speech and shows that about 120–150 words per minute are spoken. In the proposed system, deepfake videos are determined by facial emotions and speech speed. The experimental results of DFT-MF show that, using the deepfake forensics dataset, a 71.25% accuracy can be obtained, whereas with the deepfake Vid-TIMIT dataset, DFT-MF achieves a 98.7% accuracy from LQ and 73.1% accuracy from HQ.

A study [ 79 ] leverages a deep neural network (DNN) model to study fake videos and formulate a novel approach for exposing fraudulent face videos. An eye blinking signal is detected in the videos, which is a physiological signal that does not show up well in synthetically created false videos. CNN and eye aspect ratio methods are employed in the long-term recurrent convolutional neural network (LRCN) model. For this purpose, an eye blinking video dataset of 50 videos of a 30-second duration was generated. The study utilized 40 movies for training the LRCN model and 10 for testing. Results show that regarding area under reciprocal control (AUC), the result shows that LRCN performs best with 99% accuracy whereas 98% and 79% accuracy is achieved by the CNN and eye aspect ratio (EAR), respectively.

‘DeepVision’, a novel algorithm to discriminate between real and fake videos, is presented in [ 80 ] and utilizes eye blink patterns for deepfake video detection. Fast-hyperFace and EAR are used to recognize the face and calculate the eye aspect ratio. The study created a dataset based on eye blinking patterns for experiments. The features of eye blink count and eye blink duration are retrieved to determine if a video is real or a deepfake. Experimental results using DeepVision show that an accuracy of 87.5% can be obtained.

Korshunov and Marcel [ 1 ] studied baseline methods based on the discrepancies between mouth movements and voice, as well as many versions of image-based systems frequently employed in biometrics to identify deepfakes. The study found that auditory and visible features can be used for mouth movement profiles in deepfakes. An RNN based on LSTM is utilized to recognize real and fake videos and principal component analysis (PCA) and latent Dirichlet allocation (LDA) have been utilized to decrease the dimensions of the blocks of data. In the second case, i.e., voice movements, the study used two detection methods, raw faces as features and image quality measurements (IQMs). For this purpose, 129 features were investigated including signal-to-noise ratio, specularity, blurriness, etc. The final categorization was based on PCA-LDA or SVM. For the deepfake TIMIT database, the study suggested that the detection techniques based on IQM+SVM produced the best results of 3.3% low-quality energy efficiency ratio (LQ EER) and 8.9% high-quality EER (HQ EER).

4.1.3. Deepfake Video Detection Using Biological Signals

Biological signals have been predominantly used in the medical field to determine the physical and emotional state of people [ 81 ]. Using the features from the data indicating heart rate, galvanic skin response, electrocardiogram, etc., abnormal biological signals can be identified by experts. For medical analysis, such approaches require the use of sensors and nodes which are placed on different limbs of the human body; this is not possible for deepfake detection. Intuitively, computer experts have designed algorithms that can measure biological signals using features from the video data such as changes in color, motion, subtle head movements, etc. [ 82 ].

Besides the physiological signals gathered from the videos, biological signals present a potential opportunity to identify deepfakes. For example, a study [ 83 ] detected deepfakes from videos using a ‘FakeCatcher’ system. The study proposed a FakeCatcher technique for detecting synthesized information of portrait videos as a deep fake prevention solution. This method is based on the findings that biological signals collected from face regions are poorly retained geographically and temporally in synthetic content. Different methods are proposed for enhancements to the derived PPG signal’s quality, as well as the extraction process’s reliability. Chrominance attributes [ 84 ], green channel elements [ 85 ], optical properties [ 86 ], Kalman filters [ 87 ], and distinct face regions [ 85 , 86 , 88 ] are some of the suggested enhancements. The study used six biological signals, GL, GR, GM, CL, CR, and CM, where GL represents green left cheek, GR represents green right cheek, GM is the green mid-region, CL represents chrominance left, CR represents chrominance right, and CM represents chrominance mid-region. Experiments were performed using three benchmark datasets, FaceForensics, FaceForensics++, and CelebDF, in addition to the newly collected dataset, Deep Fakes (DF). The DF dataset comprises 142 portrait videos collected ’in the wild’ where each video has a length of 32 min. For detection, the study uses a CNN classifier trained on the above-mentioned features. Results indicate that the CNN achieves a 91.07% accuracy from the DF dataset, 96% accuracy for the FaceForensics dataset, 91.59% from the CelebDF dataset, and 94.65% using the FaceForensics++ dataset.

The authors of [ 89 ] discussed several biological signals for deepfake video detection, eye and gaze properties, by which deepfakes differ. Furthermore, the researchers combined those characteristics into signatures and compared original and fake videos, generating geometric, visual, metric, temporal, and spectral variances. The study used FaceForensics++, Deep Fakes, CelebDF, and DeeperForensics datasets. To categorize any video in the wild as false or real, the researchers used a deep neural network. With the proposed approach, 80.0% accuracy on FaceForensics++, 88.35% with Deep Fakes (in the wild), 99.27% using CelebDF, and 92.48% using the DeeperForensics dataset can be obtained.

The research described in [ 90 ] offers a method for not only distinguishing deepfakes from the original movies but also presents a generative model that underpins a deepfake. DL techniques are used to categorize deepfakes using CNNs. The authors found that such manipulative artifacts from biological signals can be used to detect deepfakes. The findings reveal that spatial–temporal patterns in biological signals may be thought of as a representative projection of residuals. The results show that the method correctly detects bogus films with a 97.29% accuracy and correctly detects the source model with 93.39% accuracy.

4.2. Deepfake Image Detection

Unlike the detection of video deepfakes, which are sequences of images, deepfake image detection focuses on identifying a single image as a deepfake, as shown in Figure 9 . For example, the study described in [ 91 ] proposed a system to detect deepfake human faces. The expectation-maximization (EM) method was used to extract features from the image. For the classification, k-nearest neighbors (k-NN), SVM, and LDA algorithms were applied. In the study, deepfake images were generated using the based approach. Fake pictures were created by five different GAN techniques, AttGAN, StarGAN, GDWCT, StyleGAN, and StyleGAN2, with the CelebA dataset as ground truth for non-fakes. For experiments, 6005 images from AttGAN, 3369 images from GDWCT, 9999 images from StyleGAN, 5648 images from StarGAN, and 3000 images from StyleGAN2 were used. The study achieved the best accuracy of 99.81% on StyleGAN2-generated images with linear SVM.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g009.jpg

Deepfake and original image: Original image ( left ), deepfake ( right ) [ 92 ].

A comprehensive evaluation of face manipulation techniques was conducted in [ 93 ] using a variety of modern detection technologies and experimental settings, including both controlled and uncontrolled scenarios. The study used four distinct deepfake image databases using different GAN variants. The StyleGAN architecture was used to create 150,000 fake faces collected online. Similarly, the 100K-faces public database that contains 80,000 synthetic faces was used. The GANprintR approach was used to remove GAN fingerprint information from the iFakeFaceDB database, which is an enhanced version of previous fake databases, as shown in Figure 10 . Findings of the study reveal that an EER of 0.02% is obtained in controlled situations which is similar to the best recent research. From the iFakeFaceDB dataset, the study achieved 4.5% EER for the best fake detectors.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-04556-g010.jpg

Deepfake and GANprintR-processed deepfake: ( a ) Deepfake, ( b ) deepfake after GANprintR [ 93 ].

A method for detecting fake images was developed by Dang et al. [ 94 ]. An attention process was used to enhance the functionality of feature maps for the CNN model. Fake pictures were produced using the FaceApp software, which has up to 28 distinct filters including age, color, glasses, hair, etc. Similarly, the StarGAN method was used which has up to 40 different filters [ 95 ]. The CNN model was also tested on the study’s own collected DFFD dataset with 18,416 real and 79,960 fake pictures produced using FaceApp and StarGAN. The results were outstanding, with an EER of less than 1.0% and 99.9 percent AUC.

Along the same direction, Wang et al. [ 96 ] used the publicly accessible commercial software Face-Aware Liquify tool from Adobe Photoshop to create new faces. Additionally, skilled artists used 50 real images to produce modified images. Participants were shown fake and real photos and asked to categorize the images into groups, as part of Amazon Mechanical Turk (AMT) research. Humans were able to attain only 53.5% accuracy, which is close to chance (50 percent). Two alternative automated methods were presented against the human study: One using dilated residual networks (DRNs) to estimate whether or not the face has been distorted, and another using the optical flow field to detect where manipulation has occurred and reverse it. Using the automatic face synthesis manipulation, the study achieved 99.8% accuracy, and using the manual face synthesis manipulation, the study achieved 97.4% accuracy.

A study [ 97 ] used CNN models to detect fake face images. Different CNN techniques were used for this purpose, such as VGG16, VGG19, ResNet, and XceptionNet. For this purpose, the study used two datasets for manipulated and original images. For real images, the study used the CelebA database whereas for the fake pictures two different options were utilized. First, machine learning techniques based on GAN were used in ProGAN, and secondly, manual approaches were leveraged using Adobe Photoshop CS6 based on different features such as cosmetics, glasses, sunglasses, hair, and headwear alterations, among other things. For experiments, a range of picture sizes (from 32 × 32 to 256 × 256 pixels) were tested. A 99.99% accuracy was achieved within the machine-created scenario, whereas 74.9% accuracy was achieved from the CNN model. Another study [ 98 ] suggested detection methods for fake images using visual features such as eye color and missing details for eyes, dental regions, and reflections. Machine learning algorithms, logistic regression (LR) model and multi-layer perceptron (MLP), were used to detect the fake faces. The proposed technique was evaluated on a private FaceForensics database where LR achieved an 86.6% accuracy and the MLP achieved an 82.3% accuracy.

A restricted Boltzmann machine (RBM) is used in [ 99 ] to develop deepfake images made using facial image digital retouching. By learning discriminative characteristics, each image was classified as original or fake. The authors generated two datasets for fake images using the actual ND-IIITD Retouching (ND-IIITDR) dataset (collection B) and Celebrity Retouching (CR) which is a set of celebrity facial pictures retrieved from the internet. Fake pictures were created with the help of Max25’s PortraitPro Studio software, which took into account elements of the face such as the texture of the skin and skin tone, as well as eye coloration. In the CR and ND-IIITD Retouching datasets, the study achieved an accuracy of 96.2% and 87.1% percent, respectively.

A study [ 100 ] proposed a face X-ray, a novel image representation for identifying fraudulent face images or deepfakes. The basic finding for face X-rays is that most current face alteration algorithms share a similar step of blending a changed face into an existing backdrop picture, and there are inherent image disparities across the blending boundaries. The study used FF++ and Blending Images (BI) datasets for training and DFD, DFDC, and CDF datasets for testing. For the experimental process, the study used a CNN and achieved 95.40% accuracy from DFD, 80.92% accuracy from DFDC, and 80.58% accuracy from CDF.

4.3. Deepfake Audio Detection

4.3.1. fake audio datasets.

The Fake or Real (FoR) dataset [ 101 ], which includes eight synthetically created English-accented voices using the Deep Voice 3 and Google-Wav Net generation models, was released in 2019. It is publicly available; its most important feature is that it contains sounds in two different formats, MP3 and WAV. The complete dataset consists of 198,000 files, comprising 111,000 original samples and 87,000 counterfeit samples, each lasting two seconds. The Arabic Diversified Audio [ 102 ] dataset (Ar-DAD), which was acquired via the Holy Quran audio site, was apparently a fake audio collection of Arabic speakers. The audio is of 30 male Arabian reciters and 12 mimics, and it comprises the original and mimicked voices of Quran reciters. The reciters are Arabians from Egypt, Sudan, Saudi Arabia, Yemen, Kuwait, and the United Arab Emirates. There are 379 false and 15,810 actual samples in the data, each voice has a 10 s duration.

The H-Voice dataset [ 103 ] was recently established using fake and real voices in various languages including French, English, Portuguese, Spanish, and Tagalog. It includes samples stored in the PNG format as a histogram. There are 6672 samples in this dataset and it is organized into six folders: ‘training original’, ‘training fake’, ‘validation original’, ‘validation fake’, ‘external test 1’, and ‘external test 2’. Each category has a different number of samples, where the first category has 2020 histograms while the second category contains 2088 histograms, 2016 imitations, and 72 deep voices. The third category contains 864 histograms, ‘validation fake’ contains 864 histograms, and ‘external test 1’ and ‘external test 2’ are further divided into two sub-folders, ‘fake’ and ‘original’. The ‘external test 1’ set contains a total of 760 histograms (380 fake imitation histograms and 380 original histograms) while the ‘external test 2’ set contains 76 histograms (72 fake deep voice histograms and four original histograms).

Finally, the ASV spoof 2021 challenge dataset includes two false circumstances, one cognitive and one actual. False audio is created in the cognitive environment utilizing synthetic software, whereas fake audio is created in the actual environment by replicating prerecorded sounds using sections of genuine speaker data. This dataset has not yet been released; prior versions are freely available (2015 [ 104 ], 2017 [ 105 ], and 2019 [ 106 ]).

4.3.2. Deepfake Audio Detection Techniques

A large variety of methods and techniques for creating fake audio have prompted a wide interest in detecting deepfake audio in many languages. This section presents the works on recognizing mimicked and synthetically created voices. In general, there are two types of techniques that are used currently: ML and DL approaches.

Traditional ML methods are commonly used in the identification of fake audios. A study [ 107 ] created a own fake audio dataset by extracting entropy features using an imitation technique named the H-Voice dataset [ 103 ]. To distinguish between the fake and real audio, the study used ML model LR. LR achieved 98% accuracy for real vs. fake audio detection. The study points out that manual feature extraction can boost the performance of the proposed approach.

To identify artificial audio from natural human voices, Singh et al. [ 108 ] used the H-Voice dataset and suggested a quadratic SVM (QSVM) method. The study classified the audio into two types, human and AI-generated. Additional ML approaches including linear discriminant (LD), quadratic LDSVM, weighted KNN, boosted tree ensemble, and LR were compared against this model. It is observed that QSVM beats other traditional approaches by 97.56% accuracy and has only a 2.43% misclassification rate. Similarly, Borrelli et al. [ 109 ] created an SVM model using RF to classify artificial voices using a novel audio component known as short-term long-term (STLT). The Automatic Speaker Verification (ASV) spoof 2019 challenge dataset was used to train the models. The results show that RF performs best compared to SVM with a 71% accuracy result. In a similar way, [ 110 ] also used the H-Voice dataset and compared the effectiveness of SVM with the DL technique CNN to distinguish fake audio from actual stereo audio. The study discovered that the CNN is more resilient than the SVM, even though both obtained a high classification accuracy of 99%. The SVM, however, suffers from the same feature extraction issues as the LR model did.

A study [ 111 ] designed a CNN method in which the audio was converted to scatter plot pictures of surrounding samples before being input into the CNN model. The generated method was evaluated using the Fake or Real (FoR) dataset and achieved a prediction accuracy of 88.9%. Whereas the suggested model solved the generalization problem of DL-based architectures by training with various data generation techniques, it did not perform well as compared to other models in the literature. The accuracy and equal error rate (EER) were 88% and 11%, respectively, which are lower than other DL models.

Another similar study is [ 112 ] that presented deep sonar and a DNN model. The study presented the neuron behaviors of speaker recognition (SR) systems in the face of AI-produced fake sounds. In the classification challenge, their model is based on layer-wise neuron activities. For this purpose, the study used the voices of English speakers from the FoR dataset. Experimental results show that 98.1% accuracy can be achieved using the proposed approach. The efficiency of the CNN and BiLSTM was compared with ML models in [ 113 ]. The proposed approach detects imitation-based fakeness from the Ar-DAD of Quranic audio samples. The study tested the CNN and BiLSTM to identify fake and real voices. SVM, SVM-linear, radial basis function (SVMRBF), LR, DT, RF, and XGBoost were also investigated as ML algorithms. The research concludes that the SVM has a maximum accuracy of 99%, while DT has the lowest accuracy of 73.33%. Furthermore, the CNN achieves a 94.33% detection rate which is higher than BiLSTM.

4.4. Deepfake Tweet Detection

Similar to deepfake videos and images that are posted online as separate units, tweets posted on Twitter may also be fake, so they are also called deepfakes. Therefore, a specialized study [ 114 ] focused on detecting deepfakes from tweets alone. The study collected a dataset of deepfake tweets named the TweepFake dataset. The study collected 25,572 randomly selected tweets from 17 human accounts imitated by 23 bots. Markov chains, RNN, RNN+Markov, and LSTM are some of the approaches used to create the bots. The study used 13 deepfake detection methods: LR_BOW, RR _BOW, SVC_BOW, LR_BERT, RF_BERT, SVC_BERT, CHAR_CNN, CHAR_GRU, CHAR_CNNGRU, BERT_FT, DISTILBERT_FT, ROBERTA_FT, and XLNET_FT. Experimental results show that ROBERTA_FT performs best with an 89.6% accuracy whereas LR_BOW achieved 80.4%, RF_BOW achieved 77.2%, SVC_BOW achieved 81.1%, LR_BERT achieved 83.5%, RF_BERT achieved 82.7%, SVC_BERT achieved 84.2%, CHAR_CNN achieved 85.1%, CHAR_GRU achieved 83%, CHAR_CNNGRU achieved 83.7%, BERT_FT achieved 89.1%, DISTILBERT_FT achieved 88.7%, and XLNET_FT achieved 87.7% accuracy.

5. Discussion

Deepfakes have become a matter of great concern in recent times due to their large-scale impact on the public, as well as the safeguarding of countries. Often aimed at celebrities, politicians, and other important individuals, deepfakes can easily become a matter of national security. Therefore, the analysis of deepfake techniques is an important research area to devise effective countermeasures. This study performs an elaborate and comprehensive survey of the deepfake techniques and divides them into deepfake image, deepfake video, and deepfake tweet categories. In this regard, each category is separately studied and different research works are discussed with respect to the proposed approaches for detecting deepfakes. The discussed research works can be categorized into two groups with respect to the used datasets: research works using their own collected datasets for experiments and the ones making use of benchmark datasets. Table 4 contains all those research works that created their own datasets to conduct experiments for detecting deepfakes. Authors of the studies [ 72 , 74 , 75 , 79 , 80 , 83 , 91 , 93 , 94 , 96 , 97 , 99 , 114 ] created their datasets to evaluate the performance of the proposed approaches.

Comparison table of self-made datasets.

The authors of [ 72 ] used feature point extraction methods to create their dataset. The study created 90 MP4 videos with a length of about 30 s which were then used to perform experiments using the proposed approach for deepfake detection. Similarly, the study [ 74 ] uses facial landmarks and facial features to create a dataset containing fake images. The authors used Dlib to recognize faces and retrieve 68 facial landmarks which were used to generate fake images. To recognize deepfakes, the study used the 3D head pose method to find the dissimilarities between the genuine head pose and the head poses found in the fake images. Along the same lines, a study [ 79 ] created an eye blinking video dataset of 50 videos of 30 s duration for conducting experiments for deepfake detection. A study [ 75 ] created a dataset of 600 videos where 300 videos were deepfakes gathered from a variety of video-hosting websites while the remaining 300 were random selections from the HOHA dataset which is a publicly available deepfake dataset. From the CelebA dataset, a study [ 91 ] created a dataset by using five different GAN techniques, AttGAN, StarGAN, GDWCT, StyleGAN, and StyleGAN2. The generated dataset was later used for evaluating the performance of the proposed approach.

A study [ 93 ] used GANprintR and StyleGAN techniques to create a dataset for deepfake detection. Another study [ 94 ] used FaceApp software and the StarGAN method to create a dataset named DFFD. The generated dataset contains 18,416 real and 79,960 fake images of different celebrities. A study [ 96 ] used the Face-Aware Liquify tool provided by Adobe Photoshop to create new faces for deepfake detection by manipulating different facial features. Another study [ 97 ] used GAN, ProGAN, and Adobe Photoshop CS6 to create a dataset of fake and real images. A study [ 80 ] created a dataset by using two features, eye blink count and eye blink duration. The generated dataset was further used with several approaches to analyze the performance of deepfake detection.

By using Max25’s PortraitPro Studio software, a study [ 99 ] created two datasets, the ND-IIITD retouching (ND-IIITDR) database (collection B) and CR. Both datasets contain fake and real images of different celebrities. Real images were gathered from different online sources. Another study [ 83 ] created a DF dataset by using portrait videos. In their study, the authors used biological signals and proposed FaceChecker. Although for the same purpose of deepfake detection, another study [ 114 ] created a unique dataset for deepfake tweets where different bots are used to generate tweets about specific topics. The study collected 25,572 deepfake tweets.

Apart from the research works discussed in Table 5 , several researchers have made use of the publicly available benchmark datasets for deepfakes. The authors of [ 1 , 10 , 71 , 76 , 77 , 78 , 98 ] used benchmark datasets shown in Table 5 . The benchmark dataset VidTIMIT [ 115 ] consists of 35 persons speaking brief words on video and accompanying audio recordings. It can be used in experiments for automated lip-reading, multi-view face recognition, multi-modal voice recognition, and person identification, among others. The dataset was gathered in three sessions by applying a 7-day gap between the first and second sessions whereas a 6-day gap between the second and third sessions was used. The texts were picked from the TIMIT corpus’s test portion. Each individual has ten sentences. Session 1 is made up of the first six phrases (ordered alphabetically by file name); session 2 includes the following two phrases, while session 3 includes the last two.

Comparison of benchmark datasets.

The Celeb-DF dataset [ 116 ] has two versions, v1 and v2. The latest version, v2, comprises actual and deepfake-generated movies of comparable visual quality to those found on the internet. The Celeb-DF v2 dataset is significantly larger than the Celeb-DF v1 dataset, which only contained 795 deepfake movies. Celeb-DF now has 590 original YouTube videos with topics related to different ages, ethnic groupings, and genders, as well as 5639 videos.

DeepfakeTIMIT [ 117 ] is a collection of videos in which faces have been switched by utilizing free GAN-based software which was derived from the original autoencoder-based deepfake algorithm. The dataset was built by selecting 16 similar-looking pairings of persons from the openly available VidTIMIT database. Two alternative classifiers are trained for each of the 32 subjects: lower quality with a 64 × 64 input/output size model and higher quality with a 128 × 128 size model. As each individual had ten movies in the VidTIMIT database, 320 videos are created for each version, totaling 620 videos with faces changed.

FaceForensics++ is a forensics dataset [ 118 ] comprising 1.8 million modified frames, 1000 original videos, and 4000 false video sequences which are changed using four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap, and NeuralTextures. The data came from 977 YouTube videos, all of which include a trackable mainly frontal face with no occlusions, allowing automated tampering methods to create plausible forgeries.

UADFV is another publicly available dataset [ 119 ]. This dataset is a collection of 49 actual YouTube videos that were used to make 49 false videos using the FakeApp mobile application, replacing the original face with Nicolas Cage’s visage in each of them. As a result, in all fraudulent videos, just one identity is considered. In this dataset, each video depicts a single person, with a typical resolution of 294 × 500 pixels and an average duration of 11.14 s.

The authors of [ 78 ] used two datasets for their research including the deepfake forensics dataset and the Vid-TIMIT dataset. The deepfake forensics dataset contains 1203 videos with 408 real videos and 795 deepfake videos whereas the VID-TIMID dataset comprises 320 LQ and 320 HQ videos. The study used a CNN and designed a deepfake detection model with mouth feature DFT-MF to detect deepfakes. Refs. [ 71 , 98 ] used the FaceForensics dataset for the detection of deepfakes. The FaceForensics dataset contains real and fake videos; 363 and 3068 videos are real and fake, respectively.

A study [ 71 ] used Dlib to extract faces from videos and CNN XceptionNet to extract features. A study [ 71 ] used a CNN to detect deepfakes whereas another study [ 98 ] used machine learning algorithms LR and MLP to detect the fake faces. Another study [ 1 ] used the deepfake TIMIT benchmark dataset. A study used PCA-LDA and SVM to extract face features and image quality measurements. The study shows that IQM+SVM performs best for fake face detection. Another study [ 76 ] used FaceForensics++, deepfake TIMIT, UADVF, and Celeb-DF benchmark datasets. FSSPOTTER is used to detect fake faces. A study [ 77 ] used deepfake detection (DFD), Celeb-DF, and deepfake detection challenge (DFDC) datasets. For the experimental process, the study used CNN XceptionNet with and without transfer learning. The Celeb-DF-FaceForensics++ (c23) dataset is used in [ 10 ] and is the combination of two datasets: Celeb-DF and FaceForensics++ (c23). For the detection of deepfakes, the study used YOLO-CNN-XGBoost.

6. Future Directions

Deepfakes, as the name suggests, involve the use of deep learning approaches to generate fake content, therefore, deep learning methods are desired to effectively detect deepfakes. Although the literature on deepfake detection is sparse, this study discusses several works for deepfake detection. Most of the discussed research works leverage deep learning models directly or indirectly to discriminate between real and fake content such as images and videos. Predominantly, a CNN is used either as the final classifier or feature extraction level for deepfake detection. Similarly, machine learning models SVM and LR are also utilized along with the deep learning models. Other variants of deep learning models are also investigated such as YOLO-CNN, CNN with attention mechanism, and RF and LR variants.

For most of the discussed research works, the studies utilize their collected datasets which may or may not be publicly available, so the repeatability and reproducibility of the experiments may be poor. More experiments on the publicly available benchmark datasets are needed. The knowledge of the tools used to generate deepfakes plays a vital role to determine a proper tool/model for deepfake detection, which is helpful but not very practical for real-world scenarios. For benchmark datasets, such information should not be available so that exhaustive experiments can be conducted to devise robust and efficient approaches to determine fake content.

Although changes in images can be determined using the digital signatures found in fake content, this is very difficult for deepfake content. Several indicators such as head pose, eye blink count and duration, cues found in teeth placement, and other facial landmarks are used to detect deepfakes and, with the advancement in deep learning technology, such indicators will become less prevalent in future deepfakes. Consequently, more indicators at a refined scale will be required for future deepfakes.

Keeping in view the rapid increase in and wide use of social media, the content to make deepfakes will become easier and deepfake use more widespread in the future. This means that more robust and efficient deepfake detection methods are required to work in real time, which is not in practice yet. Therefore, special focus should be placed on approaches that can work in real time. This is important as deepfake technology has shown the potential to do damage that is irreparable. Often, the damage is done before it is realized that the posted content is not real. Coupled with the speed of social media, the damage is manifold and dangerously fast. Therefore, effective, robust, and reliable approaches are needed to perform real-time deepfake detection. Due to their resource-hungry nature, deep learning approaches cannot be deployed on smartphones which are a major source of content sharing on social media. For quick and timely detection of fake content on smartphones, novel methods are needed that can be deployed on smartphones.

7. Conclusions

Deepfake content, both images and videos, has grown at an unprecedented speed during the past few years. The use of deep learning approaches with the wide availability of images, audio, and videos from social media platforms can create fake content that can threaten the goodwill, popularity, and security of both individuals and governments. Manipulated facial images and videos created using deepfake techniques may be quickly distributed through the internet, especially social media platforms, endangering societal stability and personal privacy. As a safeguard against such threats, commercial firms, government offices, and academic organizations are devising and implementing relevant countermeasures to alleviate the harmful effects of deepfakes.

This study aims to highlight the recent research in deepfake images and video detection, such as deepfake creation, various detection algorithms on self-made datasets, and existing benchmark datasets. This study provides a comprehensive overview of the approaches that have been presented in the literature to detect deepfakes and thus help to mitigate the impact of deepfakes. Analysis indicates that machine and deep learning models such as the CNN and its variants, SVM, LR, and RF and its variants are quite helpful to discriminate between real and fake content in the form of both images and videos. This study elaborates on the methods to counter the threats of deepfake technology and alleviate its impact. Analytical findings recommend that deepfakes can be combated through legislation and corporate policies, regulation, and individual action, training, and education. In addition to that, evolution of technology is desired for identification and authentication of the content on the internet to prevent the wide spread of deepfakes. To overcome the challenge of deepfake identification and mitigate their impact, large efforts are needed to devise novel and intuitive methods to detect deepfakes that can work in real time.

Funding Statement

This research was supported by the European University of the Atlantic.

Author Contributions

Conceptualization, F.R., I.A.; methodology, H.F.S., F.R.; software, E.S.F.; validation, J.L.V.M.; E.S.F.; investigation, J.L.V.M., I.d.l.T.D.; resources, I.d.l.T.D.; data curation, H.F.S.; writing—original draft preparation, H.F.S., F.R.; writing—review and editing, I.A.; visualization, E.S.F.; I.A.; funding acquisition, I.d.l.T.D. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research on Image Processing Technology Based on Artificial Intelligence Algorithm

  • Jiaqi Xu 5  
  • Conference paper
  • First Online: 09 April 2023

272 Accesses

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 156))

Artificial intelligence algorithm can optimize the traditional image processing technology, so that the technology can give more accurate and high-quality results. This paper mainly introduces the basic concept of artificial intelligence algorithm and its application advantages in image processing and then establishes the artificial intelligence image processing technology system. The research proves that the image processing technology supported by artificial intelligence algorithm can give higher quality results, indicating that the algorithm has higher application value in image processing.

  • Artificial intelligence
  • Image processing

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Yan X (2021) Research on graphic design image processing technology based on reference relation. J Phys: Conf Ser 1982(1):012209(7pp)

Google Scholar  

Xi E, Zhang J (2021) Research on image deblurring processing technology based on genetic algorithm. J Phys: Conf Ser 1852(2):022042

Shan P, Sun W (2021) Research on landscape design system based on 3D virtual reality and image processing technology. Eco Inform 9:101287

Article   Google Scholar  

Zhang X, Jiang J (2021) Research on frame design of sports image analysis system based on feature extraction algorithm

Chang J, Li Y, Zheng H (2021) Research on key algorithms of the lung CAD system based on cascade feature and hybrid swarm intelligence optimization for MKL-SVM. Comput Intell Neurosci 2021(4):1–16

Wang S, Xu G (2021) Research on traveling wave fault technology based on ground potential. AIP Adv 11

Liu MZ, Xu X, Hu J et al (2022) Real time detection of driver fatigue based on CNN-LSTM. IET Image Process 16

Gao B (2022) Research and implementation of intelligent evaluation system of teaching quality in universities based on artificial intelligence neural network model. Math Probl Eng 2022

Lin R, Wang L, Xia T (2021) Research on image super-resolution technology based on sparse constraint SegNet network. J Phys: Conf Ser 1952(2):022005

Huang M, Liu D (2021) Research on vehicle control system of construction machinery based on machine vision. J Phys: Conf Ser 1982(1):012002

Download references

Author information

Authors and affiliations.

Guangxi University, Nanning, 530004, Guangx, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jiaqi Xu .

Editor information

Editors and affiliations.

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar

Bernard J. Jansen

School of Economics and Management, Changzhou Institute of Mechatronic Technology, Changzhou, China

Qingyuan Zhou

School of Computer Science and Cyberspace Security, Hainan University, Haikou, Hainan, China

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Xu, J. (2023). Research on Image Processing Technology Based on Artificial Intelligence Algorithm. In: Jansen, B.J., Zhou, Q., Ye, J. (eds) Proceedings of the 2nd International Conference on Cognitive Based Information Processing and Applications (CIPA 2022). CIPA 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 156. Springer, Singapore. https://doi.org/10.1007/978-981-19-9376-3_72

Download citation

DOI : https://doi.org/10.1007/978-981-19-9376-3_72

Published : 09 April 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-9375-6

Online ISBN : 978-981-19-9376-3

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. (PDF) A Review on Image Processing

    image processing research papers google scholar

  2. Digital Image Processing Research Proposal [Professional Thesis Writers]

    image processing research papers google scholar

  3. (PDF) Digital Image Processing Using Machine Learning

    image processing research papers google scholar

  4. How to use and find Research Papers on Google Scholar? 10 Tips for

    image processing research papers google scholar

  5. (PDF) A STUDY ON THE IMPORTANCE OF IMAGE PROCESSING AND ITS APLLICATIONS

    image processing research papers google scholar

  6. How to use Google scholar: the ultimate guide

    image processing research papers google scholar

VIDEO

  1. Google Scholar research paper search

  2. Fundamentals of image processing (Inverse Filtering)

  3. How To Find The Literature In Your Field With Google Scholar

  4. how we download the research papers from google scholar?

  5. How to access and download paid research papers for free (all steps)?

  6. How to search and access a Research Paper for FREE in 2024

COMMENTS

  1. Google Scholar

    Google Scholar provides a simple way to broadly search for scholarly literature. Search across a wide variety of disciplines and sources: articles, theses, books, abstracts and court opinions.

  2. An Analysis Of Convolutional Neural Networks For Image Classification

    Abstract. This paper presents an empirical analysis of theperformance of popular convolutional neural networks (CNNs) for identifying objects in real time video feeds. The most popular convolution neural networks for object detection and object category classification from images are Alex Nets, GoogLeNet, and ResNet50.

  3. Image Processing based on Deep Neural Networks for Detecting Quality

    Their customers demand high-quality bags that will store their products in the best possible conditions. The Image Processing based on Deep Neural Networks for Detecting Quality Problems in Paper Bag Production Anna Syberfeldt*a and Fredrik Vuoluterä a a University of Skövde, PO 408 SE-541 48 Skövde, Sweden * Corresponding author.

  4. Image Processing: Research Opportunities and Challenges

    Interest in digital image processing methods stems from two principal application areas: improvement of pictorial information for human interpretation; and processing of image data for storage ...

  5. A Novel Image Processing Approach to Enhancement and Compression of X

    Section 2 reviews many relevant papers in medical image processing and studies some image processing methods used for improving medical images that researchers have proposed in their papers. Section 3 is the core of the present research paper. ... methods in medical image processing; pp. 153-166. [Google Scholar] 41. Hussain A.J., Al-Fayadh A ...

  6. Editorial on the Special Issue: New Trends in Image Processing III

    Additionally, image segmentation also plays a vital role in image processing and computer vision. The development of image segmentation methods is closely connected to several disciplines and fields, e.g., industrial inspection [ 31 ], intelligent medical technology [ 32 ], augmented reality [ 33 ], and autonomous vehicles [ 34 ].

  7. ‪Yann LeCun‬

    147. 114. i10-index. 377. 301. Yann LeCun. Chief AI Scientist at Facebook & Silver Professor at the Courant Institute, New York University. Verified email at cs.nyu.edu - Homepage. AI machine learning computer vision robotics image compression.

  8. Deep-learning-based image segmentation integrated with optical ...

    Deep-learning algorithms enable precise image recognition based on high-dimensional hierarchical image features. Here, we report the development and implementation of a deep-learning-based image ...

  9. Frontiers

    The field of image processing has been the subject of intensive research and development activities for several decades. This broad area encompasses topics such as image/video processing, image/video analysis, image/video communications, image/video sensing, modeling and representation, computational imaging, electronic imaging, information forensics and security, 3D imaging, medical imaging ...

  10. Frontiers

    PAMI Research Lab, Computer Science, University of South Dakota, Vermillion, SD, United States ... we opened a call for papers on Current Trends in Image Processing & Pattern Recognition that exactly followed third International Conference on Recent Trends in Image Processing & Pattern Recognition ... Google Scholar. Santosh, KC, Das, N., and ...

  11. Attention mechanisms in computer vision: A survey

    He has published 60+ refereed research papers, with 20,000+ Google Scholar citations. His research interests include computer graphics, computer vision, and image processing. He received research awards, including ACM China Rising Star Award, IBM Global SUR Award, and so on.

  12. Research on image classification model based on deep convolution neural

    For the data set in this paper, because the training sample and the test sample are not well distinguished [], the random generation method is used to avoid the subjective color of the artificial classification.2.2 Image feature extraction based on time-frequency composite weighting. Feature extraction is a concept in computer vision and image processing.

  13. Trends and Advancements of Image Processing and Its Applications

    This book covers innovations in image processing and analysis techniques with application in remote ... You can also search for this editor in PubMed Google Scholar. Mario José Diván. Data Science Research Group at Economy School, National University of La Pampa, Santa Rosa, Argentina ... He published more than 100 research papers in National ...

  14. The Constantly Evolving Role of Medical Image Processing in Oncology

    Traditional medical image processing was founded on classical image processing and computer vision principles, focusing on low-level feature extraction and simple classification tasks, e.g., benign vs. malignant, or in the geometrical alignment of temporal images and the segmentation of tumors for volumetric analyses.

  15. Google Cloud Vision and Its Application in Image Processing ...

    Abstract. This document aims to highlight the advantages of using the optical character recognition (OCR) software provided by Google Cloud Vision and how it can be effectively used with a Raspberry Pi, a small platform with limited processing and memory capabilities. OCR involves the conversion of images containing text into machine-encoded ...

  16. A Review of Image Processing Techniques for Deepfakes

    The first and foremost step in conducting a review is searching and selecting the most appropriate research papers. For this paper, both relevant and recent research papers need to be selected. In this regard, this study searches the deepfake literature from Web of Science (WoS) and Google Scholar which are prominent scientific research databases.

  17. (PDF) IMAGE RECOGNITION USING MACHINE LEARNING

    Image classification using Deep learning. The image classification is a classical problem of image processing, computer vision and machine learning fields. In this paper we study the image ...

  18. Image Processing Technology Research of On-Line Thread Processing

    The paper introduced image processing technology based on image segmentation about on-line threads images, and describes in detail image processing technology from image preprocessing, image gmentation, and threaded parameter test. Threaded images of on-line processing parts obtained are introduced as the key technology, Target edge extraction ...

  19. Brain Tumor Detection Using Image Processing Approach

    Brain tumor identification is the most demanding and intriguing thing to conduct in medical image processing. Tumor location, size, shape, type, and contrast of tumor tissues are used in computer-aided diagnosis (CAD). ... Google Scholar Soni GK, Rawat A, Yadav D, Kumar A (2021) 2.4 GHz Antenna design for tumor detection on flexible substrate ...

  20. Computer Vision and Image Processing

    You can also search for this editor in PubMed Google Scholar. Ananda Chowdhury . Jadavpur University, Kolkata, India ... Vision and Image Processing, CVIP 2021, held in Rupnagar, India, in December 2021. The 70 full papers and 20 short papers were carefully reviewed and selected from the 260 submissions. The papers present recent research on ...

  21. Research on the Processing of Image and Spectral ...

    In order to solve the problems of a low target recognition rate and poor real-time performance brought about by conventional infrared imaging spectral detection technology under complex background conditions or in the detection of targets of weak radiation or long distance, a kind of infrared polarization snapshot spectral imaging system (PSIFTIS) and a spectrum information processing method ...

  22. ‪Signal & Image Processing : An International ...

    A Comparative Study of Histogram Equalization Based Image Enhancement Techniques for Brightness Preservation and Contrast Enhancement. O Patel, Y P S, Maravi, S Sharma. Signal & Image Processing : An International Journal (SIPIJ) 4 (5), 11-25. , 2013.

  23. Research on Image Processing Technology Based on Artificial ...

    Google Scholar Xi E, Zhang J (2021) Research on image deblurring processing technology based on genetic algorithm. J Phys: Conf Ser 1852(2):022042. Google Scholar Shan P, Sun W (2021) Research on landscape design system based on 3D virtual reality and image processing technology. Eco Inform 9:101287