Data mining techniques and applications — A decade review

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

Summary of Data Mining Research on Technical and Economic Information of Power Grid Construction Project

  • Conference paper
  • First Online: 25 January 2024
  • Cite this conference paper

data mining research paper conclusion

  • Zhongxuan Li 10 ,
  • Hong Chen 10 ,
  • Weiya Guan 10 ,
  • Zhongshi Xie 11 &
  • Guoxin Jiang 11  

Part of the book series: Mechanisms and Machine Science ((Mechan. Machine Science,volume 146))

Included in the following conference series:

  • International Conference on Computational & Experimental Engineering and Sciences

312 Accesses

Technical and economic review of power grid construction project is a key link in power grid project construction. In order to reduce the subjectivity of evaluation and enhance the reliability of evaluation in the technical and economic review of power grid projects, it is necessary to fully summarize historical experience and mine data information. From this point of view, this paper summarizes the data mining technology and its research and application in the technical and economic evaluation of power grid projects and related fields. According to the characteristics of information flow and informatization in the process of technical and economic review, the data mining process is divided into four layers: data processing layer, method processing layer, business function layer and application layer. In these four layers, data acquisition, processing and storage technologies in the data processing layer, data mining algorithms in the method processing layer, business function implementation methods in the business function layer and application scenarios in the application layer are reviewed respectively. Through the above hierarchical analysis and research, this paper systematically shows the research of data mining technology in the technical and economic evaluation of power grid projects, which contributes to realize the value-added service of existing data, and improve the intelligence of the technical and economic review of power grid projects on this basis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

data mining research paper conclusion

Project Management Mechanism Research of Distribution Network Based on Big Data

data mining research paper conclusion

A Power Grid Operations Monitoring Platform Based on Big Data Technology

data mining research paper conclusion

Application of Analytic Network Process in Power Grid Development-Diagnosis Management

Xia, X.: Construction of power grid project cost management system suitable for life-cycle asset management. Money China 12 , 157–158 (2015)

Google Scholar  

Wang, X., Ma, G., Yang, Y.: Analysis of difficulties in the reform of power grid project cost management system and suggestions for improvement. Power Syst. Technol. 219−223 (2011)

Jiang, W.: Research on Power Grid Project Construction Management Based on Asset Life Cycle. North China Electric Power University, Beijing, China (2010)

Zhang, J.: Research on the Evaluation System of the Cost Management in the Early Stage of the Power Grid Construction Project of BJ Company. Xi’an University of Science and Technology, Xi’an, China (2019)

Liu, X.: Study on Cost Control and Evaluation of Power Grid Project of Tianjin Electric Power Company. Tianjin University, Tianjin, China (2018)

Tian, M.: Study on the Whole Process Engineering Cost Management of Power Grid Construction Project. North China Electric Power University, Beijing, China (2016)

Wang, J.: Study on Construction of Transmission Project Cost Index and Prediction of Index Value. Liaoning University, Shenyang (2018)

Vasiliev, A.A., Goryachev, A.V.: Application of text mining technology to solve project management problems. In: 2022 XXV International Conference on Soft Computing and Measurements (SCM) (2022)

Ugai, T.: Fuzzy search of knowledge graph with link prediction. In: The 10th International Joint Conference on Knowledge Graphs (IJCKG’21), pp. 121–125. Association for Computing Machinery, New York, NY, USA (2022)

Zhang, X., Wang, H., Gu, W.: Knowledge inference model of OCR conversion error rules based on Chinese character construction attributes knowledge graph. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds.) Natural Language Processing and Chinese Computing. NLPCC 2020. Lecture Notes in Computer Science, vol. 12431. Springer, Cham (2020)

Ilyas, I.F., Rekatsinas, T.: Machine learning and data cleaning: which serves the other. Data Inf. Qual. 14 (3), Article 13, 11 pages (2022)

Yuan, Q., Pi, Y., Kou, L., Zhang, F., Li, Y., Zhang, Z.: Multi-source data processing and fusion method for power distribution internet of things based on edge intelligence. Front Energy Res. 10 , 891867 (2022)

Article   Google Scholar  

Wu, Y., Fu, X.F., Xu, L., Jiang, Z.F.: A summary of the latest research on knowledge graph technology. In: 2021 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), pp. 38–42 (2021)

Pang, N., Xiao, W., Zhao, X.: Chinese text classification via bidirectional lattice LSTM. In: Li, G., Shen, H., Yuan, Y., Wang, X., Liu, H., Zhao, X. (eds.) Knowledge Science, Engineering and Management. KSEM 2020. Lecture Notes in Computer Science, vol. 12275. Springer, Cham (2020)

Zeng, L., Huang, W., Wen, J., Zhan, H., Lin, Y.: Data cleaning system and method based on IC card data characteristics. In: International Conference on Smart Transportation and City Engineering, Nov 2021, 120500R

Song, Y., Alatorre, G., Mandagere, N., Singh, A.: Storage mining: where IT management meets big data analytics. In: 2013 IEEE International Congress on Big Data, pp. 421–422 (2013)

Lu, G., Zhang, B., Sun, H.: Application of data warehouse and data mining technology in power system. Power Syst. Technol. 08 , 54–57 (2001)

Li, D., Zhou, T., Li, L., Li, Y.: Research on the construction of technical and economic analysis system for power grid engineering. Sci. Technol. Inf. 13 (05), 136–137 (2015)

Zhong, Y., Chen, L., Dan, C., et al.: A systematic survey of data mining and big data analysis in internet of things. J. Supercomput. 78 , 18405–18453 (2022)

Moody, A., Hu, C., Zhan, H., Spurling, M., Sheng, V.S.: Towards explainable summary of crowdsourced reviews through text mining. In: Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. Communications in Computer and Information Science, vol. 1601. Springer, Cham (2022)

Zhao, J., Xin, C., Yang, R.: Risk assessment of mine shaft construction projects based on risk matrix. In: 2010 3rd International Conference on Information Management, Innovation Management and Industrial Engineering, pp. 527–530 (2010)

Luo, Y., Li, Y.: Comprehensive decision-making of transmission network planning scheme based on entropy weight method and grey correlation analysis method. Power Syst. Technol. 37 (01), 77–81 (2013)

Chen, Q., Peng, Y., Zhang, M., Yin, Q.: Application analysis on PSO algorithm in the discrete optimization problems. J. Phys. Conf. Ser. 2078 (2021).

Xie, X., Xiong, Y.: Generalized multi-view learning based on generalized eigenvalues proximal support vector machines. Expert Syst. Appl. 194 (2021)

Jia, Z., Gong, L., Han, J.: Application of support vector machine based on rough sets to project risk assessment (RS-SVM). In: 2008 International Conference on Computer Science and Software Engineering, pp. 508–511 (2008)

Chen, P., Li, H., Wang, Y.: Calculation and analysis of power grid project cost impact factors based on analytic hierarchy process. Popul. Util. Electr. 36 (04), 40–42 (2021)

Zhu, W., Tian, H.: Analysis method of power grid engineering data based on clustering algorithm. Electron. Des. Eng. 28 (20), 116–119+124 (2020)

Xu, W., Feng, Y., Zhang, D.: Analysis on deviation of cost control objectives in the whole process of power transmission and transformation project construction. China Power Enterpr. Manage. 12 , 40–48 (2016)

Yan, H., et al.: Research on commercial sector electricity load model based on exponential smoothing method. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds.) Artificial Intelligence and Security. ICAIS 2022. Lecture Notes in Computer Science, vol. 13338. Springer, Cham (2022)

Li, Y., Li, X., Lu, W., Lei, C.: Evaluation and realization analysis of power grid project indicators. Technol. Innov. Appl. 05 , 176–177 (2018)

Ruchi, S., Srinath, P.: Big data platform for enterprise project management digitization using machine learning. In: 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 1479–1484 (2018)

Voskoglou, M.G.: Application of the grey system theory to assessment with approximate data. In: 2018 28th International Conference on Computer Theory and Applications (ICCTA), pp. 123–128 (2018)

Ye, J., Dang, Y., Li, B.: Grey-Markov prediction model based on background value optimization and central-point triangular whitenization weight function. Commun. Nonlinear Sci. Numer. Simul. 54 , 320–330 (2017)

Article   MathSciNet   Google Scholar  

Jiskani, I.M., Han, S., Rehman, A.U., et al.: An integrated entropy weight and grey clustering method–based evaluation to improve safety in mines. Min. Metall. Explor. 38 , 1773–1787

Qian, S., Yang, T.: Research on security risk assessment of smart micro grid based on fuzzy set theory. Energy Conserv. 39 (2), 6–9 (2020)

MathSciNet   Google Scholar  

Deng, X., Chen, Z., Liang, Z., et al.: Risk assessment of relay protection based on Monte Carlo method. Power Syst. Clean Energy 32 (1), 24–31, 41 (2016)

Wang, P., Xiang, Zhu, H., et al.: Risk assessment of power transmission and transformation project cost based on set pair analysis. J. Wuhan Univ. Technol. 38 (4), 427–432 (2016)

Kim, B.C., Reinschmidt, K.F.: Combination of project cost forecasts in earner value management. J. Constr. Eng. Manag. 137 (11), 958–966 (2011)

Download references

Author information

Authors and affiliations.

Economic and Technical Research Institute of State Grid Jiangsu Electric Power Co. Ltd., Nanjing, China

Zhongxuan Li, Hong Chen & Weiya Guan

School of Computer and Information, Hohai University, Nanjing, China

Zhongshi Xie & Guoxin Jiang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Zhongxuan Li .

Editor information

Editors and affiliations.

Dept of Civil and Env'l Engg, University of California, Berkeley, Berkeley, CA, USA

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Li, Z., Chen, H., Guan, W., Xie, Z., Jiang, G. (2024). Summary of Data Mining Research on Technical and Economic Information of Power Grid Construction Project. In: Li, S. (eds) Computational and Experimental Simulations in Engineering. ICCES 2023. Mechanisms and Machine Science, vol 146. Springer, Cham. https://doi.org/10.1007/978-3-031-44947-5_3

Download citation

DOI : https://doi.org/10.1007/978-3-031-44947-5_3

Published : 25 January 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-44946-8

Online ISBN : 978-3-031-44947-5

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

50 selected papers in Data Mining and Machine Learning

Here is the list of 50 selected papers in Data Mining and Machine Learning . You can download them for your detailed reading and research. Enjoy!

Data Mining and Statistics: What’s the Connection?

Data Mining: Statistics and More? , D. Hand, American Statistician, 52(2):112-118.

Data Mining , G. Weiss and B. Davison, in Handbook of Technology Management, John Wiley and Sons, expected 2010.

From Data Mining to Knowledge Discovery in Databases , U. Fayyad, G. Piatesky-Shapiro & P. Smyth, AI Magazine, 17(3):37-54, Fall 1996.

Mining Business Databases , Communications of the ACM, 39(11): 42-48.

10 Challenging Problems in Data Mining Research , Q. Yiang and X. Wu, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604.

The Long Tail , by Anderson, C., Wired magazine.

AOL’s Disturbing Glimpse Into Users’ Lives , by McCullagh, D., News.com, August 9, 2006

General Data Mining Methods and Algorithms

Top 10 Algorithms in Data Mining , X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. motoda, G.J. MClachlan, A. Ng, B. Liu, P.S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, D. Steinberg, Knowl Inf Syst (2008) 141-37.

Induction of Decision Trees , R. Quinlan, Machine Learning, 1(1):81-106, 1986.

Web and Link Mining

The Pagerank Citation Ranking: Bringing Order to the Web , L. Page, S. Brin, R. Motwani, T. Winograd, Technical Report, Stanford University, 1999.

The Structure and Function of Complex Networks , M. E. J. Newman, SIAM Review, 2003, 45, 167-256.

Link Mining: A New Data Mining Challenge , L. Getoor, SIGKDD Explorations, 2003, 5(1), 84-89.

Link Mining: A Survey , L. Getoor, SIGKDD Explorations, 2005, 7(2), 3-12.

Semi-supervised Learning

Semi-Supervised Learning Literature Survey , X. Zhu, Computer Sciences TR 1530, University of Wisconsin — Madison.

Introduction to Semi-Supervised Learning, in Semi-Supervised Learning (Chapter 1) O. Chapelle, B. Scholkopf, A. Zien (eds.), MIT Press, 2006. (Fordham’s library has online access to the entire text)

Learning with Labeled and Unlabeled Data , M. Seeger, University of Edinburgh (unpublished), 2002.

Person Identification in Webcam Images: An Application of Semi-Supervised Learning , M. Balcan, A. Blum, P. Choi, J. lafferty, B. Pantano, M. Rwebangira, X. Zhu, Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data , 2005.

Learning from Labeled and Unlabeled Data: An Empirical Study across Techniques and Domains , N. Chawla, G. Karakoulas, Journal of Artificial Intelligence Research , 23:331-366, 2005.

Text Classification from Labeled and Unlabeled Documents using EM , K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Machine Learning , 39, 103-134, 2000.

Self-taught Learning: Transfer Learning from Unlabeled Data , R. Raina, A. Battle, H. Lee, B. Packer, A. Ng, in Proceedings of the 24th International Conference on Machine Learning , 2007.

An iterative algorithm for extending learners to a semisupervised setting , M. Culp, G. Michailidis, 2007 Joint Statistical Meetings (JSM), 2007

Partially-Supervised Learning / Learning with Uncertain Class Labels

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers , V. Sheng, F. Provost, P. Ipeirotis, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2008.

Logistic Regression for Partial Labels , in 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems , Volume III, pp. 1935-1941, 2002.

Classification with Partial labels , N. Nguyen, R. Caruana, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2008.

Imprecise and Uncertain Labelling: A Solution based on Mixture Model and Belief Functions, E. Come, 2008 (powerpoint slides).

Induction of Decision Trees from Partially Classified Data Using Belief Functions , M. Bjanger, Norweigen University of Science and Technology, 2000.

Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth , P. Smyth, M. Burl, U. Fayyad, P. Perona, KDD Workshop 1994, AAAI Technical Report WS-94-03, pp. 109-120, 1994.

Recommender Systems

Trust No One: Evaluating Trust-based Filtering for Recommenders , J. O’Donovan and B. Smyth, In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), 2005, 1663-1665.

Trust in Recommender Systems, J. O’Donovan and B. Symyth, In Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 2005, 167-174.

General resources available on this topic :

ICML 2003 Workshop: Learning from Imbalanced Data Sets II

AAAI ‘2000 Workshop on Learning from Imbalanced Data Sets

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data , G. Batista, R. Prati, and M. Monard, SIGKDD Explorations , 6(1):20-29, 2004.

Class Imbalance versus Small Disjuncts , T. Jo and N. Japkowicz, SIGKDD Explorations , 6(1): 40-49, 2004.

Extreme Re-balancing for SVMs: a Case Study , B. Raskutti and A. Kowalczyk, SIGKDD Explorations , 6(1):60-69, 2004.

A Multiple Resampling Method for Learning from Imbalanced Data Sets , A. Estabrooks, T. Jo, and N. Japkowicz, in Computational Intelligence , 20(1), 2004.

SMOTE: Synthetic Minority Over-sampling Technique , N. Chawla, K. Boyer, L. Hall, and W. Kegelmeyer, Journal of Articifial Intelligence Research , 16:321-357.

Generative Oversampling for Mining Imbalanced Datasets, A. Liu, J. Ghosh, and C. Martin, Third International Conference on Data Mining (DMIN-07), 66-72.

Learning from Little: Comparison of Classifiers Given Little of Classifiers given Little Training , G. Forman and I. Cohen, in 8th European Conference on Principles and Practice of Knowledge Discovery in Databases , 161-172, 2004.

Issues in Mining Imbalanced Data Sets – A Review Paper , S. Visa and A. Ralescu, in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference , pp. 67-73, 2005.

Wrapper-based Computation and Evaluation of Sampling Methods for Imbalanced Datasets , N. Chawla, L. Hall, and A. Joshi, in Proceedings of the 1st International Workshop on Utility-based Data Mining , 24-33, 2005.

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , C. Drummond and R. Holte, in ICML Workshop onLearning from Imbalanced Datasets II , 2003.

C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure , N. Chawla, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Class Imbalances: Are we Focusing on the Right Issue?, N. Japkowicz, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Learning when Data Sets are Imbalanced and When Costs are Unequal and Unknown , M. Maloof, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Uncertainty Sampling Methods for One-class Classifiers , P. Juszcak and R. Duin, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Active Learning

Improving Generalization with Active Learning , D Cohn, L. Atlas, and R. Ladner, Machine Learning 15(2), 201-221, May 1994.

On Active Learning for Data Acquisition , Z. Zheng and B. Padmanabhan, In Proc. of IEEE Intl. Conf. on Data Mining, 2002.

Active Sampling for Class Probability Estimation and Ranking , M. Saar-Tsechansky and F. Provost, Machine Learning 54:2 2004, 153-178.

The Learning-Curve Sampling Method Applied to Model-Based Clustering , C. Meek, B. Thiesson, and D. Heckerman, Journal of Machine Learning Research 2:397-418, 2002.

Active Sampling for Feature Selection , S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003.

Heterogeneous Uncertainty Sampling for Supervised Learning , D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994.

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , G. Weiss and F. Provost, Journal of Artificial Intelligence Research, 19:315-354, 2003.

Active Learning using Adaptive Resampling , KDD 2000, 91-98.

Cost-Sensitive Learning

Types of Cost in Inductive Concept Learning , P. Turney, In Proceedings Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning.

Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , P. Chan and S. Stolfo, KDD 1998.

Recent Blogs

Artificial intelligence and machine learning: What’s the difference

Artificial intelligence and machine learning: What’s the difference

Artificial Intelligence , Machine Learning

10 online courses for understanding machine learning

10 online courses for understanding machine learning

Machine Learning , Tutorials

How is ML Being Used to Handle Security Vulnerabilities?

Machine Learning

10 groups of machine learning algorithms

10 groups of machine learning algorithms

How a nearly forgotten physicist shaped internet access today 

How a nearly forgotten physicist shaped internet access today 

Massachuse...

FinTech 2019: 5 uses cases of machine learning in finance

FinTech 2019: 5 uses cases of machine learning in finance

Banking / Finance , Machine Learning

The biggest impact of machine learning for digital marketing professionals

The biggest impact of machine learning for digital marketing professionals

Machine Learning , Marketing

Looking ahead: the innovative future of iOS in 2019

How machine learning is changing identity theft detection

How machine learning is changing identity theft detection

Machine Learning , Privacy / Security

Wearable technology to boost the process of digitalization of the modern world

Wearable technology to boost the process of digitalization of the modern world

Top 8 machine learning startups you should know about

Top 8 machine learning startups you should know about

The term...

How retargeting algorithms help in web personalization

How retargeting algorithms help in web personalization

others , Machine Learning

3 automation tools to help you in your next app build

3 automation tools to help you in your next app build

Machine learning and information security: impact and trends

Machine learning and information security: impact and trends

Machine Learning , Privacy / Security , Sectors , Tech and Tools

How to improve your productivity with AI and Machine Learning?

How to improve your productivity with AI and Machine Learning?

Artificial Intelligence , Human Resource , Machine Learning

Artificial...

Ask Data – A new and intuitive way to analyze data with natural language

10 free machine learning ebooks all scientists & ai engineers should read, yisi, a machine translation teacher who cracks down on errors in meaning, machine learning & license plate recognition: an ideal partnership, top 17 data science and machine learning vendors shortlisted by gartner, accuracy and bias in machine learning models – overview, interview with dejan s. milojicic on top technology trends and predictions for 2019.

Artificial Intelligence , Interviews , Machine Learning

Recently,...

Why every small business should use machine learning?

Why every small business should use machine learning?

Microsoft’s ML.NET: A blend of machine learning and .NET

Microsoft’s ML.NET: A blend of machine learning and .NET

Machine learning: best examples and ideas for mobile apps, researchers harness machine learning to predict chemical reactions, subscribe to the crayon blog.

Get the latest posts in your inbox!

Data Mining: Concepts and Methods Research Paper

Data mining can be defined as the process through which crucial data patterns can be identified from a large quantity of data. Data mining finds its applications in different industries due to a number of benefits that can be derived from its use. Various methods of data mining include predictive analysis, web mining, and clustering and association discovery (Han, Kamber and Pei, 2011).

Each of these has a number of benefits to a business. In predictive analysis, analytical models are used to deliver solutions. Using this model, a business can uncover hidden data which can be utilized for the purposes of identifying trends and therefore, predicting the future.

This method requires a business to define the problem before data can be explored. There is also development of predictive models that must be tested. Finally, these models are applied in the population identification and in the prediction of behavior. The process followed helps a business to identify its current position in relations to the industry (Simsion and Witt, 2004).

From this, businesses can plan on how best they can improve their positions in relation to other companies in the industry. The trends obtained from analysis of the acquired data can be used for the purpose of planning which might further give a company an edge over its competitors.

In association discovery, the main aim is to discover correlation among different items that make up a shopping basket. The knowledge of these correlations is important in the development of effective marketing strategies. This is possible due to the insight gained on products that customers purchase together.

This method of data analysis can also help retailers in the design layout of their stores. In this layout, the retailer can conveniently place items that customer purchase together in order to make the shopping experience interesting to customers as well as increasing chances of high sales (Kantardzic, 2011). The method can also be used by a business to determine the products they should place on sale in order to promote the sale of items that go together with the first one.

Web mining is the process through which data present in the World Wide Web or data that has a relationship with a given website activity is made available for various business purposes.

This data can either be the contents of web pages found in various websites, profiles of website users, and information about the number of visitors in a given website among others. Web mining can be used by a business to personalize its products or services in order to meet specific needs of the customers. This is possible through tracking the movement of a given target customer on various web pages.

The method can also help a business improve on its marketing strategies through effective advertising. This can be achieved when used together with business intelligence. It also helps a business to identify the relevance of information present in its web sites and how it can improve this information with the view of increasing its visibility in the market.

Clustering involves grouping of data into specific classes based on specific characteristics (Han, Kamber and Pei, 2011). The process helps in the discovery of specific groups that the business should focus on. The method also helps a business to provide specific information that can be used to win over a given class of customers.

Data mining follows a sequence that ensures the data mined meets the requirements set down by the person mining it. Different algorithms handle the process of data mining differently based on the content of the data to be mined. Therefore, the reliability of the data obtained depends highly on the method used and the nature of data. Speed of data mining process is important as it has a role to play in the relevance of the data mined.

Therefore, a given algorithm should support speedy mining of data. The accuracy of data is also another factor that can be used to measure reliability of the mined data. For this reason an algorithm should be able to use specifications issued in the process of data mining. The two requirements for reliability are met by most algorithms which make them to be reliable for the purposes of data mining.

Various concerns arise over data mining and include invasion of privacy, ethics and legality. The issue of privacy arises when private information is obtained without the consent of its owners. Application of such information for business purposes can have detrimental effects to the business. Ethical issues arise when information mined is used by a business to take advantage of the owner of such information (Kantardzic, 2011).

There is also the question of legality of data mining without the consent of the person owning such information. To address the issues above, some businesses request permission from people before they can use information on them for various purposes which must be disclosed to the person.

Predictive analysis is used by businesses in market segmentation, analysis of the shopping basket and the planning of demand. Market segmentation enables a business to serve a given market better than if it had to serve a diverse market. In shopping basket analysis, a business can easily identify the products that are needed at specific times. The business can also determine demand and effectively plan how to meet it.

Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques . Amsterdam: Elsevier

Kantardzic, M. (2011). Data Mining: Concepts, Models, Methods, and Algorithms. New York: John Wiley & Sons.

Simsion, G. C. and Witt, G. C. (2004). Data Modeling Essentials . Massachusetts: Morgan Kaufmann

  • Kaolinite, Its Properties and Usage
  • World Bank Mining Industry Forecast
  • Motivating Struggling Readers. A Book Basket Activity
  • Instruments to Collect Data
  • Database Normalization and Implementation
  • Information Systems – New Organizational Forms
  • Business Information System Management
  • Car Maintenance Company Relational Database System
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2019, May 15). Data Mining: Concepts and Methods. https://ivypanda.com/essays/data-mining-research-paper/

"Data Mining: Concepts and Methods." IvyPanda , 15 May 2019, ivypanda.com/essays/data-mining-research-paper/.

IvyPanda . (2019) 'Data Mining: Concepts and Methods'. 15 May.

IvyPanda . 2019. "Data Mining: Concepts and Methods." May 15, 2019. https://ivypanda.com/essays/data-mining-research-paper/.

1. IvyPanda . "Data Mining: Concepts and Methods." May 15, 2019. https://ivypanda.com/essays/data-mining-research-paper/.

Bibliography

IvyPanda . "Data Mining: Concepts and Methods." May 15, 2019. https://ivypanda.com/essays/data-mining-research-paper/.

  • Search Search Please fill out this field.

What Is Data Mining?

  • How It Works
  • Data Warehousing & Mining Software
  • The Process
  • Applications
  • Advantages and Disadvantages

Data Mining and Social Media

The bottom line.

  • Marketing Essentials

What Is Data Mining? How It Works, Benefits, Techniques, and Examples

data mining research paper conclusion

Data mining is the process of searching and analyzing a large batch of raw data in order to identify patterns and extract useful information.

Companies use data mining software to learn more about their customers. It can help them to develop more effective marketing strategies, increase sales, and decrease costs. Data mining relies on effective data collection ,  warehousing , and computer processing.

Key Takeaways

  • Data mining is the process of analyzing a large batch of information to discern trends and patterns.
  • Data mining can be used by corporations for everything from learning about what customers are interested in or want to buy to fraud detection and spam filtering.
  • Data mining programs break down patterns and connections in data based on what information users request or provide.
  • Social media companies use data mining techniques to commodify their users in order to generate profit.
  • This use of data mining has come under criticism as users are often unaware of the data mining happening with their personal information, especially when it is used to influence preferences.

Investopedia / Julie Bang

How Data Mining Works

Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It is used in credit risk management, fraud detection , and spam filtering. It also is a market research tool that helps reveal the sentiment or opinions of a given group of people. The data mining process breaks down into four steps:

  • Data is collected and loaded into data warehouses on site or on a cloud service.
  • Business analysts, management teams, and information technology professionals access the data and determine how they want to organize it.
  • Custom application software sorts and organizes the data.
  • The end user presents the data in an easy-to-share format, such as a graph or table.

Data Warehousing and Mining Software

Data mining programs analyze relationships and patterns in data based on user requests. It organizes information into classes.

For example, a restaurant may want to use data mining to determine which specials it should offer and on what days. The data can be organized into classes based on when customers visit and what they order .

In other cases, data miners find clusters of information based on logical relationships or look at associations and sequential patterns to draw conclusions about trends in consumer behavior.

Warehousing is an important aspect of data mining. Warehousing is the centralization of an organization's data into one database or program. It allows the organization to spin off segments of data for specific users to analyze and use depending on their needs.

Cloud data warehouse solutions use the space and power of a cloud provider to store data. This allows smaller companies to leverage digital solutions for storage, security, and analytics.

Data Mining Techniques

Data mining uses algorithms and various other techniques to convert large collections of data into useful output. The most popular types of data mining techniques include association rules, classification, clustering, decision trees, K-Nearest Neighbor, neural networks, and predictive analysis.

  • Association rules , also referred to as market basket analysis, search for relationships between variables. This relationship in itself creates additional value within the data set as it strives to link pieces of data. For example, association rules would search a company's sales history to see which products are most commonly purchased together; with this information, stores can plan, promote, and forecast.
  • Classification uses predefined classes to assign to objects. These classes describe the characteristics of items or represent what the data points have in common with each other. This data mining technique allows the underlying data to be more neatly categorized and summarized across similar features or product lines.
  • Clustering is similar to classification. However, clustering identifies similarities between objects, then groups those items based on what makes them different from other items. While classification may result in groups such as "shampoo," "conditioner," "soap," and "toothpaste," clustering may identify groups such as "hair care" and "dental health."
  • Decision trees are used to classify or predict an outcome based on a set list of criteria or decisions. A decision tree is used to ask for the input of a series of cascading questions that sort the dataset based on the responses given. Sometimes depicted as a tree-like visual, a decision tree allows for specific direction and user input when drilling deeper into the data.
  • K-Nearest Neighbor (KNN) is an algorithm that classifies data based on its proximity to other data. The basis for KNN is rooted in the assumption that data points that are close to each other are more similar to each other than other bits of data. This non-parametric, supervised technique is used to predict the features of a group based on individual data points.
  • Neural networks process data through the use of nodes. These nodes are comprised of inputs, weights, and an output. Data is mapped through supervised learning, similar to how the human brain is interconnected. This model can be programmed to give threshold values to determine a model's accuracy.
  • Predictive analysis strives to leverage historical information to build graphical or mathematical models to forecast future outcomes. Overlapping with regression analysis , this technique aims to support an unknown figure in the future based on current data on hand.

The Data Mining Process

To be most effective, data analysts generally follow a certain flow of tasks along the data mining process. Without this structure, an analyst may encounter an issue in the middle of their analysis that could have easily been prevented had they prepared for it earlier. The data mining process is usually broken into the following steps.

Step 1: Understand the Business

Before any data is touched, extracted, cleaned, or analyzed, it is important to understand the underlying entity and the project at hand. What are the goals the company is trying to achieve by mining data? What is their current business situation? What are the findings of a SWOT analysis ? Before looking at any data, the mining process starts by understanding what will define success at the end of the process.

Step 2: Understand the Data

Once the business problem has been clearly defined, it's time to start thinking about data. This includes what sources are available, how they will be secured and stored, how the information will be gathered, and what the final outcome or analysis may look like. This step also includes determining the limits of the data, storage, security, and collection and assesses how these constraints will affect the data mining process.

Step 3: Prepare the Data

Data is gathered, uploaded, extracted, or calculated. It is then cleaned, standardized, scrubbed for outliers, assessed for mistakes, and checked for reasonableness. During this stage of data mining, the data may also be checked for size as an oversized collection of information may unnecessarily slow computations and analysis.

Step 4: Build the Model

With a clean data set in hand, it's time to crunch the numbers. Data scientists use the types of data mining above to search for relationships, trends, associations, or sequential patterns. The data may also be fed into predictive models to assess how previous bits of information may translate into future outcomes.

Step 5: Evaluate the Results

The data-centered aspect of data mining concludes by assessing the findings of the data model or models. The outcomes from the analysis may be aggregated, interpreted, and presented to decision-makers that have largely been excluded from the data mining process to this point. In this step, organizations can choose to make decisions based on the findings.

Step 6: Implement Change and Monitor

The data mining process concludes with management taking steps in response to the findings of the analysis. The company may decide the information was not strong enough or the findings were not relevant, or the company may strategically pivot based on findings. In either case, management reviews the ultimate impacts of the business and recreates future data mining loops by identifying new business problems or opportunities.

Different data mining processing models will have different steps, though the general process is usually pretty similar. For example, the Knowledge Discovery Databases model has nine steps, the CRISP-DM model has six steps, and the SEMMA process model has five steps.

Applications of Data Mining

In today's age of information, almost any department, industry, sector , or company can make use of data mining.

Data mining encourages smarter, more efficient use of capital to drive revenue growth. Consider the point-of-sale register at your favorite local coffee shop. For every sale, that coffeehouse collects the time a purchase was made and what products were sold. Using this information, the shop can strategically craft its product line.

Once the coffeehouse knows its ideal line-up, it's time to implement the changes. However, to make its marketing efforts more effective, the store can use data mining to understand where its clients see ads, what demographics to target, where to place digital ads, and what marketing strategies most resonate with customers. This includes aligning marketing campaigns , promotional offers, cross-sell offers, and programs to the findings of data mining.

Manufacturing

For companies that produce their own goods, data mining plays an integral part in analyzing how much each raw material costs, what materials are being used most efficiently, how time is spent along the manufacturing process, and what bottlenecks negatively impact the process. Data mining helps ensure the flow of goods is uninterrupted.

Fraud Detection

The heart of data mining is finding patterns, trends, and correlations that link data points together. Therefore, a company can use data mining to identify outliers or correlations that should not exist. For example, a company may analyze its cash flow and find a reoccurring transaction to an unknown account. If this is unexpected, the company may wish to investigate whether funds are being mismanaged.

Human Resources

Human resources departments often have a wide range of data available for processing including data on retention, promotions, salary ranges, company benefits, use of those benefits, and employee satisfaction surveys. Data mining can correlate this data to get a better understanding of why employees leave and what entices new hires.

Customer Service

Customer satisfaction may be caused (or destroyed) by many events or interactions. Imagine a company that ships goods. A customer may be dissatisfied with shipping times, shipping quality, or communications. The same customer may be frustrated with long telephone wait times or slow e-mail responses. Data mining gathers operational information about customer interactions and summarizes the findings to pinpoint weak points and highlight what the company is doing right.

Advantages and Disadvantages of Data Mining

It drives profitability and efficiency

It can be applied to any type of data and business problem

It can reveal hidden information and trends

It is complex

Results and benefits are not guaranteed

It can be expensive

Pros Explained

  • Profitability and efficiency : Data mining ensures a company is collecting and analyzing reliable data. It is often a more rigid, structured process that formally identifies a problem, gathers data related to the problem, and strives to formulate a solution. Therefore, data mining helps a business become more profitable , more efficient, or operationally stronger.
  • Wide applications : Data mining can look very different across applications, but the overall process can be used with almost any new or legacy application. Essentially any type of data can be gathered and analyzed, and almost every business problem that relies on qualifiable evidence can be tackled using data mining.
  • Hidden information and trends : The end goal of data mining is to take raw bits of information and determine if there is cohesion or correlation among the data. This benefit of data mining allows a company to create value with the information they have on hand that would otherwise not be overly apparent. Though data models can be complex, they can also yield fascinating results, unearth hidden trends, and suggest unique strategies.

Cons Explained

  • Complexity : The complexity of data mining is one of its greatest disadvantages. Data analytics often requires technical skill sets and certain software tools. Smaller companies may find this to be a barrier of entry too difficult to overcome.
  • No guarantees : Data mining doesn't always mean guaranteed results. A company may perform statistical analysis, make conclusions based on strong data, implement changes, and not reap any benefits. This may be due to inaccurate findings, market changes, model errors, or inappropriate data populations . Data mining can only guide decisions and not ensure outcomes.
  • High cost : There is also a cost component to data mining. Data tools may require costly subscriptions, and some data may be expensive to obtain. Security and privacy concerns can be pacified, though additional IT infrastructure may be costly as well. Data mining may also be most effective when using huge data sets; however, these data sets must be stored and require heavy computational power to analyze.

Even large companies or government agencies have challenges with data mining. Consider the FDA's white paper on data mining that outlines the challenges of bad information, duplicate data, underreporting, or overreporting.

One of the most lucrative applications of data mining has been undertaken by social media companies. Platforms like Facebook, TikTok, Instagram, and X (formerly Twitter) gather reams of data about their users based on their online activities.

That data can be used to make inferences about their preferences. Advertisers can target their messages to the people who appear to be most likely to respond positively.

Data mining on social media has become a big point of contention, with several investigative reports and exposés showing just how intrusive mining users' data can be. At the heart of the issue is that users may agree to the terms and conditions of the sites not realizing how their personal information is being collected or to whom their information is being sold.

Examples of Data Mining

Data mining can be used for good, or it can be used illicitly. Here is an example of both.

eBay and e-Commerce

eBay collects countless bits of information every day from sellers and buyers. The company uses data mining to attribute relationships between products, assess desired price ranges, analyze prior purchase patterns, and form product categories.

eBay outlines the recommendation process as:

  • Raw item metadata and user historical data are aggregated.
  • Scripts are run on a trained model to generate and predict the item and user.
  • A KNN search is performed.
  • The results are written to a database.
  • The real-time recommendation takes the user ID, calls the database results, and displays them to the user.

Facebook-Cambridge Analytica Scandal

A cautionary example of data mining is the Facebook-Cambridge Analytica data scandal. During the 2010s, the British consulting firm Cambridge Analytica Ltd. collected personal data from millions of Facebook users. This information was later analyzed for use in the 2016 presidential campaigns of Ted Cruz and Donald Trump. It is suspected that Cambridge Analytica interfered with other notable events such as the Brexit referendum.

In light of this inappropriate data mining and misuse of user data, Facebook agreed to pay $100 million for misleading investors about its uses of consumer data. The Securities and Exchange Commission claimed Facebook discovered the misuse in 2015 but did not correct its disclosures for more than two years.

What Are the Types of Data Mining?

There are two main types of data mining: predictive data mining and descriptive data mining. Predictive data mining extracts data that may be helpful in determining an outcome. Description data mining informs users of a given outcome.

How Is Data Mining Done?

Data mining relies on big data and advanced computing processes including machine learning and other forms of artificial intelligence (AI). The goal is to find patterns that can lead to inferences or predictions from large and unstructured data sets.

What Is Another Term for Data Mining?

Data mining also goes by the less-used term "knowledge discovery in data," or KDD.

Where Is Data Mining Used?

Data mining applications have been designed to take on just about any endeavor that relies on big data. Companies in the financial sector look for patterns in the markets. Governments try to identify potential security threats. Corporations, especially online and social media companies, use data mining to create profitable advertising and marketing campaigns that target specific sets of users.

Modern businesses have the ability to gather information on their customers, products, manufacturing lines, employees, and storefronts. These random pieces of information may not tell a story, but the use of data mining techniques, applications, and tools helps piece together information .

The ultimate goal of the data mining process is to compile data, analyze the results, and execute operational strategies based on data mining results.

Shafique, Umair, and Qaiser, Haseeb. " A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) ." International Journal of Innovation and Scientific Research . vol. 12, no. 1, November 2014, pp. 217-222.

Food and Drug Administration. " Data Mining at FDA – White Paper ."

eBay. " Building a Deep Learning Based Retrieval System for Personalized Recommendations ."

Federal Trade Commission. " FTC Issues Opinion and Order Against Cambridge Analytica for Deceiving Consumers About Collection of Facebook Data, Compliance With EU-U.S. Privacy Shield ."

U.S. Security and Exchange Commission. " Facebook to Pay $100 Million for Misleading Investors About the Risks It Faced From Misuse of User Data ."

data mining research paper conclusion

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

123dok FR

  • Aucun résultat trouvé

CONCLUSIONS AND DISCUSSION

Data mining is a good area of scientific study, holding ample promise for the research community. Recently a lot of progress has been reported for large databases, specifically involving association rules, classification, cluster-ing, similar time sequences, similar text document retrieval, similar image retrieval, outlier discovery, etc. Many papers have been published in major conferences and leading journals. However, it still remains a promising and rich field with many challenging research issues.

In this chapter we have provided an introduction to knowledge discovery from databases and data mining. The major functions of data mining have been described from the perspectives of machine learning, pattern recogni-tion, and artificial intelligence. Handling of multimedia data, their compres-sion, matching, and their implications to text and image mining have been discussed. We have also stated principles of string matching, explaining how they can be applied in text retrieval and in Bioinformatics for DNA search type of operations. Different application domains and research challenges have also been highlighted.

Since the databases to be mined are often very large, parallel algorithms are desirable [50]. However, one has to explore a trade-off between com-putation, communication, memory usage, synchronization, and the use of problem-specific information, in order to select a suitable parallel algorithm for data mining. One can also partition the data appropriately and distribute

CONCLUSIONS AND DISCUSSION 29

the subsets to multiple processors, learning concept descriptions in parallel and then combining them. This corresponds to loosely coupled collections of otherwise independent algorithms and is termed distributed data mining [51].

Traditional data mining algorithms require all data to be mined in a single, centralized data warehouse. A fundamental challenge is to develop distributed versions of data mining algorithms, so that data mining can be done while leaving some of the data in different places. In addition, appropriate proto-cols, languages, and network services are required for mining distributed data, handling the meta-data and the mappings required for mining the distributed data.

Spatial database systems involve spatial data - that is, point objects or spatially extended objects in a 2D/3D or some high-dimensional feature space.

Knowledge discovery is becoming more and more important in these databases, as increasingly large amounts of data obtained from satellite images, X-ray crystallography, or other automatic equipment are being stored in the spa-tial framework. Image mining holds promise in handling such databases.

Moreover, Bioinformatics offers applications in modeling or analyzing protein structures that are represented as spatial data.

There exist plenty of scope for the use of soft computing in data mining, because of the imprecise nature of data in many application domains. For example, neural nets can help in the learning, the fuzzy sets for natural lan-guage representation and imprecision handling, and the genetic algorithms for search and optimization. However, not much work has been reported in the use of soft computing tools in data mining. The relevance of soft comput-ing lies in its ability to (i) handle subjectivity, imprecision, and uncertainty in queries, (ii) model document relevance as a gradual instead of a crisp property, (iii) provide deduction capability to the search engines, (iv) provide person-alization and learning capability, and (v) deal with the dynamism, scale, and heterogeneity of Web documents.

We take this opportunity to compile in this book the existing literature on the various aspects of data mining, highlighting its application to multimedia information and Bioinformatics. Soft computing, an emergent technology, has also demonstrated ample promise in data mining. Chapter 2 focuses on an introduction to soft computing, its tools, and finally its role in the different functions of data mining. The fundamentals of multimedia data compression, particularly text and image compression, are dealt with in Chapter 3. Chap-ter 4 deals in-depth with various issues in string matching. Here we provide examples to show how patterns are matched in general text, as well as how they can be applied in DNA matching in Bioinformatics. The different tasks of data mining like classification, clustering and association rules are covered in Chapters 5,6, and 7, respectively. The issue of rule generation and modu-lar hybridization, in the soft computing framework, is described in Chapter 8.

Multimedia data mining, including text mining, image mining, and Web min-ing, is dealt with in Chapter 9. Finally, certain aspects of Bioinformatics, as an application of data mining, are discussed in Chapter 10.

30 INTRODUCTION TO DATA MINING

1. U. Fayyad and R. Uthurusamy, "Data mining and knowledge discovery in databases," Communications of the ACM, vol. 39, pp. 24-27, 1996.

2. W. H. Inmon, "The data warehouse and data mining," Communications of the ACM, vol. 39, pp. 49-50, 1996.

3. T. Acharya and W. Metz, "Multimedia 'applications: Issues and chal-lenges," in Proceedings of the International Conference on Communica-tions, Computers and Devices (Indian Institute of Technology, Kharagpur, India), pp. 27-34, December 2000.

4. P. Piatetsky-Shapiro and W. J. Frawley, eds., Knowledge Discovery in Databases. Menlo Park, CA: AAAI/MIT Press, 1991.

5. President's Information Technology Advisory Committee's report, Wash-ington, http://www.ccic.gov/ac/interim/, 1998.

6. M. Lesk, Practical Digital Libraries: Books, Bytes, and Bucks. San Fran-cisco: Morgan Kaufmann, 1997.

7. S. L. Salzberg, D. B. Searls, and S. Kasif, eds., Computational Methods in Molecular Biology. Amsterdam: Elsevier Sciences B. V., 1998.

8. R. L. Blum, Discovery and Representation of Causal Relationships from a Large Time-Oriented Clinical Database: The RX Project, vol. 19 of Lecture Notes in Medical Informatics. New York: Spinger-Verlag, 1982.

9. J. A. Major and D. R. Riedinger, "EFD-a hybrid knowledge statistical-based system for the detection of fraud," International Journal of Intelli-gent Systems, vol. 7, pp. 687-703, 1992.

10. R. Heider, "Troubleshooting CFM 56-3 engines for the Boeing 737-using CBR and data-mining," Lecture Notes in Computer Science, vol. 1168, pp. 512-523, 1996.

11. U. Fayyad, D. Haussler, and P. Stolorz, "Mining scientific data," Com-munications of the ACM, vol. 39, pp. 51-57, 1996.

12. O. Etzioni, "The World-Wide Web: Quagmire or goldmine?," Communi-cations of the ACM, vol. 39, pp. 65-68, 1996.

13. J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Diego: Academic Press, 2001.

14. S. Mitra, S. K. Pal, and P. Mitra, "Data mining in soft computing frame-work: A survey," IEEE Transactions on Neural Networks, vol. 13, pp. 3-14, 2002.

REFERENCES 31

15. D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. London:

MIT Press, 2001.

16. M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algo-rithms. Hoboken, NJ: Wiley Interscience, IEEE Press, 2002.

17. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining. Menlo Park, CA:

AAAI/MIT Press, 1996.

18. "Special issue on knowledge discovery in data- and knowledge bases,"

International Journal of Intelligent Systems, vol. 7, no. 7, 1992.

19. K. J. Cios, W. Pedrycz, and R. Swiniarski, Data Mining Methods for Knowledge Discovery. Dordrecht: Kluwer, 1998.

20. U. Fayyad, G. P. Shapiro, and P. Smyth, "The KDD process for extracting useful knowledge from volumes of data," Communications of the ACM, vol. 39, pp. 27-34, 1996.

21. A. Silberschatz and A. Tuzhilin, "What makes patterns interesting in knowledge discovery systems," IEEE Transactions on Knowledge and Data Engineering, vol. 8, pp. 970-974, 1996.

22. T. M. Mitchell, "Machine learning and data mining," Communications of the ACM, vol. 42, pp. 30-36, 1999.

23. F. Provost and V. Kolluri, "A survey of methods for scaling up inductive algorithms," Data Mining and Knowledge Discovery, vol. 2, pp. 131-169, 1999.

24. T. Acharya, VLSI Algorithms and Architectures for Data Compression.

Ph.D. thesis, Department of Computer Science, University of Central Florida, Orlando, FL, August 1994.

25. K. Sayood, Introduction to Data Compression. San Francisco: Morgan Kaufmann, 2000.

26. P. Mitra, C. A. Murthy, and S. K. Pal, "Density based multiscale data condensation," IEEE Transactions on Pattern Analysis and Machine In-telligence, vol. 24, pp. 734-747, 2002.

27. R. Kohavi, B. Masand, M. Spilipoulou, and J. Srivastava, "Web mining,"

Data Mining and Knowledge Discovery, vol. 6, pp. 5-8, 2002.

28. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Read-ing, MA: Addison-Wesley, 1999.

29. G. Salton and M. J. McGill, Introduction to Modern Information Re-trieval. New York: McGraw-Hill, 1983.

32 INTRODUCTION TO DATA MINING

30. O. Etzioni and D. S. Weld, "Intelligent agents on the Internet: Facts, fiction, and forecast," IEEE Expert, vol. 10, pp. 44-49, August 1995.

31. P. Maes, "Agents that reduce work and information overload," Commu-nications of the ACM, vol. 37, pp. 30-40, 1994.

32. I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco: Morgan Kaufrnann, 1999.

33. J. Zhang, W. Hsu, and M. L. Lee, "Image mining: Issues, frameworks and techniques," in Proceedings of the 2nd International Workshop on Multi-media Data Mining and the ACM SIGKDD Conference (MDM/KDD '01) (San Francisco), pp. 13-20, August 2001.

34. A. Vailaya, A. T. Figueiredo, A. K. Jain, and H. J. Zhang, "Image clas-sification for content-based indexing," IEEE Transactions on Image Pro-cessing, vol. 10, pp. 117-130, 2001.

35. J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles. London:

Addison-Wesley, 1974.

36. J. R. Quinlan, "Induction on decision trees," Machine Learning, vol. 1, pp. 81-106, 1986.

37. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks/Cole, 1984.

38. S. Haykin, Neural Networks: A Comprehensive Foundation. New York:

Macmillan College Publishing Co., 1994.

39. R. Agrawal, T. Imielinski, and A. Swami, "Mining association rules be-tween sets of items in large databases," in Proceedings of 1993 ACM SIG-MOD International Conference on Management of Data (Washington, D.C.), pp. 207-216, May 1993.

40. R. Agrawal and R. Srikant, "Fast algorithms for mining association rules in large databases," in Proceedings of 20th International Conference on

Very Large Databases, pp. 478-499, September 1994.

41. D. Knuth, J. Morris, and V. Pratt, "Fast pattern matching in strings,"

SIAM Journal of Computing, vol. 6, pp. 323-350, 1977.

42. R. Boyer and J. A. Moore, "A fast string searching algorithm," Commu-nications of the ACM, vol. 20, pp. 762-772, 1977.

43. G. A. Stephen, String Searching Algorithms. Singapore: World Scientific, 2001.

REFERENCES 33

44. M. J. Fischer and M. S. Paterson, "String matching and other products,"

in Complexity of Computation, SIAM-AMS Proceedings (R. M. Karp, ed.), vol. 7, pp. 113-125, 1974.

45. T. P. Baker, "A technique for extending rapid exact-match string match-ing to arrays of more than one dimension," SI AM Journal of Computmatch-ing, vol. 7, pp. 533-541, 1978.

46. R. S. Bird, "Two dimensional pattern matching," Information Processing Letters, vol. 6, pp. 168-170, 1977.

47. A. Amir, G. M. Landau, and U. Vishkin, "Efficient pattern matching with scaling," Journal of Algorithms, vol. 13, pp. 2-32, 1992.

48. "Special Issue on Bioinformatics, Part I: Advances and Challenges," Pro-ceedings of the IEEE, vol. 90, November 2002.

49. K. A. Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S. Chan, "Principles of protein folding-A perspective from simple exact models," Protein Science, vol. 4, pp. 561-602, 1995.

50. R. Agrawal and J. C. Shafer, "Parallel mining of association rules," IEEE Transactions on Knowledge and Data Engineering, vol. 8, pp. 962-969, 1996.

51. H. Kargupta and P. Chan, eds., Advances in Distributed and Parallel Knowledge Discovery. Cambridge, MA: MIT Press, 2000.

This page intentionally left blank

  • KNOWLEDGE DISCOVERY AND DATA MINING
  • STRING MATCHING
  • APPLICATIONS AND CHALLENGES
  • CONCLUSIONS AND DISCUSSION (Vous êtes ici)
  • Neural networks
  • Genetic algorithms
  • ROLE OF FUZZY SETS IN DATA MINING
  • ROLE OF NEURAL NETWORKS IN DATA MINING
  • ROLE OF HYBRIDIZATIONS IN DATA MINING
  • INFORMATION THEORY CONCEPTS
  • MEASURES OF COMPRESSION PERFORMANCE
  • SOURCE CODING ALGORITHMS
  • The baseline compression algorithm
  • The LZW algorithm
  • Other applications of Lempeh-Ziv coding
  • LINEAR-ORDER STRING MATCHING ALGORITHMS
  • Boyer-Moore algorithm

Documents relatifs

  • Undergraduate
  • High School
  • Architecture
  • American History
  • Asian History
  • Antique Literature
  • American Literature
  • Asian Literature
  • Classic English Literature
  • World Literature
  • Creative Writing
  • Linguistics
  • Criminal Justice
  • Legal Issues
  • Anthropology
  • Archaeology
  • Political Science
  • World Affairs
  • African-American Studies
  • East European Studies
  • Latin-American Studies
  • Native-American Studies
  • West European Studies
  • Family and Consumer Science
  • Social Issues
  • Women and Gender Studies
  • Social Work
  • Natural Sciences
  • Pharmacology
  • Earth science
  • Agriculture
  • Agricultural Studies
  • Computer Science
  • IT Management
  • Mathematics
  • Investments
  • Engineering and Technology
  • Engineering
  • Aeronautics
  • Medicine and Health
  • Alternative Medicine
  • Communications and Media
  • Advertising
  • Communication Strategies
  • Public Relations
  • Educational Theories
  • Teacher's Career
  • Chicago/Turabian
  • Company Analysis
  • Education Theories
  • Shakespeare
  • Canadian Studies
  • Food Safety
  • Relation of Global Warming and Extreme Weather Condition
  • Movie Review
  • Admission Essay
  • Annotated Bibliography
  • Application Essay
  • Article Critique
  • Article Review
  • Article Writing
  • Book Review
  • Business Plan
  • Business Proposal
  • Capstone Project
  • Cover Letter
  • Creative Essay
  • Dissertation
  • Dissertation - Abstract
  • Dissertation - Conclusion
  • Dissertation - Discussion
  • Dissertation - Hypothesis
  • Dissertation - Introduction
  • Dissertation - Literature
  • Dissertation - Methodology
  • Dissertation - Results
  • GCSE Coursework
  • Grant Proposal
  • Marketing Plan
  • Multiple Choice Quiz
  • Personal Statement
  • Power Point Presentation
  • Power Point Presentation With Speaker Notes
  • Questionnaire
  • Reaction Paper

Research Paper

  • Research Proposal
  • SWOT analysis
  • Thesis Paper
  • Online Quiz
  • Literature Review
  • Movie Analysis
  • Statistics problem
  • Math Problem
  • All papers examples
  • How It Works
  • Money Back Policy
  • Terms of Use
  • Privacy Policy
  • We Are Hiring

Data Mining, Research Paper Example

Pages: 6

Words: 1562

Hire a Writer for Custom Research Paper

Use 10% Off Discount: "custom10" in 1 Click 👇

You are free to use it as an inspiration or a source for your own work.

Data mining is a process attempting to make discoveries of patterns in large data sets. It does automate sensing of appropriate patterns available in a database. Data mining usually utilizes available methods at the intersection of database systems, machine learning, statistics, and artificial intelligence. Data mining process plays a significant role in extracting significant information from data set, then using its patterns transforms this information into a structure that can easily be understood for further use and functioning. Data management does not only involve raw analysis step, but it also several aspects. These aspects include online updating, complexity considerations, and model and interference considerations. It is important to note that data mining enhances post-processing of discovered structures, data processing, and data management processes (Pyle, 2003).

Data mining process involves automatic or semi-automatic analysis whereby large quantities of data are involved. These data usually extracted from patterns of data records and analysis such as association rule mining, anomaly detection and cluster analysis. All these analysis utilizes spatial indexes as an appropriate database technique. The process usually involves searching, cleaning, collecting, and analyzing data from different database sources with the sole purpose of evaluating them. The process can thus be said to be an automatic analysis of files found in online for the purposes of discovering patterns, which could have gone undiscovered and unexplored. Data mining involve several classes of tasks these include anomaly detection, association rule learning, classification, regression, clustering and summarization. Each of these classes is of significance in ensuring that the businesses or organization’s data and operations and handled appropriately.

Data mining algorithms comes after assembling of target data. Assembling is possible in situations where the target data is large and capable of containing the appropriate patterns and at the same time capable of being mined within the given time. Smart mart or data warehouse is usually the common source of data mining (Pyle, 2003). After data assembling has been done, the target data undergo cleaning where those observations, which contain noise, are normally removed and the ones with mining data set aside.

Data mining is technological advancement, which has resulted from the emergence of the IT industry and economic development. In this regards, data mining has now become a popular process. Several companies in the recent years are in need of solutions provided by data mining since it provides them with advantage over its competitors. With the aid of data mining, several companies have managed to gather data from various sources. This has increased benefits to the company in ensuring that efficiency is achieved. Business intelligence data mining have come up with the help of data mining, which involves gathering meaningful data from several sources especially online podiums. This is done with an intention of reaching at a sensitive business decisions (Pyle, 2003). This process usually includes economic trends, industry research, competitor and competition analysis, geographical information and market, and economic trends. With the help of data mining, various organizations and businesses have been able to manage their competitors.

Data mining helps companies and business entities in discovering information concerning their customers and the behavior of these customers towards products. In this regard, the businesses entities can then analyze, evaluate, store and synthesize crucial information from data related to the customers. Thus, data mining is a significant tool for organizations in enabling them makes improvements concerning their marketing strategies and provision of appropriate analysis concerning their customers.

The process has been of considerable help to organizations in providing solid customer focus this is because of its flexibility in its application and in foreseeing crucial data, which include customer-buying behavior, in addition to industry analysis.

Data mining process is a reliable process in undertaking business processes. It is one of the steps taking place between a business or a company and its customers. The influence on data mining on business is dependent on the processes of the business and not the process of data mining. Data mining results are usually distinct from those of other business processes, which are usually data-driven. Analysis of customer’s data using data mining shows that the results the user gets are the information known to them; and that they already existed in the database. Data mining has enabled businesses selling its products indifferent regions to translate easily the display of the information found to an appropriate understanding concerning various business processes.

The process is valid in that it extracts hidden information from the database, the user concerning its existence might not know some information. It has also aided in finding the relationship and connection between the customer’s behavior and different variables, which are normally non-intuitive. The advantage of data mining in this case is that it can utilize the output of its system after translating into solutions for business problems thus benefiting the business entity. Data mining has been a reliable process since its output has enabled the company to find the list of target customers and thus increasing their credit limit. The persons concern, in the process, has little task to accomplish since all the tasks has been accomplished by the data mining process, this has proved to be an effective approach and an efficient one thus affordable to any other business.

However, using results from data mining has proved to be a difficult means in using its results and getting the customers understand the process effectively and to take action in its operations. These processes can be of considerable benefit to customers in cases where the data mining system is made clear to the customers so that they can understand it qualitatively. It is imperative that failure to do so renders the process inappropriate (Kudyba & Hoptroff, 2001).

Assessing the reliability of this process can be achieved using several approaches. These methods include measuring statistical validity with the aim of determining where the problem are found. It involves separating data into training and testing its prediction accuracy and viewing the results with an effort of determining the meaningfulness of the discovered patterns. Utilizing all these methods leads to the effectiveness in using data mining process. Data mining can only trusted in situations where the company has effectively put in place the appropriate approaches in assessing the information found using this process. The process is, however, unreliable in cases where the extraction of information is extracted from the customer’s hidden behaviors and understanding these processes becomes complicated.

Data mining’s paramount concern is privacy. The technology of data mining is prone to abuse by different parties. For example, when one fills information in bank during loan processing, all the personal information is normally left in a database and are normally assessed by anyone (Soares & Ghani, 2010). This has led to cases of insecurities since thugs and robbers in tracing the person can use personal private information. Data mining usually make an assumption concerning the location of the information; they assume that the information in databases is held in one location within the organization. This, however, is not the case since information in the organizational database can fall in the hands of those who assess the database within and outside the organization, implying that private information are made public, and anyone can access in the internet thus privacy policy violated.

Privacy concern and legal issues in data mining are the leading source of conflict in business entities. In the recent past, government and corporate entities collect data and stored in data warehouses thus placing the privacy of consumers in a jeopardy state.

Consumers have, however, raised some privacy concern these include Secondary Use of the Personal Information, Handling Misinformation, Granulated Access to Personal Information and new privacy threats. The substantial privacy concern facing consumers is the use of private information, government and business entities normally access the information of the customers obtained from the organizational database and use it for other purposes mainly for their own benefits. This poses a problem in the side of consumers since their privacy is tampered with without relevant consultation. This concern is valid for consumers to raise since their privacy is interfered with and crucial information are left in the hands of strangers hence their security interfered (Shmueli et al, 2011).

Handling of misinformation by other parties who get access to customers’ private information in the company’s database, is also an ethical issue related to data mining. This information are usually prune to mishandling by the third party making the whole thing irrelevant. The consumers concern is valid since their personal information can be tampered in the hands of other internet users. New privacy threats as a privacy concern raised by the consumers, the threat is normally posed by Knowledge Discovery and Data Mining (KDDM), have lead to consumers information being interfered with. The threat normally includes deductive learning, data collection, and statistical analysis. This poses the problem to privacy of consumers since there is no guarantee of personal information being secure. The concern is valid since if left, personal information can get into the hands of individuals who are not trustworthy.

Kudyba, S. & Hoptroff, R. (2001). Data Mining and Business Intelligence: A Guide to Productivity. Idea Group Inc (IGI).

Pyle, D. (2003). Business Modeling and Data Mining. Morgan Kaufmann.

Shmueli, G., Patel, N., & Bruce, P. (2011). Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. John Wiley & Sons.

Soares, C. & Ghani, R. (2010). Data Mining for Business Applications: Frontiers in Artificial Intelligence and Applications. IOS Press.

Stuck with your Research Paper?

Get in touch with one of our experts for instant help!

Business Manager, Interview Example

Latin Murals, Essay Example

Time is precious

don’t waste it!

Plagiarism-free guarantee

Privacy guarantee

Secure checkout

Money back guarantee

E-book

Related Research Paper Samples & Examples

The risk of teenagers smoking, research paper example.

Pages: 11

Words: 3102

Impacts on Patients and Healthcare Workers in Canada, Research Paper Example

Pages: 7

Words: 1839

Death by Neurological Criteria, Research Paper Example

Words: 2028

Ethical Considerations in End-Of-Life Care, Research Paper Example

Pages: 5

Words: 1391

Ethical Dilemmas in Brain Death, Research Paper Example

Words: 2005

Politics of Difference and the Case of School Uniforms, Research Paper Example

Pages: 2

Words: 631

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 27 May 2024

Research on domain ontology construction based on the content features of online rumors

  • Jianbo Zhao 1 ,
  • Huailiang Liu 1 ,
  • Weili Zhang 1 ,
  • Tong Sun 1 ,
  • Qiuyi Chen 1 ,
  • Yuehai Wang 2 ,
  • Jiale Cheng 2 ,
  • Yan Zhuang 1 ,
  • Xiaojin Zhang 1 ,
  • Shanzhuang Zhang 1 ,
  • Bowei Li 3 &
  • Ruiyu Ding 2  

Scientific Reports volume  14 , Article number:  12134 ( 2024 ) Cite this article

1 Altmetric

Metrics details

  • Computational neuroscience
  • Computer science
  • Data acquisition
  • Data integration
  • Data mining
  • Data processing
  • Human behaviour
  • Information technology
  • Literature mining
  • Machine learning
  • Scientific data

Online rumors are widespread and difficult to identify, which bring serious harm to society and individuals. To effectively detect and govern online rumors, it is necessary to conduct in-depth semantic analysis and understand the content features of rumors. This paper proposes a TFI domain ontology construction method, which aims to achieve semantic parsing and reasoning of the rumor text content. This paper starts from the term layer, the frame layer, and the instance layer, and based on the reuse of the top-level ontology, the extraction of core literature content features, and the discovery of new concepts in the real corpus, obtains the core classes (five parent classes and 88 subclasses) of the rumor domain ontology and defines their concept hierarchy. Object properties and data properties are designed to describe relationships between entities or their features, and the instance layer is created according to the real rumor datasets. OWL language is used to encode the ontology, Protégé is used to visualize it, and SWRL rules and pellet reasoner are used to mine and verify implicit knowledge of the ontology, and judge the category of rumor text. This paper constructs a rumor domain ontology with high consistency and reliability.

Similar content being viewed by others

data mining research paper conclusion

From rumor to genetic mutation detection with explanations: a GAN approach

data mining research paper conclusion

Improving long COVID-related text classification: a novel end-to-end domain-adaptive paraphrasing framework

data mining research paper conclusion

How do we study misogyny in the digital age? A systematic literature review using a computational linguistic approach

Introduction.

Online rumors are false information spread through online media, which have the characteristics of wide content 1 , hard to identify 2 , 3 . Online rumors can mislead the public, disrupt social order, damage personal and collective reputations, and pose a great challenge to the governance of internet information content. Therefore, in order to effectively detect and govern online rumors, it is necessary to conduct an in-depth semantic analysis and understanding of the rumor text content features.

The research on the content features of online rumors focuses on the lexical, syntactic and semantic features of the rumor text, including lexical, syntactic and semantic features 4 , syntactic structure and functional features 5 , source features 5 , 6 , rhetorical methods 7 , narrative structure 6 , 7 , 8 , language style 6 , 9 , 10 , corroborative means 10 , 11 and emotional features 10 , 12 , 13 , 14 , 15 , 16 , 17 , 18 . Most of the existing researches on rumor content features are feature mining under a single domain topic type, and lack of mining the influence relationship between multiple features. Therefore, this paper proposes to build an online rumor domain ontology to realize fine-grained hierarchical modeling of the relationship between rumor content features and credible verification of its effectiveness. Domain ontology is a systematic description of the objective existence in a specific discipline 19 . The construction methods mainly include TOVE method 20 , skeleton method 21 , IDEF-5 method 22 , 23 , methontology method 24 , 25 and seven-step method 26 , 27 , among which seven-step method is the most mature and widely used method at present 28 , which has strong systematicness and applicability 29 , but it does not provide quantitative indicators and methods about the quality and effect of ontology. The construction technology can be divided into the construction technology based on thesaurus conversion, the construction technology based on existing ontology reuse and the semi-automatic and automatic construction technology based on ontology engineering method 30 . The construction technology based on thesaurus conversion and the construction technology based on existing ontology reuse can save construction time and cost, and improve ontology reusability and interoperability, but there are often differences in structure, semantics and scene. Semi-automatic and automatic construction technology based on ontology engineering method The application of artificial intelligence technology can automatically extract ontology elements and structures from data sources with high efficiency and low cost, but the quality and accuracy are difficult to guarantee. Traditional domain ontology construction methods lack effective quality evaluation support, and construction technology lacks effective integration application. Therefore, this paper proposes an improved TFI network rumor domain ontology construction method based on the seven-step method. Starting from the terminology layer, the framework layer and the instance layer, it integrates the top-level ontology and core document content feature reuse technology, the bottom-up semi-automatic construction technology based on N-gram new word discovery algorithm and RoBERTa-Kmeans clustering algorithm, defines the fine-grained features of network rumor content and carries out hierarchical modeling. Using SWRL rules and pellet inference machine, the tacit knowledge of ontology is mined, and the quality of ontology validity and consistency is evaluated and verified.

The structure of this paper is as follows: Sect “ Related work ” introduces the characteristics of rumor content and the related work of domain ontology construction.; Sect “ Research method ” constructs the term layer, the frame layer and the instance layer of the domain ontology; Sect “ Domain ontology construction ” mines and verifies the implicit knowledge of the ontology based on SWRL rules and Pellet reasoner; Sect “ Ontology reasoning and validation ” points out the research limitations and future research directions; Sect “ Discussion ” summarizes the research content and contribution; Sect “ Conclusion ” summarizes the research content and contribution of this paper.

Related Work

Content features of online rumors.

The content features of online rumors refer to the adaptive description of vocabulary, syntax and semantics in rumor texts. Fu et al. 5 have made a linguistic analysis of COVID-19’s online rumors from the perspectives of pragmatics, discourse analysis and syntax, and concluded that the source of information, the specific place and time of the event, the length of the title and statement, and the emotions aroused are the important characteristics to judge the authenticity of the rumors; Zhang et al. 6 summarized the narrative theme, narrative characteristics, topic characteristics, language style and source characteristics of new media rumors; Li et al. 7 found that rumors have authoritative blessing and fear appeal in headline rhetoric, and they use news and digital headlines extensively, and the topic construction mostly uses programmed fixed structure; Yu et al. 8 analyzed and summarized the content distribution, narrative structure, topic scene construction and title characteristics of rumors in detail; Mourao et al. 9 found that the language style of rumors is significantly different from that of real texts, and rumors tend to use simpler, more emotional and more radical discourse strategies; Zhou et al. 10 analyzed the rumor text based on six analysis categories, such as content type, focus object and corroboration means, and found that the epidemic rumors were mostly “infectious” topics, with narrative expression being the most common, strong fear, and preference for exaggerated and polarized discourse style. Huang et al. 11 conducted an empirical study based on WeChat rumors, and found that the “confirmation” means of rumors include data corroboration and specific information, hot events and authoritative release; Butt et al. 12 analyzed the psycholinguistic features of rumors, and extracted four features from the rumor data set: LIWC, readability, senticnet and emotions. Zhou et al. 13 analyzed the semantic features of fake news content in theme and emotion, and found that the distribution of fake news and real news is different in theme features, and the overall mood, negative mood and anger of fake news are higher; Tan et al. 14 divided the content characteristics of rumors into content characteristics with certain emotional tendency and social characteristics that affect credibility; Damstra et al. 15 identified the elements as a consistent indicator of intentionally deceptive news content, including negative emotions causing anger or fear, lengthy sensational headlines, using informal language or swearing, etc. Lai et al. 16 put forward that emotional rumors can make the rumor audience have similar positive and negative emotions through emotional contagion; Yuan et al. 17 found that multimedia evidence form and topic shaping are important means to create rumors, which mostly convey negative emotions of fear and anger, and the provision of information sources is related to the popularity and duration of rumors; Ruan et al. 18 analyzed the content types, emotional types and discourse focus of Weibo’s rumor samples, and found that the proportion of social life rumors was the highest, and the emotional types were mainly hostile and fearful, with the focus on the general public and the personnel of the party, government and military institutions.

The forms and contents of online rumors tend to be diversified and complicated. The existing research on the content features of rumors is mostly aimed at the mining of content characteristics under specific topics, which cannot cover various types of rumor topics, and lacks fine-grained hierarchical modeling of the relationship between features and credible verification of their effectiveness.

Domain ontology construction

Domain ontology is a unified definition, standardized organization and visual representation of the concepts of knowledge in a specific domain 31 , 32 , and it is an important source of information for knowledge-based systems 19 , 33 . Theoretical methods include TOVE method 20 , skeleton method 21 , IDEF-5 method 22 , 23 , methontology method 24 , 25 and seven-step method 26 , 27 . TOVE method transforms informal description into formal ontology, which is suitable for fields that need accurate knowledge, but it is complex and time-consuming, requires high-level domain knowledge and is not easy to expand and maintain. Skeleton method forms an ontology skeleton by defining the concepts and relationships of goals, activities, resources, organizations and environment, which can be adjusted according to needs and is suitable for fields that need multi-perspective and multi-level knowledge, but it lacks formal semantics and reasoning ability. Based on this method, Ran et al. 34 constructed the ontology of idioms and allusions. IDEF5 method uses chart language and detailed description language to construct ontology, formalizes and visualizes objective knowledge, and is suitable for fields that need multi-source data and multi-participation, but it lacks a unified ontology representation language. Based on this method, Li et al. 35 constructed the business process activity ontology of military equipment maintenance support, and Song et al. 36 established the air defense and anti-missile operation process ontology. Methontology is a method close to software engineering. It systematically develops ontologies through the processes of specification, knowledge acquisition, conceptualization, integration, implementation, evaluation and document arrangement, which is suitable for fields that need multi-technology and multi-ontology integration, but it is too complicated and tedious, and requires a lot of resources and time 37 . Based on this method, Yang et al. 38 completed the ontology of emergency plan, Duan et al. 39 established the ontology of high-resolution images of rural residents, and Chen et al. 40 constructed the corpus ontology of Jiangui. Seven-step method is the most mature and widely used method at present 28 . It is systematic and applicable to construct ontology by determining its purpose, scope, terms, structure, attributes, limitations and examples 29 , but it does not provide quantitative indicators and methods about the quality and effect of ontology. Based on this method, Zhu et al. 41 constructed the disease ontology of asthma, Li et al. 42 constructed the ontology of military events, the ontology of weapons and equipment and the ontology model of battlefield environment, and Zhang et al. 43 constructed the ontology of stroke nursing field, and verified the construction results by expert consultation.

Domain ontology construction technology includes thesaurus conversion, existing ontology reuse and semi-automatic and automatic construction technology based on ontology engineering method 30 . The construction technology based on thesaurus transformation takes the existing thesaurus as the knowledge source, and transforms the concepts, terms and relationships in the thesaurus into the entities and relationships of domain ontology through certain rules and methods, which saves the time and cost of ontology construction and improves the quality and reusability of ontology. However, it is necessary to solve the structural and semantic differences between thesaurus and ontology and adjust and optimize them according to the characteristics of different fields and application scenarios. Wu et al. 44 constructed the ontology of the natural gas market according to the thesaurus of the natural gas market and the mapping of subject words to ontology, and Li et al. 45 constructed the ontology of the medical field according to the Chinese medical thesaurus. The construction technology based on existing ontology reuse uses existing ontologies or knowledge resources to generate new domain ontologies through modification, expansion, merger and mapping, which saves time and cost and improves the consistency and interoperability of ontologies, but it also needs to solve semantic differences and conflicts between ontologies. Chen et al. 46 reuse the top-level framework of scientific evidence source information ontology (SEPIO) and traditional Chinese medicine language system (TCMLS) to construct the ontology of clinical trials of traditional Chinese medicine, and Xiao et al. 47 construct the domain ontology of COVID-19 by extracting the existing ontology and the knowledge related to COVID-19 in the diagnosis and treatment guide. Semi-automatic and automatic construction technology based on ontology engineering method semi-automatically or automatically extracts the elements and structures of ontology from data sources by using natural language processing, machine learning and other technologies to realize large-scale, fast and low-cost domain ontology construction 48 , but there are technical difficulties, the quality and accuracy of knowledge extraction can not be well guaranteed, and the quality and consistency of different knowledge sources need to be considered. Suet al. 48 used regular templates and clustering algorithm to construct the ontology of port machinery, Zheng et al. 49 realized the automatic construction of mobile phone ontology through LDA and other models, Dong et al. 50 realized the automatic construction of ontology for human–machine ternary data fusion in manufacturing field, Linli et al. 51 proposed an ontology learning algorithm based on hypergraph, and Zhai et al. 52 learned from it through part-of-speech tagging, dependency syntax analysis and pattern matching.

At present, domain ontology construction methods are not easy to expand, lack of effective quality evaluation support, lack of effective integration and application of construction technology, construction divorced from reality can not guide subsequent practice, subjective ontology verification and so on. Aiming at the problems existing in the research of content characteristics and domain ontology construction of online rumors, this paper proposes an improved TFI network rumor domain ontology construction method based on seven-step method, which combines top-down existing ontology reuse technology with bottom-up semi-automatic construction technology, and establishes rumor domain ontology based on top-level ontology reuse, core document content feature extraction and new concept discovery in the real corpus from the terminology layer, framework layer and instance layer. Using Protégé as a visualization tool, the implicit knowledge mining of ontology is carried out by constructing SWRL rules to verify the semantic parsing ability and consistency of domain ontology.

Research method

This paper proposes a TFI online rumor domain ontology construction method based on the improvement of the seven-step method, which includes the term layer, the frame layer and the instance layer construction.

Term layer construction

Determine the domain and scope: the purpose of constructing the rumor domain ontology is to support the credible detection and governance of online rumors, and the domain and scope of the ontology are determined by answering questions.

Three-dimensional term set construction: investigate the top-level ontology and related core literature, complete the mapping of reusable top-level ontology and rumor content feature concept extraction semi-automatically from top to bottom; establish authoritative real rumor datasets, and complete the domain new concept discovery automatically from bottom to top; based on this, determine the term set of the domain ontology.

Frame layer construction

Define core classes and hierarchical relationships: combine the concepts of the three-dimensional rumor term set, based on the data distribution of the rumor dataset, define the parent class, summarize the subclasses, design hierarchical relationships and explain the content of each class.

Define core properties and facets of properties: in order to achieve deep semantic parsing of rumor text contents, define object properties, data properties and property facets for each category in the ontology.

Instance layer construction

Create instances: analyze the real rumor dataset, extract instance data, and add them to the corresponding concepts in the ontology.

Encode and visualize ontology: use OWL language to encode ontology, and use Protégé to visualize ontology, so that ontology can be understood and operated by computer.

Ontology verification: use SWRL rules and pellet reasoner to mine implicit knowledge of ontology, and verify its semantic parsing ability and consistency.

Ethical statements

This article does not contain any studies with human participants performed by any of the authors.

Determine the professional domain and scope of the ontology description

This paper determines the domain and scope of the online rumor domain ontology by answering the following four questions:

(1) What is the domain covered by the ontology?

The “Rumor Domain Ontology” constructed in this paper only considers content features, not user features and propagation features; the data covers six rumor types of politics and military, disease prevention and treatment, social life, science and technology, nutrition and health, and others involved in China’s mainstream internet rumor-refuting websites.

(2) What is the purpose of the ontology?

To perform fine-grained hierarchical modeling of the relationships among the features of multi-domain online rumor contents, realize semantic parsing and credibility reasoning verification of rumor texts, and guide fine-grained rumor detection and governance. It can also be used as a guiding framework and constraint condition for online rumor knowledge graph construction.

(3) What kind of questions should the information in the ontology provide answers for?

To provide answers for questions such as the fine-grained rumor types of rumor instances, the valid features of rumor types, etc.

(4) Who will use the ontology in the future?

Users of online rumor detection and governance, users of online rumor knowledge graphs construction.

Three-dimensional term set construction

Domain concepts reused by top-level ontology.

As a mature and authoritative common ontology, top-level ontology can be shared and reused in a large range, providing reference and support for the construction of domain ontology. The domain ontology of online rumors established in this paper focuses on the content characteristics, mainly including the content theme, events and emotions of rumor texts. By reusing the terminology concepts in the existing top-level ontology, the terminology in the terminology set can be unified and standardized. At the same time, the top-level concept and its subclass structure can guide the framework construction of domain ontology and reduce the difficulty and cost of ontology construction. Reusable top-level ontologies include: SUMO, senticnet and ERE after screening.

SUMO ontology: a public upper-level knowledge ontology containing some general concepts and relations for describing knowledge in different domains. The partial reusable SUMO top-level concepts and subclasses selected in this paper are shown in Table 1 , which provides support for the sub-concept design of text topics in rumor domain ontology.

Senticnet: a knowledge base for concept-based sentiment analysis, which contains semantic, emotional, and polarity information related to natural language concepts. The partial reusable SenticNet top-level concepts and subclasses selected in this paper are shown in Table 2 , which provides support for the sub-concept design of text topics in rumor domain ontology.

Entities, relations, and events (ERE): a knowledge base of events and entity relations. The partial reusable ERE top-level concepts and subclasses selected in this paper are shown in Table 3 , which provides support for the sub-concept design of text elements in the rumor domain ontology.

Extracting domain concepts based on core literature content features

Domain core literature is an important source for extracting feature concepts. This paper uses ‘rumor detection’ as the search term to retrieve 274 WOS papers and 257 CNKI papers from the WOS and CNKI core literature databases. The content features of rumor texts involved in the literature samples are extracted, the repetition content features are eliminated, the core content features are screened, and the canonical naming of synonymous concepts from different literatures yields the domain concepts as shown in Table 4 . Among them, text theme, text element, text style, text feature and text rhetoric are classified as text features; emotional category, emotional appeal and rumor motive are classified as emotional characteristics; source credibility, evidence credibility and testimony method are classified as information credibility characteristics; social context is implicit.

Extracting domain concepts based on new concept discovery

This paper builds a general rumor dataset based on China’s mainstream rumor-refuting websites as data sources, and proposes a domain new concept discovery algorithm to discover domain new words in the dataset, add them to the word segmentation dictionary to improve the accuracy of word segmentation, and cluster them according to rumor type, resulting in a concept subclass dictionary based on the real rumor dataset, which provided realistic basis and data support for the conceptual design of each subclass in domain ontology.

Building a general rumor dataset

The rumor dataset constructed in this paper contains 12,472 texts, with 6236 rumors and 6236 non-rumors; the data sources are China’s mainstream internet rumor-refuting websites: 1032 from the internet rumor exposure platform of China internet joint rumor-refuting platform, 270 from today’s rumor-refuting of China internet joint rumor-refuting platform, 1852 from Tencent news Jiaozhen platform, 1744 from Baidu rumor-refuting platform, 7036 from science rumor-refuting platform, and 538 from Weibo community management center. This paper invited eight researchers to annotate the labels (rumor, non-rumor), categories (politics and military, disease prevention and treatment, social life, science and technology, nutrition and health, others) of the rumor dataset. Because data annotation is artificial and subjective, in order to ensure the effectiveness and consistency of annotation, before inviting researchers to annotate, this paper formulates annotation standards, including the screening method, trigger words and sentence break identification of rumor information and corresponding rumor information, and clearly explains and exemplifies the screening method and trigger words of rumor categories, so as to reduce the understanding differences among researchers; in view of this standard, researchers are trained in labeling to familiarize them with labeling specifications, so as to improve their labeling ability and efficiency. The method of multi-person cross-labeling is adopted when labeling, and each piece of data is independently labeled by at least two researchers. In case of conflicting labeling results, the labeling results are jointly decided by the data annotators to increase the reliability and accuracy of labeling. After labeling, multi-person cross-validation method is used to evaluate the labeling results. Each piece of data is independently verified by at least two researchers who did not participate in labeling, and conflicting labeling results are jointly decided by at least five researchers to ensure the consistency of evaluation results. Examples of the results are shown in Table 5 .

N-gram word granularity rumor text new word discovery algorithm

Existing neologism discovery algorithms are mostly based on the granularity of Chinese characters, and the time complexity of long word discovery is high and the accuracy rate is low. The algorithm’s usefulness is low, and the newly discovered words are mostly already found in general domain dictionaries. To solve these problems, this paper proposes an online rumor new word discovery algorithm based on N-gram word granularity, as shown in Fig.  1 .

figure 1

Flowchart of domain new word discovery algorithm.

First, obtain the corpus to be processed \({\varvec{c}}=\{{{\varvec{s}}}_{1},{{\varvec{s}}}_{2},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}\}\) , and perform the first preprocessing on the corpus to be processed, which includes: sentence segmentation, Chinese word segmentation and punctuation removal for the corpus to be processed. Obtain the first corpus \({{\varvec{c}}}^{{\varvec{p}}}=\{{{\varvec{s}}}_{1}^{{\varvec{p}}},{{\varvec{s}}}_{2}^{{\varvec{p}}},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}^{{\varvec{p}}}\}\) ; where \({s}_{i}\) represents the \(i\) -th sentence in the corpus to be processed, \({n}_{c}\) represents the number of sentences in the corpus to be processed, and \({s}_{i}^{p}\) is the i-th sentence in the first corpus; perform N-gram operation on each sentence in the first corpus separately, and obtain multiple candidate words \(n=2\sim 5\) ; count the word frequency of each candidate word in the first corpus, and remove the candidate words with word frequency less than the first threshold, and obtain the first class of candidate word set;calculate the cohesion of each candidate word in the first class of candidate word set according to the following formula:

In the formula, \(P(\cdot )\) represents word frequency.Then filter according to the second threshold corresponding to N-gram operation, and obtain the second class of candidate word set; after loading the new words in the second class of candidate word set into LTP dictionary, perform the second preprocessing on the corpus to be processed \({\varvec{c}}=\{{{\varvec{s}}}_{1},{{\varvec{s}}}_{2},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}\}\) ; and obtain the second corpus \({{\varvec{c}}}^{{\varvec{p}}\boldsymbol{^{\prime}}}=\{{{\varvec{s}}}_{1}^{{\varvec{p}}\boldsymbol{^{\prime}}},{{\varvec{s}}}_{2}^{{\varvec{p}}\boldsymbol{^{\prime}}},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}^{{\varvec{p}}\boldsymbol{^{\prime}}}\}\) ; where the second preprocessing includes: sentence segmentation, Chinese word segmentation and stop word removal for the corpus to be processed; after obtaining the vector representation of each word in the second corpus, determine the vector representation of each new word in the second class of candidate word set; according to the vector representation of each new word, use K-means algorithm for clustering; according to the clustering results and preset classification rules, classify each new word to the corresponding domain. The examples of new words discovered are shown in Table 6 :

RoBERTa-Kmeans rumor text concepts extraction algorithm

After adding the new words obtained by the new word discovery to the LTP dictionary, the accuracy of LTP word segmentation is improved. The five types of rumor texts established in this paper are segmented by using the new LTP dictionary, and the word vectors are obtained by inputting them into the RoBERTa word embedding layer after removing the stop words. The word vectors are clustered by k-means according to rumor type to obtain the concept subclass dictionary. The main process is as follows:

(1) Word embedding layer

The RoBERTa model uses Transformer-Encode for computation, and each module contains multi-head attention mechanism, residual connection and layer normalization, feed-forward neural network. The word vectors are obtained by representing the rumor texts after accurate word segmentation through one-hot encoding, and the position encoding represents the relative or absolute position of the word in the sequence. The word embedding vectors generated by superimposing the two are used as input X. The multi-head attention mechanism uses multiple independent Attention modules to perform parallel operations on the input information, as shown in formula ( 2 ):

where \(\left\{{\varvec{Q}},{\varvec{K}},{\varvec{V}}\right\}\) is the input matrix, \({{\varvec{d}}}_{{\varvec{k}}}\) is the dimension of the input matrix. After calculation, the hidden vectors obtained after computation are residual concatenated with layer normalization, and then calculated by two fully connected layers of feed-forward neural network for input, as shown in formula ( 3 ):

where \(\left\{{{\varvec{W}}}_{{\varvec{e}}},{{\varvec{W}}}_{0}\boldsymbol{^{\prime}}\right\}\) are the weight matrices of two connected layers, \(\left\{{{\varvec{b}}}_{{\varvec{e}}},{{\varvec{b}}}_{0}\boldsymbol{^{\prime}}\right\}\) are the bias terms of two connected layers.

After calculation, a bidirectional association between word embedding vectors is established, which enables the model to learn the semantic features contained in each word embedding vector in different contexts. Through fine-tuning, the learned knowledge is transferred to the downstream clustering task.

(2) K-means clustering

Randomly select k initial points to obtain k classes, and iterate until the loss function of the clustering result is minimized. The loss function can be defined as the sum of squared errors of each sample point from its cluster center point, as shown in formula ( 4 ).

where \({x}_{i}\) represents the \(i\) sample, \({a}_{i}\) is the cluster that \({x}_{i}\) belongs to, \({u}_{{a}_{i}}\) represents the corresponding center point, \(N\) is the total number of samples.

After RoBERTa-kmeans calculation, the concept subclasses obtained are manually screened, merged repetition items, deleted invalid items, and finally obtained 79 rumor concept subclasses, including 14 politics and military subclasses, 23 disease prevention and treatment subclasses, 15 social life subclasses, 13 science and technology subclasses, and 14 nutrition and health subclasses. Some statistics are shown in Table 7 .

Each concept subclass is obtained by clustering several topic words. For example, the topic words that constitute the subclasses of body part, epidemic prevention and control, chemical drugs, etc. under the disease prevention and treatment topic are shown in Table 8 .

(3) Determining the terminology set

This paper constructs a three-dimensional rumor domain ontology terminology set based on the above three methods, and unifies the naming of the terms. Some of the terms are shown in Table 9 .

Framework layer construction

Define core classes and hierarchy, define parent classes.

This paper aims at fine-grained hierarchical modeling of the relationship between the content characteristics of multi-domain network rumors. Therefore, the top-level parent class needs to include the rumor category and the main content characteristics of a sub-category rumor design. The main content characteristics are the clustering results of domain concepts extracted based on the content characteristics of core documents, that is, rumor text feature, rumor emotional characteristic, rumor credibility and social context. The specific contents of the five top parent classes are as follows:

Rumor type: the specific classification of rumors under different subject categories; Rumor text feature, the common features of rumor texts in terms of theme, style, rhetoric, etc. Rumor emotional characteristic: the emotional elements of rumor texts, the Rumor motive of the publisher, and the emotional changes they hope to trigger in the receiver. Rumor credibility: the authority of the information source, the credibility of the evidence material provided by the publisher, and the effectiveness of the testimony method. Social context: the relevant issues and events in the society when the rumor is published.

Induce subclasses and design hierarchical relationships

In this paper, under the top-level parent class, according to the top-level concepts of top-level ontologies such as SUMO, senticnet and ERE and their subclass structures, and the rumor text features of each category extracted from the real rumor text dataset, we summarize its 88 subclasses and design the hierarchical relationships, as shown in Fig.  2 , which include:

(1) Rumor text feature

figure 2

Diagram of the core classes and hierarchy of the rumor domain ontology.

① Text theme 6 , 8 , 13 , 18 , 53 : the theme or topic that the rumor text content involves. Based on the self-built rumor dataset, it is divided into politics and military 54 , involving information such as political figures, political policies, political relations, political activities, military actions, military events, strategic objectives, politics and military reviews, etc.; nutrition and health 55 , involving information such as the relationship between human health and nutrition, the nutritional components and value of food, the plan and advice for healthy eating, health problems and habits, etc.; disease prevention and treatment 10 , involving information such as the definition of disease, vaccine, treatment, prevention, data, etc.; social life 56 , involving information such as social issues, social environment, social values, cultural activities, social media, education system, etc.; science and technology 57 , involving information such as scientific research, scientific discovery, technological innovation, technological application, technological enterprise, etc.; other categories.

② Text element 15 : the structured information of the rumor text contents. It is divided into character, political character, public character, etc.; geographical position, city, region, area, etc.; event, historical event, current event, crisis event, policy event, etc.; action, protection, prevention and control, exercise, fighting, crime, eating, breeding, health preservation, rest, exercise, education, sports, social, cultural, ideological, business, economic, transportation, etc.; material, food, products (food, medicine, health products, cosmetics, etc.) and the materials they contain and their relationship with human health. effect, nutrition, health, harm, natural disaster, man-made disaster, guarantee, prevention, treatment, etc.; institution, government, enterprise, school, hospital, army, police, social group, etc.; nature, weather, astronomy, environment, agriculture, disease, etc.

③ Text style 7 , 10 : the discourse style of the rumor text contents, preferring exaggerated and emotional expression. It is divided into gossip style, creating conflict or entertainment effect; curious style, satisfying people’s curiosity and stimulation; critical style, using receivers’ stereotypes or preconceptions; lyrical style, creating resonance and influencing emotion; didactic style influencing receivers’ thought and behavior from an authoritative perspective; plain style concise objective arousing resonance etc.

④ Text feature 7 , 58 : special language means in the rumor text contents that can increase the transmission and influence of the rumor. It is divided into extensive punctuation reminding or attracting receivers’ attention; many mood words enhancing emotional color and persuasiveness; many emoji conveying attitude; induce forwarding using @ symbol etc. to induce receivers to forward etc.

⑤ Text rhetoric 15 : common rhetorical devices in rumor contents. It is divided into metaphor hyperbole repetition personification etc.

(2) Rumor emotional characteristic

① Emotion category 17 , 59 , 60 : the emotional tendency and intensity expressed in the rumor texts. It is divided into positive emotion happy praise etc.; negative emotion fear 10 anger sadness anxiety 61 dissatisfaction depression etc.; neutral emotion no preference plain objective etc.

② Emotional appeal 16 , 62 , 63 : the online rumor disseminator hopes that the rumor they disseminate can trigger some emotional changes in the receiver. It is divided into “joy” happy pleasant satisfied emotions that prompt receivers to spread or believe some rumors that are conducive to social harmony; “love” love appreciation admiration emotions that prompt receivers to spread or believe some rumors that are conducive to some people or group interests; “anger” angry annoyed dissatisfied emotions that prompt receivers to spread or believe some rumors that are anti-social or intensify conflicts; “fear” fearful afraid nervous emotions that prompt receivers to spread or believe some rumors that have bad effects deliberately exaggerated; “repugnance” disgusted nauseous emotions that prompt receivers to spread or believe some rumors that are detrimental to social harmony; “surprise” surprised shocked amazed emotions that prompt receivers to spread or believe some rumors that deliberately attract traffic exaggerated fabricated etc.

③ Rumor motive 17 , 64 , 65 , 66 : the purpose and need of the rumor publisher to publish rumors and the receiver to forward rumors. Such as profit-driven seeking fame and fortune deceiving receivers; emotional catharsis relieving dissatisfaction emotions by venting; creating panic creating social unrest and riots disrupting social order; entertainment fooling receivers seeking stimulation; information verification digging out the truth of events etc.

(3) Rumor credibility

① source credibility 7 , 17 : the degree of trustworthiness that the information source has. Such as official institutions and authoritative experts and scholars in the field with high credibility; well-known encyclopedias and large-scale civil organizations with medium credibility; small-scale civil organizations and personal hearsay personal experience with low credibility etc.

② evidence credibility 61 : the credibility of the information proof material provided by the publisher. Data support such as scientific basis based on scientific theory or method; related feature with definite research or investigation result in data support; temporal background with clear time place character event and other elements which related to the information content; the common sense of life in line with the facts and scientific common sense that are widely recognized.

③ testimony method 10 , 11 , 17 : the method to support or refute a certain point of view. Such as multimedia material expressing or fabricating content details through pictures videos audio; authority endorsement policy documents research papers etc. of authorized institutions or persons; social identity identity of social relation groups.

(4) Social context

① social issue 67 : some bad phenomena or difficulties in society such as poverty pollution corruption crime government credibility decline 68 etc.

② public attention 63 : events or topics that arouse widespread attention or discussion in the society such as sports events technological innovation food safety religious beliefs Myanmar fraud nuclear wastewater discharge etc.

③ emergency(public sentiment) 69 : some major or urgent events that suddenly occur in society such as earthquake flood public safety malignant infectious disease outbreaks etc.

(5) Rumor type

① Political and military rumor:

Political image rumor: rumors related to images closely connected to politics and military, such as countries, political figures, institutions, symbols, etc. These include positive political image smear rumor, negative political image whitewash rumor, political image fabrication and distortion rumor, etc.

Political event rumor: rumors about military and political events, such as international relations, security cooperation, military strategy, judicial trial, etc. These include positive political event smear rumor, negative political event whitewash rumor, political event fabrication and distortion rumor, etc.

② Nutrition and health rumor:

Food product rumor: rumors related to food, products (food, medicine, health products, cosmetics, etc.), the materials they contain and their association with human health. These include positive effect of food product rumor, negative effect of food product rumor, food product knowledge rumor, etc.

Living habit rumor: rumors related to habitual actions in life and their association with human health. These include positive effect of living habit rumor, negative effect of living habit rumor, living habit knowledge rumor, etc.

③ Disease prevention and treatment rumor:

Disease management rumor: rumors related to disease management and control methods that maintain and promote individual and group health. These include positive prevention and treatment rumor, negative aggravating disease rumor, disease management knowledge rumor, etc.

Disease confirmed transmission rumor: rumors about the confirmation, transmission, and immunity of epidemic diseases at the social level in terms of causes, processes, results, etc. These include local confirmed cases rumor, celebrity confirmed cases rumor, transmission mechanism rumor, etc.

Disease notification and advice rumor: rumors that fabricate or distort the statements of authorized institutions or experts in the field, and provide false policies or suggestions related to diseases. These include institutional notification rumor, expert advice rumor, etc.

④ Social life rumor:

Public figure public opinion rumor: rumors related to public figures’ opinions, actions, private lives, etc. These include positive public figure smear rumor, negative public figure whitewash rumor, public figure life exposure rumor, etc.

Social life event rumor: rumors related to events, actions, and impacts on people's social life. These include positive event sharing rumor, negative event exposure rumor, neutral event knowledge rumor, etc.

Disaster occurrence rumor: rumors related to natural disasters or man-made disasters and their subsequent developments. These include natural disaster occurrence rumor, man-made disaster occurrence rumor, etc.

⑤ Science and technology rumor:

Scientific knowledge rumor: rumors related to natural science or social science theories and knowledge. These include scientific theory rumor, scientific concept rumor, etc.

Science and technology application rumor: rumors related to the research and development and practical application of science and technology and related products. These include scientific and technological product rumor, scientific and technological information rumor, etc.

⑥ Other rumor: rumors that do not contain elements from the above categories.

Definition of core properties and facets of properties

Properties in the ontology are used to describe the relationships between entities or the characteristics of entities. Object properties are relationships that connect two entities, describing the interactions between entities; data properties represent the characteristics of entities, usually in the form of some data type. Based on the self-built rumor dataset, this paper designs object properties, data properties and facets of properties for the parent classes and subclasses of the rumor domain ontology.

Object properties

A partial set of object properties is shown in Table 10 .

Data attributes

The partial data attribute set is shown in Table 11 .

Creating instances

Based on the defined core classes and properties, this paper creates instances according to the real rumor dataset. An example is shown in Table 12 .

This paper selects the online rumor that “Lin Chi-ling was abused by her husband Kuroki Meisa, the tears of betrayal, the shadow of gambling, all shrouded her head. Even if she tried to divorce, she could not get a solution…..” as an example, and draws a structure diagram of the rumor domain ontology instance, as shown in Fig.  3 . This instance shows the seven major text features of the rumor text: text theme, text element, text style, emotion category, emotional appeal, rumor motivation, and rumor credibility, as well as the related subclass instances, laying a foundation for building a multi-source rumor domain knowledge graph.

figure 3

Schematic example of the rumor domain ontology.

Encoding ontology and visualization

Encoding ontology.

This paper uses OWL language to encode the rumor domain ontology, to accurately describe the entities, concepts and their relationships, and to facilitate knowledge reasoning and semantic understanding. Classes in the rumor domain ontology are represented by the class “Class” in OWL and the hierarchical relationship is represented by subclassof. For example, in the creation of the rumor emotional characteristic class and its subclasses, the OWL code is shown in Fig.  4 :

figure 4

Partial OWL codes of the rumor domain ontology.

The ontology is formalized and stored as a code file using the above OWL language, providing support for reasoning.

Ontology visualization

This paper uses protégé5.5 to visualize the rumor domain ontology, showing the hierarchical structure and relationship of the ontology parent class and its subclasses. Due to space limitations, this paper only shows the ontology parent class “RumorEmotionalFeatures” and its subclasses, as shown in Fig.  5 .

figure 5

Ontology parent class “RumorEmotionalFeatures” and its subclasses.

Ontology reasoning and validation

Swrl reasoning rule construction.

SWRL reasoning rule is an ontology-based rule language that can be used to define Horn-like rules to enhance the reasoning and expressive ability of the ontology. This paper uses SWRL reasoning rules to deal with the conflict relationships between classes and between classes and instances in the rumor domain ontology, and uses pellet reasoner to deeply mine the implicit semantic relationships between classes and instances, to verify the semantic parsing ability and consistency of the rumor domain ontology.

This paper summarizes the object property features of various types of online rumors based on the self-built rumor dataset, maps the real rumor texts with the rumor domain ontology, constructs typical SWRL reasoning rules for judging 32 typical rumor types, as shown in Table 13 , and imports them into the protégé rule library, as shown in Fig.  6 . In which x, n, e, z, i, t, v, l, etc. are instances of rumor types, text theme, emotion category, effect, institution, event, action, geographical position, etc. in the ontology. HasTheme, HasEmotion, HasElement, HasSource, HasMood and HasSupport are object property relationships. Polarity value is a data property relationship.

figure 6

Partial SWRL rules for the rumor domain ontology.

Implicit knowledge mining and verification based on pellet reasoner

This paper extracts corresponding instances from the rumor dataset, imports the rumor domain ontology and SWRL rule description into the pellet reasoner in the protégé software, performs implicit knowledge mining of the rumor domain ontology, judges the rumor type of the instance, and verifies the semantic parsing ability and consistency of the ontology.

Positive prevention and treatment of disease rumors are mainly based on the theme of disease prevention and treatment, usually containing products to be sold (including drugs, vaccines, equipment, etc.) and effect of disease names, claiming to have positive effects (such as prevention, cure, relief, etc.) on certain diseases or symptoms, causing positive emotions such as surprise and happiness among patients and their families, thereby achieving the purpose of selling products. The text features and emotional features of this kind of rumors are relatively clear, so this paper takes the rumor text “Hong Kong MDX Medical Group released the ‘DCV Cancer Vaccine’, which can prevent more than 12 kinds of cancers, including prostate cancer, breast cancer and lung cancer.” as an example to verify the semantic parsing ability of the rumor domain ontology. The analysis result of this instance is shown in Fig.  7 . The text theme is cancer prevention in disease prevention and treatment, the text style is plain narrative style, and the text element includes product-DCV cancer vaccine, positive effect-prevention, disease name-prostate cancer, disease name-breast cancer, disease name-lung cancer; the emotion category of this instance is a positive emotion, emotional appeal is joy, love, surprise; The motive for releasing rumors is profit-driven in selling products, the information source is Hong Kong MDX medical group, and pictures and celebrity endorsements are used as testimony method. This paper uses a pellet reasoner to reason on the parsed instance based on SWRL rules, and mines out the specific rumor type of this instance as positive prevention and treatment of disease rumor. This paper also conducted similar instance analysis and reasoning verification for other types of rumor texts, and the results show that the ontology has high consistency and reliability.

figure 7

Implicit relationship between rumor instance parsing results and pellet reasoner mining.

Comparison and evaluation of ontology performance

In this paper, the constructed ontology is compared with the representative rumor index system in the field. By inviting four experts to make a comprehensive evaluation based on the self-built index system 70 , 71 , 72 , their performance in the indicators of reliability, coverage and operability is evaluated. According to the ranking order given by experts, they are given 1–4 points, and the first place in each indicator item gets four points. The average value given by three experts is taken as the single indicator score of each subject, and the total score of each indicator item is taken as the final score of the subject.

As can be seen from Table 14 , the rumor domain ontology constructed in this paper constructs a term set through three ways: reusing the existing ontology, extracting the content features of core documents and discovering new concepts based on real rumor data sets, and the ontology structure has been verified by SWRL rule reasoning of pellet inference machine, which has high reliability; ontology covers six kinds of Chinese online rumors, including the grammatical, semantic, pragmatic and social characteristics of rumor text characteristics, emotional characteristics, rumor credibility and social background, which has a high coverage; ontology is coded by OWL language specification and displayed visually on protege, which is convenient for further expansion and reuse of scholars and has high operability.

The construction method of TFI domain ontology proposed in this paper includes terminology layer, framework layer and instance layer. Compared with the traditional methods, this paper adopts three-dimensional data set construction method in terminology layer construction, investigates top-level ontology and related core documents, and completes the mapping of reusable top-level ontology from top to bottom and the concept extraction of rumor content features in existing literature research. Based on the mainstream internet rumor websites in China, the authoritative real rumor data set is established, and the new word discovery algorithm of N-gram combined with RoBERTa-Kmeans clustering algorithm is used to automatically discover new concepts in the field from bottom to top; determine the terminology set of domain ontology more comprehensively and efficiently. This paper extracts the clustering results of domain concepts based on the content characteristics of core documents in the selection of parent rumors content characteristics in the framework layer construction, that is, rumors text characteristics, rumors emotional characteristics, rumors credibility characteristics and social background characteristics; based on the emotional characteristics and the entity categories of real rumor data sets, the characteristics of rumor categories are defined. Sub-category rumor content features combine the concept of three-dimensional rumor term set and the concept distribution based on real rumor data set, define the sub-category concept and hierarchical relationship close to the real needs, and realize the fine-grained hierarchical modeling of the relationship between multi-domain network rumor content features. In this paper, OWL language is used to encode the rumor domain ontology in the instance layer construction, and SWRL rule language and Pellet inference machine are used to deal with the conflict and mine tacit knowledge, judge the fine-grained categories of rumor texts, and realize the effective quality evaluation of rumor ontology. This makes the rumor domain ontology constructed in this paper have high consistency and reliability, and can effectively analyze and reason different types of rumor texts, which enriches the knowledge system in this field and provides a solid foundation for subsequent credible rumor detection and governance.

However, the study of the text has the following limitations and deficiencies:

(1) The rumor domain ontology constructed in this paper only considers the content characteristics, but does not consider the user characteristics and communication characteristics. User characteristics and communication characteristics are important factors affecting the emergence and spread of online rumors, and the motivation and influence of rumors can be analyzed. In this paper, these factors are not included in the rumor feature system, which may limit the expressive ability and reasoning ability of the rumor ontology and fail to fully reflect the complexity and multidimensional nature of online rumors.

(2) In this paper, the mainstream Internet rumor-dispelling websites in China are taken as the data source of ontology instantiation. The data covers five rumor categories: political and military, disease prevention, social life, science and technology, and nutrition and health, and the data range is limited. And these data sources are mainly official or authoritative rumor websites, and their data volume and update frequency may not be enough to reflect the diversity and variability of online rumors, and can not fully guarantee the timeliness and comprehensiveness of rumor data.

(3) The SWRL reasoning rules used in this paper are based on manual writing, which may not cover all reasoning scenarios, and the degree of automation needs to be improved. The pellet inference engine used in this paper is an ontology inference engine based on OWL-DL, which may have some computational complexity problems and lack of advanced reasoning ability.

The following aspects can be considered for optimization and improvement in the future:

(1) This paper will introduce user characteristics into the rumor ontology, and analyze the factors that cause and accept rumors, such as social attributes, psychological state, knowledge level, beliefs and attitudes, behavioral intentions and so on. This paper will introduce the characteristics of communication, and analyze the propagation dynamic factors of various types of rumors, such as propagation path, propagation speed, propagation range, propagation period, propagation effect, etc. This paper hopes to introduce these factors into the rumor feature system, increase the breadth and depth of the rumor domain ontology, and provide more credible clues and basis for the detection, intervention and prevention of rumors.

(2) This paper will expand the data sources, collect the original rumor data directly from social media, news media, authoritative rumor dispelling institutions and other channels, and build a rumor data set with comprehensive types, diverse expressions and rich characteristics; regularly grab the latest rumor data from these data sources and update and improve the rumor data set in time; strengthen the expressive ability of rumor ontology instance layer, and provide full data support and verification for the effective application of ontology.

(3) The text will introduce GPT, LLaMA, ChantGLM and other language models, and explore the automatic generation algorithm and technology of ontology inference rules based on rumor ontology and dynamic Prompt, so as to realize more effective and intelligent rumor ontology evaluation and complex reasoning.

This paper proposed a method of constructing TFI network rumor domain ontology. Based on the concept distribution of three-dimensional term set and real rumor data set, the main features of network rumors are defined, including text features, emotional features, credibility features, social background features and category features, and the relationships among these multi-domain features are modeled in a fine-grained hierarchy, including five parent classes and 88 subcategories. At the instance level, 32 types of typical rumor category judgment and reasoning rules are constructed, and the ontology is processed by using SWRL rule language and pellet inference machine for conflict processing and tacit knowledge mining, so that the semantic analysis and reasoning of rumor text content are realized, which proves its effectiveness in dealing with complex, fuzzy and uncertain information in online rumors and provides a new perspective and tool for the interpretable analysis and processing of online rumors.

Data availability

The datasets generated during the current study are available from the corresponding author upon reasonable request.

Jiang, S. The production scene and content characteristics of scientific rumors. Youth J. https://doi.org/10.15997/j.cnki.qnjz.2020.33.011 (2020).

Article   Google Scholar  

Jin, X. & Zhao, Y. Analysis of internet rumors from the perspective of co-governance—Practice of rumor governance on wechat platform. News and Writing. 6 , 41–44 (2017).

Bai, S. Research on the causes and countermeasures of internet rumors. Press https://doi.org/10.15897/j.cnki.cn51-1046/g2.2010.04.035 (2010).

Garg, S. & Sharma, D. K. Linguistic features based framework for automatic fake news detection. Comput. Ind. Eng. 172 , 108432 (2022).

Zhao, J., Fu, C. & Kang, X. Content characteristics predict the putative authenticity of COVID-19 rumors. Front. Public Health 10 , 920103 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Zhang, Z., Shu, K. & He, L. The theme and characteristics of wechat rumors. News and Writing. 1 , 60–64 (2016).

Li, B. & Yu, G. Research on the discourse space and communication field of internet rumors in the post-truth era—Based on the analysis of 4160 rumors in wechat circle of friends. Journalism Research. 2 , 103–112 (2018).

Yu, G. Text structure and expression characteristics of internet rumors—Analysis of 6000+ rumors based on tencent big data screening and identification. News and Writing. 2 , 53–59 (2018).

Mourão, R. R. & Robertson, C. T. Fake news as discursive integration: An analysis of sites that publish false, misleading, hyperpartisan and sensational information. J. Stud. 20 , 2077–2095 (2019).

Google Scholar  

Zhou, G. Analysis on the content characteristics and strategies of epidemic rumors—Based on Sina’s “novel coronavirus epidemic rumors list”. Sci. Popul. https://doi.org/10.19293/j.cnki.1673-8357.2021.05.002 (2021).

Huang, Y. An analysis of the internal logic and methods of rumor “confirmation”—An empirical study based on 60 rumors spread on wechat. J. Party Sch. Tianjin Munic. Comm. CPC 20 , 7 (2018).

Butt, S. et al . What goes on inside rumour and non-rumour tweets and their reactions: A psycholinguistic analyses. Comput. Hum. Behav. 135 , 107345 (2022).

Zhou, L., Tao, J. & Zhang, D. Does fake news in different languages tell the same story? An analysis of multi-level thematic and emotional characteristics of news about COVID-19. Inf. Syst. Front. 25 , 493–512. https://doi.org/10.1007/s10796-022-10329-7 (2023).

Article   PubMed   Google Scholar  

Tan, L. et al . Research status of deep learning methods for rumor detection. Multimed. Tools Appl. 82 , 2941–2982 (2023).

Damstra, A. et al. What does fake look like? A review of the literature on intentional deception in the news and on social media. J. Stud. 22 , 1947–1963. https://doi.org/10.1080/1461670X.2021.1979423 (2021).

Lai, S. & Tang, X. Research on the influence of information emotionality on the spread of online rumors. J. Inf. 35 , 116–121 (2016).

ADS   Google Scholar  

Yuan, H. & Xie, Y. Research on the rumor maker of internet rumors about public events—Based on the content analysis of 118 influential Internet rumors about public events. Journalist https://doi.org/10.16057/j.cnki.31-1171/g2.2015.05.008 (2015).

Ruan, Z. & Yin, L. Types and discourse focus of weibo rumors—Based on the content analysis of 307 weibo rumors. Contemporary Communication. 4 , 77–78+84 (2014).

Zhang, W. & Zhu, Q. Research on the Construction Method of Domain Ontology. Books and Information. 5 , 16–19+40 (2011).

Tham, K.D., Fox, M.S. & Gruninger, M. A cost ontology for enterprise modelling. In Proceedings of 3rd IEEE Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises. IEEE , 197–210. https://doi.org/10.1109/ENABL.1994.330502 (1994).

Uschold, M. & Gruninger, M. Ontologies: Principles, methods and applications. Knowl. Eng. Rev. 11 , 93–136 (1996).

Menzel, C. P., Mayer, R. J. & Painter, M. K. IDEF5 ontology description capture method: Concepts and formal foundations (Armstrong Laboratory, Air Force Materiel Command, Wright-Patterson Air Force, 1992).

Book   Google Scholar  

Song, Z., Zhu, F. & ZHANG, D. Research on air and missile defense domain ontology development based on IDEF5 and OWL. Journal of Projectiles, Rockets, Missiles and Guidance. 30 , 176–178 (2010).

Fernández-López, M., Gómez-Pérez, A. & Juristo, N. Methontology: From ontological art towards ontological engineering. AAAI-97 Spring Symposium Series . https://oa.upm.es/5484/ (1997).

Sawsaa, A. & Lu, J. Building information science ontology (OIS) with methontology and protégé. J. Internet Technol. Secur. Trans. 1 , 100–109 (2012).

Yue, L. & Liu, W. Comparative study on the construction methods of domain ontology at home and abroad. Inf. Stud. Theory Appl. 39 , 119–125. https://doi.org/10.16353/j.cnki.1000-7490.2016.08.024 (2016).

Noy, N.F. & McGuinness, D.L. Ontology development 101: A guide to creating your first ontology. Stanford knowledge systems laboratory technical report. KSL-01–05 (2001).

Luo, Y. et al . vim: Research on OWL-based vocabulary ontology construction method for units of measurement. Electronics 12 , 3783 (2023).

Al-Aswadi, F. N., Chan, H. Y. & Gan, K. H. Automatic ontology construction from text: A review from shallow to deep learning trend. Artif. Intell. Rev. 53 , 3901–3928 (2020).

Chen, X. & Mao, T. Ontology construction of documentary heritage—Taking China archives documentary heritage list as an example. Libr. Trib. 43 , 120–131 (2023).

CAS   Google Scholar  

Zhao, X. & Li, T. Research on the ontology construction of archives oriented to digital humanism—Taking Wanli tea ceremony archives as an example. Inf. Stud. Theory Appl. 45 , 154–161. https://doi.org/10.16353/j.cnki.1000-7490.2022.08.021 (2022).

Huang, X. et al . Construction of special knowledge base of government website pages based on domain ontology—Taking “COVID-19 vaccine science popularization” as an example. Libr. Inf. Serv. 66 , 35–46. https://doi.org/10.13266/j.issn.0252-3116.2022.17.004 (2022).

Jindal, R., Seeja, K. & Jain, S. Construction of domain ontology utilizing formal concept analysis and social media analytics. Int. J. Cogn. Comput. Eng. 1 , 62–69 (2020).

Ran, J. et al . Research on ontology construction of idioms and allusions based on OWL. Comput. Technol. Dev. 20 , 63–66 (2010).

Li, L. et al . Research on business process modeling of army equipment maintenance support based on IDEF5. Technol. Innov. Appl. 11 , 80–82 (2021).

Song, Z. et al . Ontology modeling of air defense and anti-missile operation process based on IDEF5/OWL. J. Missiles Guid. 30 , 176–178 (2010).

Li, A., Xu, Y. & Chi, Y. Summary of ontology construction and application. Inf. Stud. Theory Appl 46 , 189–195. https://doi.org/10.16353/j.cnki.1000-7490.2023.11.024 (2023).

Yang, J., Song, C. & Jin, L. Ontology construction of emergency plan based on methontology method. J. Saf. Environ. 18 , 1427–1431. https://doi.org/10.13637/j.issn.1009-6094.2018.04.033 (2018).

Duan, L. & Li, H. Ontology modeling method of high-resolution image rural residential area supported by OIA technology. Modern Agricultural Science and Technology. 2 , 338–340 (2016).

Chen, Y. & Jiang, H. Construction of fire inspection knowledge map based on GIS geospatial relationship. J. Subtrop. Resour. Environ. 18 , 109–118. https://doi.org/10.19687/j.cnki.1673-7105.2023.03.014 (2023).

Zhu, L. et al. Construction of TCM asthma domain ontology. Chin. J. Exp. Tradit. Med. Formulae 23 , 222–226. https://doi.org/10.13422/j.cnki.syfjx.2017150222 (2017).

Li, H. et al . Domain ontology construction and relational reasoning. J. Inf. Eng. Univ. 24 , 321–327 (2023).

Zhang, Y. et al. Construction of ontology of stroke nursing field based on corpus. Chin. Nurs. Res. 36 , 4186–4190 (2022).

Wu, M. et al. Ontology construction of natural gas market knowledge map. Pet. New Energy 34 , 71–76 (2022).

Li, X. et al . Research on ontology construction based on thesaurus and its semantic relationship. Inf. Sci. 36 , 83–87 (2018).

Article   ADS   CAS   Google Scholar  

Chen, Q. et al . Construction of knowledge ontology of clinical trial literature of traditional Chinese medicine. Chin. J. Exp. Tradit. Med. Formulae 29 , 190–197. https://doi.org/10.13422/j.cnki.syfjx.20231115 (2023).

Xiao, Y. et al. Construction and application of novel coronavirus domain ontology. Mil. Med. 46 , 263–268 (2022).

Su, N. et al . Automatic construction method of domain-limited ontology. Lifting the Transport Machinery. 8 , 49–57 (2023).

Zheng, S. et al . Ontology construction method for user-generated content. Inf. Sci. 37 , 43–47. https://doi.org/10.13833/j.issn.1007-7634.2019.11.007 (2019).

Dong, J., Wang, J. & Wang, Z. Ontology automatic construction method for human-machine-object ternary data fusion in manufacturing field. Control Decis. 37 , 1251–1257. https://doi.org/10.13195/j.kzyjc.2020.1298 (2022).

Zhu, L., Hua, G. & Gao, W. Mapping ontology vertices to a line using hypergraph framework. Int. J. Cogn. Comput. Eng. 1 , 1–8 (2020).

Zhai, Y. & Wang, F. Research on the construction method of Chinese domain ontology based on text mining. Inf. Sci. 33 , 3–10. https://doi.org/10.13833/j.cnki.is.2015.06.001 (2015).

Duan, Z. Generation mechanism of internet rumors and countermeasures. Guizhou Soc. Sci. https://doi.org/10.13713/j.cnki.cssci.2016.04.014 (2016).

Du, Z. & Zhi, S. The harm and governance of network political rumors. Academic Journal of Zhongzhou. 4 , 161–165 (2019).

Song, X. et al . Research on influencing factors of health rumor sharing willingness based on MOA theory. J. China Soc. Sci. Tech. Inf. 39 , 511–520 (2020).

Jiang, S. Research on the characteristics, causes and countermeasures of social rumors dissemination in china in recent years. Red Flag Manuscript . 16 , 4 (2011).

Huang, J., Wang, G. & Zhong, S. Research on the propagation law and function mode of sci-tech rumors. Journal of Information. 34 , 156–160 (2015).

Liu, Y. et al . A survey of rumor recognition in social media. Chin. J. Comput. 41 , 1536–1558 (2018).

Wei, D. et al. Public emotions and rumors spread during the covid-19 epidemic in China: Web-based correlation study. J. Med. Internet Res. 22 , e21933 (2020).

Runxi, Z. & Di, Z. A model and simulation of the emotional contagion of netizens in the process of rumor refutation. Sci. Rep. https://doi.org/10.1038/s41598-019-50770-4 (2019).

Tang, X. & Lai, S. Research on the forwarding of network health rumors in public health security incidents—Interaction between perceived risk and information credibility. J. Inf. 40 , 101–107 (2021).

Nicolas, P., Dominik, B. & Stefan, F. Emotions in online rumor diffusion. EPJ Data Sci. https://doi.org/10.1140/epjds/s13688-021-00307-5 (2021).

Deng, G. & Tang, G. Research on the spread of network rumors and its social impact. Seeker https://doi.org/10.16059/j.cnki.cn43-1008/c.2005.10.031 (2005).

Ji, Y. Research on the communication motivation of wechat rumors. Youth J. https://doi.org/10.15997/j.cnki.qnjz.2019.17.006 (2019).

Yuan, G. Analysis on the causes and motives of internet rumors in emergencies—Taking social media as an example. Media. 21 , 80–83 (2016).

Zhao, N., Li, Y. & Zhang, J. A review of the research on influencing factors and motivation mechanism of rumor spread. J. Psychol. Sci. 36 , 965–970. https://doi.org/10.16719/j.cnki.1671-6981.2013.04.015 (2013).

Article   CAS   Google Scholar  

Hu, H. On the formation mechanism of social rumors from the perspective of “rumors and salt storm”. J. Henan Univ. 52 , 63–68 (2012).

Yue, Y. et al. Trust in government buffers the negative effect of rumor exposure on people’s emotions. Curr. Psychol. 42 , 23917–23930 (2023).

Wang, C. & Hou, X. Analysis of rumor discourse in major emergencies. J. Commun. 19 , 34–38 (2012).

Xu, L. Research progress of ontology evaluation. J. China Soc. Scie. Tech. Inf. 35 , 772–784 (2016).

Lantow, B. & Sandkuhl, K. An analysis of applicability using quality metrics for ontologies on ontology design patterns. Intell. Syst. Acc. Financ. Manag. 22 , 81–99 (2015).

Pak, J. & Zhou, L. A framework for ontology evaluationIn. Exploring the Grand Challenges for Next Generation E-Business: 8th Workshop on E-Business, WEB 2009, Phoenix, AZ, USA, December 15, 2009, Revised Selected Papers 8. , 10–18. https://doi.org/10.1007/978-3-642-17449-0_2 (Springer Berlin Heidelberg, 2011).

Download references

Acknowledgements

This study was financially supported by Xi'an Major Scientific and Technological Achievements Transformation and Industrialization Project (20KYPT0003-10).

This work was supported by Xi’an Municipal Bureau of Science and Technology, 20KYPT0003-10.

Author information

Authors and affiliations.

School of Economics and Management, Xidian University, 266 Xifeng Road, Xi’an, 710071, China

Jianbo Zhao, Huailiang Liu, Weili Zhang, Tong Sun, Qiuyi Chen, Yan Zhuang, Xiaojin Zhang & Shanzhuang Zhang

School of Artificial Intelligence, Xidian University, 266 Xifeng Road, Xi’an, 710071, China

Yuehai Wang, Jiale Cheng & Ruiyu Ding

School of Telecommunications Engineering, Xidian University, 266 Xifeng Road, Xi’an, 710071, China

You can also search for this author in PubMed   Google Scholar

Contributions

H.L. formulated the overall research strategy and guided the work. J.Z kept the original data on which the paper was based and verified whether the charts and conclusions accurately reflected the collected data. J.Z. W.Z. and T.S. wrote the main manuscript text. W.Z. Y.W. and Q.C. finished collecting and sorting out the data. J.C. Y.Z. and X.Z. prepared Figs.  1 – 7 , S.Z. B.L. and R.D. prepared Tables 1 – 14 . All authors reviewed the manuscript.

Corresponding author

Correspondence to Jianbo Zhao .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Zhao, J., Liu, H., Zhang, W. et al. Research on domain ontology construction based on the content features of online rumors. Sci Rep 14 , 12134 (2024). https://doi.org/10.1038/s41598-024-62459-4

Download citation

Received : 07 December 2023

Accepted : 16 May 2024

Published : 27 May 2024

DOI : https://doi.org/10.1038/s41598-024-62459-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Rumor content features
  • Domain ontology
  • Top-level ontology reuse
  • New concept discovery

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

data mining research paper conclusion

IMAGES

  1. How to Write a Research Paper Conclusion: Tips & Examples

    data mining research paper conclusion

  2. ️ Example of a conclusion paragraph for a research paper. Conclusion

    data mining research paper conclusion

  3. (PDF) Data Mining, Big Data and Artificial Intelligence: An Overview

    data mining research paper conclusion

  4. Data Dissertation Management Mining

    data mining research paper conclusion

  5. (PDF) A Survey of Data Mining Applications and Techniques

    data mining research paper conclusion

  6. Data mining

    data mining research paper conclusion

VIDEO

  1. Major Issues in Data Mining || Data Mining challenges

  2. Unveiling The Advantages of Data Mining

  3. Data mining final revision

  4. Lecture 16: Data Mining CSE 2020 Fall

  5. DATA MINING PROCESS

  6. Challenges and Opportunities for Educational Data Mining ! Research Paper review

COMMENTS

  1. Data Mining for the Internet of Things: Literature Review and

    Motivated by this, in this paper, we attempt to make a comprehensive survey of the important recent developments of data mining research. This survey focuses on knowledge view, utilized techniques view, and application view of data mining. ... Conclusions. The Internet of Things concept arises from the need to manage, automate, and explore all ...

  2. (PDF) Data mining techniques and applications

    Data Mining Algorithms and Techniques. Various algorithms and techniques like Classification, Clustering, Regression, Artificial. Intelligence, Neural Networks, Association Rules, Decision Trees ...

  3. PDF Overview of Data Mining's Potential Benefits and Limitations in ...

    contrasted data mining to traditional statistics (Grover & Mehra, 2008; Zhao & Luan, 2006), which turned out to be an important theoretical framework through which to understand the purported benefits and drawbacks of data mining. Potential benefits of using data mining in education research Most scholars were optimistic about the benefits

  4. Data Mining Methods and Obstacles: A Comprehensive Analysis

    Data mini ng is a step in the knowledge discovery in databases (KDD) procedure, a data science method. for gathering, processing, and conducting more in depth analyses of data. While they ...

  5. (PDF) DATA MINING IN CLOUD COMPUTING: A REVIEW

    This paper introduces the basic concept of cloud computing and data mining firstly, and sketches out how data mining is used in cloud computing; Then summarizes the research of parallel ...

  6. A comprehensive survey of data mining

    Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is ...

  7. Data mining techniques and applications

    Data mining is also known as Knowledge Discovery in Database (KDD). It is also defined as the process which includes extracting the interesting, interpretable and useful information from the raw data. There are different sources that generate raw data in very large amount. This is the main reason the applications of data mining are increasing rapidly. This paper reviews data mining techniques ...

  8. Review Paper on Data Mining Techniques and Applications

    Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...

  9. Conclusion and Future Direction in Data Mining and Machine Learning

    Due to COVID-19, a significantly fewer number of flights are running, and hence the scientists cannot forecast the weather accurately. The data capturing also goes low because of this smaller number of flights. Data mining techniques play a vital role in collecting data for prediction and forecasting using different machine learning techniques.

  10. Statistical Analysis and Data Mining: The ASA Data Science Journal

    Statistical Analysis and Data Mining addresses the broad area of data analysis, including data mining algorithms, statistical approaches, and practical applications. Topics include problems involving massive and complex datasets, solutions utilizing innovative data mining algorithms and/or novel statistical approaches. ... RESEARCH ARTICLE ...

  11. PDF Data Mining Research: Opportunities and Challenges

    mining is human centered and is sometimes coupled with human-computer interfaces research. Data mining is a step in the data mining process, which is an interactive, semi-automated process which begins with raw data. Results of the data mining process may be insights, rules, or predictive models. The field of data mining draws upon several ...

  12. data mining Latest Research Papers

    The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm. Download Full-text.

  13. (PDF) Trends in data mining research: A two-decade review using topic

    Address: 20, Myasnitskaya Street, Moscow 101000, Russia. Abstract. This work analyzes the intellectual structure of data mining as a scientific discipline. T o do this, we use. topic analysis ...

  14. Summary of Data Mining Research on Technical and Economic ...

    This paper starts from the data mining work related to the economic data review of the entire power grid project, and summarizes the above four layers, providing strong support for further optimizing and improving the level of intelligent review technology, and enhancing the overall process review capability of system design, feasibility study ...

  15. Writing a Research Paper Conclusion

    Table of contents. Step 1: Restate the problem. Step 2: Sum up the paper. Step 3: Discuss the implications. Research paper conclusion examples. Frequently asked questions about research paper conclusions.

  16. PDF Crime Analysis and Prediction Using Datamining: a Review

    This paper explains various types of criminal analysis and crime prediction using several data mining techniques. KEYWORDS Crime prediction, Decision trees, Linear Regression, k-means. 1. INTRODUCTION Day by day crime data rate is increasing because the modern technologies and hi-tech methods are helps the

  17. 50 selected papers in Data Mining and Machine Learning

    Active Sampling for Feature Selection, S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003. Heterogeneous Uncertainty Sampling for Supervised Learning, D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994. Learning When Training Data are Costly: The Effect of ...

  18. Data Mining: Concepts and Methods Research Paper

    Various methods of data mining include predictive analysis, web mining, and clustering and association discovery (Han, Kamber and Pei, 2011). We will write a custom essay on your topic. Each of these has a number of benefits to a business. In predictive analysis, analytical models are used to deliver solutions.

  19. What Is Data Mining? How It Works, Benefits, Techniques, and Examples

    Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their ...

  20. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  21. CONCLUSIONS AND DISCUSSION

    CONCLUSIONS AND DISCUSSION. Dans le document Data Mining (Page 47-54) Data mining is a good area of scientific study, holding ample promise for the research community. Recently a lot of progress has been reported for large databases, specifically involving association rules, classification, cluster-ing, similar time sequences, similar text ...

  22. Data Mining, Research Paper Example

    Data mining process involves automatic or semi-automatic analysis whereby large quantities of data are involved. These data usually extracted from patterns of data records and analysis such as association rule mining, anomaly detection and cluster analysis. All these analysis utilizes spatial indexes as an appropriate database technique.

  23. (PDF) A Review of Data Mining Literature

    REVIEW OF LITERATURE. Fayyad et.al (1996) [3] in their paper " From data mining. to knowledge discovery in databases" desc ribed KDD. as "a nontrivial proce ss of recognizing valid, novel ...

  24. Research on domain ontology construction based on the content ...

    This paper proposes a TFI online rumor domain ontology construction method based on the improvement of the seven-step method, which includes the term layer, the frame layer and the instance layer ...