text mining Recently Published Documents
Total documents.
- Latest Documents
- Most Cited Documents
- Contributed Authors
- Related Sources
- Related Keywords

Automated Text Classification of Maintenance Data of Higher Education Buildings Using Text Mining and Machine Learning Techniques
Evaluation of the synergy degree of industrial de-capacity policies based on text mining: a case study of china's coal industry, application of informetrics on financial network text mining based on affective computing, recycling behaviour: mapping knowledge domain through bibliometrics and text mining, penerapan text mining untuk melakukan clustering data tweet akun blibli pada media sosial twitter menggunakan k-means clustering.
Social media is computer-based technology that facilitates the sharing of ideas, thoughts, and information through the building of virtual networks and communities. Twitter is one of the most popular social media in Indonesia which has 78 million users. Businesses rely heavily on Twitter for advertising. Businesses can use these types of tweet content as a means of advertising to Twitter users by Knowing the types of tweet content that are mostly retweeted by their followers . In this study, the application of Text Mining to perform clustering using the K-means clustering method with the best number of clusters obtained from the Silhouette Coefficient method on the @bliblidotcom Twitter tweet data to determine the types of tweet content that are mostly retweeted by @bliblidotcom followers. Tweets with the most retweets and favorites are discount offers and flash sales, so Blibli Indonesia could use this kind of tweet to conduct advertising on social media Twitter because the prize quiz tweets are liked by the @bliblidotcom Twitter account followers.
The Epilepsy Ontology: a community-based ontology tailored for semantic interoperability and text-mining
Abstract Motivation: Epilepsy is a multi-faceted complex disorder that requires a precise understanding of the classification, diagnosis, treatment, and disease mechanism governing it. Although scattered resources are available on epilepsy, comprehensive and structured knowledge is missing. In contemplation to promote multidisciplinary knowledge exchange and facilitate advancement in clinical management, especially in pre-clinical research, a disease-specific ontology is necessary. The presented ontology is designed to enable better interconnection between scientific community members in the epilepsy domain.Results: The Epilepsy Ontology (EPIO) is an assembly of structured knowledge on various aspects of epilepsy, developed according to Basic Formal Ontology (BFO) and Open Biological and Biomedical Ontology (OBO) Foundry principles. Concepts and definitions are collected from the latest International League against Epilepsy (ILAE) classification, domain-specific ontologies, and scientific literature. This ontology consists of 1,879 classes and 28,151 axioms (2,171 declaration axioms, 2,219 logical axioms) from several aspects of epilepsy. This ontology is intended to be used for data management and text mining purposes.
ANALISIS KECENDERUNGAN LAPORAN MASYARAKAT PADA “LAPORGUB..!” PROVINSI JAWA TENGAH MENGGUNAKAN TEXT MINING DENGAN FUZZY C-MEANS CLUSTERING
Effective communication between the government and society is essential to achieve good governance. The government makes an effort to provide a means of public complaints through an online aspiration and complaint service called “LaporGub..!”. To group incoming reports easier, the topic of the report is searched by using clustering. Text Mining is used to convert text data into numeric data so that it can be processed further. Clustering is classified as soft clustering (fuzzy) and hard clustering. Hard clustering will divide data into clusters strictly without any overlapping membership with other clusters. Soft clustering can enter data into several clusters with a certain degree of membership value. Different membership values make fuzzy grouping have more natural results than hard clustering because objects at the boundary between several classes are not forced to fully fit into one class but each object is assigned a degree of membership. Fuzzy c-means has an advantage in terms of having a more precise placement of the cluster center compared to other cluster methods, by improving the cluster center repeatedly. The formation of the best number of clusters is seen based on the maximum silhouette coefficient. Wordcloud is used to determine the dominant topic in each cluster. Word cloud is a form of text data visualization. The results show that the maximum silhouette coefficient value for fuzzy c-means clustering is shown by the three clusters. The first cluster produces a word cloud regarding road conditions as many as 449 reports, the second cluster produces a word cloud regarding covid assistance as many as 964 reports, and the third cluster produces a word cloud regarding farmers fertilizers as many as 176 reports. The topic of the report regarding covid assistance is the cluster with the most number of members.
Text visualization for geological hazard documents via text mining and natural language processing
Analysis of sebaceous gland carcinoma associated genes using network analysis to identify potentially actionable genes.
Eyelid sebaceous gland carcinoma (SGC) is a rare but life-threatening condi-tion. However, there is limited computational research associated with un-derlying protein interactions specific to eyelid sebaceous gland carcinoma. The aim of our study is to identify and analyse the genes associated with eyelid sebaceous gland carcinoma using text mining and to develop a protein-protein interaction network to predict significant biological pathways using bioinformatics tool. Genes associated with eyelid sebaceous gland carcinoma were retrieved from the PubMed database using text mining with key terms ‘eyelid’, ‘sebaceous gland carcinoma’ and excluding the genes for ‘Muir-Torre Syndrome’. The interaction partners were identified using STRING. Cytoscape was used for visualization and analysis of the PPI network. Molec-ular complexes in the network were predicted using MCODE plug-in and ana-lyzed for gene ontology terms using DAVID. PubMed retrieval process identi-fied 79 genes related to eyelid sebaceous gland carcinoma. The PPI network associated with eyelid sebaceous gland carcinoma produced 79 nodes, 1768 edges. Network analysis using Cytoscape identified nine key genes and two molecular complexes to be enriched in the protein-protein interaction net-work. GO enrichment analysis identified biological processes cell fate com-mitment, Wnt signalling pathway, retinoic acid signalling and response to cytokines to be enriched in our network. Genes identified in the study might play a pivotal role in understanding the underlying molecular pathways in-volved in the development and progression of eyelid sebaceous gland carci-noma. Furthermore, it may aid in the identification of candidate biomarkers and therapeutic targets in the treatment of eyelid sebaceous gland carcino-ma.
Determining banking service attributes from online reviews: text mining and sentiment analysis
PurposeThe current study employs text mining and sentiment analysis to identify core banking service attributes and customer sentiment in online user-generated reviews. Additionally, the study explains customer satisfaction based on the identified predictors.Design/methodology/approachA total of 32,217 customer reviews were collected across 29 top banks on bankbazaar.com posted from 2014 to 2021. In total three conceptual models were developed and evaluated employing regression analysis.FindingsThe study revealed that all variables were found to be statistically significant and affect customer satisfaction in their respective models except the interest rate.Research limitations/implicationsThe study is confined to the geographical representation of its subjects' i.e. Indian customers. A cross-cultural and socioeconomic background analysis of banking customers in different countries may help to better generalize the findings.Practical implicationsThe study makes essential theoretical and managerial contributions to the existing literature on services, particularly the banking sector.Originality/valueThis paper is unique in nature that focuses on banking customer satisfaction from online reviews and ratings using text mining and sentiment analysis.
Export Citation Format
Share document.
- Open Access
- Published: 02 November 2020
Comprehensive review of text-mining applications in finance
- Aaryan Gupta 1 ,
- Vinya Dengre 1 ,
- Hamza Abubakar Kheruwala 1 &
- Manan Shah 2
Financial Innovation volume 6 , Article number: 39 ( 2020 ) Cite this article
26k Accesses
40 Citations
1 Altmetric
Metrics details
Text-mining technologies have substantially affected financial industries. As the data in every sector of finance have grown immensely, text mining has emerged as an important field of research in the domain of finance. Therefore, reviewing the recent literature on text-mining applications in finance can be useful for identifying areas for further research. This paper focuses on the text-mining literature related to financial forecasting, banking, and corporate finance. It also analyses the existing literature on text mining in financial applications and provides a summary of some recent studies. Finally, the paper briefly discusses various text-mining methods being applied in the financial domain, the challenges faced in these applications, and the future scope of text mining in finance.
Introduction
Today, technology is deeply integrated with everyone’s lives. Nearly every activity in modern life, from phone calls to satellites sent into space, has evolved exponentially with technology (Patel et al. 2020a , b , c ; Panchiwala and Shah 2020 ). The increasing ability to create and manage information has been an influential factor in the development of technology. According to the National Security Agency of the United States, 1826 petabytes on average are handled daily over the Internet (Hariri et al. 2019 ; Jaseena and David 2014 ). With the rapid increase in data and information communicated over the Internet, it has become necessary to regulate and ease the flow of the same (Ahir et al. 2020 ; Gandhi et al. 2020 ). A number of commercial and social applications have been introduced for these purposes. Aspects of data and information, such as security, research, and sentiment analysis, can be of great help to organisations, governments, and the public (Jani et al. 2019 ; Jha et al. 2019 ). There are various optimized techniques that aid us in tasks such as classification, summarisation, and ease of access and management of data, among others (Shah et al. 2020a , b ; Talaviya et al. 2020 ). Algorithms related to machine learning and deep learning (DL) are just some of the many algorithms that can be used to process the available information (Kakkad et al. 2019 ; Kundalia et al. 2020 ). Even though there is a massive amount of available information, the use of computational techniques can help us process information from top to bottom and analyse entire documents as well as individual words (Pandya et al. 2019 ; Parekh et al. 2020 ).
Human-generated ‘natural’ data in the form of text, audio, video, and so on are rapidly increasing (Shah et al. 2020a , b ). This has led to a rise in interest in methods and tools that can help extract useful information automatically from enormous amounts of unstructured data (Jaseena and David 2014 ; David and Balakrishnan 2011 ). One crucial method is text mining, which is a combined derivative of techniques such as data mining, machine learning, and computational linguistics, among others. Text mining aims to extract information and patterns from textual data (Talib et al. 2016b ; Fan et al. 2006 ). The trivial approach to text mining is manual, in which a human reads the text and searches for useful information in it. A more logical approach is automatic, which mines text in an efficient way in terms of speed and cost (Herranz et al. 2018 ; Sukhadia et al. 2020 ; Pathan et al. 2020 ).
According to the India Brand Equity Foundation (IBEF 2019 ), the Indian financial industry alone had US $340.48 billion in assets under management as of February 2019. This value only provides us with a limited indication of the actual size and reach of the global finance industry. Technology has paved the way for digitalisation in this rapidly growing behemoth. ‘FinTech’ is a developing domain in the finance industry, which has been defined as a union of finance and information technology (Zavolokina et al. 2016 ). Marrara et al. ( 2019 ) examined how FinTech relates to Italian small and medium-sized enterprises (SMEs), where FinTech has witnessed huge growth in terms of investment and development, and how it has proved fruitful for the SME market in a short amount of time. FinTech has popularised the use of data in the financial industry. This data is substantially in the form of structured or unstructured text. Therefore, traditionally and technically, textual data can be regarded as always having been a prevailing and essential element in the finance sector.
Unstructured textual data have been increasing rapidly in the finance industry (Lewis and Young 2019 ). This is where text mining has a lot of potential. Kumar and Ravi ( 2016 ) explored various applications in the financial domain in which text mining could play a significant role. They concluded that it had numerous applications in this industry, such as various kinds of predictions, customer relationship management, and cybersecurity issues, among others. Many novel methods have been proposed for analysing financial results in recent years, and artificial intelligence has made it possible to analyse and even predict financial outcomes based on historical data.
Finance has been an important force in human life since the earliest civilisations. It is noteworthy that from barter systems to cryptocurrencies, finance has always been associated with data, such as transactions, accounts, prices, and reports. Manual approaches to processing data have been reduced in use and significance over time. Researchers and practitioners have come to prefer digitised and automated approaches for studying and analysing financial data. Financial data contain a significant amount of latent information. If the latent information were to be extracted manually from a huge corpus of data, it might take years. Advancements in text mining have made it possible to efficiently examine textual data pertaining to finance. Bach et al. ( 2019 ) published a literature review on text mining for big-data analysis in finance. They structured the review in terms of three critical questions. These questions pertained to the intellectual core of finance, the text-mining techniques used in finance, and the data sources of financial sectors. Kumar and Ravi ( 2016 ) discussed the model presented by Vu et al. ( 2012 ) that implemented text mining on Twitter messages to perform sentiment analysis for the prediction of stock prices. They also mentioned the model of Lavrenko et al. ( 2000 ), which could classify news stories in a way that could help identify which of them affected trends in finance and to what degree. We will further discuss text-mining applications in finance in subsequent sections.
Apart from finance, we present a brief overview of text mining in other industries. On social media, people generate text data in the form of posts, blogs, and web forum activity, among many others (Agichtein et al. 2008 ). Despite the vast quantity of data available, the relatively low proportion of content of significant quality is still a problem (Kinsella et al. 2011 ), which is an issue that can be solved by text mining (Salloum et al. 2017 ). In the biomedical field too, there is a need for effective text-mining and classification methods (Krallinger et al. 2011 ). On e-commerce websites, text mining is used to prevent the repetition of information to the same audience (Da-sheng et al. 2009 ) and improve product listings through reviews (Kang and Park 2016 ; Ur-Rahman and Harding 2012 ). In healthcare, researchers have worked on applications such as the identification of healthcare topics directly from personal messages over the Internet (Lu 2013 ), classification of online data (Srivastava et al. 2018 ), and analysis of patient feedback (James et al. 2017 ). The agriculture industry has also used text mining in, for example, the classification of agricultural regulations (Espejo-Garcia et al. 2018 ), ontology-based agricultural text clustering (Su et al. 2012 ), and analysis of agricultural network public opinions (Lee 2019 ). Text mining has also been utilised in the detection of malicious web URLs which evolve over time and have complex features (Li et al. 2020a ; b , c ).
This paper discusses the use of text mining in the financial domain in detail, taking into consideration three major areas of application: financial forecasting, banking, and corporate finance. We also discuss the widely used methodologies and techniques for text mining in finance, the challenges faced by researchers, and the future scope for text-mining methods in finance.
Overview of text-mining methodologies
Text mining is a process through which the user derives high-quality information from a given piece of text. Text mining has seen a significant increase in demand over the last few years. Coupled with big data analytics, the field of text mining is evolving continuously. Finance is one major sector that can benefit from these techniques; the analysis of large volumes of financial data is both a need and an advantage for corporates, government, and the general public. This section discusses some important and widely used techniques in the analysis of textual data in the context of finance.
Sentiment analysis (SA)
One of the most important techniques in the field is SA. It has applications in numerous sectors. This technique extracts the underlying opinions within textual data and is therefore also referred to as opinion mining (Akaichi et al. 2013 ). It is of prime use in a number of domains, such as e-commerce platforms, blogs, online social media, and microblogs. The motives behind sentiment analysis can be broadly divided into emotion recognition and polarity detection. Emotion detection is focused on the extraction of a set of emotion labels, and polarity detection is more of a classifier-oriented approach with discrete outputs (e.g., positive and negative) (Cambria 2016 ).
There are two main approaches for SA, namely lexicon-based (dictionary-based) and machine learning (ML). The latter is further classified into supervised and unsupervised learning approaches (Xu et al. 2019 ; Pradhan et al. 2016 ). Lexicon-based approaches use SentiWordNet word maps, whereas ML considers SA as a classification problem and uses established techniques for it. In lexicon-based approaches, the overall score for sentiment is calculated by dividing the sentiment frequency by the sum of positive and negative sentiments. In ML approaches, the major techniques that are used are Naïve Bayes (NB) classifier and support vector machines (SVMs), which use labelled data for classification. SA using ML has an edge over the lexicon approach, as it doesn’t require word dictionaries that are highly costly. However, ML requires domain-specific datasets, which can be considered as a limitation (Al-Natour and Turetken 2020 ). After data preprocessing, feature selection is performed as per the requirement, following which one obtains the final results after the analysis of the given data as per the adopted approach (Hassonah et al. 2019 ).
In the financial domain, stock market prediction is one of the applications in which SA has been used to predict future stock market trends and prices from the analysis of financial news articles. Joshi et al. ( 2016 ) compared three ML algorithms and observed that random forest (RF) and SVMs performed better than NB. Renault ( 2019 ) used StockTwits (a platform where people share ideas about the stock market) as a data source and applied five algorithms, namely NB, a maximum entropy method, a linear SVM, an RF, and a multilayer perceptron and concluded that the maximum entropy and linear SVM methods gave the best results. Over the years, researchers have combined deep learning methods with traditional machine learning techniques (e.g., construction of sentiment lexicon), thus obtaining more promising results (Yang et al. 2020 ).
Information extraction
Information extraction (IE) is used to extract predefined data types from a text document. IE systems mainly aim for object identification by extracting relevant information from the fragments and then putting all the extracted pieces in a framework. Post extraction, DiscoTEX (Discovery from TextEXtraction) is one of the core methods used to convert the structured data into meaningful data to discover knowledge from it (Salloum et al. 2018 ).
In finance, named-entity recognition (NER) is used for extracting predefined types of data from a document. In banking, transaction order documents of customers may come via fax, which results in very diverse documents because of the lack of a fixed template and creates the need for proper feature extraction to obtain a structured document (Emekligil et al. 2016 ).
Natural language processing (NLP)
NLP is a part of the artificial intelligence domain and attempts to help transform imprecise and ambiguous messages into unambiguous and precise messages. In the financial sector, it has been used to assess a firm’s current and future performance, domain standards, and regulations. It is often used to mine documents to obtain insights for developing conclusions (Fisher et al. 2016 ). NLP can help perform various analyses, such as NER, which further helps in identifying the relationships and other information to identify the key concept. However, NLP lacks a dictionary list for all the named entities used for identification (Talib et al. 2016a ; b ).
As NLP is a pragmatic research approach to analyse the huge amount of available data, Xing et al. ( 2017 ) applied it to bridge the gap between NLP and financial forecasting by considering topics that would interest both the research fields. Figure 1 provides an intuitive grasp of natural language-based financial forecasting (NLFF).

An intersection of NLP and financial forecasting to illustrate the concept of NLFF (Xing et al. 2017 )
Chen et al. ( 2020 ) discussed the role of NLP in FinTech in the past, present, and future. They reviewed three aspects, namely know your customer (KYC), know your product (KYP), and satisfy your customer (SYC). In KYC, a lot of textual data is generated in the process of acquiring information about customers (corporate sector and retail). With respect to KYP, salespersons are required to know all the attributes of their product, which again requires data in order to know the prospects, risks, and opportunities of the product. In SYC, salespersons/traders and researchers try to make the financial activities more efficient to satisfy the customers in the business-to-customer as well as customer-to-customer business models. Herranz et al. ( 2018 ) discussed the role of NLP in teaching finance and reported that it enhanced the transfer of knowledge within an environment overloaded with information.
- Text classification
Text classification is a four-step process comprising feature extraction, dimension reduction, classifier selection, and evaluation. Feature extraction can be done with common techniques such as term frequency and Word2Vec; then, dimensionality reduction is performed using techniques such as principal component analysis and linear discriminant analysis. Choosing a classifier is an important step, and it has been observed that deep learning approaches have surpassed the results of other machine learning algorithms. The evaluation step helps in understanding the performance of the model; it is conducting using various parameters, such as the Matthews correlation coefficient (MCC), area under the ROC curve (AUC), and accuracy. Accuracy is the simplest of these to evaluate. Figure 2 shows an overview of the text classification process (Kowsari et al. 2019 ).

A general overview of the text classification process (Kowsari et al. 2019 )
Brindha et al. ( 2016 ) compared the performance of various text classification techniques, namely NB, k-nearest neighbour (KNN), SVM, decision tree, and regression, and found that based on the precision, recall, and F1 measures, SVM provided better results than the others.
Deep learning
Deep learning is a part of machine learning, which trains a data model to make predictions about new data. Deep learning has a layered architecture, where the input data goes into the lowest level and the output data is generated at the highest level. The input is transformed at the various middle levels by applying algorithms to extract features, transform features into factors, and then input the factors into the deeper layer again to obtain transformed features (Heaton et al. 2016 ). Widiastuti ( 2018 ) focused on the input data, as it plays an important role in the performance of any algorithm. The author concluded that modification of the network architecture with deep learning algorithms can markedly affect performance and provide good results.
In finance, deep learning solves the problem of complexity and ambiguity of natural language. Kraus and Feuerriegel ( 2017 ) used a corpus of 13,135 German ad hoc announcements in English to predict stock market movements and concluded that deep learning was better than the traditional bag-of-words approach. The results also showed that the long short-term memory models outperformed all the existing machine learning algorithms when transfer learning was performed to pre-train word embeddings.
Review of text-mining applications in finance
As mentioned in earlier sections, this paper focuses on the applications of text mining in three sectors of finance, namely financial predictions, banking, and corporate finance. In the subsections, we review various studies. Some literature has been summarised in detail, and in the end, a tabular summary of some more studies is included. Figure 3 shows a summarised link between the text-mining techniques and their corresponding applications in the respective domains. Although the following subsections discuss the studies pertaining to each sector individually, there has also been research on techniques that can be applied to multiple financial sectors. One such system was proposed by Li et al. ( 2020a ), which was a classifier based on adaptive hyper-spheres. It could be helpful in tasks such as credit scoring, stock price prediction, and anti-fraud analysis.

An overview of how text mining can be used in the financial domain. This paper follows a systematic approach for reviewing text-mining applications, as depicted by the flowchart in the figure. The two independent entities, namely finance and text mining, are linked together to show the possible applications of various text-mining techniques in various financial domains
Prediction of financial trends
Using the ever-expanding pool of textual data to improve the dynamics of the market has long been a practice in the financial industry. The increasing volume of press releases, financial data, and related news articles have been motivating continued and sophisticated analysis, dating back to the 1980s, in order to derive a competitive advantage (Xing et al. 2017 ). Abundant data investigated with text mining can deliver an advantage in a variety of scenarios. As per Tkáč and Verner ( 2016 ) and Schneider and Gupta ( 2016 ), among the many ideas covered in financial forecasting, from credit scoring to inflation rate prediction, a large proportion of focus is on stock market and forex prediction. Wen et al. ( 2019 ) proposed an idea regarding how retail investor attention can be used for evaluation of the stock price crash risk.
Wu et al. ( 2012 ) proposed a model that combined the features of technical analysis of stocks with sentiment analysis, as stock prices also depend on the decisions of investors who read stock news articles. They focused on obtaining the overall sentiment behind each news article and assigned it the respective sentiment based on the weight it carried. Next, using different indicators, such as price, direction, and volume, technical analysis was performed and the learning prediction model was generated. The model was used to predict Taiwan’s stock market, and the results proved to be more promising than models that employed either of the two. This indicates an efficient system that can be integrated with even better features in the future.
Al-Rubaiee et al. ( 2015 ) analysed the relationship between Saudi Twitter posts and the country’s stock market (Tadawul). They used a number of algorithms such as SVM, KNN, and NB algorithms to classify Arabic text for the purpose of stock trading. Their major focus was on properly preprocessing data before the analysis. By comparing the results, they found that SVM had the best recall, and KNN had the best precision. The one-to-one model that they built showcased the positive and negative sentiments as well as the closing values of the Tadawul All Share Index (TASI). The relationship between a rise in the TASI index and an increase in positive sentiments was found out to be greater than that of a decline in the index and negative sentiments. The researchers mentioned that in future work they would incorporate the Saudi stock market closing values and sentiment features on tweets to explore the patterns between the Saudi stock index and public opinion on Twitter.
Vijayan and Potey ( 2016 ) proposed a model based on recent news headlines that predicted the forex trends based on the given market situations. The information about the past forex currency pair trends was analysed along with the news headlines corresponding to that timeline, and it was assumed that the market would behave in the future as it had done in the past. The researchers focused on the elimination of redundancy, and their model focused on news headlines rather on entire articles. Multilayer dimension reduction algorithms were used for text mining, the Synchronous Targeted Label Prediction algorithm was used for optimal feature reduction, and the J48 algorithm was used for the generation of decision trees. The main focus was on fundamental analysis that targeted unstructured textual data in addition to technical analysis to make predictions based on historical data. The J48 algorithm resulted in an improvement in the accuracy and performance of the overall system, better efficiency, and less runtime. In fact, the researchers reported that the algorithm could be applied to diverse subjects, such as movie reviews.
Nassirtoussi et al. ( 2015 ) proposed an approach for forex prediction wherein the major focus was on strengthening text-mining aspects that had not been focused upon in previous studies. Dimensionality reduction, semantic integration, and sentiment analysis enabled efficient results. The system predicted the directional movement of a currency pair based on news headlines in the sector from a few hours before. Again, headlines were taken into consideration for the analysis, and a multilayer algorithm was used to address semantics, sentiments, and dimensionality reduction. This model’s process was highly accurate, with results of up to 83%. The strong results obtained in that study demonstrate that the studied relationships exist. The models can be applied to other contexts as well.
Nikfarjam et al. ( 2010 ) discussed the components that constitute a forecasting model in this sector and the prototypes that had been recently introduced. The main components were compared with each other. Feature selection and feature weighting were used to select a piece of news and assign weights to them, used either individually or in combination for feature selection. Next, feature weighting was used to calculate the weights for the given terms. The feature weighting methodology was based on the study by Fung et al. ( 2002 ), who had assigned more weights to enhance the term frequency-inverse document frequency (TF-IDF) weighting. For text classification, most researchers have applied SVMs to classify the input text into either good or bad news. Some researchers have used Bayesian classifiers, and some others have used a combination of binary classifiers to achieve the final classification decision. Many authors have focused on news features but not equally addressed the available market data. The focus of most studies has been on the analysis of news and indicator values separately, which has proved to be less efficient. The combination of both market news and the status of market trends at the same time is expected to provide stronger results.
Gupta et al. ( 2019 ) proposed a combination of two models: the primary model obtained the dataset for prediction, preprocessed the dataset using logistic regression to remove redundancy, and employed a genetic algorithm, KNN, and support vector regression (SVR). In a comparison of all three, KNN was the basis for their predictions, with an efficiency of more than 50%. The genetic algorithm was used next in search for better accuracy. In an attempt to further support the genetic algorithm, SVR was used, which gave the opening price for any day in the future. For sentiment analysis, Twitter was used, as it was considered the most popular source for related news. The model divided the tweets into two categories, and the rise or fall of the market was predicted taking into consideration the huge pool of keywords. In the end, the model had an accuracy of about 70–75%, which seems reasonable for a dynamic environment.
Nguyen et al. ( 2015 ) focused on sentiment analysis of social media. They obtained the sentiments behind specific topics of discussion of the company on social media and achieved promising results in comparison with the accuracy of stocks in the preceding year. Sentiments annotated by humans on social media with regards to stock prediction were analysed, and the percentage of desired sentiments was calculated for each class. For a remaining lot of messages without explicit sentiments, a classification model was trained using the annotated sentiments on the dataset. For both of these tasks, an SVM was used as the classification model. In another study, after lemmatisation by CoreNLP, latent Dirichlet allocation (LDA) (Blei et al. 2003 ) was used as the generative probabilistic model. The authors also implemented the JST model (Lin and He 2009 ) and Aspect-based Sentiment Analysis for analysing topic sentiments for stock prediction. The study’s limitation was that the topics and models were selected beforehand. The accuracy was around 54%; however, the overall prediction in the model passed only if the stock went up or down. As the model just focused on sentiments and historical prices, the authors intended to add more factors to build a more accurate model.
Li et al. ( 2009 ) approached financial risk analysis through the available financial data on sentiments and used machine learning and sentiment analysis. The uniqueness of their study was the volume of data and the information sentiments. A generalised autoregressive conditional heteroskedasticity modelling (GARCH)-based artificial neural network and a GARCH-based SVM were used. A special training process, named the ‘dynamic training technique’, was applied because the data was non-stationary and noisy and could have resulted in overfitting. For analysing news, the semantic orientation-based approach was adopted, mainly because of the number of articles that were analysed in the study. The future work on this model was expected to include more input data and better sentiment analysis algorithms to obtain better results.
The use of sentiment analysis as a tool to facilitate investment and risk decisions by stock investors was demonstrated by Wu et al. ( 2014 ). Sina Finance, an experimental platform, was the basis for the collection of financial data for this model. The method incorporated machine learning based on SVM and GARCH with sentiment analysis. At the specific opening and closing times for each day, the GARCH-based SVM was used to identify the relations between the obtained information’s sentiment and stock price volatility. This model showed better results when predicting individual stocks rather than at the industry level. The machine learning approach was about 6% more accurate than the lexicon-based semantic approach, and it performed better with bigger datasets. The model performed better on datasets relating to small companies, as small companies were observed to be more sensitive to online reviews. The authors mentioned their future scope as expanding their dataset and attempting to create a more efficient sentiment calculation algorithm to increase the overall accuracy, similar to the one made by Li et al. ( 2009 ).
A slightly different approach was used by Ahmad et al. ( 2006 ), who focused on sentiment analysis of financial news streams in multiple languages. Three widely spoken languages, namely Arabic, Chinese, and English, were used for replication for automatic sentiment analysis. The authors adopted a local grammar approach using a local archive of the three languages. A statistical criterion in the training collection of texts helped in the identification of keywords. The most widely available corpus was for English, followed by Chinese and Arabic. Based on the frequencies of various words, the most widely utilised words were ranked and selected. Through manual evaluation, the accuracy of extraction ranged from 60 to 75%. A more robust evaluation of this model would be necessary for use in real-time markets, with the inclusion of more than one news vendor at a time.
Over the years, deep learning has become acknowledged as a useful machine learning technique that enables state-of-the-art results. It uses multiple layers to create representations and features from the input data. Text-mining analysis has also continuously evolved. The early basic model used lexicon-based analysis to account for a particular entity (sentiment analysis). Considering the complexity of language, a complete understanding of what any piece of text aims to convey requires a more complex analysis to identify and target relevant entities and related aspects (Dohaiha et al. 2018 ). The most important aspect is the relationship between the words in the text, and how the same is dominant in determining the meaning of the content. Several language elements, such as implications (Ray and Chakrabarti 2019 ) and sarcasm, require high-level methods for handling. This problem requires the use of deep learning models that can help completely understand a given piece of text. Deep learning may incorporate time series analysis and aspect-based sentiment analysis, which enhances data mining, feature selection, and fast information retrieval. Deep learning models learn features during the process of learning. They create abstract representations of the given data and therefore are unchanged with local changes to the input data (Sohangir et al. 2018 ). Word embeddings target words that are similar in context. By the measurement of similarities between words (e.g., cosine similarity in the case of vectors), one can employ word embeddings in the initial data preprocessing layers for faster and more efficient NLP execution (Young et al. 2018 ).
The huge amount of streaming financial news and articles are impossible to be processed by humans for interpretation and application on a daily basis. In a number of uses, such as portfolio construction, forecasting a financial time series is essential. The application of DL techniques on such data for forecasting purposes is of interest to industry professionals. It has been reported that repeated patterns of price movements can be estimated using econometric and statistical models (Souma et al. 2019 ). Even though the market is dynamic, a combination of deep learning models and past market trends is very useful for accurate predictions. In a comparison of real trades with the generated market trades with the use of SA, Kordonis et al. ( 2016 ) found a considerable effect of sentiments on the predictions. Because of the promising results, the use of artificial intelligence and deep learning has attracted the interests of many researchers and practitioners to improve forecasting.
With the use of deep learning, one has to perform little work by hand, while being able to harness a large amount of computation and data. DL techniques that use distributed representation are considered state-of-the-art methods for a large variety of NLP problems. We expect these models to improve and get better at handling unlabelled data through the development and use of approaches such as reinforcement learning.
Owing to the advancements in technology, there are several factors that can be used in models that aim to predict market movements. Not only the price models but also a number of different related models include macroeconomic variables (e.g., investment). Although macroeconomic indicators are important, they tend to be updated infrequently. Unlike such economic factors, public mood and sentiments (Xing et al. 2018a , b ) are dynamic and can be instantaneously monitored. For instance, behavioural science researchers have found that the stock market is affected by the investors’ psychology (Daniel et al. 2001 ). Depending on their mood states, investors make numerous decisions, a big proportion of which are risky. The impact of sentiment and attention measures on stock market volatility (Audrino et al. 2018 ) can be gauged through news articles, social media, and search engine results. The models that incorporate technical indicators of the market with sentiments obtained from the aforementioned sources outperform those that rely on only one of the two (Li et al. 2009 ). In a study pertaining to optimal portfolio allocation, Malandri et al. ( 2018 ) used historical data of the New York Stock Exchange and combined it with sentiment data to get comparatively better returns for the portfolios taken under consideration.
Empirical studies have shown that current market prices are a reflection of recently published news; this has been clearly shown by the Efficient Market Hypothesis (Fama 1991 ). Rather than being dependent on the existing information, price changes are markedly affected by new information or news. ML and DL methods have allowed data scientists to play a part in financial sector analysis and prediction (Picasso et al. 2019 ). There has been an increasing use of text-mining methods to make trading decisions (Wu et al. 2012 ). Different kinds of models, including neural networks, are used for sentiment embeddings from news, tweets, and financial blogs. Mudinas et al. ( 2019 ) studied the change of Granger-caused stocks based on sentiments alone—although this did not provide promising results, the integration with prediction models gave better results. This is because sentiments cannot be determinant factors alone, but they can be used with prediction models to lead to better and dynamic results.
As discussed above, a plethora of proposals and approaches in relation to financial forecasting have been studied, the two main applications of which have been stock prediction and forex. The main focus of these studies was on obtaining sentiments from news headlines and not from entire articles. Researchers have used a variety of text-mining approaches to integrate the abundant amount of useful information with financial patterns. Table 1 summarises some more research studies that have been conducted in recent years on the subject of text mining in financial predictions.
Banking and related applications
Banking is one of the largest and fastest-growing industries in this era of globalisation. The industry is heading towards adopting the most efficient practices for each of its departments. The total lending in the financial year 2017–2018 increased from US $429.92 billion to $1347.18 billion at a CAGR of 10.94% (Ministry of Commerce and Industry, Government of India, 2019). This huge rise is promoting strong economic growth, increasing incomes, enhancing trouble-free access to bank credit, and increasing consumerism. In the midst of an IT revolution, competitive reasons have led to the rising importance and adoption of banking automation. IT enables the implementation of various techniques for risk controls and smooth flow of transactions over electronic mediums and supports financial product innovation and development.
Gao and Ye ( 2007 ) proposed a framework for preventing money laundering with the help of the transaction histories of customers. They did this by identifying suspicious data from various textual reports from law enforcement agencies. They also mined unstructured databases and text documents for knowledge discovery in order to automatically extract the profiles of the entities that could be involved in money laundering. They employed SVM, decision trees, and Bayesian inference to develop a hierarchical structure of the suspicious reports and regression to identify hidden patterns.
Bholat et al. ( 2015 ) analysed the utility of text mining in central banks (CB), as a wide range of data sources is required for evaluating monetary and financial stability and for achieving policy objectives. Therefore, text-mining techniques are more powerful than manual means. The authors elucidated two major approaches: the use of text as data for research purposes in CB, and the various text-mining techniques for this purpose. For the former, they suggested that textual data in the form of social narratives can be used by central banks as financial indicators for risk and uncertainty management by employing topic clustering on the narratives. The latter aspect involved preprocessing of data to de-duplicate it, convert it into text files, and reduce it into tokens by various tokenisation techniques. Thereafter, text-mining techniques, such as dictionary techniques, vector space models, latent semantic analysis, LDA, and NB algorithm, were applied to the tokenised data. The authors concluded that aggregately, these can be a very useful addition to the efficient functioning of the CB.
Bach et al. ( 2019 ) stated that a huge amount of unstructured data from various sources has created a requirement for the extraction of keywords in the banking sector. They mentioned four different procedures for the extraction of keywords, which were obtained from the study by Bharti and Babu ( 2017 ). Bach et al. further discussed how keyword extraction can be implemented to extract related useful comments and documents and to compare the banking institutions as well. They also reviewed some other text-mining techniques that can be utilised by banks. NER was used on large datasets for the extraction of entities such as a person, location, and organisation. Sentiment analysis was done to analyse customer opinions, which is crucial for a bank’s functioning. Topic extraction was found to be useful mainly in credit banking. Social network analysis, a graph theory-based methodology to study the social media user structure, provided an outlook on how the customers are connected on the social media and how impactful they were in sharing information to the network of interests. This social network analysis could then be coupled with text mining to identify the keywords which correspond to the customers’ common interest.
Yap et al. ( 2011 ) discussed the issue faced by recreational clubs with respect to potential defaulters and non-defaulters. They proposed a credit scoring model that utilised text mining for estimating the financial obligations of credit applicants. A scorecard was built with the help of past performance reports of the borrowers wherein different clubs used different criteria for evaluating the historic data. The data was split into a 70:30 ratio for training and validating, respectively. They used three different models, namely a credit scorecard model, logistic regression model, and decision tree model, with an accuracy rate of 72.0%, 71.9%, and 71.2% respectively. Although the model benefitted the club administration, it also had a few limitations, such as poor quality of the scorecard and biased samples used to evaluate new applicants, as the model was built on historic data.
Xiong et al. ( 2013 ) devised a model for personal bankruptcy prediction using sequence mining techniques. The sequence results showed good prediction ability. This model has potential value in many industries. For clustering categorical sequences, a model-based k-means algorithm was designed. A comparative study of three models, namely SVM, credit scoring, and the one proposed by them, found that the accuracies were 89.3%, 80.54%, and 94.07% respectively. The sequence mining used in the proposed model outperformed the other two models. In terms of loss prediction, the KNN algorithm had the potential to identify bad accounts with promising predictive ability.
Bhattacharyya et al. ( 2011 ) explored the use of text mining in credit card fraud detection by evaluating two predictive models: one based on SVM, and the other based on a combination of random forest with logistic regression. They discussed various challenges and problems in the implementation of the models. They recommended that the models should always be kept updated to account for the growing malpractices. The original dataset used in the study comprised more than 50 million real-time credit card transactions. The dataset was split into multiple datasets as per the requirements of different techniques. Because of imbalanced data, the performance was not solely measured by the overall accuracy but also by sensitivity, specificity, and area under the curve. Although the random forest model showed the highest overall accuracy of 96.2%, the study provided some other noteworthy observations. The accuracy of each model varied according to the proportion of the fraudulent cases, with all of them having more than 99% accuracy for a dataset with 2% fraud rates. The authors concluded with suggestions for future exploration: modifying the models to make them more accurate and devising a more reliable approach to split datasets into training and testing sets.
Kou et al. ( 2014 ) used data regarding credit approval and bankruptcy risk from credit card applications to analyse financial risks using clustering algorithms. They made evaluations based on 11 performance measures using multicriteria decision-making (MCDM) methods. A previous study by Kou et al. ( 2012 ) had proposed these MCDM methods for the evaluation of classification algorithms. In a later study (Kou et al. 2019 ), they employed these methods for assessing the feature selection methods for text classification.
In addition to the above-discussed literature in this section, Table 2 provides a summary of some more studies related to the banking finance industry. As visible in Table 2 , banking has a lot of different text-mining applications. Risk assessment, quality assessment, money laundering detection, and customer relationship management are just a few examples from the wide pool of possible text-mining applications in banking.
Applications in corporate finance
Corporate finance is an important aspect of the financial domain because it integrates a company’s functioning with its financial structure. Various corporate documents such as the annual reports of a company have a lot of hidden financial context. Text-mining techniques can be employed to extract this hidden information and also to predict the company’s future financial sustainability.
Guo et al. ( 2016 ) implemented text-mining algorithms that are widely used in accounting and finance. They merged the Thomson Reuters News Archive database and the News Analytics database. The former provides original news, and the latter provides sentiment scores ranging from − 1 to 1 with positive, negative, and neutral scores. To balance the dataset, 3000 news articles were randomly selected for training and 500 for testing. Three algorithms, namely NB, SVM, and neural network, were run on the dataset. The overall output accuracies were 58.7%, 78.2%, and 79.6%, respectively. With the neural network having the highest accuracy, it was concluded that it can be used for text mining-based finance studies. Another model based on semantic analysis was also implemented, which used LDA. LDA was used to extract document relationships and the most relevant information from the documents. According to the authors, in accounting and finance, this technique has proven to be advantageous for examining analyst reports and financial reporting.
Lewis and Young ( 2019 ) discussed the importance of text mining in financial reports. They preferred NLP methods. They highlighted the exploding growth of unstructured textual data in corporate reporting, which opens numerous possibilities for financial applications. According to the authors, NLP methods for text mining provide solutions for two significant problems. One, they prevent overload through automated procedures to deal with immense amounts of data. Two, unlike human cognition, they are able to identify the underlying important latent features. The authors reviewed the widely used methodologies for financial reporting. These include keyword searches and word counts, attribute dictionaries, NB classification, cosine similarity, and LDA. Some factors, such as limited access to the text data resources and insufficient collaboration between various sectors and disciplines, were identified as challenges that are hindering progress in the application of text mining to finance.
Arguing that corporate sustainability reports (CSR) have increased dramatically, become crucial from the financial reporting perspective, and are not amenable to manual analysis processes, Shahi et al. ( 2014 ) proposed an automated model based on text-mining approaches for more intelligent scoring of CSR reports. After preprocessing of the dataset, four classification algorithms were implemented, namely NB, random subspace, decision table, and neural networks. Various parameters were evaluated and the training categories and feature selection algorithms were tuned to determine the most effective model. NB with the Correlation-based Feature Selection (CFS) filter was chosen as the preferred model. Based on this model, software was designed for CSR report scoring that lets the user input a CSR report to get its score as an automated output. The software was tested and had an overall effectiveness of 81.10%. The authors concluded that the software could be utilised for other purposes such as the popularity of performance indicators as well.
Holton ( 2009 ) implemented a model for preventing corporate financial fraud with a different and interesting perspective. The author considered employee disgruntlement or employee dissatisfaction as a hidden indicator that is responsible for fraud. A minimal dataset of intra-company communication messages and emails on online discussion groups was prepared. After using document clustering for estimating that the data possess sufficient predictive power, the NB classifier was implemented to classify the messages into disgruntled/non-disgruntled classes, and an accuracy of 89% was achieved. The author proposed the use of the model for fraud risk assessment in corporations and organisations with the motivation that it can be used to prevent huge financial losses. The performance of other models such as neural networks and decision trees was to be compared in future work.
Chan and Franklin ( 2011 ) developed a new decision-support system to predict the occurrence of an event by analysing patterns and extracting sequences from financial reports. After text preprocessing, textual information generalisation was performed with the help of a shallow parser, which had an F-measure of 85%. The extracted information was stored in a separate database. From this database, the event sequences were identified and extracted. A decision tree model was then implemented on these sequences to create an inference engine that could predict the occurrence of new events based on the training sequences. With an 85: 15% training-to-testing split, the model achieved an overall accuracy of 89.09%. The authors concluded by highlighting that their model had better and robust performance compared to the prevailing models.
Humpherys et al. ( 2011 ) reviewed various text-mining methods and theories that have been proposed for the detection of corporate fraud in financial statements and subsequently devised a methodology of their own. Their dataset comprised the Management’s Discussion and Analysis section of corporate annual financial reports. After basic analysis and reduction, various statistical and machine learning algorithms were implemented on the dataset, among which the NB and C4.5 decision tree models both gave the highest accuracy of 67.3% for classifying 10-K reports into fraudulent and non-fraudulent. The authors suggested that their model can be used by auditors for detecting fraudulent statements in reports with the aid of the Agent99 analyser tool.
Loughran and McDonald ( 2011 ) came up with the argument that the word lists contained in the Harvard Dictionary, which is commonly used for textual analysis, are not suitable for financial text classification because a lot of negative words in the Harvard list are not actually considered a negative in the financial context. Corporate 10-K reports were taken as data sources to create a new dictionary with new word lists for financial purposes. The authors advised the use of term weighting for the word lists. The new word lists were compared with the Harvard word lists on multiple financial data items, such as 10-K filing returns, material weaknesses, and standardised unexpected earnings. Although a significant difference between the word lists was not observed for classification, the authors still suggested the use of their lists in order to be more careful and prevent any erroneous results.
Whereas other researchers have mostly focused on fraud detection and financial predictions from corporate financial reports, Song et al. ( 2018 ) focused on sentiment analysis of these reports with respect to the CSR score. The sentences in the sample reports were manually labelled as positive and negative in order to create sample data for the machine learning algorithm. SVM was implemented on the dataset with a 3:1 training to test split, which achieved a precision ratio of 86.83%. Following this, an object library was created, with objects referring to the internal and external environment of the company. Sentiment analysis was conducted on these objects. Then, six regression models were developed to get the CSR score, with the model comprising of the Political, Economic, Social, Technological, Environmental and Legal (PESTEL), Porter’s Five Forces, and Primary and Support Activities showing the best performance in predicting the CSR score. The authors concluded that CSR plays a vital role in a company’s sustainability, and their research could aid stakeholders in their company-related decision-making.
There have been more studies on CSR reports and sustainability. Liew et al. ( 2014 ) analysed process industries for their sustainability trends with the help of CSR and sustainability reports of a large number of big companies. The RapidMiner tool was used for text preprocessing followed by generating frequency statistics, pruning, and further text refinement, which generated sustainability-related terms for analysis. The most occurring terms were taken into consideration to create a hierarchical tree model. Environment, health and safety, and social were identified as the key concepts for sustainability. Based on term occurrence and involvement, the authors classified the sustainability issues as specific, critical, rare, and general.
Table 3 presents some more studies on the applications of text mining in corporate finance. As evident from the table and the above-mentioned studies, the annual corporate reports are the most commonly used data source for text-mining applications.
Challenges and future scope
The financial sector is a significant driver of broader industry, and the increasing amount of data in this field has given rise to a number of applications that can be used to improve the field and achieve commercial objectives.
Figure 4 shows some common challenges faced by various text-mining techniques in the financial sector. The huge amount of data available is highly unstructured and has explicit meanings in addition to implicit ones. The data needs to undergo proper preprocessing before it can be used for analysis. Although lexicon lists are available for various domains, the financial sector has to have a specific dictionary for such approaches, so as to assign proper weights to corresponding aspects in the document. In addition to this, there is still restricted access to classified information, which is a significant obstacle. Lastly, the current techniques focus on obtaining static results statically that are true for a given period of time. There is a need for a system that performs text-mining techniques on dynamically obtained data to output real-time results to enable even better insights.

Major challenges to text mining in finance
The combination of text-mining techniques and financial data analytics can produce a model that can potentially be the most efficient model for this problem domain. The results obtained from mining textual data can be integrated with those from financial analysis, thereby providing models that focus on historical data as well as opinions from diverse sources.
This paper conducted an organised qualitative review of recent literature pertaining to three specific sectors of finance. First, this paper analysed the growing importance of text mining in predicting financial trends. While the prior consensus may have been that financial markets are unpredictable, text mining has challenged this notion. The second area of study was banking, which has seen constant growth in technological innovation over the years, especially in digitisation. Text mining has played a key role in supporting these advancements both directly and indirectly through combination with other technologies. Corporate finance was the third study area. We discussed the importance of text mining in enabling the utilisation of corporate reports and financial statements for serving various purposes in addition to supporting corporate sustainability goals. The use of text mining in financial applications is not limited to these sectors. Researchers are increasingly showing interest in text-mining applications and constantly seeking to build more accurate models. There are still many unexplored possibilities in the financial domain, and the related research can help develop more robust and accurate predictive and analytic systems.
Availability of data and materials
All relevant data and material are presented in the main paper.
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the international conference on web search and web data mining—WSDM ’08. https://doi.org/10.1145/1341531.1341557
Ahir K, Govani K, Gajera R, Shah M (2020) Application on virtual reality for enhanced education learning, military training and sports. Augment Hum Res 5:7
Article Google Scholar
Ahmad K, Cheng D, Almas Y (2006) Multi-lingual sentiment analysis of financial news streams. In: Proceedings of science, pp 1–8
Akaichi J, Dhouioui Z, López-Huertas Pérez MJ (2013) Text mining facebook status updates for sentiment classification. In: 2013 17th international conference on system theory, control and computing (ICSTCC), Sinaia, 2013, pp 640–645. https://doi.org/10.1109/ICSTCC.2013.6689032
Al-Natour S, Turetken O (2020) A comparative assessment of sentiment analysis and star ratings for consumer reviews. Int J Inf Manage. https://doi.org/10.1016/j.ijinfomgt.2020.102132
AL-Rubaiee H, Qiu R, Li D (2015) Analysis of the relationship between Saudi twitter posts and the Saudi stock market. In: 2015 IEEE seventh international conference on intelligent computing and information systems (ICICIS). https://doi.org/10.1109/intelcis.2015.7397193
Audrino F, Sigrist F, Ballinari D (2018) The impact of sentiment and attention measures on stock market volatility. Available at SSRN: https://ssrn.com/abstract=3188941 or https://doi.org/10.2139/ssrn.3188941
Aureli S (2017) A comparison of content analysis usage and text mining in CSR corporate disclosure. Int J Digit Account Res 17:1–32
Bach MP, Krsti Z, Seljan S, Turulja L (2019) Text mining for big data analysis in financial sector: a literature review. Sustainability 2019(11):1277
Bharti SK, Babu KS (2017) Automatic keyword extraction for text summarization: a survey. CoRR. abs/1704.03242.
Bhattacharyya S, Jha S, Tharakunnel K, Westland JC (2011) Data mining for credit card fraud: a comparative study. Decis Support Syst 50(3):602–613
Bholat D, Hansen S, Santos P, Schonhardt-Bailey C (2015) Text mining for central banks: handbook. Centre Cent Bank Stud 33:1–19
Google Scholar
Bidulya Y, Brunova E (2016) Sentiment analysis for bank service quality: a rule-based classifier. In: 2016 IEEE 10th international conference on application of information and communication technologies (AICT). https://doi.org/10.1109/icaict.2016.7991688
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(2003):993–1022
Brindha S, Prabha K, Sukumaran S (2016) A survey on classification techniques for text mining. In: 2016 3rd international conference on advanced computing and communication systems (ICACCS), Coimbatore, 2016, pp 1–5. https://doi.org/10.1109/ICACCS.2016.7586371
Bruno G (2016) Text mining and sentiment extraction in central bank documents. In: 2016 IEEE international conference on big data (big data). https://doi.org/10.1109/bigdata.2016.7840784
Cambria E (2016) Affective computing and sentiment analysis. IEEE Intell Syst 31(2):102–107. https://doi.org/10.1109/MIS.2016.31
Chakraborty V, Chiu V, Vasarhelyi M (2014) Automatic classification of accounting literature. Int J Account Inf Syst 15(2):122–148
Chan SWK, Franklin J (2011) A text-based decision support system for financial sequence prediction. Decis Support Syst 52(1):189–198
Chaturvedi D, Chopra S (2014) Customers sentiment on banks. Int J Comput Appl 98(13):8–13
Chen CC, Huang HH, Chen HH (2020) NLP in FinTech applications: past, present and future
Cook A, Herron B (2018) Harvesting unstructured data to reduce anti-money laundering (AML) compliance risk, pp 1–10
Daniel K, Hirshleifer D, Teoh S (2001) Investor psychology in capital markets: evidence and policy implications. J Monet Econ 49:139–209. https://doi.org/10.1016/S0304-3932(01)00091-5
Da-sheng W, Qin-fen Y, Li-juan L (2009) An efficient text classification algorithm in E-commerce application. In: 2009 WRI world congress on computer science and information engineering. https://doi.org/10.1109/csie.2009.346
David JM, Balakrishnan K (2011) Prediction of key symptoms of learning disabilities in school-age children using rough sets. Int J Comput Electr Eng Hong Kong 3(1):163–169
Dohaiha H, Prasad PWC, Maag A, Alsadoon A (2018) Deep learning for aspect-based sentiment analysis: a comparative review. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2018.10.003
Elagamy MN, Stanier C, Sharp B (2018) Stock market random forest-text mining system mining critical indicators of stock market movements. In: 2018 2nd international conference on natural language and speech processing (ICNLSP). https://doi.org/10.1109/icnlsp.2018.8374370
Emekligil E, Arslan S, Agin O (2016) A bank information extraction system based on named entity recognition with CRFs from noisy customer order texts in Turkish. In: Knowledge engineering and semantic web, pp 93–102
Espejo-Garcia B, Martinez-Guanter J, Pérez-Ruiz M, Lopez-Pellicer FJ, Javier Zarazaga-Soria F (2018) Machine learning for automatic rule classification of agricultural regulations: a case study in Spain. Comput Electron Agric 150:343–352
Fama EF (1991) Efficient capital markets: II. J Finance 46(5):1575–1617. https://doi.org/10.2307/2328565
Fan W, Wallace L, Rich S, Zhang Z (2006) Tapping the power of text mining. Commun ACM 49(9):76–82
Feuerriegel S, Gordon J (2018) Long-term stock index forecasting based on text mining of regulatory disclosures. Decis Support Syst 112:88–97
Fisher I, Garnsey M, Hughes M (2016) Natural language processing in accounting, auditing and finance: a synthesis of the literature with a roadmap for future research. Intell Syst Account Finance Manag. https://doi.org/10.1002/isaf.1386
Fritz D, Tows E (2018) Text mining and reporting quality in German banks—a cooccurrence and sentiment analysis. Univers J Account Finance 6(2):54–81
Fung G, Yu J, Lam W (2002) News sensitive stock trend prediction. Adv Knowl Discov Data Min. https://doi.org/10.1007/3-540-47887-6_48
Gandhi M, Kamdar J, Shah M (2020) Preprocessing of non-symmetrical images for edge detection. Augment Hum Res 5:10. https://doi.org/10.1007/s41133-019-0030-5
Gao Z, Ye M (2007) A framework for data mining-based anti-money laundering research. J Money Laund Control 10(2):170–179
Gemar G, Jiménez-Quintero JA (2015) Text mining social media for competitive analysis. Tour Manag Stud 11(1):84–90
Gulaty M (2016) Aspect-based sentiment analysis in bank reviews. https://doi.org/10.13140/RG.2.1.2072.3445
Guo L, Shi F, Tu J (2016) Textual analysis and machine leaning: crack unstructured data in finance and accounting. J Finance Data Sci 2(3):153–170
Gupta R, Gill NS (2012) Financial statement fraud detection using text mining. Int J Adv Comput Sci Appl 3(12):189–191
Gupta A, Simaan M, Zaki MJ (2016) Investigating bank failures using text mining. In: 2016 IEEE symposium series on computational intelligence (SSCI). https://doi.org/10.1109/ssci.2016.7850006
Gupta A, Bhatia P, Dave K, Jain P (2019) Stock market prediction using data mining techniques. In: 2nd international conference on advances in science and technology, pp 1–5
Hagenau M, Liebmann M, Neumann D (2013) Automated news reading: stock price prediction based on financial news using context-capturing features. Decis Support Syst 55(3):685–697
Hájek P, Olej V (2013) Evaluating sentiment in annual reports for financial distress prediction using neural networks and support vector machines. In: Communications in computer and information science, pp 1–10.
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. https://doi.org/10.1186/s40537-019-0206-3
Hassonah M, Al-Sayyed R, Rodan A, Al-Zoubi A, Aljarah I, Faris H (2019) An efficient hybrid filter and evolutionary wrapper approach for sentiment analysis of various topics on Twitter. Knowl Based Syst. https://doi.org/10.1016/j.knosys.2019.105353
Heaton JB, Polson NG, Witte JH (2016) Deep learning in finance. arXiv:1602.06561
Heidari M, Felden C (2015) Financial footnote analysis: developing a text mining approach. In: Int'l conf. data mining, pp 10–16
Herranz S, Palomo J, Cruz M (2018) Building an educational platform using NLP: a case study in teaching finance. J Univ Comput Sci 24:1403
Holton C (2009) Identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem. Decis Support Syst 46(4):853–864
Humpherys SL, Moffitt KC, Burns MB, Burgoon JK, Felix WF (2011) Identification of fraudulent financial statements using linguistic credibility analysis. Decis Support Syst 50(3):585–594
IBEF (2019) https://www.ibef.org/download/financial-services-april-2019.pdf
James TL, Calderon EDV, Cook DF (2017) Exploring patient perceptions of healthcare service quality through analysis of unstructured feedback. Expert Syst Appl 71:479–492
Jani K, Chaudhuri M, Patel H, Shah M (2019) Machine learning in films: an approach towards automation in film censoring. J Data Inf Manag. https://doi.org/10.1007/s42488-019-00016-9
Jaseena KU, David JM (2014) Issues, challenges, and solutions: big data mining. In: Natarajan Meghanathan et al. (eds) NeTCoM, CSIT, GRAPH-HOC, SPTM—2014, pp 131–140
Jha K, Doshi A, Patel P, Shah M (2019) A comprehensive review on automation in agriculture using artificial intelligence. Artif Intell Agric 2:1–12
Joshi K, Bharathi N, Jyothi R (2016) Stock trend prediction using news sentiment analysis. Int J Comput Sci Inf Technol 8:67–76. https://doi.org/10.5121/ijcsit.2016.8306
Junqué de Fortuny E, De Smedt T, Martens D, Daelemans W (2014) Evaluating and understanding text-based stock price prediction models. Inf Process Manag 50(2):426–441
Kakkad V, Patel M, Shah M (2019) Biometric authentication and image encryption for image security in cloud framework. Multiscale Multidiscip Model Exp Des. https://doi.org/10.1007/s41939-019-00049-y
Kamaruddin SS, Hamdan AR, Bakar AA (2007) Text mining for deviation detection in financial statements. In: Proceedings of the international conference on electrical engineering and informatics. Institut Teknologi Bandung, Indonesia, 2007, June 17–19
Kang T, Park DH (2016) The effect of expert reviews on consumer product evaluations: a text mining approach. J Intell Inf Syst 22(1):63–82
Kinsella S, Passant A, Breslin JG (2011) Topic classification in social media using metadata from hyperlinked objects. Adv Inf Retr. https://doi.org/10.1007/978-3-642-20161-5_20
Kloptchenko A, Eklund T, Karlsson J, Back B, Vanharanta H, Visa A (2004) Combining data and text mining techniques for analysing financial reports. Intell Syst Account Finance Manag 12(1):29–41
Kordonis J, Symeonidis S, Arampatzis A (2016) Stock price forecasting via sentiment analysis on twitter. https://doi.org/10.1145/3003733.3003787 .
Kou G, Lu Y, Peng Y, Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak. https://doi.org/10.1142/S0219622012500095
Kou G, Peng Yi, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137
Kou G, Yang P, Peng Yi, Xiao F, Chen Y, Alsaadi F (2019) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836. https://doi.org/10.1016/j.asoc.2019.105836
Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10:150
Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M, Castagnoli L, Cesareni G, Tyers M, Schneider G, Rinaldi F, Leaman R, Gonzalez G, Matos S, Kim S, Wilbur WJ, Rocha L, Shatkay H, Tendulkar AV, Agarwal S, Liu F, Wang X, Rak R, Noto K, Elkan C, Lu Z, Dogan RI, Fontaine JF, Andrade-Navarro MA, Valencia A (2011) The protein–protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform 12(Suppl 8):S3. https://doi.org/10.1186/1471-2105-12-s8-s3
Kraus M, Feuerriegel S (2017) Decision support from financial disclosures with deep neural networks and transfer learning. Decis Support Syst. https://doi.org/10.1016/j.dss.2017.10.001
Krstić Ž, Seljan S, Zoroja J (2019) Visualization of big data text analytics in financial industry: a case study of topic extraction for Italian banks (September 12, 2019). In: 2019 ENTRENOVA conference proceedings. https://ssrn.com/abstract=3490108 or https://doi.org/10.2139/ssrn.3490108
Kumar BS, Ravi V (2016) A survey of the applications of text mining in financial domain. Knowl Based Syst 114:128–147
Kundalia K, Patel Y, Shah M (2020) Multi-label movie genre detection from a movie poster using knowledge transfer learning. Augment Hum Res 5:11. https://doi.org/10.1007/s41133-019-0029-y
Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Mining of concurrent text and time series. In: KDD-2000 Workshop on text mining, vol 2000. Citeseer, pp 37–44
Lee CT (2019) Early warning mechanism of agricultural network public opinion based on text mining. Revista De La Facultad De Agronomia De La Universidad Del Zulia, 36
Lee B, Park JH, Kwon L, Moon YH, Shin Y, Kim G, Kim H (2018) About relationship between business text patterns and financial performance in corporate data. J Open Innov Technol Market Complex. https://doi.org/10.1186/s40852-018-0080-9
Lewis C, Young S (2019) Fad or future? Automated analysis of financial text and its implications for corporate reporting. Account Bus Res 49(5):587–615
Li N, Liang X, Li X, Wang C, Wu DD (2009) Network Environment and Financial Risk Using Machine Learning and Sentiment Analysis. Human Ecol Risk Assess Int J 15(2):227–252. https://doi.org/10.1080/10807030902761056
Li T, Kou G, Peng Y, Shi Y (2020a) Classifying with adaptive hyper-spheres: an incremental classifier based on competitive learning. IEEE Trans Syst Man Cybern Syst 50(4):1218–1229. https://doi.org/10.1109/TSMC.2017.2761360
Li X, Wu P, Wang W (2020b) Incorporating stock prices and news sentiments for stock market prediction: a case of Hong Kong. Inf Process Manag. https://doi.org/10.1016/j.ipm.2020.102212
Li T, Kou G, Peng Yi (2020c) Improving malicious URLs detection via feature engineering: linear and nonlinear space transformation methods. Inf Syst 91:101494. https://doi.org/10.1016/j.is.2020.101494
Liew WT, Adhitya A, Srinivasan R (2014) Sustainability trends in the process industries: a text mining-based analysis. Comput Ind 65(3):393–400
Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceeding of the 18th ACM conference on information and knowledge management—CIKM ’09. https://doi.org/10.1145/1645953.1646003
Loughran T, Mcdonald B (2011) When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J Finance 66(1):35–65
Lu Y (2013) Automatic topic identification of health-related messages in online health community using text classification. SpringerPlus 2(1):309
Malandri L, Xing F, Orsenigo C, Vercellis C, Cambria E (2018) Public mood-driven asset allocation: the importance of financial sentiment in portfolio management. Cogn Comput. https://doi.org/10.1007/s12559-018-9609-2
Marrara S, Pejic Bach M, Seljan S, Topalovic A (2019) FinTech and SMEs—the Italian case. https://doi.org/10.4018/978-1-5225-7805-5.ch002
Matthies B, Coners A (2015) Computer-aided text analysis of corporate disclosures—demonstration and evaluation of two approaches. Int J Digit Account Res 15:69–98
Mudinas A, Zhang D, Levene M (2019) Market trend prediction using sentiment analysis: lessons learned and paths forward. arXiv:1903.05440
Nan L, Xun L, Xinli L, Chao W, Desheng DW (2009) Network environment and financial risk using machine learning and sentiment analysis. Hum Ecol Risk Assess Int J 15(2):227–252
Nassirtoussi AK, Aghabozorgi S, Wah TY, Ngo DC (2015) Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Syst Appl 42(1):306–324. https://doi.org/10.1016/j.eswa.2014.08.004
Nguyen TH, Shirai K, Velcin J (2015) Sentiment analysis on social media for stock movement prediction. Expert Syst Appl 42(24):9603–9611
Nikfarjam A, Emadzadeh E, Muthaiyah S (2010) Text mining approaches for stock market prediction. In: 2010 the 2nd international conference on computer and automation engineering (ICCAE). https://doi.org/10.1109/iccae.2010.5451705
Nopp C, Hanbury A (2015) Detecting risks in the banking system by sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 591–600
Panchiwala S, Shah MA (2020) Comprehensive study on critical security issues and challenges of the IoT world. J Data Inf Manag. https://doi.org/10.1007/s42488-020-00030-2
Pandya R, Nadiadwala S, Shah R, Shah M (2019) Buildout of methodology for meticulous diagnosis of K-complex in EEG for aiding the detection of Alzheimer’s by artificial intelligence. Augment Hum Res. https://doi.org/10.1007/s41133-019-0021-6
Parekh V, Shah D, Shah M (2020) Fatigue detection using artificial intelligence framework. Augment Hum Res 5:5
Patel D, Shah Y, Thakkar N, Shah K, Shah M (2020a) Implementation of artificial intelligence techniques for cancer detection. Augment Hum Res. https://doi.org/10.1007/s41133-019-0024-3
Patel D, Shah D, Shah M (2020b) The intertwine of brain and body: a quantitative analysis on how big data influences the system of sports. Ann Data Sci. https://doi.org/10.1007/s40745-019-00239-y
Patel H, Prajapati D, Mahida D, Shah M (2020c) Transforming petroleum downstream sector through big data: a holistic review. J Petrol Explor Prod Technol. https://doi.org/10.1007/s13202-020-00889-2
Pathan M, Patel N, Yagnik H, Shah M (2020) Artificial cognition for applications in smart agriculture: a comprehensive review. Artif Intell Agric. https://doi.org/10.1016/j.aiia.2020.06.001
Pejic Bach M, Krstić Ž, Seljan S, Turulja L (2019) Text mining for big data analysis in financial sector: a literature review. Sustainability 11:1277. https://doi.org/10.3390/su11051277
Picasso A, Merello S, Ma Y, Oneto L, Cambria E (2019) Technical analysis and sentiment embeddings for market trend prediction. Expert Syst Appl 135:60–70. https://doi.org/10.1016/j.eswa.2019.06.014
Pradhan MV, Vala J, Balani P (2016) A survey on sentiment analysis algorithms for opinion mining. Int J Comput Appl 133:7–11. https://doi.org/10.5120/ijca2016907977
Ray P, Chakrabarti A (2019) A mixed approach of deep learning method and rule-based method to improve aspect level sentiment analysis. Appl Comput Inform. https://doi.org/10.1016/j.aci.2019.02.002
Renault T (2019) Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance. https://doi.org/10.1007/s42521-019-00014-x
Sabo T (2017) Applying text analytics and machine learning to assess consumer financial complaints. In: Proceedings of the SAS global forum 2017 conference. SAS Institute Inc., Cary NC. https://support.sas.com/resources/papers/proceedings17/SAS0282-2017.pdf
Salloum S, Al-Emran M, Monem A, Shaalan K (2017) A survey of text mining in social media: facebook and twitter perspectives. Adv Sci Technol Eng Syst J 2:127–133. https://doi.org/10.25046/aj020115
Salloum S, Mostafa A, Monem A, Shaalan K (2018) Using text mining techniques for extracting information from research articles. https://doi.org/10.1007/978-3-319-67056-0_18
Schneider MJ, Gupta S (2016) Forecasting sales of new and existing products using consumer reviews: a random projections approach. Int J Forecast 32(2):243–256
Schumaker RP, Chen H (2009) Textual analysis of stock market prediction using breaking financial news. ACM Trans Inf Syst 27(2):1–19
Shah D, Isah H, Zulkernine F (2018a) Predicting the effects of news sentiments on the stock market. In: 2018 IEEE international conference on big data (big data). https://doi.org/10.1109/bigdata.2018.8621884
Shah T, Shaikh I, Patel A (2018b) Comparison of different kernels of support vector machine for predicting stock prices. Int J Eng Technol 9(6):4288–4291
Shah G, Shah A, Shah M (2019) Panacea of challenges in real-world application of big data analytics in healthcare sector. Data Inf Manag. https://doi.org/10.1007/s42488-019-00010-1
Shah D, Dixit R, Shah A, Shah P, Shah M (2020) A Comprehensive analysis regarding several breakthroughs based on computer intelligence targeting various syndromes. Augment Hum Res 5:14. https://doi.org/10.1007/s41133-020-00033-z
Shah K, Patel H, Sanghvi D, Shah M (2020) A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment Hum Res 5:12. https://doi.org/10.1007/s41133-020-00032-0
Shahi AM, Issac B, Modapothala JR (2014) Automatic analysis of corporate sustainability reports and intelligent SCORING. Int J Comput Intell Appl 13(01):1450006. https://doi.org/10.1142/s1469026814500060
Shirata CY, Takeuchi H, Ogino S, Watanabe H (2011) Extracting key phrases as predictors of corporate bankruptcy: empirical analysis of annual reports by text mining. J Emerg Technol Account 8(1):31–44
Sohangir S, Wang D, Pomeranets A et al (2018) Big data: deep learning for financial sentiment analysis. J Big Data 5:3. https://doi.org/10.1186/s40537-017-0111-6
Song Y, Wang H, Zhu M (2018) Sustainable strategy for corporate governance based on the sentiment analysis of financial reports with CSR. Financ Innov. https://doi.org/10.1186/s40854-018-0086-0
Souma W, Vodenska I, Aoyama H (2019) Enhanced news sentiment analysis using deep learning methods. J Comput Soc Sci 2:33–46. https://doi.org/10.1007/s42001-019-00035-x
Srivastava SK, Singh SK, Suri JS (2018) Healthcare text classification system and its performance evaluation: a source of better intelligence by characterizing healthcare text. J Med Syst. https://doi.org/10.1007/s10916-018-0941-6
Su Y, Wang R, Chen P, Wei Y, Li C, Hu Y (2012) Agricultural ontology based feature optimization for agricultural text clustering. J Integr Agric 11(5):752–759
Sukhadia A, Upadhyay K, Gundeti M, Shah S, Shah M (2020) Optimization of smart traffic governance system using artificial intelligence. Augment Hum Res 5:13. https://doi.org/10.1007/s41133-020-00035-x
Sumathi N, Sheela T (2017) Opinion mining analysis in banking system using rough feature selection technique from social media text. Int J Mech Eng Technol 8(12):274–289
Talaviya T, Shah D, Patel N, Yagnik H, Shah M (2020) Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artif Intell Agric. https://doi.org/10.1016/j.aiia.2020.04.002
Talib R, Hanif MK, Ayesha S, Fatima F (2016a) Text mining: techniques. Appl Issues 7(11):414–418
Talib R, Kashif M, Ayesha S, Fatima F (2016b) Text mining: techniques, applications and issues. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2016.071153
Tkáč M, Verner R (2016) Artificial neural networks in business: two decades of research. Appl Soft Comput 38:788–804
Ur-Rahman N, Harding JA (2012) Textual data mining for industrial knowledge management and text classification: a business oriented approach. Expert Syst Appl 39(5):4729–4739
Vijayan R, Potey MA (2016) Improved accuracy of FOREX intraday trend prediction through text mining of news headlines using J48. Int J Adv Res Comput Eng Technol 5(6):1862–1866
Vu TT, Chang S, Ha QT, Collier N (2012) An experiment in integrating sentiment features for tech stock prediction in twitter. In: Workshop on information extraction and entity analytics on social media data, COLING, Mumbai, India, pp 23–38
Wang B, Huang H, Wang X (2012) A novel text mining approach to financial time series forecasting. Neurocomputing 83:136–145
Wen F, Xu L, Ouyang G, Kou G (2019) Retail investor attention and stock price crash risk: evidence from China. Int Rev Financ Anal 65:101376. https://doi.org/10.1016/j.irfa.2019.101376
Widiastuti N (2018) Deep learning—now and next in text mining and natural language processing. IOP Conf Ser Mater Sci Eng 407:012114. https://doi.org/10.1088/1757-899X/407/1/012114
Wu JL, Su CC, Yu LC, Chang PC (2012) Stock price predication using combinational features from sentimental analysis of stock news and technical analysis of trading information. Int Proc Econ Dev Res. https://doi.org/10.7763/ipedr
Wu DD, Zheng L, Olson DL (2014) A decision support approach for online stock forum sentiment analysis. IEEE Trans Syst Man Cybern Syst 44(8):1077–1087
Xing FZ, Cambria E, Welsch RE (2017) Natural language based financial forecasting: a survey. Artif Intell Rev 50(1):49–73
Xing FZ, Cambria E, Welsch RE (2018a) Natural language based financial forecasting: a survey. Artif Intell Rev 50:49–73. https://doi.org/10.1007/s10462-017-9588-9
Xing F, Cambria E, Welsch R (2018b) Intelligent asset allocation via market sentiment views. IEEE Comput Intell Mag 13:25–34. https://doi.org/10.1109/MCI.2018.2866727
Xiong T, Wang S, Mayers A, Monga E (2013) Personal bankruptcy prediction by mining credit card data. Expert Syst Appl 40(2):665–676
Xu G, Yu Z, Yao H, Li F, Meng Y, Wu X (2019) Chinese text sentiment analysis based on extended sentiment dictionary. IEEE Access 7:43749–43762. https://doi.org/10.1109/ACCESS.2019.2907772
Yang Li, Li Y, Wang J, Sherratt R (2020) Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 8:1–1. https://doi.org/10.1109/ACCESS.2020.2969854
Yap BW, Ong SH, Husain NHM (2011) Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Syst Appl 38(10):13274–13283
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [review article]. IEEE Comput Intell Mag 13(3):55–75. https://doi.org/10.1109/MCI.2018.2840738
Yusuuf H, Shihabeldeen A (2019) Using text mining to predicate exchange rates with sentiment indicators. J Bus Theory Pract 7(2):60–75
Zavolokina L, Dolata M, Schwabe G (2016) The FinTech phenomenon: antecedents of financial innovation perceived by the popular press. Financ Innov. https://doi.org/10.1186/s40854-016-0036-7
Download references
Acknowledgements
The authors are grateful to Nirma University and Department of Chemical Engineering, School of Technology, Pandit Deendayal Petroleum University for the permission to publish this research.
Not applicable.
Author information
Authors and affiliations.
Department of Computer Science, Nirma University, Ahmedabad, Gujarat, India
Aaryan Gupta, Vinya Dengre & Hamza Abubakar Kheruwala
Department of Chemical Engineering, School of Technology, Pandit Deendayal Petroleum University, Gandhinagar, Gujarat, 382007, India
You can also search for this author in PubMed Google Scholar
Contributions
All the authors make substantial contribution in this manuscript. AG, VD, HA and MS participated in drafting the manuscript. AG, VD and HA wrote the main manuscript, all the authors discussed the results and implication on the manuscript at all stages. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Manan Shah .
Ethics declarations
Ethics approval and consent to participate, consent for publication, competing interests.
The authors declare that they have no competing interests.
Additional information
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and Permissions
About this article
Cite this article.
Gupta, A., Dengre, V., Kheruwala, H.A. et al. Comprehensive review of text-mining applications in finance. Financ Innov 6 , 39 (2020). https://doi.org/10.1186/s40854-020-00205-1
Download citation
Received : 29 January 2020
Accepted : 17 September 2020
Published : 02 November 2020
DOI : https://doi.org/10.1186/s40854-020-00205-1
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Text mining
- Machine learning
- Financial forecasting
- Sentiment analysis
- Corporate finance
- Reference Manager
- Simple TEXT file
People also looked at
Systematic review article, toward an ethical framework for the text mining of social media for health research: a systematic review.
- 1 Department of Primary Care and Public Health, Brighton and Sussex Medical School, Brighton, United Kingdom
- 2 Population Data Science, Medical School, Swansea University, Swansea, United Kingdom
- 3 Division of Informatics, Imaging & Data Sciences, University of Manchester, Manchester, United Kingdom
Background: Text-mining techniques are advancing all the time and vast corpora of social media text can be analyzed for users' views and experiences related to their health. There is great promise for new insights into health issues such as drug side effects and spread of disease, as well as patient experiences of health conditions and health care. However, this emerging field lacks ethical consensus and guidance. We aimed to bring together a comprehensive body of opinion, views, and recommendations in this area so that academic researchers new to the field can understand relevant ethical issues.
Methods: After registration of a protocol in PROSPERO, three parallel systematic searches were conducted, to identify academic articles comprising commentaries, opinion, and recommendations on ethical practice in social media text mining for health research and gray literature guidelines and recommendations. These were integrated with social media users' views from qualitative studies. Papers and reports that met the inclusion criteria were analyzed thematically to identify key themes, and an overarching set of themes was deduced.
Results: A total of 47 reports and articles were reviewed, and eight themes were identified. Commentators suggested that publicly posted social media data could be used without consent and formal research ethics approval, provided that the anonymity of users is ensured, although we note that privacy settings are difficult for users to navigate on some sites. Even without the need for formal approvals, we note ethical issues: to actively identify and minimize possible harms, to conduct research for public benefit rather than private gain, to ensure transparency and quality of data access and analysis methods, and to abide by the law and terms and conditions of social media sites.
Conclusion: Although social media text mining can often legally and reasonably proceed without formal ethics approvals, we recommend improving ethical standards in health-related research by increasing transparency of the purpose of research, data access, and analysis methods; consultation with social media users and target groups to identify and mitigate against potential harms that could arise; and ensuring the anonymity of social media users.
Introduction
In the last two decades, social media platforms, social networking sites, and internet discussion forums have undergone an exponential increase in users, with 3.48 billion active users in 2019, a 9% increase from the previous year ( 1 ). Facebook continues to be the highest-ranking social media website, with 2.27 billion active users per month, while Twitter attracts around 326 million active users per month ( 1 ). Other sites such as Instagram (nearly 1 billion users) and TikTok are rapidly gaining users, particularly of a younger demographic. Additionally, many discussion forums on the internet provide ways for users to discuss and share their experiences and seek advice from their peers.
Social media platforms are used to disseminate health information ( 2 , 3 ) and are used by health care professionals to interact with and advise users ( 4 , 5 ). In addition, users post information about their health behaviors, experiences, and attitudes, often in publicly open forums ( 4 , 6 ). With 57% of the world's population accessing social media each year ( 1 ), this opens new opportunities for researchers to harvest and analyze data for health research, gaining information regarding people's health on a scale that would have previously been unachievable ( 6 ). This new source of health data may also give researchers access to the views and experiences of people who have traditionally been hard to recruit to research studies ( 7 ). The availability of this information is already being utilized by health researchers, to study adverse drug effects for pharmacovigilance ( 8 , 9 ), flu outbreak surveillance ( 10 ), and mental health monitoring ( 11 ).
The vast amount of data available from social media and patient discussion forums and the necessity of identifying relevant posts from a large number of irrelevant ones mean that computer-based text-mining techniques are often used. Natural language processing (NLP) is a branch of computer science that uses series of rules and/or machine learning algorithms to identify relevant information from written natural language. Usually, algorithms are able to recognize named entities to a high accuracy and also assess the context for negation, subject, timing, and hedging ( 12 ). Further work, such as looking for links between drugs and side effects, focuses on extracting relationships between entities within the same sentence or document ( 13 ). Usually, quantitative data are derived from this information extraction, which can then be analyzed statistically, allowing simultaneous analysis of multiple posts or documents. Some research using social media data has been qualitative and thematic, which involves more detailed and in-depth reading and analysis of the full written content ( 14 ).
NLP research uses the available text, which is posted on discussion forums and social media; as it usually does not involve any interaction or direct contact with social media users, it is assumed to be ethically low risk. It is also perceived as a “low-stakes” approach for student projects, and therefore, the research community may include many new or inexperienced researchers who are not well-versed in ethical issues. While medical researchers are usually trained in “human subjects” research ethics, computer science researchers may be less experienced with key ethical issues ( 15 ). In addition, previous controversial incidents have raised important questions about the ethics and acceptability of this approach. In 2008, researchers published data that they had collected from the Facebook accounts of an entire cohort of university students over 4 years. This research subsequently came under intense scrutiny when it was discovered that the “anonymous” university used in the study could be easily re-identified ( 16 ). In a second controversial case in 2016, Danish researchers published data from 70,000 OkCupid users, with information including usernames, age, gender, location, and sexual preference being made publicly available with no attempts to anonymize the data ( 17 ).
Such incidents have led to concern that individuals' privacy could be threatened, with only a small amount of effort from an adversary, if harvested datasets are made public. Previous studies reviewing public and patients' views have indicated the fears of harms that patients have when their sensitive, personal health data are used for the secondary purpose of research ( 18 , 19 ). Participants feared that if they were re-identified from their data following unauthorized disclosure or access, this could lead to identity theft, consequences for employment, pension eligibility, increased insurance costs, social discomfort, community embarrassment, unnecessary stigmatizing judgments in clinical settings, or the use of their data for financial gain ( 19 ). Although these studies focused on clinical and administrative data sources, it is possible that social media users may have similar fears about potential consequences of the secondary use of their internet posts.
However, personal content posted on social media platforms and internet discussion forums has been made public to a greater or lesser extent by the content creator and therefore differs considerably from clinical data created by a health care professional in the course of recording a confidential consultation. It is therefore important to have a separate framework guiding best practice for using this type of health data in research. Core research ethics principles for biomedical research, such as those proposed in the Belmont Report ( 20 ), and Beauchamp and Childress ( 21 ), and for ICT research, such as proposed in the Menlo Report ( 22 ) are likely to be relevant, as they give overarching principles that are relevant for many scenarios in health research. Beauchamp and Childress' four main principles have been shown to underpin public or lay thinking about ethical issues in data-sharing for health research ( 19 ).
Previously, researchers have acknowledged the lack of ethical guidance in social media data mining ( 6 , 23 ) and often institutional ethics review boards report feeling ill-equipped to keep pace with rapidly changing technologies used in research ( 24 , 25 ). We therefore aimed to review the literature on best ethical practice in this field to bring together recommendations for text-mining/NLP researchers who are using social media and patient discussion forum data for health research. We cast our net wide for this study, systematically searching both academic and gray literature, and aiming to include social media users' perspectives in our recommendations.
Materials and Methods
A systematic review protocol was registered with the PROSPERO database (CRD42018112923) ( 26 ). Three systematic searches were undertaken in parallel, two are reported using PRISMA flow charts ( 27 ), and the manuscript reporting adheres to the Enhancing Transparency in Reporting the Synthesis of Qualitative Research (ENTREQ) statement ( 28 ). No ethical approval was sought for this study as it involved analysis of previously published reports.
Search Strategy
We conducted three systematic searches, two of academic literature and one of the gray literature.
Search 1 and 2: Academic Literature (1: Commentaries, Editorials, Tutorials, and Recommendations; 2: Qualitative Studies Reporting Social Media Users' Views on Social Media Text Mining for Health Research)
Three databases were searched: MEDLINE (Ovid), Scopus, and ASSIA (Proquest). The terms used in the search are given in Table 1 . Truncations and Boolean operators were used to allow for a comprehensive but specific search. The search was limited to results published in English, with no constraint on the country of publication. A date restriction of 2006–2018 was employed, as this was when Facebook was expanded beyond educational institutions and made public to anyone with a registered email address, changing the landscape of online social networking. Results from each search were imported into separate Zotero files and duplicates were discarded. Reference screening was conducted to identify further papers.

Table 1 . Search date, search strings, and databases for three searches.
Search 3: Gray Literature
What is meant by gray literature is undefined, but is generally recognized to be publications and documents not controlled by commercial organizations or publishers, or not collected and preserved in library holdings ( 29 ). We searched for these documents using the Google search engine; search terms are given in Table 1 . The first 60 results were examined.
Inclusion and Exclusion Criteria
Due to the qualitative nature of this research and its exploration of a phenomenon rather than an intervention, the SPIDER tool ( 30 ) was used to construct the inclusion and exclusion criteria for academic articles ( Table 2 ). Following each search, one author (SS) screened each set of articles based on their title and abstract. The remaining articles from each search were then screened using the full text by two authors (SS and EF), with a consensus being met for each article regarding its eligibility.

Table 2 . Inclusion and exclusion criteria for systematic review using the SPIDER tool.
For the gray literature search, articles were accepted if they were published commentaries, editorials, tutorials, guidelines, or recommendations for ethical health or social science research, which used social media data in a passive way (e.g., qualitative analysis of posts; text mining), i.e., not involving recruitment or interaction with social media users, or providing any interventions through social media. Screening was conducted by EF and reports were checked for eligibility by LH. Where this search brought up academic publications, these were examined according to search 2 criteria and, if eligible, were added to the pool of academic articles covering commentaries, editorials, tutorials, and recommendations.

Quality Assessment
Academic articles reporting qualitative studies on social media users' views were assessed for quality by author SS using the Mixed Methods Appraisal Tool (MMAT) ( 31 ). For a study to be included in the final review, it had to score at least three points, with two points coming from the initial two screening questions. All eligible studies met this criterion.
Data Extraction and Synthesis
The following information was extracted from qualitative studies (where relevant): date, location, and publication type, authors, data, study design, number of participants, research objective, and findings.
For thematic analysis, all articles and documents were imported into NVivo 12, and separate thematic analyses were carried out for commentaries, editorials, tutorials, and recommendations and social media user studies, following the thematic synthesis principles of Thomas and Harden ( 32 ). These principles allow for transparency and reproducibility of the methods due to its detailed methodology. Analysis involved coding all relevant text, line by line into nodes (initial coding was conducted by SS). Existing nodes were used for subsequent papers where appropriate, and new nodes were created where necessary. Once a full coding and node structure had been completed, nodes were examined and discussed between SS and EF. Nodes were then aggregated into larger descriptive themes, and following iterative refinement and discussion, these were then used to deductively generate the final analytical themes for the results of the study. When synthesizing the findings, priority was given to data that contributed to the formation of a set of ethical guidelines. Once the two types of academic articles had each been analyzed separately, the themes generated for each set were examined and matched together. Next, the gray literature results were examined in the same way, identifying extracts that related to existing themes from the previous two analyses and coding extracts that related to new themes (coding was conducted by EF, and nodes were examined and discussed by LH). Once all gray literature documents had been coded, a final complete set of themes was agreed by all authors. All articles and documents were then re-read to ensure all content relating to the final themes was extracted.
Search Results
Search 1: academic articles comprising commentaries, editorials, tutorials, and recommendations.
From 1,690 articles returned by the search and by reference screening, 26 met eligibility criteria and were included in the study. A further 9 articles and a book chapter were identified as eligible from the gray literature search, making 36 articles in total ( Figure 1 ). These were perspective, commentary, or recommendation full articles ( N = 20), literature/systematic reviews ( N = 7), case studies ( N = 5), opinion ( N = 1), conference proceedings ( N = 1), book chapter ( N = 1), and editorial ( N = 1) papers. All papers were published between 2008 and 2019, with 31 of the articles published from 2013. Authors of the papers were geographically spread between the USA ( N = 15), UK ( N = 13), Germany ( N = 2), Switzerland ( N = 2), Canada ( N = 1), France ( N = 1), Australia ( N = 1), and Saudi Arabia ( N = 1).

Figure 1 . Prisma flow chart for inclusion of articles for Search 1.
Search 2: Qualitative Studies of Social Media Users' Views
A total of 7402 peer reviewed articles were identified through the systematic search, with an additional 4 articles identified though reference screening. Of these, four papers met the inclusion criteria and were included in the review ( Figure 2 ). All studies were published after 2012 and were geographically spread between the USA ( N = 2), the UK ( N = 1), and Australia ( N = 1). All studies were qualitative and ranged in the number of participants from 26 to 132, with a total of 232 participants across all studies. All studies published some demographic information, with males and females being well-represented. Three of the four studies included the average age or range of ages for participants, with two of the studies focusing on adolescent populations. Two of the studies focused on specific conditions, one being diabetes and the other mental health. Study characteristics are shown in Table 3 .

Figure 2 . Prisma flow chart for inclusion of articles for Search 1.

Table 3 . Study characteristics for qualitative studies of social media users.
The separate search strings produced a range of different and overlapping results, so the full number of results screened is not available. After identifying and examining 12 reports in detail, 7 were identified as eligible. From the gray literature search, a further nine academic articles and one book chapter, not picked up by search 1, were also found to be eligible and added to the pool of articles from search 1. The seven included gray reports were dated between 2012 and 2016; five were authored by researchers in the UK and two had international authorship.
A total of 47 articles were therefore included for thematic analysis, from the three sources: 36 commentaries, 4 qualitative studies, and 7 gray literature reports. During the thematic analysis, four themes were identified from the qualitative studies of user views: private vs. public data; consent; anonymity; and research for public benefit. Six themes were separately identified from the academic articles comprising commentaries, editorials, tutorials, and recommendations; these were private vs. public data; consent; anonymity; weighing harms against benefits, necessity for ethical approval for studies; and governance of data, annotations, algorithms, and linkage. Three of these themes overlapped, giving seven from academic literature. These themes were largely supported by the gray literature, and an eighth theme was identified from these reports: legal issues and terms and conditions of social media sites. The reports and articles contributing to each theme are shown in Table 4 .

Table 4 . Contribution of each article or report to the eight themes.
Private vs. Public Data—To Whom Does the Data Belong?
One of the leading themes to emerge from the included articles was the argument around whether social media posts should be considered public or private data. Many commentators argued that once data have been openly posted on social media, it then becomes part of the public domain, and that subsequently dismantles the expectation of privacy and implies consent for the use of data for any purpose ( 45 , 51 , 52 , 56 , 57 , 61 , 69 ).
“ However, those posting information do so knowing it is an open public forum; therefore, researchers may conclude that consent is implied for the use of the postings for any purpose.” ( 57 )
However, different social media sites give different levels of public access to content; for example, access to Facebook content can be restricted to predefined “friends,” whereas public Twitter posts are viewable to anyone, even people without a Twitter account. This leads researchers to question how they should approach the posted data ( 61 ).
“ The nature of new media itself blurs the boundary between public and private. Should what is posted be considered the equivalent of a personal diary or a newspaper? In some cases, the answer may be obvious but often it will not be.” ( 61 )
The ESOMAR guidelines suggest that:
“ Researchers should not copy or scrape content within private areas, even if they have permission of the site owner. If researchers do so, it should be made clear to all users that this is happening and they should provide individuals with a process to be excluded from such data collection.” ( 38 )
Adding to the complexity of blurred boundaries, public opinion should also be considered. Some social media users believed that information posted on the internet is in the public domain, thus removing the need for consent, and were also likely to think that they had forfeited the right to privacy ( 34 ).
“ I don't pay to use Twitter. I sort of signed up with the expectations that it's a free site and you just kind of throw things out publicly, [so] I don't really have an expectation that anything that I post is going to remain private.” (Control Group, 29, male) ( 34 )
The concept of privacy varied for individuals and was often framed by generational, cultural, and social norms ( 37 , 58 ). Additionally, what is considered public and private is continuously changing, even within individual conceptions ( 37 , 59 ). One issue that arose was the idea that just because something has been published, it does not mean the user expects their content to be re-used for any other purpose.
“ People may operate in public spaces but maintain strong perceptions or expectations of privacy. Or, they may acknowledge that the substance of their communication is public, but that the specific context in which it appears implies restrictions on how that information is – or ought to be – used by other parties .” ( 37 )
“ Data may be public (or semi-public) but this does not simplistically equate with full permission being given for all uses .” ( 49 )
Assuming that content posted in public implies consent for re-use also assumes that social media users have had agency over the settings on their social media account. However, there is strong evidence that users lack knowledge of social media privacy settings ( 46 , 58 , 60 ). This is because privacy settings can be difficult to navigate and lack transparency ( 46 , 58 ). There is evidence that social media users are not always adequately equipped with digital knowledge to operate privacy settings and protect their data. This suggests that just because a post is public, the researcher cannot assume the creator of the post deliberately made it public ( 46 ).
“ For example, Facebook's privacy settings are problematic because they are opaque and dependent on the user's self-education.” ( 46 )
“ It was felt that Facebook often blurs the lines between what is public and private, and this lack of distinction is not made at all clear to Facebook users, e.g. the fact that ‘private' posts can be made public by re-posting .” ( 41 )
“ One can only properly grasp how to maximize privacy by knowing precisely what the default settings imply and what the users have signed up for … However, these privacy settings are not particularly user friendly.” ( 58 )
Some participants expressed a lack of understanding of how privacy settings work and believed that they should not be forced into research because of that ( 35 , 36 ). In Moreno et al. ( 36 ), participants expressed confusion about whether their Facebook settings were private or public and, after discussion in the group, made comments such as:
“ Yes, so that means my Facebook is public right now? I don't want that.” (Facebook user) ( 36 )
“ I guess I'm surprised because I thought it was private.” (Facebook user) ( 36 )
“ There are quite a few people who are late to join to Facebook or are of a generation who don't know how to use it … they shouldn't be punished for that.” (Focus Group 7) ( 35 )
This sense of confusion may especially be to be true for users of Facebook, where users can control to some extent which other users can see their posts, and may belong to closed groups for which there is reasonable expectation of privacy. For other types of social media where public accounts are the norm, such as Twitter, there is likely to be much less expectation of privacy, especially as tweets written by celebrities are regularly reposted in mainstream media. However, even with no expectation of privacy, misconceptions were also evident among users about how social media works regarding permanence of posts, how far back data can reach, who can access the posts, and how data can be analyzed ( 34 ).
“ I would say definitely . < chuckles > Maybe it's because I'm young, so I started into social media when I was younger, like really young. So every once in a while, I'll go through [and delete].” (Control Group, 21, female) ( 34 )
“ I would say most of the time I'm not afraid to rock the boat. But I mean, Twitter won't let you scroll back that far, so I'm not super concerned.” (Depression Group, 20, male) ( 34 )
“ Are you naïve enough to think that your public tweet is going to be seen by like a million people? I mean sure, it's public. Anyone could go and find it, or search for it, or whatever. I mean, but it's not like Beyoncé tweeting is the same as me tweeting.” (Depression Group, 54, male) ( 34 )
However, participants in qualitative studies expressed that consent was implied and the data should not be considered private if posters had failed to protect the data through privacy settings ( 34 – 36 ).
“ And I don't think it's bad that you went and looked at people's profiles, ‘cause if they have them open, it's their choice.” (Facebook user) ( 36 )
“ I think it's the own person's fault for having a public profile because it's out there.” (Focus Group 8) ( 35 )
This assertion becomes especially difficult as Facebook and other platforms have been known to periodically redesign the privacy settings. Even if users are required to formally accept changes that may have affected their existing privacy settings, there is no guarantee they have comprehended, or even read updated terms and conditions. Thus, users may remain unaware of such changes. This means researchers should consider that users may not be in full control of their privacy settings and that the settings that users have chosen (or left) may not reflect their actual perception of privacy.
Consent—Should Users Be Asked?
Following directly from the debate on whether data could be considered public or private was the question of whether social media users should be asked for consent to re-use their posted content for the new purpose of health research. Commentators were divided between arguing for and against acquiring consent from individual users ( 46 , 71 , 75 ).
Many researchers followed the established notion that consent was not required to collect information already in the public domain ( 45 , 56 , 72 ). This position followed from the assumption that data mining is a form of secondary data analysis of publicly available material; therefore, as long as the data are freely available and log-in details are not required to view the data, consent is not required ( 39 ). Also discussed was whether social media users were “participants” or “authors of public written content” ( 38 , 47 ). There were also considerations of the practicality of gathering consent, with it being impractical to gain opt-in consent from large numbers of users.
“ Individual informed consent is impractical for research involving large datasets. In these cases researchers should ensure data use is in line with terms and conditions and care should be taken to protect the identity of users .” ( 39 )
In all qualitative studies of social media users' views, there were participants who felt consent was not required. Some participants acknowledged that the internet is a public domain, and therefore posted data are freely available for anyone to see and use, including researchers ( 33 – 36 ).
“ That's a good way to do [the study]. Because if people are publicly showing their pictures, then it's, like, open for anyone to see.” (Facebook user) ( 36 )
“ As the information is posted on the very public Internet, I don't think there is a need for permission to use the posts.” (Participant 10) ( 33 )
Although this second quote indicates that some users feel the internet is public, not every user may agree, especially given the nuance of difficulty over privacy settings described above. The contrasting view was that users' consent should be solicited. Some participants in the qualitative studies felt that consent should be required ( 33 – 35 ), rationalizing this because the data would be used for a purpose other than the one the user originally intended ( 35 ).
“ I reckon they should ask first ‘cos they have only posted on a public site like you are posting it for people to see not for them to take the information.” (Focus Group 5) ( 35 )
Social media users acknowledged that consent for the re-use of their posted data was often required for acceptance of the social media website's terms and conditions ( 34 , 35 ), and this was often the view taken by commentators. However, often these policies are not read by users, especially adolescents ( 35 ). Furthermore, participants felt that the terms and conditions policy was inaccessible, as the blanket language used was often difficult to understand and lacked transparency ( 34 ), and commentators agreed that signing terms and conditions did not constitute informed consent as understood in traditional research methodologies ( 43 ).
“ When consent is sought through a terms and conditions document upon entry to a particular social media site, young people may be unlikely to read it; “I don't think anybody reads them.” (Focus Group 1) ( 35 )
Respondents did not feel as though simple blanket language in the “terms and conditions” constituted transparency. Such language was confusing and buried in what one participant terms, “a wall of text that no one ever reads.” (Twitter user) ( 34 )
“ Whilst there is a fair and lawful process for analysing social media data on quantitative scale, this is not synonymous with user consent to be included in a research project.” ( 43 )
Where data are not freely available without a log-in (e.g., closed Facebook groups), then there was agreement that consent should be sought before their data are used, as there is an expectation of privacy in these groups. However, a policy of gaining consent in these cases may be considered unworkable, given the logistical difficulties encountered with so many potential participants ( 66 , 73 ). A suggestion attempting to mitigate the need for individual consent is that a “group consent” can be acquired, with or without an opt-out for members of the group ( 66 ); however, this too has its critics:
“ Even when conducting research on a large community that possesses a distinctive identity, such as the black community or the cancer-survivor community, obtaining the consent of the group as a whole is futile. Who can truly speak on behalf of the group of cancer survivors? This is an unanswerable question.” ( 62 )
There was some consensus that consent should be gained for use of data posted by vulnerable groups, including children. Children are one group that often require special forms of protection. This is often sought by gaining the consent of parents or guardians, as children do not always have adequate decision-making capacities. This poses a greater level of difficulty in the realm of online research, as adolescents can feel insulted at the notion of having to ask their parents for permission ( 69 ). Children may also be less likely to understand the full implications of posting content publicly and its possible reach ( 62 ). Also, it is not always possible to identify that users may be underage or vulnerable, so again, it is not possible to have a blanket policy for this. Rather, what constitutes the best form of consent will be best decided for the individual study.
One further issue is that of deleted posts. Users can subsequently delete their posts from the social media platform, and this could happen after the post has already been captured in the research. This might imply that consent for use of posts has been withdrawn. Researchers should plan in advance how to manage this issue. One plan might be to check that all quotes of individual posts still exist prior to publication of the results. If the quoted post has been deleted, it should be removed from the report ( 39 ).
Anonymity—Users Should Not Be Identifiable
There was a consensus throughout reports and articles that researchers have an important duty to maintain the anonymity and protect the identity of posters of social media content ( 65 ), throughout the analysis and especially in any publication of results. Participants in qualitative studies also highlighted the importance of ensuring the anonymity of social media users when posts were re-used for research purposes ( 33 – 36 ). Some participants felt that as long as the anonymity of the poster was assured, consent to use the data was not needed ( 33 ).
“ If you're using the data in some kind of statistical analysis – and not quoting directly the posting then I'd say no permission is probably needed.” (Participant 10) ( 33 )
“ As long as it's de-identified, that's all I really care about.” (Focus Group 1) ( 35 )
Throughout the literature, commentators have suggested various ways to achieve anonymity. For example, the data should be locally encrypted ( 58 ), identifiable information should be removed prior to publication ( 61 ), users' identity should be hidden through disguise (e.g., using pseudonyms or synthetic quotes) ( 59 ), and data aggregation methods should be applied ( 53 ). Participants in qualitative studies offered suggestions of acceptable ways to ensure their anonymity was maintained. These included aggregation of data ( 34 ), making generalizations, and removing identifiable information ( 33 ).
“ I'm OK as long as we can, you know, figure out ways to keep the data anonymous and completely, highly aggregated.” (Depression Group, 47, male) ( 34 )
“ Few people would have a problem with generalised and anonymised references.” (Participant 17) ( 33 )
“ [Permission isn't needed] as long as you don't identify the poster by more than sex, age, and type of diabetic.” (Participant 20) ( 33 )
In the social media context, there may be a greater risk to an individual's confidentiality and anonymity compared to conventional research because search engines may be able to detect the original post of a user when a key phase or quote is published in a research article ( 37 , 41 , 74 ). This could allow for identification of the user's personal profile, which opens the possibility for them to be contacted via personal messages through the website ( 51 , 56 , 61 , 63 , 68 , 69 ).
To avoid this risk, commentators have suggested avoiding the use of direct quotes from a user's post ( 38 , 51 , 63 ) or paraphrasing quotes if it is felt necessary to include this type of data in the writeup ( 42 , 56 , 59 , 61 ). A third possibility is to synthesize quotes to illustrate a finding, which are based on principles within the data.
“ To better maintain the principle of respect for person while presenting verifiable data, we recommend that researchers paraphrase users' comments .” ( 56 ).
Participants in qualitative studies also identified the threat from the use of direct quotes from participants' posts ( 33 ).
“ If you want to use actual quotes from people that's a different matter as even if you make the quote anonymous in your research it will be quite easy to find the author simply by typing in key phrases into Google which will then give links back to [the forum].” (Participant 29) ( 33 )
The ESOMAR guidelines recommend how to deal with this problem:
“ If researchers wish to quote publicly made comments, they must first check if the user's identity can be easily discoverable using online search services. If it can, they must make reasonable efforts to either seek permission from the user to quote them or mask the comment to such an extent that the identity of the user cannot be obtained.” ( 38 )
There have also been concerns raised that even data aggregation is not fool-proof in terms of disguising group identities. For example, in one case study, an adversary found the identity of an anonymized university, when only aggregated data were presented. This was due to the uniqueness of the information given about the university that the participants attended. If the course and year group of students were also identified, it would then be a short step for an adversary with inside knowledge to re-identify individuals [( 63 ), referencing ( 16 )].
The importance of protecting anonymity was of greater concern to certain populations, particularly those who live with a stigmatized diagnosis or who are part of a vulnerable group (such as adolescents) ( 34 , 35 ). Stigmatization and bullying were two key concerns:
“ Once you've got the taint of depression – mental illness at all in our society, it's an uphill battle. Even now, people in my family are like, ‘Oh, you sound cranky. Have you taken your meds?” (Depression Group, 33, male) ( 34 )
“ De-identification of social media posts was crucial to minimise negative ramifications; “If you do write something on there, it is going back to maybe someone you know, you could get bullied for that reason.”” (Focus Group 3) ( 35 )
The ESOMAR report also stated a list of features of social media content where it may be especially important to protect user identities: if the topic being discussed is sensitive or personal; if abusive language is used; if it includes anything against the law; if it includes anything embarrassing or is likely to impact career opportunities; and if it includes any personally identifiable information or data about others that is not already public ( 38 ).
Weighing Harms Against Benefits
“ Researchers have an obligation to avoid causing physical, emotional or psychological harm to participants.” ( 40 )
According to Beauchamp and Childress' principles of research ethics, researchers have an obligation to ensure no harm comes to participants, and that the research will have potential benefits for the target group ( 21 ). The risk of harm is greatest when “ a social media user's privacy and anonymity have been breached ” ( 42 ). After considering and minimizing the risks of a breach in confidentiality, researchers must also identify any further sources of harm that their study could precipitate ( 37 , 38 , 41 , 42 , 46 ). This is often difficult for researchers mining social media data, as they need to predict sources of harm ahead of publication, which are often not apparent ( 74 ).
“ Is the project a potential source of harm? … it may be difficult to identify “harm.” Researchers have to be thoughtful about any potential harm that their research might incur by being sensitive to the content extracted from social media websites, the degree and context of content exposure, and the authenticity of the material used.” ( 46 )
“ Does the connection between one's online data and his or her physical person enable psychological, economic, or physical harm?” ( 37 )
Throughout the literature, commentators identified several sources of harm that should be considered by online researchers. Possible harms include the following: blurring of personal and professional boundaries; creating a culture of mistrust between researchers and users (which can form when taking data without consent) ( 54 ); leaving the user at risk of “abuse” ( 40 ); embarrassment, reputational damage, or prosecution ( 42 ) and abusing the power imbalance that can exist between researcher and user ( 56 ).
“ Power, especially the power differential between the researcher and the researched, must be considered. Where the power imbalance is abused, there is a significant threat to justice and the potential for harm.” ( 56 )
“ Participants felt that being identifiable in research could lead to unsolicited attention online and, more seriously, ‘abuse'. This might be from people they knew, or from organisations that could ‘exploit' them.” ( 40 )
Special care to assess risks and benefits must be taken in certain situations, including vulnerable groups such as children, and with sensitive issues that could lead to stigmatization ( 52 , 60 , 74 ). Harms could result from users' being re-identified within the research, or from the publication of sensitive findings that could harm an entire group ( 58 ). This places a responsibility on researchers to consider methods to ensure that individuals' anonymity is maintained ( 60 ), along with being sympathetic to the generalizability of their findings ( 52 ).
“ Vigilance is required when conducting social media research on sensitive topics that might permit the identification of participants, resulting in stigmatization; the dissemination of findings that could harm an individual or social group; challenges to an individual's values or beliefs; and instances of bullying and abuse. Such research risks inducing or exacerbating emotional distress.” ( 58 )
When re-using potentially personally identifiable data, the benefits of the research must be justified ( 46 , 51 ), especially considering the introduction of the General Data Protection Regulation (GDPR), as this legislation provides a legal responsibility for European researchers to ensure they provide justification of the benefits their research will provide ( 58 ). Researchers should consider the risk–benefit relationship within each study, so that information discovery that can contribute to patient care and well-being can proceed, while causing minimal harm to users ( 56 ). One way of ensuring a thorough consideration of risks and harms, which may not be obvious to researchers without lived experience of a condition, is to include social media users from the target population in the study team or consult with them before the research is conducted.
Research for Public Benefit
An overarching theme that emerged from the qualitative studies with participants was an altruistic view that if the research was being done for the greater good, and not for commercial gain, then many people were in support ( 33 – 36 ). Some participants were even willing to put privacy concerns aside for the greater good of the research ( 34 ).
“ Well, I mean Facebook is pretty much open to anyone, so as long as it's not for a bad intention I think it's fine.” (Facebook user) ( 36 )
“ When people post on the Internet, it is there for all to see. They should not complain if it can be harvested and used for the general good.” (Participant 12) ( 33 )
“ I can't be in a position to know all the possible things that someone could come up with, all the beneficial things, all the harmful things. I think [it represents one-percent of the issues], the whole array of things that are possible shouldn't be stopped because we're so overly worried about [privacy].” (Depression Group, 54, male) ( 34 )
“ ‘It could well be of benefit to the, you know, the people who deal with these kind of things that, good information about domestic violence. And if there's no risk to the individuals [whose information is being used] then it's probably a good thing (Male, age 61 + , High User) ( 40 ).
This was felt strongly by some adolescent participants, who also felt that the research had to be done by reliable organizations, so that their words were not taken out of context ( 35 ).
“ Like having a trustworthy organisation that we know you're not going to like spin our words and make us look like bratty teenagers who just post because we can. Like try and understand it from our point of view I guess.” (Focus Group 1) ( 35 )
This contrasted with research being done “for profit” by private companies or to drive an agenda.
“ Research being conducted by a not-for-profit organisation, rather than for ‘commercial' reasons, was preferred for two reasons. Participants who preferred not-for-profit researchers to commercial organisations did so because the former were felt to be more ‘productive', more ‘ethical' and ‘not exploitative'. The second reason not-for-profit researchers were preferred is because participants did not like to think of their social media posts being used to generate a profit for others.” ( 40 )
Ethical Approval for Studies—Is It Needed?
Throughout the papers, there was mixed opinion regarding whether research ethics committee (REC) approval was required for studies that mined publicly available data from social media sites. A systematic review of mental health social media research papers identified that only 9 out of a total 48 papers gained REC approval, and a further 2 used a dataset that had ethical approvals in place ( 78 ); no obvious methodological differences were reported between these studies and those that did not seek ethical approvals.
A key consideration when determining if social media mining required REC approval focused on whether the social media users who posted the content are considered human research subjects and therefore participants in the research; or if their posted content can be treated as stand-alone public data of which they are authors, and therefore the research can be considered a secondary data analysis ( 37 , 47 ).
“ In internet research, ‘human subject' has never been a good fit for describing many internet-based research environments . ( 37 )
Some researchers feel that the process of mining social media data is synonymous to observing people in a public place, such as a park, but they also identify that this perspective may not be shared by everyone.
“ The researcher believes that … individuals are aware that they are in the public sphere and potentially being observed. She seeks research ethics consultation because she recognizes that others may feel that viewing of publicly available Facebook pages is qualitatively different from observing unknown people in a park, for example.” ( 73 )
On the other hand, some researchers have attempted to define more clearly the meaning of human subjects research by applying a legal definition to the term “human subject.” With this definition excluding the authors of social media content, it is then suggested that this form of passive data use is exempt from the REC approval process, especially if the researcher is not interacting with users or publishing identifiable information ( 55 , 59 , 63 , 76 ).
“ If the following conditions are met: access to the [social media websites] is public; information is identifiable, but not private; and information gathering requires no interaction with the person who posted it online; then presumably the proposed project does not constitute human subjects research.” ( 63 )
“ Since most social media projects rely on publicly available data, and do not include interventions or interactions with the population, they may qualify for IRB [internal review board] exempt status” ( 47 )
Where researchers do seek ethical approvals from a committee, they should not assume that this absolves them of considering all the ethical issues around the project themselves (such as user anonymity, and risk/benefit ratios):
“ Many ethics boards do not understand the processes of mining and anonymizing Big Data, let alone the errors that can cause data to become personally identifiable. Accountability requires rigorous thinking about the ramifications of Big Data, rather than assuming that ethics boards will necessarily do the work of ensuring that people are protected .” ( 49 )
Legal Issues and Terms and Conditions of Sites
Fewer articles mentioned this theme, and it was found most commonly in articles and reports found through the gray literature search. While automated technological tools “ can collect, clean, store and analyse large volumes of data at high velocity ” ( 39 ), researchers are not always permitted to scrape data in this way by social media sites' rules ( 50 ). Researchers should be clear on the constraints within social media sites' terms and conditions and should make sure they are operating within the law.
“ It is important for researchers to take the time to read user agreements for social media platforms as they govern what practices are permissible and provide guidance on publishing posts.” ( 44 )
Several guidelines advise researchers to abide by the regulations of the website that they are mining data from ( 37 , 40 , 42 , 72 ), and the law ( 38 ), and to identify what users consented to at the time of data capture ( 41 ). For example, if the research is being conducted on Facebook, it explicitly states in the terms and conditions that researchers should both inform and gain the consent of the people from whom they collect data ( 58 , 63 ).
“ Facebook's Statement of Rights and Responsibilities now states that, when collecting users' personal data, one must obtain consent, make it clear who is collecting the information, and declare how the data will be used.” ( 58 )
By agreeing to Twitter's terms and service agreement, users consent for their information to be collected and used by third parties ( 44 ). Researchers using Twitter data often justify their collection and analysis of Twitter posts because users have signed this agreement. However, the Twitter terms and conditions do not allow scraping or downloading of tweets (instead, researchers should use an approved Twitter API), and therefore, researchers who scrape data may be in contravention of Twitter's terms and conditions ( 44 ). A further consideration is that Twitter users retain the “right to be forgotten” ( 77 ), and this right complicates the publication of direct quotes, especially without consent, as these cannot easily be removed from peer-reviewed publications. Other sites such as Reddit also have official APIs for accessing posts and associated metadata for use in research.
Governance of Data, Annotations, Algorithms, and Linkage
An emerging theme, populated only by the most recent articles and those with an NLP focus, suggested that researchers should focus on transparency of methods, and good governance practices with regard to datasets, annotations, and linkage.
There is a tension over the datasets created following the harvesting of social media data. Two arguments have been made, both in favor of sharing openly datasets and annotations, and advocating that datasets should be protected in case they contain potentially sensitive data. Datasets where there is a risk of user identity disclosure may be placed on a protected server for example ( 47 ).
“ We strongly encourage researchers to share datasets and annotations they have created so that others can replicate research findings and develop new uses for existing datasets… However, where there may be risk to users, data should not be shared blindly without concern for how it will be used.” ( 47 )
While sharing annotated datasets substantially reduces the burden on other researchers to create and annotate new sets, annotations of the data should also be considered as potentially sensitive:
“ Domain experts may manually label users for different medical conditions based on their public statements. These annotations, either manually identified or automatically extracted, may be considered sensitive user information even when derived from public data.” ( 47 )
It is recommended that annotations be stored separately from the raw data where possible.
To reduce the likelihood of re-identification of or harm to users, it is recommended to remove the author's name and @tag from the dataset, strip out other named persons or place names, remove metadata such as geolocation, generalize data such as locations to large group categories (e.g., city rather than street name), and identify where “ the need for creating derived characteristics is crucial to a project, and not running these algorithms as standard.” ( 43 )
A further level of transparency is needed around the data processing and information extraction methods used in the research. While data science methods can be couched in claims of objectivity, researchers should be aware that biases may be introduced by their algorithms or case identification, entity recognition, or relationship extraction methods. It is also important that these methods are made available to the public and social media users in a culture of transparency:
“ Machine learning algorithms do not learn perfectly predictive models. Errors and misclassifications will be made, and this should be accounted for by the researcher.” ( 47 )
“ If using software that enhances Twitter data, ensure algorithms that derive characteristics and classify users are open to researchers and research subjects. The accuracy, configurations and threshold settings of algorithms should be made public to ensure reproducible research.” ( 77 )
“ Many are not aware of the multiplicity of agents and algorithms currently gathering and storing their data for future use. Researchers are rarely in a user's imagined audience.” ( 49 )
Special caution should be used when using “off-the-shelf” algorithms, which are not well-understood by the researcher:
“ Data scraping without context may result in potentially inaccurate algorithms that may get reported and reused in application, leading to potentially harmful consequences.” ( 50 )
Researchers should be very cautious when linking data across sites or to other data sources. While users may share data publicly on multiple platforms, they may not intend for combinations of data across platforms to be public ( 61 ). Caution should especially be used if trying to identify the same user posting on separate sites or platforms, as they may not wish to be identifiable on all platforms (e.g., on Twitter vs. an anonymous patient forum). Given the high likelihood of making individuals more identifiable by linking data across different sources, REC approvals should be sought for this activity.
Our review demonstrates key ethical issues in approaching text mining of social media data for health research and is relevant to all NLP and text-mining researchers who engage in this endeavor. Like previous reviews and guidelines, we have shown the existence of a complex intertwined matrix of ethical considerations around the use of social media data for research purposes. We have extended previous work by showing some themes that are specific to analysts and computer scientists who employ algorithms and other methods for processing multiple documents automatically.
The key issue for academic researchers that may determine whether consent or ethical review is needed is to reach consensus on whether social media users are considered human subjects within the research, or whether their data are assumed to be public data of which the researcher is undertaking a secondary analysis. Many commentators' view was that much social media is in the public domain, the public or private nature of which is controlled by the user via their account privacy settings. Therefore, if they are freely displaying the data in an observable public domain, they are relinquishing the privacy rights of that data, and the data can be considered public. However, this should be balanced by researcher consideration of how easy it is for users to control their privacy settings in the platform from which the data are being collected.
Informed consent is a voluntary agreement by a participant with mental capacity who understands the full consequences of participating in research. It is one of the most critical aspects of research ethics ( 79 ), which likely accounts for the strong theme discussing issues around consent in this review. The principles of informed consent are embedded in the most influential research ethics reports, such as The Nuremberg Code, The Declaration of Helsinki, and The Belmont Report ( 20 ). In the case of social media data being considered public, experts generally agreed that consent was not needed, except in special cases involving vulnerable populations and sensitive issues.
However, social media users were more divided on the requirement for consent, as they acknowledged that privacy settings were challenging to navigate. Often social media platforms update their terms of use or privacy controls, meaning users must go back and navigate a new system to ensure their privacy choices are still set correctly. These shifting sands may be considered—as in Gibson's theory of affordances ( 80 ), which is used in science and technology studies and other related fields ( 81 , 82 )—as agency on the part of the social media platform. In this example, changes to the technical properties of platforms may function to constrain agency by the user, ensuring that more public content remains available by default, and connections between users can be maximized ( 83 ).
Another issue raised in the arguments against gaining consent was that it is often impractical to do so when such large datasets are being used. Some previous commentators have discussed whether impracticality in gaining consent can be considered a justification for not doing so. This has been discussed in relation to clinical trials of emergency treatment, for example, but also in circumstances where requiring consent will reduce inclusion to the study to the most engaged or empowered sectors of society, such as is likely with big datasets in health ( 84 ). This then affects data quality and results in bias in participant selection and possibly the most disenfranchised sectors of society remaining under-researched. In addition, it has been noted that users with health conditions actively manage how their identities are presented online, with privacy in mind ( 85 ). For example, those who suspect their data may be used to infer their health status, for example, for targeted advertising, may deliberately deploy strategies to mask their underlying condition (e.g., by filtering or changing the content or frequency of posts) ( 86 ). This again will have an impact on data quality and reliability.
Despite data often being considered public, experts and users expressed the importance of ensuring the anonymity of social media users in project publications, and acknowledged that this can be difficult to achieve if full direct quotes are printed. Therefore, researchers should take appropriate precautions to ensure anonymity by employing methods such as removing identifiable information, disguising identities, aggregating data, making generalizations, and creating synthetic quotes. Anonymization of the data also contributes to the argument that consent is not needed: within the 2001 bulletin of the Declaration of Helsinki, it states “ Medical research involving human subjects includes research on identifiable human material or identifiable data” ( 87 ), thus removing, or not collecting, users' identifiers contributes to the determination that social media mining research does not fit the criteria of human subject research.
Even if under certain conditions, social media mining does not fit the mold of human subjects research, this does not mean it is free of ethical issues. The issues involved may not be addressed by traditional ethical frameworks ( 88 ) and new, creative ways of thinking about research ethics may be needed. One potential method is The Ethics of Care Theory ( 89 ). This is a theory that has been developed from feminist thought toward the morals of caring for others. Unlike more traditional basis of ethics, it focuses on the relationships in the research, acknowledging the differences in power that exist along with vulnerability. It also emphasizes the ethical decisions that need to take place in project design ( 90 ).
In line with this theory, our findings suggest that researchers should consider and identify potential harms that could arise from the use of social media data, whether these relate to individuals whose data was used, or to a patient group as a whole (such as those posting about stigmatized issues), or to the relationships and trust between social media users and the research community. Within our findings, there was reference to special consideration that should be awarded to vulnerable groups and stigmatized issues, but it can be difficult to define what these terms encompass. More thought should be given to defining these groups, but it is likely that researchers will need to be reflective and determine this on a project-by-project basis. Including social media users with lived experience of the conditions under study within research teams will help researchers to be more reflective about the potential for stigmatization or harm, which could result from their work. Harms should ideally be explored prior to research commencing, by consulting with all relevant stakeholders. To ensure widespread approbation of the research, researchers should make it clear when the primary goal is public benefit, rather than for private profit or to further an agenda, and the routes to benefits should be made transparent in the dissemination of the research.
Research quality and transparency are additional issues that contribute to the overall ethical nature of scientific studies. It is especially important that text-mining researchers ensure the quality of their work by developing or using high-quality techniques and examining carefully any limitations or biases in their understanding of the data and the context of its production, or in the transparency and quality of their data scraping and analysis methods. A lack of care and rigor in these elements of a study would make their work scientifically concerning even without other ethical issues ( 50 ).
Strengths and Limitations
We identified an extensive collection of peer-reviewed articles, commentaries, editorials, and gray literature reports, but may not have captured every piece of writing on this issue. The search specifications for the three searches were challenging, due to the interdisciplinary nature of this review (computer science, medical research, and ethics), meaning that MeSH terms could not be used. Therefore, it is possible that some terms may have been neglected, for example, while we used the terms “discussion forum” in relation to internet forums, we did not specify sites such as Reddit, Linked-in, Instagram, WeChat, or YouTube; thus, it is possible that we missed papers focusing on specific sites. This may have limited the scope of themes, such as the legal, terms and conditions theme, to identify issues relating to each site. Furthermore, because user demographics vary across sites, this choice may have affected the range of participants (and views) included in the review.
In addition, it is likely that this research missed some viewpoints by excluding papers not focused on health research and academic research, meaning that the guidelines formed here may be missing considerations specific to other scientific sectors such as social and market research. As our searches were limited to medical-, social science-, and life science-focused databases, we may have missed some of the computer science literature, such as papers published as conference proceedings. Some of these were picked up in the gray literature search, but there remains the possibility that some may have been missed.
Once the final articles were selected, it was not possible to assess the quality of expert commentary articles and gray literature guidelines, and public views reported were taken from a small pool of four papers, due to the lack of peer-reviewed research in this area. This likely resulted in a selection bias in the results, as the articles focused on specific populations (adolescents, mental health, and diabetes), and so the views demonstrated here may not be generalizable to all other social media users or conditions that could be studied.
While every effort was made to be neutral and data-led in this thematic analysis, due to the qualitative nature of this study, it is important to recognize the possibility of unintentional bias or subjectivity in the results because of the researchers' academic interests and knowledge. However, our results are comparable with other reviews in this field, such as Golder et al. ( 23 ).
Recommendations, Future Work, and Conclusions
Social media research using text analytics and NLP is evolving quickly in a largely unregulated landscape ( 24 ), with many researchers acknowledging the absence and subsequent need for guidance ( 23 , 51 , 66 , 68 , 69 , 72 , 75 ). While, in a range of circumstances, social media text mining can legally and reasonably proceed without specific ethics committee approvals, there are certain circumstances where scrutiny from ethical committees should be sought.
Ethical approval was considered necessary for research using data from closed groups, engaging in direct contact with users, when conducting any kind of intervention through social media, if research was specifically about users who are under 18 or lack capacity, if users could be identified from the study publication or dataset, if multiple sources of data are being linked, or if, following consultation, it is assessed that there are reasonable risks of potential harms or stigmatization occurring. Likewise, researchers should gain consent from social media users in the circumstances above.
Regardless of whether formal approvals are sought, we make some additional recommendations to improve ethical standards in all text-mining research using social media data for health research purposes, including increasing public awareness about research uses of social media data; aiming for transparency in data access and analysis methods; transparency in routes to benefits for users from the research; consultation with social media users and target groups to identify and mitigate against potential harms that could arise from the research; and ensuring the anonymity of social media users by masking or synthesizing direct quotes and aggregating quantitative data. Researchers should always act within the law and abide by the social media site's terms and conditions, for example, using approved APIs to access data, such as exist for Twitter and Reddit, among other sites. The research community as a whole should foster a culture of continuous improvement in terms of technology and transparency of methods for the processing of social media data for health research ( 91 ).
Future work in this area will aim to distill out a list of recommendations or guidance for text-mining researchers that can be widely disseminated, working with national regulators and advisors. In addition, we propose consulting with social media providers to work with them to improve transparency of terms and conditions, and accessibility of guidance, for accessing and using their users' data for health research. This would help to ensure that transparent and ethical practice becomes embedded in the culture of text-mining social media data and that ethical guidance is available to all. We also note that the number of studies asking social media users for their perspective on their data being used for health research is very limited, and we would recommend more studies be conducted in this area.
Data Availability Statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.
Author Contributions
EF conceived the study and carried out some searches, screening, and analysis and redrafted the paper. SS conducted searches, screening, and analysis and wrote the first draft of the paper. LH carried out some screening and commented on paper drafts. KJ commented on paper drafts. All authors agreed to be accountable for the content of the work.
This study was supported by the Medical Research Council (grant awarded to LH) (Ref: MR/S004025/1).
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
1. Kemp S. Digital 2019: Global Internet Use Accelerates . (2019). Available online at: https://wearesocial.com/blog/2019/01/digital-2019-global-internet-use-accelerates (accessed January 30, 2019).
2. Denecke K, and Nejdl W. How valuable is medical social media data? Content analysis of the medical web. Inform Sci. (2009) 179:1870–80. doi: 10.1016/j.ins.2009.01.025
CrossRef Full Text | Google Scholar
3. VanDam C, Kanthawala S, Pratt W, Chai J, and Huh J. Detecting clinically related content in online patient posts. J Biomed Inform. (2017) 75:96–106. doi: 10.1016/j.jbi.2017.09.015
PubMed Abstract | CrossRef Full Text | Google Scholar
4. Moorhead SA, Hazlett DE, Harrison L, Carroll JK, Irwin A, and Hoving C. A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication. J Med Internet Res. (2013) 15:e85. doi: 10.2196/jmir.1933
5. Antheunis ML, Tates K, and Nieboer TE. Patients' and health professionals' use of social media in health care: motives, barriers and expectations. Patient Educ Couns. (2013) 92:426–31. doi: 10.1016/j.pec.2013.06.020
6. Sinnenberg L, Buttenheim AM, Padrez K, Mancheno C, Ungar L, and Merchant RM. Twitter as a tool for health research: a systematic review. Am J Public Health. (2017) 107:e1–8. doi: 10.2105/AJPH.2016.303512
7. Barber TX, and Silver MJ. Fact, fiction, and the experimenter bias effect. Psychol Bull. (1968) 70:1. doi: 10.1037/h0026724
8. Nikfarjam A, Sarker A, O'Connor K, Ginn R, and Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc. (2015) 22:671–81. doi: 10.1093/jamia/ocu041
9. Liu X, Liu J, and Chen H. Identifying adverse drug events from health social media: a case study on heart disease discussion forums. In: International Conference on Smart Health . Cham: Springer (2014).
PubMed Abstract | Google Scholar
10. Corley CD, Cook DJ, Mikler AR, and Singh KP. Text and structural data mining of influenza mentions in web and social media. Int J Environ Res Public Health. (2010) 7:596–615. doi: 10.3390/ijerph7020596
11. Jashinsky J, Burton SH, Hanson CL, West J, Giraud-Carrier C, Barnes MD, et al. Tracking suicide risk factors through Twitter in the US. Crisis. (2014) 35:51–9. doi: 10.1027/0227-5910/a000234
12. Conway M, Hu M, and Chapman WW. Recent advances in using natural language processing to address public health research questions using social media and consumergenerated data. Yearbook Med Inform. (2019) 28:208–17. doi: 10.1055/s-0039-1677918
13. Cepoiu M, McCusker J, Cole MG, Sewitch M, Belzile E, and Ciampi A. Recognition of depression by non-psychiatric physicians—a systematic literature review and meta-analysis. J Gen Intern Med. (2008) 23:25–36. doi: 10.1007/s11606-007-0428-5
14. Attard A, and Coulson NS. A thematic analysis of patient communication in Parkinson's disease online support group discussion forums. Comput Hum Behav. (2012) 28:500–6. doi: 10.1016/j.chb.2011.10.022
15. Eysenbach G, and Till JE. Ethical issues in qualitative research on internet communities. BMJ. (2001) 323:1103–5. doi: 10.1136/bmj.323.7321.1103
16. Zimmer M. “But the data is already public”: on the ethics of research in Facebook. Ethics Inf Technol. (2010) 12:313–25. doi: 10.1007/s10676-010-9227-5
17. Zimmer M. OkCupid and the Ethics of Big Data Research . (2016). Available online at: https://blog.primr.org/okcupid-ethics-big-data-research/ (accessed March 6, 2019).
18. Aitken M, and de St. Jorre J, Pagliari C, Jepson R, Cunningham-Burley S. Public responses to the sharing and linkage of health data for research purposes: a systematic review and thematic synthesis of qualitative studies. BMC Med Ethics. (2016) 17:73. doi: 10.1186/s12910-016-0153-x
19. Stockdale J, Cassell J, and Ford E. “Giving something back”: a systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland. Wellcome Open Res. (2019) 3:6. doi: 10.12688/wellcomeopenres.13531.2
20. Department of Health Education and Welfare. The Belmont Report. Ethical Principles and Guidelines for the Protection of Human Subjects of Research 1079 . Available online at: https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-508c_FINAL.pdf (accessed August 05, 2020).
Google Scholar
21. Beauchamp TL, and Childress JF. Principles of Biomedical Ethics . 6th ed. New York, NY: Oxford University Press (2009).
22. Dittrich D, and Kenneally E. The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research . US Department of Homeland Security (2012). doi: 10.2139/ssrn.2445102
23. Golder S, Ahmed S, Norman G, and Booth A. Attitudes toward the ethics of research using social media: a systematic review. J Med Internet Res. (2017) 19:e195. doi: 10.2196/jmir.7082
24. Pagoto S, and Nebeker C. How scientists can take the lead in establishing ethical practices for social media research. J Am Med Inform Assoc. (2019) 26:311–3. doi: 10.1093/jamia/ocy174
25. Nebeker C, Harlow J, Espinoza Giacinto R, Orozco-Linares R, Bloss CS, and Weibel N. Ethical and regulatory challenges of research using pervasive sensing and other emerging technologies: IRB perspectives. AJOB Empir Bioeth. (2017) 8:266–76. doi: 10.1080/23294515.2017.1403980
26. Ford E, and Shepherd S. Towards an ethical framework for the text mining of social media for health research: a systematic review. PROSPERO Int Prospect Register Syst Rev. (2018) CRD42018112923. Available online at: https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=112923 (accessed January 08, 2021).
27. Moher D, Liberati A, Tetzlaff J, and Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Intern Med. (2009) 151:264–9. doi: 10.7326/0003-4819-151-4-200908180-00135
28. Tong A, Flemming K, McInnes E, Oliver S, and Craig J. Enhancing transparency in reporting the synthesis of qualitative research: ENTREQ. BMC Med Res Methodol. (2012) 12:181. doi: 10.1186/1471-2288-12-181
29. Adams J, Hillier-Brown FC, Moore HJ, Lake AA, Araujo-Soares V, White M, et al. Searching and synthesising ‘grey literature' and ‘grey information' in public health: critical reflections on three case studies. Syst Rev. (2016) 5:164. doi: 10.1186/s13643-016-0337-y
30. Cooke A, Smith D, and Booth A. Beyond PICO: the SPIDER tool for qualitative evidence synthesis. Qual Health Res. (2012) 22:1435–43. doi: 10.1177/1049732312452938
31. Pace R, Pluye P, Bartlett G, Macaulay AC, Salsberg J, Jagosh J, et al. Testing the reliability and efficiency of the pilot mixed methods appraisal tool (MMAT) for systematic mixed studies review. Int J Nurs Stud. (2012) 49:47–53. doi: 10.1016/j.ijnurstu.2011.07.002
32. Thomas J, and Harden A. Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med Res Methodol. (2008) 8:45. doi: 10.1186/1471-2288-8-45
33. Bond CS, Ahmed OH, Hind M, Thomas B, and Hewitt-Taylor J. The conceptual and practical ethical dilemmas of using health discussion board posts as research data. J Med Internet Res. (2013) 15:e112. doi: 10.2196/jmir.2435
34. Mikal J, Hurst S, and Conway M. Ethical issues in using Twitter for population-level depression monitoring: a qualitative study. BMC Med Ethics. (2016) 17:22. doi: 10.1186/s12910-016-0105-5
35. Monks H, Cardoso P, Papageorgiou A, Carolan C, Costello L, and Thomas L. Young people's views regarding participation in mental health and well being research through social media. Int J Emot Educ. (2015) 7:4–19.
36. Moreno MA, Grant A, Kacvinsky L, Moreno P, and Fleming M. Older adolescents' views regarding participation in facebook research. J Adolesc Health. (2012) 51:439–44. doi: 10.1016/j.jadohealth.2012.02.001
37. Markham A, and Buchanan E. Ethical Decision-Making and Internet Research: Version 2.0 . Recommendations From the AoIR Ethics Working Committee (2012). Available online at: http://aoir.org/reports/ethics2.pdf
38. ESOMAR. Esomar Guideline on Social Media Research. Amsterdam: ESOMAR (2011). Available online at: https://www.esomar.org/uploads/public/knowledge-and-standards/codes-and-guidelines/ESOMAR-Guideline-on-Social-Media-Research.pdf
39. Social Media Research Group. Using Social Media for Social Research: An Introduction . (2016). Available online at: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/524750/GSR_Social_Media_Research_Guidance_-_Using_social_media_for_social_research.pdf
40. Beninger K, Fry A, Jago N, Lepps H, Nass L, and Silvester H. Research using Social Media; Users' Views . (2014). Available online at: http://www.natcen.ac.uk/media/282288/p0639-research-using-social-media-report-final-190214.pdf (accessed August 05, 2020).
41. Unwin L, and Kenny A. Report of a Research Ethics Workshop held on Thursday 14 July 2016. The Ethics of Internet-based and Social Media Research . (2016). Available online at: https://www.sheffield.ac.uk/polopoly_fs/1.644904!/file/Report_Ethics_of_Social_Media_Research_Jul16.pdf (accessed August 05, 2020).
42. Townsend L, and Wallace C. Social Media Research: A Guide to Ethics. (2016). Available online at: https://www.gla.ac.uk/media/Media_487729_smxx.pdf (accessed August 05, 2020).
43. Evans H, Ginnis S, and Bartlett J. SocialEthics: A Guide to Embedding Ethics in Social Media Research . (2015). Available online at: https://www.ipsos.com/sites/default/files/migrations/en-uk/files/Assets/Docs/Publications/im-demos-social-ethics-in-social-media-research-summary.pdf (accessed August 05, 2020).
44. Ahmed W, Bath PA, and Demartini G. Using Twitter as a data source: an overview of ethical, legal, and methodological challenges. Ethics Online Res. (2017) 2:4. doi: 10.1108/S2398-601820180000002004
45. Azam R. Accessing social media information for pharmacovigilance: what are the ethical implications? Ther Adv Drug Saf. (2018) 9:385–7. doi: 10.1177/2042098618778191
46. Azer SA. Social media channels in health ca r e research and rising ethical issues. JAMA J Ethics. (2017) 19:1061–9. doi: 10.1001/journalofethics.2017.19.11.peer1-1711
47. Benton A, Coppersmith G, and Dredze M. Ethical research protocols for social media health research. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing . Valencia (2017).
48. Bica M, and Anderson J. You are what you tweet!” the ethics of (Re) publishing public data as crafted narratives. In: Workshop on Ethical Encounters in HCI Research/ACM Conf on Human Factors in Computing , San Jose, CA (2016).
49. Boyd D, and Crawford K. Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inform Commun Soc. (2012) 15:662–79. doi: 10.1080/1369118X.2012.678878
50. Chiauzzi E, and Wicks P. Digital trespass: ethical and terms-of-use violations by researchers accessing data from an online patient community. J Med Internet Res. (2019) 21:e11985. doi: 10.2196/11985
51. Conway M. Ethical issues in using twitter for public health surveillance and research: da taxonomy of ethical concepts from the research literature. J Med Internet Res. (2014) 16:e290. doi: 10.2196/jmir.3617
52. Conway M, and O'Connor D. Social media, big data, and mental health: current advances and ethical implications. Curr Opin Psychol. (2016) 9:77–82. doi: 10.1016/j.copsyc.2016.01.004
53. Denecke K. Ethical aspects of using medical social media in healthcare applications. Stud Health Technol Inform. (2014) 198:55–62. doi: 10.3233/978-1-61499-397-1-55
54. Farnan JM. Connectivity and consent: does posting imply participation? Am J Bioethics. (2014) 14:62–3. doi: 10.1080/15265161.2014.947823
55. Goodyear VA. Social media, apps and wearable technologies: navigating ethical dilemmas and procedures. Qual Res Sport Exerc Health. (2017) 9:285–302. doi: 10.1080/2159676X.2017.1303790
56. Gustafson DL, and Woodworth CF. Methodological and ethical issues in research using social media: a metamethod of human papillomavirus vaccine studies. BMC Med Res Methodol. (2014) 14:127. doi: 10.1186/1471-2288-14-127
57. Hammer MJ. Ethical considerations when using social media for research. Oncol Nurs Forum. (2017) 44:410–2. doi: 10.1188/17.ONF.410-412
58. Hunter RF, Gough A, O'Kane N, McKeown G, Fitzpatrick A, Walker T, et al. Ethical issues in social media research for public health. Am J Public Health. (2018) 108:343–8. doi: 10.2105/AJPH.2017.304249
59. Lafferty NT, and Manca A. Perspectives on social media in and as research: a synthetic review. Int Rev Psychiatry. (2015) 27:85–96. doi: 10.3109/09540261.2015.1009419
60. Li J. Privacy policies for health social networking sites. J Am Med Inform Assoc. (2013) 20:704–7. doi: 10.1136/amiajnl-2012-001500
61. McKee R. Ethical issues in using social media for health and health care research. Health Policy. (2013) 110:298–301. doi: 10.1016/j.healthpol.2013.02.006
62. Moreno MA, Fost NC, and Christakis DA. Research ethics in the mySpace era. Pediatrics. (2008) 121:157–61. doi: 10.1542/peds.2007-3015
63. Moreno MA, Goniu N, Moreno PS, and Diekema D. Ethics of social media research: common concerns and practical considerations. Cyberpsychol Behav Soc Netw. (2013) 16:708–13. doi: 10.1089/cyber.2012.0334
64. Norva C, and Henderson T. Contextual consent: ethical mining of social media for health research. arXiv Preprint. (2017) arXiv:170107765.
65. Ravoire S Lang M Perrin E participants of Giens XRTN. Advantages and limitations of online communities of patients for research on health products. Therapie. (2017) 72:135–43. doi: 10.1016/j.therap.2016.11.058
66. Schneble CO, Elger BS, and Shaw D. The cambridge analytica affair and internet-mediated research. EMBO Rep. (2018) 19:e46579. doi: 10.15252/embr.201846579
67. Sharkey S, Jones R, Smithson J, Hewis E, Emmens T, Ford T, et al. Ethical practice in internet research involving vulnerable people: lessons from a self-harm discussion forum study (SharpTalk). J Med Ethics. (2011) 37:752–8. doi: 10.1136/medethics-2011-100080
68. Smith J, and Milnes LJ. Social media: the relevance for research. Evid Based Nurs. (2016) 19:99–100. doi: 10.1136/eb-2016-102475
69. Spriggs M. Consent in cyberspace: internet-based research involving young people. Monash Bioethics Rev. (2009) 28:25–39. doi: 10.1007/BF03351319
70. Sugiura L, Wiles R, and Pope C. Ethical challenges in online research: public/private perceptions. Res Ethics. (2017) 13:184–99. doi: 10.1177/1747016116650720
71. Swirsky ES, Hoop JG, and Labott S. Using social media in research: new ethics for a new meme? Am J Bioethics. (2014) 14:60–1. doi: 10.1080/15265161.2014.948302
72. Taylor J, and Pagliari C. Mining social media data: how are research sponsors and researchers addressing the ethical challenges? Res Ethics. (2018) 14:1–39. doi: 10.1177/1747016117738559
73. Taylor HA, Kuwana E, and Wilfond BS. Ethical implications of social media in health care research. Ame J Bioethics. (2014) 14:58–9. doi: 10.1080/15265161.2014.947820
74. Valente TW, and Pitts SR. An appraisal of social network theory and analysis as applied to public health: challenges and opportunities. Annu Rev Public Health. (2017) 38:103–18. doi: 10.1146/annurev-publhealth-031816-044528
75. Vayena E, Mastroianni A, and Kahn J. Caught in the web: informed consent for online health research. Sci Transl Med. (2013) 5:173fs6. doi: 10.2139/ssrn.2272177
76. Weigmann K. Health research 2.0: the use in research of personal fitness or health data shared on social network raises both scientific and ethical concerns. EMBO Rep. (2014) 15:223–6. doi: 10.1002/embr.201438510
77. Williams ML, Burnap P, and Sloan L. Towards an ethical framework for publishing Twitter data in social research: taking into account users' views, online context and algorithmic estimation. Sociology. (2017) 51:1149–68. doi: 10.1177/0038038517708140
78. Wongkoblap A, Vadillo MA, and Curcin V. Researching mental health disorders in the era of social media: systematic review. J Med Internet Res. (2017) 19:1. doi: 10.2196/jmir.7215
79. Gupta UC. Informed consent in clinical research: revisiting few concepts and areas. Perspect Clin Res. (2013) 4:26–32. doi: 10.4103/2229-3485.106373
80. Gibson JJ. The theory of affordances. In: Shaw RJB, editor. Perceiving, Acting and Knowing. Hillsdale, NJ: Lawrence Erlbaum (1977).
81. Vos JFJ, Boonstra AA-O, Kooistra A, Seelen M, and van Offenbeek M. The influence of electronic health record use on collaboration among medical specialties. BMC Health Serv Res. 20:676. doi: 10.1186/s12913-020-05542-6
82. Bardram JE, and Houben S. Collaborative affordances of medical records. Comput Support Cooperative Work. (2018) 27:1–36. doi: 10.1007/s10606-017-9298-5
83. van Dijck J. Facebook and the engineering of connectivity:A multi-layered approach to social media platforms. Convergence. (2013) 19:141–55. doi: 10.1177/1354856512457548
84. Rebers S, Aaronson NK, van Leeuwen FE, and Schmidt MK. Exceptions to the rule of informed consent for research with an intervention. BMC Med Ethics. (2016) 17:9. doi: 10.1186/s12910-016-0092-6
85. Brady E, Segar J, and Sanders C. “I always vet things”: navigating privacy and the presentation of self on health discussion boards among individuals with long-term conditions. J Med Internet Res. (2016) 18:e274. doi: 10.2196/jmir.6019
86. Ford E, Curlewis K, Wongkoblap A, and Curcin V. Public opinions on using social media content to identify users with depression and target mental health care advertising: mixed methods survey. JMIR Mental Health. (2019) 6:e12942. doi: 10.2196/12942
87. World Medical Association Declaration of Helsinki. Ethical principles for medical research involving human subjects. Bull World Health Organ. (2001) 79:373–4. doi: 10.1097/00063110-200109000-00010
88. Herron M, Sinclair M, Kernohan G, and Stockdale J. Ethical issues in undertaking internet research of user-generated content: a review of the literature. Evidence Based Midwifery . (2011) 9:9–15.
89. Held V. The Ethics of Care: Personal, Political, and Global . Oxford University Press on Demand (2006).
90. Suomela T, Chee F, Berendt B, and Rockwell G. Applying an ethics of care to internet research: gamergate and digital humanities. Digital Stud. (2019) 9:4. doi: 10.16995/dscn.302
91. Ford E, Oswald M, Hassan L, Bozentko K, Nenadic G, and Cassell J. Should free-text data in electronic medical records be shared for research? A citizens' jury study in the UK. J Med Ethics. (2020) 46:367–77. doi: 10.1136/medethics-2019-105472
Keywords: social media, text-mining, health research, natural language processing, ethics
Citation: Ford E, Shepherd S, Jones K and Hassan L (2021) Toward an Ethical Framework for the Text Mining of Social Media for Health Research: A Systematic Review. Front. Digit. Health 2:592237. doi: 10.3389/fdgth.2020.592237
Received: 06 August 2020; Accepted: 18 December 2020; Published: 26 January 2021.
Reviewed by:
Copyright © 2021 Ford, Shepherd, Jones and Hassan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Elizabeth Ford, e.m.ford@bsms.ac.uk
This article is part of the Research Topic
Healthcare Text Analytics: Unlocking the Evidence from Free Text
This paper is in the following e-collection/theme issue:
Published on 24.5.2023 in Vol 25 (2023)
Text Analysis of Trends in Health Equity and Disparities From the Internal Revenue Service Tax Documentation Submitted by US Nonprofit Hospitals Between 2010 and 2019: Exploratory Study
Authors of this article:

Original Paper
- Emily Hadley, MS ;
- Laura Haak Marcial, PhD ;
- Wes Quattrone, MA ;
- Georgiy Bobashev, PhD
RTI International, Durham, NC, United States
Corresponding Author:
Emily Hadley, MS
RTI International
3040 East Cornwallis Road
Durham, NC, 27514
United States
Phone: 1 919 541 6000
Email: [email protected]
Background: Many US hospitals are classified as nonprofits and receive tax-exempt status partially in exchange for providing benefits to the community. Proof of compliance is collected with the Schedule H form submitted as part of the annual Internal Revenue Service Form 990 (F990H), including a free-response text section that is known for being ambiguous and difficult to audit. This research is among the first to use natural language processing approaches to evaluate this text section with a focus on health equity and disparities.
Objective: This study aims to determine the extent to which the free-response text in F990H reveals how nonprofit hospitals address health equity and disparities, including alignment with public priorities.
Methods: We used free-response text submitted by hospital reporting entities in Part V and VI of the Internal Revenue Service Form 990 Schedule H between 2010 and 2019. We identified 29 main themes connected to health equity and disparities, and 152 related key phrases. We tallied occurrences of these phrases through term frequency analysis, calculated the Moran I statistic to assess geographic variation in 2018, analyzed Google Trends use for the same terms during the same period, and used semantic search with Sentence-BERT in Python to understand contextual use.
Results: We found increased use from 2010 to 2019 across all the 29 phrase themes related to health equity and disparities. More than 90% of hospital reporting entities used terms in 2018 and 2019 related to affordability (2018: 2117/2131, 99.34%; 2019: 1620/1627, 99.57%), government organizations (2018: 2053/2131, 96.33%; 2019: 1577/1627, 96.93%), mental health (2018: 1937/2131, 90.9%; 2019: 1517/1627, 93.24%), and data collection (2018: 1947/2131, 91.37%; 2019: 1502/1627, 92.32%). The themes with the largest relative increase were LGBTQ (lesbian, gay, bisexual, transgender, and queer; 1676%; 2010: 12/2328, 0.51%; 2019: 149/1627, 9.16%) and social determinants of health (958%; 2010: 68/2328, 2.92%; 2019: 503/1627, 30.92%). Terms related to homelessness varied geographically from 2010 to 2018, and terms related to equity , health IT , immigration , LGBTQ, oral health , rural , social determinants of health , and substance use showed statistically significant ( P <.05) geographic variation in 2018. The largest percentage point increase was for terms related to substance use (2010: 403/2328, 17.31%; 2019: 1149/1627, 70.62%). However, use in themes such as LGBTQ , disability, oral health, and race and ethnicity ranked lower than public interest in these topics, and some increased mentions of themes were to explicitly say that no action was taken.
Conclusions: Hospital reporting entities demonstrate an increasing awareness of health equity and disparities in community benefit tax documentation, but these do not necessarily correspond with general population interests or additional action. We propose further investigation of alignment with community health needs assessments and make suggestions for improvements to F990H reporting requirements.
Introduction
Nonprofit hospitals in the United States are exempt from federal taxes. In exchange for this exemption, these hospitals have an obligation to provide community benefit [ 1 ]. The proof of compliance is collected with Schedule H, a form submitted as part of the annual Form 990 (F990) Internal Revenue Service (IRS) tax documentation for nonprofit hospitals. A substantial section of the F990 Schedule H (F990H) is composed of free-response (unstructured) text fields, where reporting entities can voluntarily provide details on community benefit spending. This may include discussion of community needs and the measures a hospital has or has not taken to address these needs.
Community needs can and do include topics related to health equity and disparities. Health equity is commonly understood as an opportunity for all individuals to be healthy, regardless of membership in a group that has historically been economically or socially disadvantaged [ 2 ]. Health disparities are defined as a particular type of health difference that is worse among socially disadvantaged individuals, namely members of disadvantaged race or ethnicity groups, or economically disadvantaged people within any racial or ethnic group [ 2 ]. Addressing social determinants of health (SDOH), defined as the economic and social conditions that impact the health of people and communities, is considered a primary approach for reducing health disparities and achieving health equity [ 3 ].
In recent years, legislators and other stakeholders have paid increasing attention to whether hospitals are providing adequate community benefits to justify their tax-exempt status [ 4 ]. The IRS is required to review each tax-exempt hospital’s community benefit activities at least once every 3 years, although historically, this requirement has been ambiguous and difficult to track [ 5 ]. In 2020, the Government Accountability Office completed a review of the IRS’s implementation of requirements for tax-exempt hospitals and made a series of recommendations [ 5 ]. One recommendation was that the IRS establish a well-documented process for identifying hospitals at risk of noncompliance with the community benefit standard; the IRS added instructions in April and July 2021 for employees to document case files with relevant facts and circumstances considered during their review to determine whether the hospital organization satisfied the community benefit standard [ 5 ]. One unfulfilled recommendation is updating F990H to ensure that the community benefit a hospital is providing is clear and can be easily identified by Congress and the public [ 5 ]. The IRS recognizes that 3 of the factors currently addressed through open-ended narrative responses are not part of the quantitative, machine-readable files and that a revised F990H could more clearly, consistently, and comprehensively provide community benefit information to the public [ 5 ]. A related open recommendation is that Congress should specify which hospital services and activities are sufficient for community benefit [ 5 ]. These recommendations provide an opportunity for an explicit alignment with approaches to address health equity and disparities.
Most existing studies that use data from F990H have focused on financial data. Empirical studies suggest that nonprofit hospital community benefit spending focuses on charity care and patient care services with little effort to improve community health [ 6 - 8 ]. Limited literature has explored community benefit spending with a focus on health equity or disparities, generally finding that increased IRS reporting clarification or explicit goals to address health disparities could more directly address community needs [ 9 - 11 ]. The text in F990H is unstructured data that varies in detail and length and has been historically challenging to analyze or review in large quantities. Only 1 study by Chen et al [ 12 ] reviewed this text in depth using a manual review of a small sample of 47 hospitals from 2015 to 2017. Recent advances in natural language processing techniques have made text analysis with much larger and longer text data sets more accessible [ 13 ].
We present a novel text analysis of F990H tax documentation to understand if and how US nonprofit hospitals address health equity and disparities through community benefit programming. Our research is the first known work to analyze the F990H free-response text on a national scale across a 9-year period using text analytics approaches. We contribute to a larger body of work regarding hospital community benefits programming, including the limited existing discussion on how hospitals use community benefits programming to address health equity and disparities. By comparing our results with public search trends and identifying gaps in term use and action, we provide findings that stakeholders can use to advocate for community benefit approaches and improvements to F990H to better address health equity and disparities.
Data Source
The data for this analysis come from free-response text in Part V and VI of Schedule H from F990s submitted by US nonprofit hospitals for tax years between 2010 and 2019 [ 14 ]. Detailed descriptions of the specific IRS requirements for Parts V and VI are included in Multimedia Appendix 1 . Data from 2020 onward were not available at the time of analysis because of a data lag that has been exacerbated by the COVID-19 pandemic. F990H is submitted annually (although sometimes delayed by extensions) by a hospital facility or, in many cases, by a hospital organization with a shared employee identification number for multiple hospital facilities. All free-response answers for Part V and VI were combined for this analysis. Data were collected and maintained through the Community Benefit Insight project [ 15 ]. Analysis was completed in Python (Python Software Foundation) using pandas, numpy, nltk, PyTorch, and SentenceTransformers. Visualizations were created in Tableau (Tableau) and R (R Foundation for Statistical Computing) using ggplot2.
The free-response text sections include answers to several questions regarding community health needs assessments (CHNAs), financial eligibility assistance programs, and descriptions of whether and how identified community needs are addressed by a hospital facility. With a few exceptions, these questions are often ambiguous, and hospitals voluntarily choose the level of detail they provide [ 16 ]. The F990H for each hospital is reviewed by the IRS at least once every 3 years but rarely audited; an audit is more complicated and thorough than a review. Even if audited, there is no clear definition of the activities and services that are sufficient to prove community benefit [ 5 ]. Text responses are generally full sentences and paragraphs. Colloquial terminology and misspellings are infrequent.
Figure 1 shows the number of hospital entities that report each year. As portions of the free-response text are required in the IRS Schedule H Instructions ( Multimedia Appendix 1 ), every nonprofit hospital reporting entity is expected to have a free-response text entry; an average of 99.9% of hospital reporting entities that submitted the IRS Form 990 have completed the free-response text in Schedule H. Of the 2131 reporting hospital entities in 2018, a total of 90.5% (n=1930) had continuously submitted free-response text data since 2010. However, the overall number of hospital reporting entities decreased from 2010 to 2018, likely reflecting national trends in hospital consolidation and closure [ 17 , 18 ]. The number of reporting hospitals was notably low in 2019 (n=1627), likely because of the reporting extensions permitted in 2020 during the COVID-19 pandemic. Though we anticipate that 2019 is likely missing data from some hospitals, we decided to retain the 2019 data in the analysis as we prioritized the timeliness of the findings. The median number of words in text responses increased from 1629 in 2010 to 3439 in 2019, whereas the average number of words increased from 2840 in 2010 to 10,123 in 2019. The average was skewed by hospital reporting entities in California, Arizona, and Utah, which submitted average responses of over 30,000 words. Most hospitals (2406/2554, 94.21%) do not submit duplicate text across years.

Outcomes and Variable Construction
Term frequency analysis is a type of lexical analysis that searches for an exact word or phrase using a bag-of-words model [ 19 ]. In this analysis, we used the term frequency to flag whether a word or phrase was used one or more times by a particular hospital organization in each tax year. Term frequency analysis was proposed as an option in the first step of computational grounded theory in sociology, which combines expert human knowledge with the processing power and pattern recognition of computers for content analysis [ 20 ]. This study leveraged an opportunity to use clearly defined expectations from stakeholders that did not align with the more exploratory nature of the computational grounded theory approach. However, we used the key principles of computational grounded theory by deriving a list of terms, regularly seeking expert feedback, and validating our results. We also evaluated a semisupervised topic modeling approach that could have supported finding additional terms and topics related to key anchor words; however, we found that the suggested topics were too broad or included terms that were unrelated to the topic, as defined by stakeholders [ 21 , 22 ]. Improvements in semisupervised topic modeling in this context could be an area of future research.
Term frequency analysis requires a list of words or phrases to begin with. A limitation of term frequency analysis is that it will only consider the exact words and phrases searched for, so a thorough and nontrivial multistep process combining text analysis approaches and stakeholder input was used to create the term set. We used a 3-step process to build this list.
The first step in key term selection was providing stakeholders the opportunity to suggest specific words, phrases, and themes related to health equity and disparities. Stakeholders included 11 subject matter experts and programming staff with experience in health and community benefits from the Robert Wood Johnson Foundation and the Robert Wood Johnson Foundation grantees Community Catalyst and Healthy Food in Health Care. These stakeholders provided suggestions and agreed on words, phrases, and themes in meetings and by email from fall 2021 to fall 2022. The terms included both single words ( rural ) and phrases ( data collection ). Terms with singular and specific meanings were selected. For example, specific drugs such as opioids and fentanyl were used for the substance use theme, as opposed to drug , which is broader and not always related to substance use. In some cases, both singular and plural versions of a term were included (ie, equity and equities ).
The second step involved n-gram tokenization of the text, followed by a search for the 1000 most common single words, bigrams, and trigrams. We reviewed the most common words and phrases. Common terms closely related to the themes suggested by the stakeholders were added to the full list of terms. For example, the terms listening tour and focus group were both added to the data collection theme through this process. The third step involved a review of the SDOH literature associated with the Healthy People 2030 initiative led by the US Office of Disease Prevention and Health Promotion for any other words or phrases that should be included [ 23 ]. For example, this review led to the addition of lead-based paint and air pollution to the environment theme. The final step was to provide the list of terms to a variety of stakeholders, including those from the first step, for feedback.
The resulting term set spanned 29 themes and included 152 words and phrases. These words and themes are shown in Figure 2 . A clean version of the text was created such that it was all lower case with no symbols or punctuation.
Term frequency analysis was performed using this text. We ranked the use of these terms in 2019. We also calculated the percentage and percentage point change from 2010 to 2019. Percent change measurements are useful for understanding the relative increase in term use, especially for less-frequent themes. Percentage point change measurements are useful for understanding the raw change in term use.

Geographic Variation in F990H Use
Community benefit programming is intended to align with community needs, which may vary by geographic region. To assess the alignment of F990H theme use by geography, we first aggregated the percentage of hospital reporting entities with one or more uses of a term by state for each year from 2010 to 2019 and mapped the findings. We visually reviewed the maps for trends in changes over time. We then calculated the Moran I statistic for 2018 to evaluate the presence of spatial autocorrelation to determine if there is a pattern of similarity between observations that are geographically close to one another [ 24 ]. The data from 2018 were used as it is the most recent year with the most complete data. Neighboring states were identified through centroids up to 1000 km apart. We performed a 1-sided statistical significance test with an alternative hypothesis that the observed spatial autocorrelation in the data was significantly greater than what would be expected by chance under a null hypothesis of no spatial autocorrelation. Statistical significance was assessed at α values of .001, .01, and .05.
Comparison With Google Trends
One criticism of hospital community benefits programming (and related spending) is that it is out of touch with community-identified needs [ 8 ]. Google Trends has emerged as a source for investigating how social trends change over time [ 25 ]. We used Google Trends as a proxy for the general population’s interest in topics related to health equity and disparities. We obtained the relative frequency of term use in Google searches related to health (as categorized by Google) between January 1, 2010, and December 31, 2019, in the United States, for the words and phrases in the term list. We note a limitation in that it is unclear how Google determines the category to which a search query belongs, and we are unable to determine the scope of the health category; however, for the purposes of this work, filtering to the health category is still preferable to using all search categories. Further details on Google Trends are provided in Multimedia Appendix 2 .
Considering only the searches in 2019, we ranked public interest in each of the 29 themes. We compared this Google Trends ranking with the ranking of use in F990H in 2019. We assigned similar relative usage to themes within 5 rankings of each other (eg, Nutrition has rank 11 in 2019 F990H use and rank 10 in 2019 in Google Trends searches; Nutrition is considered a theme with “similar relative usage” as rank 11 is only 1 rank difference from rank 10). A difference of 6 to 18 rankings was considered large, whereas a difference of more than 18 rankings was considered very large. These thresholds were selected to ensure bands of similar width in the figure were used for comparison of the Google Trends and F990H results.
Semantic Search
A major criticism of term frequency analysis is that, although it is useful for determining whether a keyword or phrase is used, it is challenging to determine the context or meaning of the term. Therefore, a semantic search was used to augment term frequency analysis. This methodology is used by major search engines to search for meaning by evaluating both the searcher’s intent and the contextual meaning of the terms. We used the question-answer retrieval implementation of Sentence-BERT, a model pretrained on the Natural Questions data set which uses real questions from Google Search with annotated data from Wikipedia as the answers [ 26 ]. This approach is best for an asymmetrical search task in which a short query (such as a question or keyword) is used to find a sentence or paragraph.
For this project, we built a semantic search model and used two search queries for each theme: (1) “took action on <theme>” and (2) “did not take action on <theme>.”
We reviewed the top 20 sentences returned for each query and reported summary findings from the themes with the 3 largest percent or percentage point increase. We selected up to 3 sentences for each theme that best reflected the dichotomy between taking action and not taking action. Not all themes had examples of both action and inaction, and we intentionally did not seek to quantify the results from the semantic search, as it is an imperfect method that can return ambiguous or unrelated sentences.
Ethical Considerations
This research was completed using publicly available secondary data from hospital reporting entities and did not require institutional review board review because it did not involve human participants. All research was conducted with an ethic of respect for cultures, communities, individuals, and independent knowledge. Feedback was obtained from stakeholders likely to be impacted by the findings of this study.
Figure 2 shows the results of the term frequency analysis sorted by prevalence in 2019. Figures S1-S29 in Multimedia Appendix 3 provide the detailed disaggregation for each theme. Figure 2 illustrates that nearly every hospital organization uses a term related to affordability (2018: 2117/2131, 99.34%; 2019: 1620/1627, 99.57%), and more than 90% of hospital reporting entities used a term in 2018 and 2019 related to government organizations (2018: 2053/2131, 96.33%; 2019: 1577/1627, 96.93%), mental health (2018: 1937/2131, 90.9%; 2019: 1517/1627, 93.24%), and data collection (2018: 1947/2131, 91.37%; 2019: 1502/1627, 92.32%).
The least used themes, with a prevalence of less than 10% in 2018 and 2019, were related to immigration (2018: 191/2131, 8.96%; 2019: 149/1627, 9.16%), LGTBQ (2018: 161/2131, 7.56%; 2019: 149/1627, 9.16%), and the environment (2018: 104/2131, 4.88%; 2019: 100/1627, 6.14%).
Figure 2 provides additional details on the percentage increase and raw percentage increase in the use of a term at least once from 2010 to 2019. All 29 themes showed an increase in use in Schedule H of F990, as indicated by both relative and percentage point change. Although LGBTQ (lesbian, gay, bisexual, transgender, and queer)-related terms were used by only a small percentage of hospital reporting entities ( Figure 2 ), this theme saw the largest relative increase from 2010 to 2019 (1676%; 2010: 12/2328, 0.51%; 2019: 149/1627, 9.16%). Other themes with large relative increases were SDOH (958%; 2010: 68/2328, 2.92%; 2019: 503/1627, 30.92%) and environment (522%; 2010: 23/2328, 0.99%; 2019: 100/1627, 6.15%). The themes with the smallest relative increase include affordability (2.06%; 2010: 2270/2328, 97.51%; 2019: 1620/1627, 99.57%) and insurance (13.39%; 2010: 1763/2328, 75.73%; 2019: 1450/1627, 89.12%). Terms related to substance use saw the largest raw percentage point increase: less than a fifth of hospital reporting entities used any substance use language in 2010 (403/2328, 17.31%), and more than two-thirds of hospital reporting entities used a substance use term in 2019 (1149/1627, 70.62%). Other themes with notable increases included keywords related to oral health (48.26 percentage points, 2010: 429/2328, 18.43%; 2019: 1085/1627, 66.69%) and nutrition (40.31 percentage points, 2010: 853/2328, 36.64%; 2019: 1252/1627, 76.95%). Themes with the smallest percentage point increase included affordability (2.06 percentage points, 2010: 2270/2328, 97.51%; 2019: 1620/1627, 99.57%) and environment (5.16 percentage points, 2010: 23/2328, 0.99%; 2019: 100/1627, 6.15%).
Geographic Variation in Themes From 2010 to 2019
Theme use can vary across states and time. In Figure 3 , we highlight the theme with the most visually clear example of change in geographic variability, homelessness . In 2010, the percentage of hospital reporting entities using one or more key terms related to homelessness was small and similar across states. From 2012 to 2015, the proportion of hospital reporting entities using one or more key terms related to homelessness increased in states on the West Coast. In 2018 and 2019, the majority of hospital reporting entities on the West Coast (2018: 99/168, 58.9%; 2019: 115/161, 71.4%) used one or more terms related to homelessness . The geographic maps for all the themes are available in Multimedia Appendix 4 .
Although homelessness was the most visually striking theme for showing change across time, Table 1 highlights the results of the Moran test for spatial autocorrelation among themes in 2018. A total of 8 themes showed statistically significant ( equity : P =.001; health IT : P =.02; immigration : P =.002; LGBTQ : P =.007; oral health : P =.04; rural : P <.001; SDOH : P <.001; substance use : P =.003) positive spatial autocorrelation, suggesting some degree of neighboring state clustering in the discussion of themes in F990H. The maps in Multimedia Appendix 4 help illustrate where clusters may occur. The equity theme was used in clusters of states on the East and West coasts, with less use in the Midwest, excluding a small cluster around Indiana. The health IT theme had higher use in a cluster of neighboring states, including Nebraska, South Dakota, Colorado, Montana, Utah, and Iowa as well as another cluster, including Indiana, Michigan, Ohio, Kentucky, and Illinois. The use of the immigration theme was clustered in the Pacific Northwest (Washington, Idaho, and Oregon) and a smaller cluster in the Northeast (Massachusetts, New York, New Jersey, and Connecticut). The LGBTQ theme was almost exclusively used in 3 clusters: West Coast (Washington, Oregon, and California), Midwest (Minnesota, Wisconsin, Illinois, Kentucky, and Oklahoma), and New England (Maine, Vermont, Massachuetts, and Connecticut), with some mention in New York and New Jersey. The oral health theme had a number of clusters, including Mountain West (Idaho, Montana, Wyoming, and Colorado), mid-Atlantic (Pennsylvania, New Jersey, Maryland, and Virginia), and New England (New Hampshire, Vermont, Massachusetts, Connecticut, and Rhode Island). The use of rural was higher in the Midwest (Wisconsin, Minnesota, Iowa, Missouri, Arkansas, Oklahoma, Kansas, North Dakota, and South Dakota) and Mountain West (Montana, Idaho, Utah, Colordado, and Oregon). The use of terms in the SDOH theme was higher in the West Coast (Washington, Oregon, and California) and North Atlantic (Maine, Massachusetts, New York, Rhode Island, New Jersey, and Maryland). Finally, the substance use theme had clusters in the Southwest (Utah, Arizona, New Mexico, and Colorado), Midwest (Wisoconsin, Ilinois, Missouri, Iowa, Ohio, Kentucky, and Tennessee) and on the East Coast (excluding North Carolina and Georgia).

a LGBTQ: lesbian, gay, bisexual, transgender, and queer.
b SDOH: social determinants of health.
Google Trends
Figure 4 shows the relative rank of use of a theme in hospital F990H reporting versus Google Trends in 2019. A rank of 1 was the most used, whereas a rank of 29 was the least used. Items in the middle band (eg, government organizations , chronic illness , and immigration ) reflect similar relative use. Items in the bands near the top-left corner of the figure (eg, activity and insurance ) reflect themes where the relative rank in Schedule H reporting is higher than that in Google Trends. Items in the bands toward the bottom-right corner of the figure (eg, LGBTQ , oral health, and disability ) reflect themes where the relative rank in Google Trends is higher than in Schedule H. A total of 2 themes ( government organizations and mental health ) ranked in the top 5 and 1 theme ( environment ) ranked in the bottom 5 for relative use in both Google Trends and Schedule H in 2019.
Figure 5 highlights the percentage change in relative use from 2010 to 2019 by hospitals for Google Trends versus F990H. Although the LGBTQ theme has greater use in Google Trends in Figure 4 , this theme saw a much larger relative increase in use in the analysis timeframe in F990H. The SDOH and environment themes also increased considerably in use in F990H and saw some increase in use in Google Trends. The right panel in Figure 4 highlights that many themes had small and sometimes negative changes in Google Trends, even though the theme saw a substantial increase in use in F990H. Oral health and substance use were 2 themes with more than a 75% increase in Google Trends searches (oral health weighted average Google Trends increase: 98.9%, 2010 less frequent Google Trends terms average use: 16.7, 2019 less frequent Google Trends terms average use: 32.78, 2010 more frequent Google Trends terms average use: 14.6, 2019 more frequent Google Trends terms average use: 29.6; substance use weighted average Google Trends increase: 77.7%, 2010 less frequent Google Trends terms average use: 1.9, 2019 less frequent Google Trends terms average use: 4.8, 2010 more frequent Google Trends terms average use: 9.4, 2019 more frequent Google Trends terms average use: 10.9) and more than a 250% increase in use in F990H text from 2010 to 2019 (oral health 261.9%, 2010: 429/2328, 18.43%; 2019: 1085/1627, 66.69%; substance use: 308% 2010: 403/2328, 17.31%; 2019: 1149/1627, 70.62%).

Key findings from the semantic search are shown in Table 2 for the most common phrases in themes with large percentage point increases in F990 Schedule H use from 2010 to 2019. These results highlight the presence of both action and nonaction statements in the free-response text submitted by hospitals.
Principal Findings
Given the increased attention paid to health equity and disparities [ 27 ], it is gratifying to see greater use by hospital reporting entities of language related to these terms across all 29 health equity and disparity themes in this analysis. This increased use may suggest an improved hospital awareness of these issues or community needs related to these issues. It is worth noting that this increased use of words and phrases related to these specific themes is, with 2 exceptions, not required by the IRS and is therefore mostly voluntary on the part of hospitals.
The 2 exceptions to voluntary reporting are the affordability and data collection themes. Both topics require explicit descriptions in community benefit documentation. Data collection is of particular interest, because the 2010 Affordable Care Act required the implementation of CHNAs, including data collection and focus groups, by 2013. Data collection was one of the themes with the largest percentage point increase, with a particularly large jump (approximately 22 percentage points) between 2012 and 2013. This reflects the impact that data collection legislation can have on both the implementation and documentation of meeting requirements.
Some of the relative changes in theme use in the F990 documentation were parallel to national events. Substance use was a theme with a substantial percentage and percentage increase from 2010 to 2019 in F990H. This corresponds to the second wave of the US opioid epidemic, which began in 2010 with a rapid increase in heroin overdoses, and the third wave, which began in 2013 with significant overdoses resulting from synthetic opioids [ 28 ]. Heroin overdoses continued to increase in 2016, whereas synthetic opioid overdoses continued to increase in 2019. Nutrition, a theme that includes the term obesity , also increased from 2010 to 2019 in F990H. This use likely reflects an increase in obesity in the United States, and obesity-related conditions are among the leading causes of preventable premature death [ 29 ]. SDOH gained increased attention in 2010 when the World Health Organization published the “Conceptual Framework for Action on the Social Determinants of Health” [ 30 ]. The substantial increase in use during a similar timeframe suggests growing awareness among hospitals regarding the SDOH framework.
Variation in use can also be attributed to geography. Figure 3 shows that the clearest depiction of geographic differences as the percentage of reporting entities in states on the West Coast, using key terms related to homelessness, grew substantially between 2010 and 2019. This may reflect on the fact that WA, OR, and CA were among the top 10 states with the highest rates of homelessness in 2019 [ 31 ]. However, of these 3 states, only CA was also among the states with a high percentage increase in homelessness from 2010 to 2019. Other states with a large increase in documented homelessness from 2010 to 2019 included NY, SD, KS, and MA, but among these, only MA saw a corresponding increase in homelessness terms in F990H [ 31 ]. Chen et al [ 12 ] also documented lower mentions of investment in housing in F990H, with only 12 of 47 hospital organizations in 5 cities with high rates of homelessness reporting housing-related spending between 2015 and 2017. These findings suggest a lack of concordance in community needs and hospital community benefits spending in regions with high current or markedly increasing rates of homelessness. Chen et al [ 12 ] suggested that hospitals should be provided with evidence-based strategies from early adopters of homelessness strategies to see how housing may fit within their purview and that F990H instructions should be updated so that hospitals are getting adequate credit for housing investment.
The Moran results in Table 1 highlight other themes with discrepancies in F990H theme use between neighboring states and regions with a documented need. For example, oral health has clusters of F990H use in the Mountain West, mid-Atlantic, and New England, but all the states in the clusters with higher F990H oral health use are also among states with medium to high proportions of adults reporting in 2018 that they had visited a dentist or dental clinic within the past year [ 32 ]. As regular dental visits are important to both oral health and overall wellness, it is notable that no state among those with the lowest proportion of previous year adult dental visits was among the states with the highest mentions of oral health in F990H [ 33 ]. The F990H use of the immigration theme was clustered in the Pacific Northwest and mid-Atlantic and distinctly lower in the 3 states where nearly half (45%) of US immigrants live: California, Texas, and Florida [ 34 ]. This insight aligns with the documented finding that adult immigrants, regardless of immigration and citizenship status, are underserved in the US health care system [ 35 ]. Not all F990H use was misaligned—the use of rural was high in states from more rural regions (Midwest and Mountain West) and showed some overlap with the health IT theme, potentially reflecting the pre–COVID-19 pandemic emphasis on the use of health IT approaches such as telehealth in rural communities [ 36 , 37 ]. The substance use theme was highest in the Southwest, Midwest, and East Coast, all regions where most states maintained similar opioid overdose rates between 2017 and 2018 [ 38 ]. However, the remaining themes with statistically significant geographic clustering— equity, LGBTQ, and SDOH —are all themes with national applicability that transcend the few clusters with higher F990H use. The clustered use in the West Coast and North Atlantic, with some sporadic clustered use in the Midwest, suggests that these terms may be politically charged and more commonly used in politically liberal states [ 39 ].
Themes including LGBTQ , disability, oral health, and race and ethnicity are more highly ranked in Google Trends than in F990H, suggesting a degree of misalignment in general public interests and hospital activities as described in F990H. Individuals may be more interested in how these personal topics impact them, whereas hospital reporting entities discuss insurance and activity themes much more in F990H than in the broader population. This is likely expected with the current version of F990H as it includes specific questions related to how hospitals address uninsured patients but has ambiguous questions requesting information on how a hospital is addressing “significant needs” as identified in a CHNA or why a significant need is not being addressed. Given that a major motivation of community benefits reporting is to ensure that nonprofit hospitals are addressing community needs, the misalignment in ranking of some themes in Google Trends health searches and F990H free-response text suggests that a revamp of F990H with more explicit and granular community needs questions may generate greater accountability for addressing health equity and disparities.
A notable limitation of the Google Trends comparison is that the trends are from national searches, whereas hospital reporting entities generally aim to align with local community needs. Although national searches may not always be applicable to a local community, attention to larger trends may still help hospital entities that are missing opportunities to address important rising topics that are still relevant to their community. For example, the increased attention in the past decade to LGBTQ rights and disparities by race and ethnicity suggests that both LGBTQ and race and ethnicity have national applicability, and the Google Trends results suggest that they are currently more prioritized by the public than by hospitals. More hospitals could seek to discuss how they are addressing these themes in the F990H.
The increased use of terms related to health equity and disparities is promising, but the semantic search results in Table 2 make it clear that increased use of terms may not necessarily correspond to increased community benefit programming. Of the 6 themes with the largest relative or raw increase, 4 themes had results with hospitals explicitly stating their inaction. A common reason for lack of action often appears to be a lack of resources or services or misalignment with the hospital’s mission. Young et al [ 7 ] found that most benefit-related expenditures were related to patient care, rather than community health improvement. Sapirstein et al [ 9 ] found no evidence of dramatic shifts in community benefit spending from 2014 to 2019. Hospitals may be mentioning community-related themes more but may not actually take additional action or allocate funding to new community benefits themes. Governing authorities such as the IRS could better scrutinize these hospital statements, such as the inconsistency in Table 2 , where 1 hospital says that improvement of air quality does not align with a hospital’s mission, whereas a different hospital says that it provides air filters to a housing authority. Similar to the recommendation of Chen et al [ 12 ] for improved reporting of efforts to address homelessness in F990H, the IRS should clarify that providing air filters to meet a community need is mission-aligned and credit hospitals reporting efforts such as this on F990H [ 12 ].
Although it is exciting to see that hospitals use more language related to health equity and disparities, it cannot be presumed that there is a corresponding increase in activities related to these needs; mentions in text alone cannot prove action. Policy makers should consider additional language in F990H that requires a clear description of health equity and disparities, including explicit recognition of work on SDOH as a community benefit [ 10 , 11 ]. The Government Accountability Office has called for updating Form 990, including Schedule H, to more clearly, consistently, and comprehensively describe community benefit activities, as well as for Congress to specify which services and activities are sufficient to meet community benefit standards [ 5 ]. These updates could be used to improve auditing of community benefits, contribute to efforts to score hospitals on community benefits programming, or highlight innovative hospitals providing exemplary community benefits. Greater transparency, documentation of activities, and community benefits–specific IRS audit processes for F990H could lead to increased accountability and action by hospitals to address community health equity and disparities. In their study, Chen et al [ 12 ] also recommended that the IRS seek to ensure greater alignment of F990H activities with CHNAs [ 12 ]. The study by Carlton and Singh [ 40 ] found that joint CHNAs with hospital-local health department collaboration encouraged greater hospital investment in community health improvement activities. Further research can use text analytics to explore the programs that hospitals describe in F990H and assess their alignment with implementation plans in CHNAs.
Conclusions
We created a health equity and disparities term list and showed increased use of terms across all 29 themes by hospital reporting entities in free-response text submitted annually from 2010 to 2019 in F990H. We found variations across years and geographies. We suggest that hospitals demonstrate an increased awareness of health equity and disparities yet also show potential misalignment with public interests, as demonstrated through Google Trends and varying changes in action or programming with semantic search. Further research can continue to explore the degree to which hospitals have satisfactorily addressed community needs, as described in the free-response text. Policy changes to the F990H could improve transparency and accountability related to hospital community benefit efforts to address health equity and disparities.
Acknowledgments
The authors would like to thank Mark Rukavina and colleagues at the Community Catalyst and the staff at the Robert Wood Johnson Foundation for their inputs. Funding for this work was provided by the Robert Wood Johnson Foundation grant 77387.
Conflicts of Interest
None declared.
Details the requirements of Form 990 Schedule H Part V Section C and Schedule H Part VI.
Additional details on the Google Trends methodology.
Details of the percentage of hospital reporting entities with one or more uses of a word or phrase in each theme in a given tax year. The figures are sorted according to the percentage of hospitals using one or more terms in 2019.
The figures provide additional details of the geographic variation for each theme by illustrating the percentage of hospital reporting entities by state with one or more uses of a particular theme.
- Ortiz A. Tracking community benefit spending. The Medical Care Blog. 2021. URL: https://www.themedicalcareblog.com/community-benefit-spending/ [accessed 2022-08-25]
- Braveman P. What are health disparities and health equity? We need to be clear. Public Health Rep 2014 Jan;129 Suppl 2(Suppl 2):5-8 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Frequently asked questions, NCHHSTP social determinants of health. Centers for Disease Control and Prevention. URL: https://www.cdc.gov/nchhstp/socialdeterminants/faq.html [accessed 2022-08-29]
- Atkeson A. How states can hold hospitals accountable for their community benefit expenditures. National Academy for State Health Policy. 2021 Mar 15. URL: https://www.nashp.org/states-can-hold-hospitals-accountable-for-their-community-benefit-expenditures/ [accessed 2022-08-29]
- Tax administration: opportunities exist to improve oversight of hospitals' tax-exempt status. U.S. Government Accountability Office. 2020 Sep 17. URL: https://www.gao.gov/assets/gao-20-679.pdf [accessed 2022-02-01]
- Schneider H, Yilmaz H. Hospital community benefits and the effect of Schedule H: a difference-in-difference approach. Health 2013 Oct;5(10):1681-1688 [ FREE Full text ] [ CrossRef ]
- Young GJ, Chou CH, Alexander J, Lee SY, Raver E. Provision of community benefits by tax-exempt U.S. hospitals. N Engl J Med 2013 Apr 18;368(16):1519-1527 [ FREE Full text ] [ CrossRef ]
- Rubin DB, Singh SR, Young GJ. Tax-exempt hospitals and community benefit: new directions in policy and practice. Annu Rev Public Health 2015 Mar 18;36:545-557. [ CrossRef ] [ Medline ]
- Sapirstein A, Rao A, Steimle LN. Alignment of community benefit spending and initiatives to improve community health: is there evidence of progress? medRxiv 2022 Aug [ FREE Full text ] [ CrossRef ]
- Rozier MD. Nonprofit hospital community benefit in the U.S.: a scoping review from 2010 to 2019. Front Public Health 2020 Mar 11;8:72 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Rozier M, Goold S, Singh S. How should nonprofit hospitals' community benefit be more responsive to health disparities? AMA J Ethics 2019 Mar 01;21(3):E273-E280 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Chen KL, Chen K, Holaday LW, Lopez 3rd L. Assessing concordance across nonprofit hospitals' public reporting on housing as a community health need in the era of the affordable care act. J Public Health Manag Pract 2022 Mar;28(2):E615-E618 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Hirschberg J, Manning CD. Advances in natural language processing. Science 2015 Jul 17;349(6245):261-266. [ CrossRef ] [ Medline ]
- About schedule H (Form 990), hospitals. Internal Revenue Service. URL: https://www.irs.gov/forms-pubs/about-schedule-h-form-990 [accessed 2022-08-29]
- Ortiz A, Quattrone W, Underwood M, Zmuda M, Goode LA, Saur C, et al. The development and management of community benefit insight: a web-based resource that aggregates US-based nonprofit hospital community benefit spending data. RTI Press. 2022 Jul. URL: https://www.rti.org/rti-press-publication/development-and-management-community-benefit-insight [accessed 2022-08-29]
- Martha HS, Gayle DN, Carl HM. Hospital community benefits after the ACA: the state law landscape. The Hilltop Institute. 2012. URL: https://mdsoar.org/bitstream/handle/11603/13115/Hospital-Community-Benefits-after-the-ACA-The-State-Law-Landscape-pdf.pdf?sequence=1&isAllowed=y [accessed 2022-08-29]
- Williams Jr D, Reiter KL, Pink GH, Holmes GM, Song PH. Rural hospital mergers increased between 2005 and 2016-what did those hospitals look like? Inquiry 2020 Jan;57:46958020935666 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Saghafian S, Song LD, Raja AS. Towards a more efficient healthcare system: opportunities and challenges caused by hospital closures amid the COVID-19 pandemic. Health Care Manag Sci 2022 Jun;25(2):187-190 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Azam N, Yao JT. Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 2012 Apr;39(5):4760-4768 [ FREE Full text ] [ CrossRef ]
- Nelson LK. Computational grounded theory: a methodological framework. Sociol Methods Res 2020 Feb;49(1):3-42 [ FREE Full text ] [ CrossRef ]
- Fang Z, He Y, Procter R. A query-driven topic model. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021 Presented at: ACL-IJCNLP '21; August 1-6, 2021; Virtual Event p. 1764-1777 URL: https://aclanthology.org/2021.findings-acl.154.pdf [ CrossRef ]
- van Kessel P. Overcoming the limitations of topic models with a semi-supervised approach. Pew Research Center. 2019 Apr 11. URL: https://medium.com/pew-research-center-decoded/overcoming-the-limitations-of-topic-models-with-a-semi-supervised-approach-b947374e0455 [accessed 2023-02-20]
- Social determinants of health literature summaries: healthy people 2030. U.S. Department of Health and Human Services and Office of Disease Prevention and Health Promotion. URL: https://health.gov/healthypeople/priority-areas/social-determinants-health/literature-summaries [accessed 2023-02-14]
- Moran PA. Notes on continuous stochastic phenomena. Biometrika 1950 Jun;37(1-2):17-23 [ FREE Full text ] [ CrossRef ]
- Nghiem LT, Papworth SK, Lim FK, Carrasco LR. Analysis of the capacity of google trends to measure interest in conservation topics and the role of online news. PLoS One 2016 Mar 30;11(3):e0152802 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Semantic search: sentence-transformers documentation. GitHub. URL: https://www.sbert.net/examples/applications/semantic-search/README.html [accessed 2022-08-29]
- Cash-Gibson L, Rojas-Gualdrón DF, Pericàs JM, Benach J. Inequalities in global health inequalities research: a 50-year bibliometric analysis (1966-2015). PLoS One 2018;13(1):e0191901 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Understanding the epidemic: drug overdose. Centers for Disease Control and Prevention. URL: https://www.cdc.gov/drugoverdose/epidemic/index.html [accessed 2022-08-29]
- Obesity is a common, serious, and costly disease. Centers for Disease Control and Prevention. URL: https://www.cdc.gov/obesity/data/adult.html [accessed 2022-08-29]
- Osmick MJ, Wilson M. Social determinants of health-relevant history, a call to action, an organization's transformational story, and what can employers do? Am J Health Promot 2020 Mar;34(2):219-224. [ CrossRef ] [ Medline ]
- 2022 AHAR: Part 1 - PIT estimates of homelessness in the U.S. United States Department of Housing and Urban Development. 2022 Dec. URL: https://www.huduser.gov/portal/datasets/ahar/2022-ahar-part-1-pit-estimates-of-homelessness-in-the-us.html [accessed 2023-02-06]
- Oral health data: explore by topic. Centers for Disease Control and Prevention. URL: https://nccd.cdc.gov/oralhealthdata/rdPage.aspx?rdReport=DOH_DATA.ExploreByTopic&islTopic=ADT [accessed 2023-03-09]
- Oral health. US Department of Health and Human Services. URL: https://health.gov/our-work/national-health-initiatives/healthy-people/healthy-people-2020/healthy-people-2020-law-and-health-policy/oral-health [accessed 2023-03-09]
- Budiman A. Key findings about U.S. immigrants. Pew Research Center. 2020 Aug 20. URL: https://www.pewresearch.org/fact-tank/2020/08/20/key-findings-about-u-s-immigrants/ [accessed 2023-03-09]
- Bustamante AV, Chen J, Félix Beltrán L, Ortega AN. Health policy challenges posed by shifting demographics and health trends among immigrants to the United States. Health Aff (Millwood) 2021 Jul;40(7):1028-1037 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Gajarawala SN, Pelkowski JN. Telehealth benefits and barriers. J Nurse Pract 2021 Mar;17(2):218-221 [ FREE Full text ] [ CrossRef ] [ Medline ]
- Butzner M, Cuffee Y. Telehealth interventions and outcomes across rural communities in the United States: narrative review. J Med Internet Res 2021 Aug 26;23(8):e29575 [ FREE Full text ] [ CrossRef ] [ Medline ]
- NCHS data brief, no. 356. Centers for Disease Control and Prevention. 2020 Jan. URL: https://www.cdc.gov/nchs/products/databriefs/db356.htm [accessed 2023-03-09]
- Most liberal states 2023. World Population Review. URL: https://worldpopulationreview.com/state-rankings/most-liberal-states [accessed 2023-03-09]
- Carlton EL, Singh SR. Joint community health needs assessments as a path for coordinating community-wide health improvement efforts between hospitals and local health departments. Am J Public Health 2018 May;108(5):676-682. [ CrossRef ] [ Medline ]
Abbreviations
Edited by A Mavragani; submitted 15.11.22; peer-reviewed by K Chen, D Valdes, M Torii; comments to author 20.12.22; revised version received 13.03.23; accepted 14.04.23; published 24.05.23
©Emily Hadley, Laura Haak Marcial, Wes Quattrone, Georgiy Bobashev. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 24.05.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

IMAGES
VIDEO
COMMENTS
The significant findings depict that LDA and R package is the most extensively used tool and technique among the authors, most of the researchers prefer the sample size of 1000 articles for analysis, literature belonging to the domain of ICT, and related disciplines are frequently analysed in the text mining studies and abstracts constitute the ...
Challenges and Caveats of the Text-Mining-Driven Research. ... Building more structured databases of experimental data that can be related to the papers from which the data are sourced, could potentially help to test the performance of NLP methods. One can even conceive creating machine-annotated data based on an existing relation between data ...
In particular, we zoom in into (1) the journal outlets publishing innovation research informed by text mining approaches, (2) the type and amount of textual data processed, (3) the thematic content of the substantive research, (4) the text mining algorithms, (5) the specific outcome of text mining for the research performed, and (6) the ...
Our findings indicate that research papers on text mining have been published in 45 academic disciplines in the 1980s and 1990s, 105 disciplines in the 2000s, and 171 disciplines in the 2010s. ... In other words, the papers related to text mining are widely applied to various academic studies and the quantitative trend is increasing.
Several text mining applications to analyze central bank texts are presented. Keywords: Text Mining, R Programming, Sentiment Analysis, Topic Modelling, Natural Language Processing, Central Bank Communication, Bank of Israel. ... our research paper (Benchimol et al.,2020). We thank Itamar Caspi, Shir Kamenetsky Yadan, Ariel
Genes associated with eyelid sebaceous gland carcinoma were retrieved from the PubMed database using text mining with key terms 'eyelid', 'sebaceous gland carcinoma' and excluding the genes for 'Muir-Torre Syndrome'. The interaction partners were identified using STRING. Cytoscape was used for visualization and analysis of the PPI ...
In particular, we zoom in into (1) the journal outlets publishing innovation research informed by text mining approaches, (2) the type and amount of textual data processed, (3) the thematic content of the substantive research, (4) the text mining algorithms, (5) the specific outcome of text mining for the research performed, and (6) the ...
Text mining determines unrecognized information that is automatically extracted from text-based sources. Data mining tools handle the structured data very well and for handling unstructured or ...
Abstract. Text mining has become an exciting research field as it tries to discover valuable information from unstructured texts. The unstructured texts which contain vast amount of information ...
From clustering, problems related to entities and more, these text mining research papers focus on certain techniques, and may prove helpful for others facing similar issues. Starting from categorization and classification, " Support-vector networks ," an older but still relevant a text mining research paper by Corinna Cortes and Vladimir ...
Text-mining technologies have substantially affected financial industries. As the data in every sector of finance have grown immensely, text mining has emerged as an important field of research in the domain of finance. Therefore, reviewing the recent literature on text-mining applications in finance can be useful for identifying areas for further research. This paper focuses on the text ...
The goal of this paper is to provide a bibliometric analysis of scientific publications that employ text mining in management. To accomplish this, the authors collected 1282 documents from the Web of Science and performed performance analysis and science mapping with the help of the Bibliometrix package in Rstudio. The performance analysis used a range of bibliometric indicators such as ...
Background: Text-mining techniques are advancing all the time and vast corpora of social media text can be analyzed for users' views and experiences related to their health. There is great promise for new insights into health issues such as drug side effects and spread of disease, as well as patient experiences of health conditions and health care. However, this emerging field lacks ethical ...
This paper aims to identify the characteristics of English language teaching (ELT) research in Japan. For this purpose, this study analyses conference abstracts from five domestic ELT associations (CELES, JACET, JALT, JASELE and KATE), two ... A Text Mining Study. ... Related Papers. JALT Journal.
The use of KNIME as an analytics platform makes text mining accessible for interested researchers. References. 1. Hotho A, Nürnberger A, Paass G. A Brief Survey of Text Mining. LDV Forum. 2005;20 ...
Text mining (TM) is "the discovery and extraction of interesting, non-trivial knowledge from free or unstructured text" (Kao & Poteet, 2007, p. 1).Knowledge is derived from patterns and relationships and can be used to reveal facts, trends, or constructs (Gupta & Lehal, 2009; Harlow & Oswald, 2016).A related technique which organizational researchers may be more familiar with is computer ...
Text mining tools are now widely used for the efficient management of information and resources in business, academic and research organizations. This paper provides a comprehensive overview of research articles on the application of text mining techniques in the field of Supply Chain Risk Management and the apparel industry. Research articles published between 2000 and 2020, were obtained ...
In order to collect all the papers related to educational text mining, two abstraction and citation databases including Web of Science (Core Collection) and Scopus were targeted. ... synthesised the current methodological approaches to researching collaborative writing and discuss how new text mining tools can enhance research capacity. Ref.
Background: Many US hospitals are classified as nonprofits and receive tax-exempt status partially in exchange for providing benefits to the community. Proof of compliance is collected with the Schedule H form submitted as part of the annual Internal Revenue Service Form 990 (F990H), including a free-response text section that is known for being ambiguous and difficult to audit.
Our Massively Multilingual Speech AI research models can identify more than 4,000 spoken languages, 40 times more than any known previous technology. These models expand text-to-speech and speech-to-text technology from around 100 languages to more than 1,100. This kind of technology could be used for VR and AR applications in a person's ...