Machine Learning for Blockchain Data Analysis: Progress and Opportunities

Blockchain technology has rapidly emerged to mainstream attention, while its publicly accessible, heterogeneous, massive-volume, and temporal data are reminiscent of the complex dynamics encountered during the last decade of big data. Unlike any prior data source, blockchain datasets encompass multiple layers of interactions across real-world entities, e.g., human users, autonomous programs, and smart contracts. Furthermore, blockchain’s integration with cryptocurrencies has introduced financial aspects of unprecedented scale and complexity such as decentralized finance, stablecoins, non-fungible tokens, and central bank digital currencies. These unique characteristics present both opportunities and challenges for machine learning on blockchain data.

On one hand, we examine the state-of-the-art solutions, applications, and future directions associated with leveraging machine learning for blockchain data analysis critical for the improvement of blockchain technology such as e-crime detection and trends prediction. On the other hand, we shed light on the pivotal role of blockchain by providing vast datasets and tools that can catalyze the growth of the evolving machine learning ecosystem. This paper serves as a comprehensive resource for researchers, practitioners, and policymakers, offering a roadmap for navigating this dynamic and transformative field.

1 Introduction

Blockchain, originally designed as the underlying technology for cryptocurrencies, e.g., Bitcoin  Nakamoto ( 2008 ) , has evolved into a robust framework for recording and verifying transactions. Its inherent features, including decentralization and cryptographic security, make it an ideal candidate for myriad applications beyond finance, such as internet-of-things, healthcare, and smart city. One of the most intriguing aspects of blockchain is its ability to generate vast and publicly accessible datasets, containing records of transactions involving diverse real-life entities and autonomous agents.

Simultaneously, the field of machine learning (ML) is experiencing an exponential surge in its application to data analysis across domains, thanks to deep neural methods and artificial general intelligence. ML and deep learning algorithms, capable of discerning patterns, trends, and anomalies within vast datasets, have proven indispensable for extracting meaningful insights and enabling predictions from complex data in an automated and end-to-end manner.

The importance of Blockchain is increasingly felt as the United Nations, through its Innovation Fund, has committed substantial resources ($35M + 2267ETH + 8BTC) to explore and develop blockchain technologies for creating transparent, efficient systems and rethinking problem-solving approaches in enhancing lives and developing communities Chapiro et al. ( 2021 ) . Our exploration reveals that “Machine Learning for Blockchain Data Analysis” has emerged as a vibrant and influential field since 2018 with more than 1750 publications dedicated to this field in the ACM Digital Library.

We apply rigorous criteria to select and evaluate papers that contribute the most to the “ML for Blockchain Data Analysis” field. They encompass factors such as the relevance of the research, the significance of the problem addressed, the quality of the methodology employed, and the impact of the findings on the broader artificial intelligence community. Our search particularly focused on articles that analyzed and built models for data from a public blockchain such as Bitcoin, Ethereum, Litecoin, Eosio, Ripple, Monero, Zcash, and Dash.

Contributions and Roadmap. Our survey offers several key contributions to the field. First, it provides a comprehensive taxonomy (§ 2 ) and overview (§ 4 ) of the latest advancements in “ML for Blockchain Data Analysis” since 2018, offering insights into the state of the art. Second, in § 5 we discuss how the datasets and tools we have highlighted can significantly facilitate future ML research, benchmarking, and the development of innovative applications in the field. Additionally, we discuss the unique challenges (§ 3 ) and opportunities (§ 6 ) inherent in this domain, shedding light on areas that require further exploration and innovation. Ultimately, our survey aims to guide researchers, practitioners, and policymakers in harnessing the potential of machine learning within the blockchain ecosystem, promoting user-friendly, explainable, and responsible data analysis practices. To the best of our knowledge, this is the first comprehensive survey that covers all five areas of ML on blockchains (see Table  1 ).

We discuss our taxonomy of machine learning methods (§ 2.1 ), blockchain components (§ 2.2 ), data models (§ 2.3 ), and applications of blockchain data analysis (§ 2.4 ).

2.1 Machine Learning Methods

The integration of machine learning is unlocking new potential in blockchain data analysis and decision-making Khan and Akcora ( 2022 ) . ML approaches, including graph-based learning, recurrent neural networks (RNN), and transformers, have become pivotal in extracting insights from blockchain’s complex and varied data structures. These methods enable a nuanced understanding of blockchain components, such as transaction networks and smart contracts, by identifying patterns and anomalies that might otherwise remain obscured.

Graph ML approaches such as unsupervised methods, graph embedding, and graph neural networks, e.g., graph convolutional neural networks (GCNs) and graph attention networks (GATs) Xia et al. ( 2021 ) are essential for analyzing complex network structures. Sequential ML , e.g., RNNs and transformers are adept at processing sequential data Wen et al. ( 2023 ) , thus crucial for transaction analysis. Code ML techniques for smart contract analysis focus on interpreting code and bytecode  Pierro et al. ( 2020 ) . Temporal ML handles time-sensitive data – revealing trends, prices, and patterns over time Benidis. et al. ( 2023 ) . Lastly, Text ML , particularly using text and NLP on social media posts, offers insights into public perception and interactions regarding blockchains  Rouhani and Abedin ( 2020 ) . The categories are not mutually exclusive, e.g., temporal graph learning deals with both graph ML and temporal ML; it has been exploited in cryptocurrency e-crimes detection  Akcora et al. ( 2021 ) .

2.2 Blockchain Components

The key blockchain components include the transaction network , which records assets (e.g., cryptocurrency) movements; token networks , managing the distribution and interactions of various tokens; and smart contracts , which are automated agreements encoded directly in the blockchain. Additionally, the peer-to-peer (P2P) network underpins the decentralized nature of blockchains, allowing direct interactions among users. User accounts represent individuals or entities with their transaction histories and balances. A decentralized application (dApp) combines one or more smart contracts to support a certain functionality on a distributed, peer-to-peer network; for example, decentralized finance (DeFi) are dApps for financial services. One may also consider external sources, including social media data, online blogs, cryptocurrency prices, Google Trends, etc., to mine public sentiments and trends about blockchains. For a detailed survey on blockchain components, we refer to Khan ( 2022 ) .

2.3 Blockchain Data Models

The data model for blockchain analysis in ML includes i) simple graphs that illustrate basic peer-to-peer connections, ii) temporal graphs that capture changes across time, iii) attributed graphs where nodes and edges carry distinct properties and iv) weighted graphs with varying importance assigned to connections. Furthermore, directed graphs indicating transaction directions, dynamic graphs reflecting evolving relationships, stream graphs representing continuous data flows, and higher-order graphs offering a multi-dimensional perspective on interactions, have been considered  Akcora et al. ( 2022 ) .

Another aspect of the data model is the analysis of smart contract code, which is essential for understanding the functional mechanics of blockchain systems  Bartoletti et al. ( 2020 ) . This includes both the source code, which offers insights into the logic and rules governing the contracts; and the bytecode, which is the executable form deployed on the blockchain. Furthermore, analyzing text data from transaction descriptions, user comments, and other textual inputs provides a unique perspective on user behaviors and social dynamics within the blockchain ecosystem. The integration of these varied data types, including sequential data models, e.g., time series, is indispensable for a comprehensive analysis. This integration not only helps in decoding the current state of the blockchain but also in forecasting future trends. We shall highlight graph, time series, and smart contract code data models, as well as their combinations in our survey.

2.4 Applications of Blockchain Data Analysis

Blockchain data analysis has diverse applications pivotal to the advancement of blockchain technology. This domain facilitates predictive analytics in financial cryptocurrency markets and anomaly detection within blockchain networks Li et al. ( 2020a ) . Furthermore, the field is useful in identifying and mitigating financial crimes, including ransomware, money laundering, darknet markets, and Ponzi schemes Wu et al. ( 2023 ) . Additionally, blockchain data analysis is key in address/transaction clustering and scrutinizing code for duplicates or malicious contents, thus enhancing the security and integrity of blockchain systems.

3 Challenges of Machine Learning for Blockchain Data Analysis

In the realm of blockchain technology, a complex web of challenges emerges from technology, its usage, control mechanisms, the nature of data, and the ML methods employed.

Blockchain Technology. A fundamental aspect of all public blockchains is the anonymous nature of blockchain addresses. The anonymity allows fast and easy access to blockchain for users, but it also presents a significant hurdle when tracking addresses and analyzing transaction patterns. A second technological challenge in blockchain arises from the fact that only the compiled binary of smart contract code is visible on the blockchain. This limited visibility restricts our understanding of the underlying source code, obscuring the logic and potential vulnerabilities of these contracts. This opacity is a significant concern for ensuring the integrity and security of the blockchain network, as it hinders comprehensive auditing and analysis of smart contracts.

Blockchain Usage. A blockchain is characterized by the dynamic nature of its data. With new transactions arriving in blocks every 15 seconds (as seen on Ethereum  Wood ( 2018 ) ) to 10 minutes (as on Bitcoin  Nakamoto ( 2008 ) ), the data is in a constant state of evolution. This poses a significant challenge in maintaining updated and relevant analyses in real-time. The sheer volume of this data, compounded by its sparse and graph-like structure, exacerbates computational and analytical difficulties. Additionally, the complexity is further intensified by coin-mixing schemes  Wu et al. ( 2022a ) , which deliberately muddle the process of tracking transaction flows, often to obscure the origins of funds for purposes such as coin-laundering  Akcora et al. ( 2020 ) .

Blockchain Control Mechanisms. The open and decentralized nature of blockchains, while one of its strengths, also invites a range of adversarial behaviors. This includes long-range attacks and manipulations, challenging the system’s integrity and reliability. The lack of a centralized review mechanism for both code and users in the blockchain further heightens these risks, leaving the network vulnerable to malicious smart contracts and abusive users.

Blockchain Data. Data-related challenges in blockchains are multifaceted. When utilizing labeled data in blockchain analysis, the rarity of the positive class (such as instances of ransomware or money laundering) compared to the vast size of the networks results in a significant bias in the methods employed. Such a skewed distribution can lead to misleadingly high accuracy metrics. The scarcity of verified, reliable ground truth data hampers the development and validation of robust analytical models. Furthermore, the challenge of train-test mismatch in blockchain analytics is accentuated by the ever-evolving nature of blockchains, which are frequently impacted by real-world events such as government regulations or bans Xie ( 2019 ) . These external influences can significantly alter the nature of the data within a given period, leading to a scenario where the blockchain’s state during the training phase may be different from that in the testing phase. This divergence between training and testing data distributions severely compromises the accuracy and generalizability of models, presenting a substantial obstacle to the effectiveness of machine learning applications in blockchain analysis.

ML Models. The challenges extend into the domain of machine learning methods used for blockchain data analysis. The “black-box” neural models, particularly deep learning, raise concerns about explainability and interpretability. These are critical issues in a field that demands transparency and accountability to comply with financial regulations. Inherent biases in ML algorithms pose risks of unfairness, contradicting the ethos of blockchain technology. Furthermore, the high computational demands, including extensive training and inference times and the need for large volumes of labeled training data, present substantial challenges, especially when data is often scarce, dynamic, and unlabeled.

Refer to caption

4 Survey: Blockchain Data Models, Machine Learning Methods, and Applications

We primarily investigate three non-exclusive ML approaches: graph machine learning (§ 4.1 ), temporal machine learning (§ 4.2 ), and machine learning for smart contracts (§ 4.3 ). We survey their methods for blockchain data analysis, respective data models, and applications. A schematic diagram connecting various articles in our survey is illustrated in Figure 1 .

4.1 Graph Machine Learning on Blockchains

4.1.1 graph data models.

UTXO Data Models. Blockchain technology, which started with Bitcoin, utilizes a distinctive data structure known as an “output” that contains an address and an amount. Such blockchains are referred to as the UTXO (Unspent Transaction Output) blockchains. An address is a unique string representation of the holder within the transaction network. A Bitcoin transaction, where a later transaction consumes one or more outputs to generate new outputs, can effectively be modeled as heterogeneous graphs comprising two primary node types: addresses and transactions. However, a significant challenge arises with most graph libraries, e.g., NetworkX  Hagberg et al. ( 2008 ) , which are designed to handle graphs with a single node type. This limitation has led researchers to frequently model the Bitcoin transaction network as either an address graph  Spagnuolo et al. ( 2014 ) by omitting transactions, or a transaction graph  Ron and Shamir ( 2013 ) by omitting addresses. Specifically, both the address graph and the transaction graph are edge-weighted, directed graphs with nodes representing their respective namesakes, and directed edges record the flow of coins. An edge weight represents the amount of coins transferred.

Account Data Models. The emergence of Ethereum introduced a shift in blockchain data models. Unlike Bitcoin, Ethereum employs an account-based model that eschews the output data structure. Instead, the representation shifts to a graph of address nodes. A key feature of these networks is the variety of edge types, which can represent different forms of value transfer, such as the native cryptocurrency (Ether), tokens, or other user-defined assets. This complexity transforms the network into a multiplex network  Dickison et al. ( 2016 ) , where address nodes are shared, but the edges differ in their types and meanings. Therefore, these graphs are categorized as directed, edge-weighted multigraphs.

Moreover, the application of hypergraphs  Antelmi et al. ( 2023 ) presents a new dimension in modeling blockchain transactions, particularly beneficial in e-crime scenarios where coins flow between seemingly different addresses which are, in reality, owned by the same user. For instance, in coin mixing networks such as Tornado Cash  Wu et al. ( 2022b ) , the flow of coins creates a hyper-edge that connects more than two nodes, providing a more nuanced view of asset transfers in such systems.

4.1.2 Graph Machine Learning Methods

We categorize the discussion based on unsupervised and supervised graph ML, as well as techniques to scale graph ML.

Unsupervised Learning. The evolution of blockchain analytics has been significantly influenced by the application of unsupervised learning techniques. Initial research in this domain mainly focused on examining transaction patterns within blockchain networks to understand the flow of digital currencies, identify trends, and detect anomalies  Ron and Shamir ( 2013 ) . This analysis typically included studying aspects such as transaction volumes, frequency, and the interrelationships between different addresses   Lee et al. ( 2020 ) .

As the research progressed, a shift towards more address and transaction-centric views emerged. Address clustering, aiming to deduce which addresses are controlled by the same user, gained considerable attention  Victor ( 2020 ); Harrigan and Fretter ( 2016 ) . Address clustering employs various heuristics that exploit the characteristics of UTXO transactions. This process is largely unsupervised and focuses on linking entities behind blockchain addresses. Clustering plays a crucial role in identifying and understanding address behaviors and transaction patterns Spagnuolo et al. ( 2014 ) . Similar unsupervised analyses have been performed on reportedly “anonymous” cryptocurrencies, e.g., Monero Möser et al. ( 2017 ) , Zcash Kappos et al. ( 2018 ) , and a diverse set of cryptocurrency ledgers Yousaf et al. ( 2019 ) .

Supervised Learning. The advent of public datasets, e.g., Elliptic  Weber et al. ( 2019 ) signified a pivotal moment in the realm of blockchain graph machine learning, providing a rich source of labeled data. This marked a transition towards more supervised learning approaches, broadening the scope and precision of blockchain data analysis. We categorize these supervised methods into three classes: graph features extraction, graph embeddings, and graph neural networks.

Graph Features Extraction. Harlev et al.  Harlev et al. ( 2018 ) first use unsupervised clustering on the transaction graph to link bitcoin addresses owned by the same user. Next, supervised machine learning based on cluster features has been employed to de-anonymize entities on the Bitcoin blockchain. This approach relies on known data about entities whose identities were previously exposed to form a training dataset, thereby reducing the level of anonymity inherent in Bitcoin transactions. Supervised learning has also been effectively used in detecting blacklisted addresses in the Ethereum blockchain  Kılıç et al. ( 2022 ) . The approach involved using both local and global features extracted from the Ethereum transaction graph to train various machine learning models. This method’s feature extraction process, employing techniques such as random undersampling and SMOTE  Chawla et al. ( 2002 ) , is designed to address label scarcity.

Graph Embeddings. Graph embeddings map each node in a graph to a low-dimensional vector, e.g., for supervised node classification, which has been pivotal in detecting phishing activities within blockchain networks. Yuan et al.  Yuan et al. ( 2020 ) introduce a graph-based classification framework leveraging an improved Graph2Vec algorithm to analyze Ethereum transaction networks for this purpose. The paper’s focus on Ether flow in phishing scams integrates this aspect into the machine learning model, enhancing phishing detection capabilities. Similarly, Wang et al.  Wang et al. ( 2021 ) develop the transaction subgraph network model to identify phishing accounts in the Ethereum blockchain, utilizing a directed version of the model that retains transaction flow information crucial for identifying such illicit activities.

Graph Neural Networks. GNNs are deep learning models developed for graph-related tasks in an end-to-end manner. A notable contribution in this domain is the work on detecting Ponzi schemes within the Ethereum blockchain  Yu et al. ( 2021b ) . Here, a model based on a graph convolutional network is developed to classify nodes in the Ethereum transaction network as Ponzi or non-Ponzi. This approach demonstrates the efficacy of supervised learning in identifying fraudulent schemes by examining the topological structure and transactional characteristics of smart contracts. The development of graph attention network models to identify abnormal transactions in dynamically generated data is also a key area where supervised learning has shown great promise. Yu et al.  Yu et al. ( 2021a ) introduce a GAT approach, focusing on exploiting the graph structure of transactions. The method’s dynamic graph handling capability and weight assignment to nodes based on their relevance to abnormal transactions offer advanced capabilities.

Moreover, the concept of anomaly detection in Ethereum’s blockchain network has been explored. Patel et al.  Patel et al. ( 2020 ) employ the “one-class” graph neural network capturing complex relationships and interactions between accounts for more effective identification of anomalous patterns. Analogously, the paper by Patel et al.  Patel et al. ( 2022 ) develops EvAnGCN, a dynamic GCN for detecting anomalous behaviors in blockchain networks by structuring the data as temporal graphs. This model efficiently learns from the dynamic and evolving structures of blockchain networks, utilizing both temporal and structural features.

Furthermore, the identification of illicit Bitcoin addresses has been enhanced through the integration of structure and temporal information of Bitcoin transactions. Tian et al.  Tian et al. ( 2021 ) develop an attention-based graph neural network that refines address embeddings through neighbor embedding and attention mechanisms. An LSTM-based auto-encoder is used to capture hidden temporal features from transaction records, augmenting identification accuracy.

Scaling Graph Machine Learning. Scaling graph machine learning on blockchains is crucial for handling the vast and continuously growing volume of data within transaction networks. For example, Bitcoin has ≈ \approx ≈ 700,000 unique addresses daily in 500,000 transactions. 1 1 1 https://www.blockchain.com/charts/n-unique-addresses Examining the Bitcoin transaction network for even a single day poses a computationally demanding challenge for graph neural networks which are considered state-of-the-art in a multitude of predictive tasks, such as node classification  Yang et al. ( 2023 ) .

In their initial efforts to analyze large graphs, researchers typically focus on extracting information from the local neighborhoods of nodes. Kılıç et al. employ easily calculable features, including neighbor counts and the time difference between the first and last transactions of a given address  Kılıç et al. ( 2022 ) . If computing power permits, e.g., using parallel computing, researchers may extend their analysis to higher-hop neighborhoods  Yu et al. ( 2021a ) .

One common scaling approach is node sampling. This technique has been widely employed to manage large transaction networks. For instance, Harlev et al. classify entities based on transactional behaviors without necessitating analysis of the entire network  Harlev et al. ( 2018 ) . Similarly, Yu et al. identify Ponzi schemes within the Ethereum blockchain by node sampling to create subgraphs for analysis  Yu et al. ( 2021b ) . The authors randomly sample centered contracts to obtain their first-order neighbors, significantly reducing the computational load. Another scaling strategy involves the use of subgraph sampling, where transaction subgraphs are extracted and analyzed. This is evident in the work of Yu et al., where the dynamic graph structures employ a GAT model that relies on the structure of the sampled edges, rather than requiring a complete graph for analysis  Yu et al. ( 2021a ) . This method is particularly effective in processing dynamic graph structures, and adapting to real-time transaction data.

4.1.3 Open Questions and Challenges

Graph machine learning for blockchains faces several critical challenges. Label scarcity is a prominent but well-known issue. An under-reported issue is the undisclosed e-crime transactions (e.g., ransomware payments), which may create false positives in node classification tasks. The scale of blockchain graphs presents a computational hurdle, demanding efficient algorithms and scalable systems. Real-time analysis is crucial as blockchain data evolves rapidly where latency in detecting anomalies can cause billions of dollars in lost value (e.g., in the LunaTerra collapse). Integrating machine learning across multiple blockchains is complex, involving data heterogeneity and interoperability challenges (e.g., in UTXO-account data integration). Detecting data shifts within blockchain graphs is essential for maintaining model accuracy as usage patterns by ordinary users, as well as e-crime operators, change. Tackling these challenges is essential for harnessing machine learning’s potential in blockchain data analysis.

4.2 Temporal Machine Learning on Blockchains

The integration of ML with blockchain’s temporal data offers unique opportunities for enhanced security, predictive analytics, and understanding dynamic market behaviors.

4.2.1 Temporal Data Models

Temporal data on blockchains offer a rich variety, including time series of crypto asset prices; temporal, multilayer graphs of transaction and asset networks; discrete and continuous dynamic graphs; and graphs with temporal node and edge features. The market volumes of native coins have reached billions of dollars. Hence, the most critical temporal data relates to the price of the native coins, such as Ether on the Ethereum network, denominated in fiat currency. The price data also exists for a subset of crypto assets on blockchains, such as tokens on Ethereum due to global trading activities, thereby establishing an external pricing dataset. Transaction and asset trading networks provide temporal transaction data in the form of networks where both node and edge attributes, as well as edge types, may change. When a blockchain has a short block creation interval (e.g., Ethereum’s ≈ 12 absent 12 \approx 12 ≈ 12 sec gap between two blocks), the network can be effectively modeled as an (almost) continuous-time dynamic graph.

4.2.2 Temporal Machine Learning Methods

Time Series Analysis. Early work in time series analysis for cryptocurrencies used abundant transaction network data to extract predictive signals. Abay et al. Abay et al. ( 2019 ) use Bitcoin graph substructures, called chainlets Akcora et al. ( 2018 ) , to predict Bitcoin prices. Kwon et al.  Kwon et al. ( 2019 ) use the long short-term memory (LSTM) model  Schmidhuber and Hochreiter ( 1997 ) on the historic cryptocurrency price time series data to classify the time series. Livieris et al. use ensemble-averaging, bagging, and stacking with deep learning models for forecasting hourly cryptocurrency prices  Livieris et al. ( 2020 ) .

Unsupervised Learning. The transaction network provides a dynamic dataset abundant in user behavior, enabling the mining of complex patterns. For instance, Alqassem et al. analyze the Bitcoin transaction graph from its inception  Alqassem et al. ( 2018 ) . They observe changes in network diameter, node connectivity, and community structure over time. Their findings include patterns like the densification power law and shrinking diameter. Importantly, they underscore the influence of anonymity-seeking behavior on Bitcoin’s network dynamics. Zhao et al. investigate the evolutionary nature of the Ethereum blockchain network such as the growth rate, active lifespan of high-degree nodes, detecting anomalies based on temporal changes in global network properties, and forecasting the survival of network communities  Z. et al. ( 2021 ) . In the context of blockchain selection, Scheid et al.  Scheid et al. ( 2022 ) introduce an ML-based approach to simplify the selection process for non-technical individuals. The authors present a novel metric to quantify the subjective popularity of blockchain platforms, contributing to the feature set used in their ML model. This work emphasizes the temporal flexibility of their ML model, which adapts over time to new parameters and data.

Supervised Learning. Many temporal ML articles study graph ML topics with a temporal view. Alarab et al. divide the popular Elliptic dataset into 49 time-steps, each representing a distinct set of transactions within a three-hour window  Alarab et al. ( 2020 ) . This temporal division of data ensures that the model can handle real-time transaction data and be trained on temporally coherent subsets. Temporal information is also useful in profiling blockchain addresses. Harlev et al. focus on de-anonymizing entities on the Bitcoin blockchain by analyzing transactions over time and extracting useful features, such as transaction patterns and time-series data  Harlev et al. ( 2018 ) . This temporal dimension enables predicting behaviors based on transaction history.

In e-crime research, temporal transaction patterns exhibited by operators such as ransomware hackers  Akcora et al. ( 2021 ) is invaluable. Pocher et al. effectively utilize patterns by first grouping Bitcoin transactions into distinct time steps and then using a chronological analysis of transaction patterns to find characteristic of e-crime activities  Pocher et al. ( 2023 ) . In anonymity-seeking behavior, users employ different addresses for each transaction to maintain their anonymity. The anonymous behavior is further strengthened by coin-mixing services where one can launder the coins through a mixing service. Wu et al. propose a feature-based network analysis framework to identify such mixing services on Bitcoin  Wu et al. ( 2022a ) . In their work, temporal motifs are crucial to distinguish normal transactions from those associated with mixing services.

Sequence-based Models. Li et al. focus on identifying illicit Bitcoin addresses by extracting temporal features from the change in the balance of addresses over time  Li et al. ( 2020b ) . They use an auto-encoder with LSTM to generate discriminating temporal features, enhancing the model’s ability to identify illicit addresses based on temporal patterns. This approach highlights the importance of temporal analysis in distinguishing normal transaction behavior from illicit activities. Lahmiri et al. used LSTM neural networks for predicting cryptocurrency prices  Lahmiri and Bekiros ( 2019 ) . Their model memorizes both long-term and short-term temporal information, which is crucial for predicting the volatile and dynamic nature of cryptocurrency markets. One recent contribution in this field is BlockGPT, a dynamic, real-time approach for detecting anomalous blockchain transactions  Gai et al. ( 2023 ) . This tool is notable for its ability to generate tracing representations of blockchain activity and train an LLM as a real-time intrusion detection system. Unlike traditional methods, BlockGPT does not rely on predefined rules or patterns, making it significantly more effective in detecting anomalies in Ethereum transactions.

Graph Neural Networks. Zhuang et al. propose a novel method for detecting vulnerabilities in smart contracts using graph neural networks  Zhuang et al. ( 2021 ) . They introduce a degree-free graph convolutional neural network and a temporal message propagation network for automatic detection. The temporal aspect is central to their approach, considering the sequence of operations and interactions within smart contracts to detect vulnerabilities over time. Liu et al. introduce a method for detecting vulnerabilities in smart contracts by combining graph neural networks with expert knowledge  Liu et al. ( 2021b ) . They transform smart contract source code into a contract graph, focusing on critical nodes through a node elimination phase. A temporal message propagation network is employed to extract graph features, considering the sequential nature of smart contract execution. This approach is pivotal in detecting vulnerabilities by capturing the temporal dynamics of data and control flows within smart contracts. Other notable works include Patel et al. ( 2022 ); Yu et al. ( 2021a ) for detecting anomalous transactions; due to the non-exclusive nature of our categorization, they have been discussed earlier in graph ML (§ 4.1.2 ).

4.2.3 Open Questions and Challenges

Linking temporal data across multiple blockchains (e.g., between Bitcoin and Monero in money laundering) to identify behavior patterns presents a complex challenge. Blockchains operate independently, and cross-chain data analysis requires addressing issues related to data heterogeneity, interoperability, and privacy while uncovering valuable insights into cross-blockchain behaviors. Identifying significant changes or anomalies in temporal blockchain data is critical for understanding and responding to emerging trends or irregularities such as hacked blockchain bridges, seized addresses, and external events  Xie ( 2019 ) . Developing effective change point detection algorithms tailored to blockchain data remains an open question on (sparse) transaction graphs. Another challenge is dealing with data staleness issues. As blockchain data continuously evolves, ensuring that ML models operate on informative and up-to-date information is essential.

4.3 Machine Learning for Smart Contracts

4.3.1 smart contract data models.

We consider four types of smart contract data: transaction, contract state, event log, and source code. Transaction data includes information on each transaction executed on the blockchain, e.g., sender and receiver addresses, and block numbers. Smart contracts have a state, which is essentially the current data stored in the contract. This state includes variables, balances, and other information specific to the contract’s functionality. Events, emitted by contracts, record specific occurrences, such as the completion of a task, or the occurrence of an event-triggering condition. The source code of a smart contract (in bytecode or higher level languages, e.g., Solidity) is another critical element for ML analysis.

4.3.2 Machine Learning Methods for Smart Contracts

Contract Graph Analysis. Ferreira et al. automate detection and investigation of attacks on Ethereum smart contracts, utilizing logic-driven and graph-driven analysis of transactions  Ferreira T. et al. ( 2021 ) . Zhuang et al. construct a contract graph to represent both syntactic and semantic structures of contract functions  Zhuang et al. ( 2021 ) . Liu et al. propose a method that transforms smart contract source code into a contract graph, highlights critical nodes via a node elimination phase, and employs a temporal message propagation network to extract graph features  Liu et al. ( 2021b ) . These features, combined with expert-designed security patterns, contribute to an effective and scalable vulnerability detection system on platforms, e.g., Ethereum and VNT Chain.

Source Code Analysis. Mi et al. propose a metric learning-based deep neural network for vulnerability detection in smart contracts, focusing on analyzing bytecode  Mi et al. ( 2021 ) . Fan et al. detect smart Ponzi schemes in blockchain systems by extracting smart contract features from OpCodes  Fan et al. ( 2021 ) . Qian et al. present a deep learning model, BiLSTM-Attention, for detecting defects in smart contracts, treating contract operation codes as sequential sentences, and utilizing attention mechanisms for accurate detection  Qian et al. ( 2022 ) . Tang et al. identify vulnerabilities by analyzing code snippets of functions  Tang et al. ( 2023 ) .

Community and Transaction Analysis. Huang et al. provide a large-scale analysis of the EOSIO blockchain ecosystem, identifying bot activities at both community-level and account-level  Huang et al. ( 2020 ) . SoliAudit combines ML and fuzz testing for vulnerability assessment using Solidity machine code as learning features and incorporating gray-box fuzz testing  Liao et al. ( 2019 ) . Chen et al. detect Ponzi schemes in Ethereum by extracting features from user accounts and operation codes of contracts  Chen et al. ( 2018 ) .

4.3.3 Open Questions and Challenges

One significant challenge in code machine learning for blockchains is the difficulty in finding the high-level code of smart contracts. Smart contracts often have their bytecode uploaded to the blockchain, making it challenging to access their human-readable source code. Lack of access to high-level code hinders comprehensive analysis and interpretation.

The decentralized and distributed nature of blockchain networks can introduce vulnerabilities, such as reentry attacks, not found in typical software projects. Analyzing the script languages of blockchains for these vulnerabilities requires blockchain domain knowledge as well as a good understanding of how distributed systems work. As a result, coding for blockchains is a challenging software domain.

Additionally, functions and opcodes on blockchains often lack direct equivalents in conventional programming languages, which makes it challenging to apply standard code analysis techniques, as the mapping between blockchain code and traditional code constructs may not be straightforward.

5 Datasets and Tools

Graphs. Blockchain network data have become increasingly valuable in research for financial transactions, network dynamics, and user behavior. The Elliptic dataset  Weber et al. ( 2019 ) stands out with its labeled Bitcoin transaction graph, which has been utilized in GNNs. However, the dataset employs anonymized addresses, and descriptions of node features are not shared due to intellectual property rights issues. The BitcoinHeist dataset shares address and labels for about 30K addresses linked to ransomware, facilitating more direct transaction pattern analysis  Akcora et al. ( 2021 ) .

The evolution of blockchain datasets has been notable. Initially, datasets were released in conjunction with academic articles in isolated repositories  Anoaica and Levard ( 2018 ); Liang et al. ( 2018 ); Lee et al. ( 2020 ) . However, recent trends, particularly highlighted in benchmark tracks of conferences, e.g., NeurIPS, have led to the development of standardized and accessible benchmarks, such as Chartalist  Shamsi et al. ( 2022 ) and NFTGraph  Zhang et al. ( 2023 ) . These benchmarks provide large-scale, labeled graph data crucial for diverse research areas, from financial fraud detection to network dynamics analysis. The datasets are also used in the analysis of real-life phenomena where datasets are quite difficult to access. For example, Zhang et al. have proposed to use blockchain networks for studying the resilience of power networks  Zhang and Y. ( 2021 ) .

Code. Smart contract code datasets, such as  Ortner and Eskandari ( 2024 ); di Angelo et al. ( 2023 ) , include vulnerable smart contract codes, offering valuable insights into security vulnerabilities within blockchain applications. Ibba et al. Ibba ( 2022 ) provide token and non-fungible token contract code datasets, shedding light on the intricacies of these specialized smart contract types.

6 Conclusion and Future Direction

The field of machine learning for blockchains has made significant progress in addressing numerous challenges, as highlighted in this survey. However, several promising future directions await further advancement. Firstly, ensuring that ML model decisions are transparent and interpretable is crucial for responsible and trustworthy blockchain data analysis. As blockchain data continues to grow in size and complexity, the development of scalable learning and inference techniques becomes imperative. Efficient algorithms and distributed computing approaches will play a pivotal role in handling the ever-expanding datasets. Furthermore, exploring the application of machine learning to complex blockchain networks, including cross-chain analysis, offers new insights and opportunities for research. Moreover, the dynamic nature of blockchain data requires the development of machine unlearning and continuous learning techniques, enabling models to adapt to evolving data distributions and maintain accuracy over time. Lastly, harnessing the capabilities of large language models for understanding natural language, interacting with data, and generating source code can revolutionize blockchain data and smart contract analysis.

  • Abay et al. [2019] N. C. Abay, C. G. Akcora, Y. R Gel, et al. Chainnet: Learning on blockchain graphs with topological features. In ICDM , 2019.
  • Akcora et al. [2018] C. G. Akcora, A. K. Dey, Y. R Gel, and M. Kantarcioglu. Forecasting bitcoin price with graph chainlets. In PAKDD , 2018.
  • Akcora et al. [2020] C. G. Akcora, S. Purusotham, et al. How to not get caught when you launder money on blockchain? arXiv:2010.15082 , 2020.
  • Akcora et al. [2021] C. G. Akcora, Y. Li, Y. R Gel, and M. Kantarcioglu. Bitcoinheist: Topological data analysis for ransomware prediction on the bitcoin blockchain. In IJCAI , 2021.
  • Akcora et al. [2022] C. G. Akcora, Y. R. Gel, and M. Kantarcioglu. Blockchain Networks: Data Structures of Bitcoin, Monero, Zcash, Ethereum, Ripple, and Iota. WIREs Data Mining Knowl. Discov. , 12(1), 2022.
  • Alarab et al. [2020] I. Alarab, S. Prakoonwit, and M. I. Nacer. Competence of graph convolutional networks for anti-money laundering in bitcoin blockchain. In ICMLT , 2020.
  • Alqassem et al. [2018] I. Alqassem, I. Rahwan, and D. Svetinovic. The anti-social system properties: Bitcoin network data analysis. IEEE Trans Syst Man Cybern , 50(1):21–31, 2018.
  • Anoaica and Levard [2018] A. Anoaica and H. Levard. Quantitative description of internal activity on the ethereum public blockchain. In NTMS , 2018.
  • Antelmi et al. [2023] A. Antelmi, G. Cordasco, et al. A survey on hypergraph representation learning. ACM Comp. Sur. , 56(1):1–38, 2023.
  • Bartoletti et al. [2020] M. Bartoletti, S. Carta, T. Cimoli, and R. Saia. Dissecting ponzi schemes on ethereum: identification, analysis, and impact. Future Generation Computer Systems , 102:259–277, 2020.
  • Benidis. et al. [2023] K. Benidis., Syama S. R., et al. Deep learning for time series forecasting: Tutorial and literature survey. ACM Comput. Surv. , 55(6):121:1–121:36, 2023.
  • Chapiro et al. [2021] C. Chapiro, M. Hydary, and C. Lomazzo. Linking blockchain to impact, 2021.
  • Chawla et al. [2002] N. V Chawla, K. W Bowyer, et al. Smote: Synthetic minority over-sampling technique. Journal of artificial intelligence research , 16:321–357, 2002.
  • Chen et al. [2018] W. Chen, Z. Zheng, et al. Detecting ponzi schemes on ethereum: Towards healthier blockchain technology. In WWW , 2018.
  • di Angelo et al. [2023] M. di Angelo, T. Durieux, J. F. Ferreira, and G. Salzer. SmartBugs 2.0: An execution framework for weakness detection in Ethereum smart contracts. In ASE , 2023. to appear.
  • Dickison et al. [2016] M. E. Dickison, M. Magnani, and L. Rossi. Multilayer social networks . Cambridge University Press, 2016.
  • Durieux et al. [2020] T Durieux, J. F. Ferreira, et al. Empirical review of automated analysis tools on 47, 587 ethereum smart contracts. In ICSE , pages 530–541. ACM, 2020.
  • Fan et al. [2021] S. Fan, S. Fu, H. Xu, and X. Cheng. Al-spsd: Anti-leakage smart ponzi schemes detection in blockchain. IPM , 58(4):102587, 2021.
  • Ferreira T. et al. [2021] Christof Ferreira T., A. K. I., A. Gervais, and R. State. The eye of horus: Spotting and analyzing attacks on ethereum smart contracts. In FC , 2021.
  • Gai et al. [2023] Y. Gai, L. Zhou, K. Qin, D. Song, and A. Gervais. Blockchain large language models. arXiv preprint arXiv:2304.12749 , 2023.
  • Hagberg et al. [2008] A. Hagberg, P. Swart, and D. S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab, United States, 2008.
  • Harlev et al. [2018] M. A. Harlev, H. Sun Yin, et al. Breaking bad: De-anonymising entity types on the bitcoin blockchain using supervised machine learning. HICSS , 2018.
  • Harrigan and Fretter [2016] M. Harrigan and C. Fretter. The unreasonable effectiveness of address clustering. In UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld , 2016.
  • Hassan et al. [2022] Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. Anomaly detection in blockchain networks: A comprehensive survey. IEEE Communications Surveys & Tutorials , 2022.
  • Hou et al. [2021] Wenhan Hou, Bo Cui, and Ru Li. A survey on blockchain data analysis. In COMPSAC , 2021.
  • Huang et al. [2020] Y. Huang, H. Wang, L. Wu, et al. Understanding (mis) behavior on the eosio blockchain. AMCS , 4(2):1–28, 2020.
  • Ibba [2022] G. Ibba. A smart contracts repository for top trending contracts. In IWETSEB , pages 17–20, 2022.
  • Kappos et al. [2018] G. Kappos, H. Yousaf, M. Maller, and S. Meiklejohn. An empirical analysis of anonymity in zcash. In USENIX Security , 2018.
  • Khan and Akcora [2022] Arijit Khan and Cuneyt Gurcan Akcora. Graph-based management and mining of blockchain data. In CIKM , 2022.
  • Khan [2022] Arijit Khan. Graph analysis of the ethereum blockchain data: A survey of datasets, methods, and future work. In Blockchain , 2022.
  • Kılıç et al. [2022] B. Kılıç, A. Sen, and C. Özturan. Fraud detection in blockchains using machine learning. In BCCA , 2022.
  • Kushwaha et al. [2022] S. S. Kushwaha, S. Joshi, et al. Ethereum smart contract analysis tools: A systematic review. IEEE Access , 10:57037–57062, 2022.
  • Kwon et al. [2019] D. Kwon, J. Kim, J. Heo, C. Kim, and Y. Han. Time series classification of cryptocurrency price trend based on a recurrent lstm neural network. Journal of Information Processing Systems , 15(3):694–706, 2019.
  • Lahmiri and Bekiros [2019] S. Lahmiri and S. Bekiros. Cryptocurrency forecasting with deep learning chaotic neural networks. Chaos, Solitons & Fractals , 118:35–40, 2019.
  • Lee et al. [2020] X. T. Lee, A. Khan, et al. Measurements, analyses, and insights on the entire ethereum blockchain network. In WebConf , 2020.
  • Li et al. [2020a] Ji Li, C. Gu, F. Wei, and Xi Chen. A survey on blockchain anomaly detection using data mining techniques. In BlockSys , pages 491–504. Springer, 2020.
  • Li et al. [2020b] Y. Li, Y. Cai, H. Tian, G. Xue, and Z. Zheng. Identifying illicit addresses in bitcoin network. In BlockSys , pages 99–111. Springer, 2020.
  • Liang et al. [2018] J Liang, L. Li, and D. Zeng. Evolutionary dynamics of cryptocurrency transaction networks: An empirical study. PLOS ONE , 13(8):1–18, 08 2018.
  • Liao et al. [2019] J. Liao, T. Tsai, C. He, and C. Tien. Soliaudit: Smart contract vulnerability assessment based on machine learning and fuzz testing. In IOTSMS , pages 458–465. IEEE, 2019.
  • Liu et al. [2021a] X. Liu, X. Jiang, et al. Knowledge discovery in cryptocurrency transactions: A survey. Ieee access , 9:37229–37254, 2021.
  • Liu et al. [2021b] Z. Liu, P. Qian, X. Wang, et al. Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE TKDE , 2021.
  • Livieris et al. [2020] I. E Livieris, E. Pintelas, S. Stavroyiannis, and P. Pintelas. Ensemble deep learning models for forecasting cryptocurrency time-series. Algorithms , 13(5):121, 2020.
  • Mi et al. [2021] F. Mi, Z. Wang, et al. Vscl: automating vulnerability detection in smart contracts with deep learning. In ICBC , pages 1–9. IEEE, 2021.
  • Möser et al. [2017] M. Möser, Kyle Soska, et al. An empirical analysis of traceability in the monero blockchain. arXiv preprint arXiv:1704.04299 , 2017.
  • Mujlid [2023] Hana Mujlid. A survey on machine learning approaches in cryptocurrency: Challenges and opportunities. In iCoMET , pages 1–6. IEEE, 2023.
  • Nakamoto [2008] S. Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System, 2008.
  • Ortner and Eskandari [2024] M. Ortner and S. Eskandari. Smart contract sanctuary, 2024.
  • Patel et al. [2020] V. Patel, L. Pan, and S. Rajasegarar. Graph deep learning based anomaly detection in ethereum blockchain network. In ICNSS , pages 132–148. Springer, 2020.
  • Patel et al. [2022] V. Patel, S Rajasegarar, et al. Evangcn: Evolving graph deep neural network based anomaly detection in blockchain. In ICADMA , pages 444–456. Springer, 2022.
  • Pierro et al. [2020] G. A. Pierro, R. Tonelli, and M. Marchesi. An Organized Repository of Ethereum Smart Contracts’ Source Codes and Metrics. Future Internet , 12(11):197, 2020.
  • Pocher et al. [2023] N. Pocher, M. Zichichi, et al. Detecting anomalous cryptocurrency transactions: An aml/cft application of machine learning-based forensics. Electronic Markets , 33(1):37, 2023.
  • Qi et al. [2023] Y. Qi, J. Wu, H. Xu, and M. Guizani. Blockchain data mining with graph learning: A survey. IEEE Trans. on Patt. An. and Ma. Int. , 2023.
  • Qian et al. [2022] C. Qian, T. Hu, and B. Li. A bilstm-attention model for detecting smart contract defects more accurately. In QRS , pages 53–62. IEEE, 2022.
  • Ron and Shamir [2013] D. Ron and A. Shamir. Quantitative analysis of the full bitcoin transaction graph. In FC 2013 , pages 6–24. Springer, 2013.
  • Rouhani and Abedin [2020] S. Rouhani and E. Abedin. Crypto-currencies narrated on tweets: a sentiment analysis approach. IJES , 36(1):58–72, 2020.
  • Scheid et al. [2022] E. J Scheid, R. Hy, et al. On the employment of machine learning in the blockchain selection process. IEEE Transactions on Network and Service Management , 19(4):3835–3846, 2022.
  • Schmidhuber and Hochreiter [1997] J. Schmidhuber and S.. Hochreiter. Long short-term memory. Neural Comput , 9(8):1735–1780, 1997.
  • Shamsi et al. [2022] K. Shamsi, F. Victor, et al. Chartalist: Labeled graph datasets for utxo and account-based blockchains. NeurIPS , 35:34926–34939, 2022.
  • Spagnuolo et al. [2014] M. Spagnuolo, F. Maggi, and S. Zanero. Bitiodine: Extracting intelligence from the bitcoin network. In FC , 2014.
  • Tang et al. [2023] X. Tang, Y. Du, A. Lai, et al. Deep learning-based solution for smart contract vulnerabilities detection. Scientific Reports , 13(1):20106, 2023.
  • Tian et al. [2021] H. Tian, Y. Li, Y. Cai, X. Shi, and Z. Zheng. Attention-based graph neural network for identifying illicit bitcoin addresses. In BlockSys , 2021.
  • Victor [2020] F. Victor. Address clustering heuristics for ethereum. In FC , 2020.
  • Wang et al. [2021] J. Wang, P. Chen, S. Yu, and Q. Xuan. Tsgn: Transaction subgraph networks for identifying ethereum phishing accounts. In BlockSys , 2021.
  • Weber et al. [2019] M. Weber, G. Domeniconi, et al. Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics. arXiv:1908.02591 , 2019.
  • Wen et al. [2023] M Wen, R. Lin, et al. Large sequence models for sequential decision-making: a survey. Frontiers Comput. Sci. , 17(6):176349, 2023.
  • Wood [2018] G. Wood. Ethereum: A secure decentralised generalised transaction ledger. https://github.com/ethereum/yellowpaper , 2018.
  • Wu et al. [2021] J Wu, J. Liu, Y. Zhao, and Z. Zheng. Analysis of cryptocurrency transactions from a network perspective: An overview. Journal of Network and Computer Applications , 190:103139, 2021.
  • Wu et al. [2022a] J. Wu, J. Liu, W. Chen, et al. Detecting mixing services via mining bitcoin transaction network with hybrid motifs. IEEE Trans. Syst. Man Cybern. Syst. , 52(4):2237–2249, 2022.
  • Wu et al. [2022b] M. Wu, W. McTighe, , et al. Tutela: An open-source tool for assessing user-privacy on ethereum and tornado cash. arXiv:2201.06811 , 2022.
  • Wu et al. [2023] J. Wu, K. Lin, Dan Lin, et al. Financial crimes in web3-empowered metaverse: Taxonomy, countermeasures, and opportunities. IEEE Open Journal of the Computer Society , 4:37–49, 2023.
  • Xia et al. [2021] F. Xia, K. Sun, et al. Graph learning: A survey. IEEE Trans. Artif. Intell. , 2(2):109–127, 2021.
  • Xie [2019] Rain Xie. Why china had to ban cryptocurrency but the us did not: a comparative analysis of regulations on crypto-markets between the us and china. Wash. U. Global Stud. L. Rev. , 18:457, 2019.
  • Yang et al. [2023] Z. Yang, G. Zhang, J. Wu, et al. A comprehensive survey of graph-level learning. arXiv preprint arXiv:2301.05860 , 2023.
  • Yousaf et al. [2019] H. Yousaf, G. Kappos, and S. Meiklejohn. Tracing transactions across cryptocurrency ledgers. In USENIX Security , 2019.
  • Yu et al. [2021a] L. Yu, N. Zhang, and W. Wen. Abnormal transaction detection based on graph networks. In COMPSAC , 2021.
  • Yu et al. [2021b] S. Yu, J. Jin, Y. Xie, J. Shen, and Q. Xuan. Ponzi scheme detection in ethereum transaction network. In BlockSys , 2021.
  • Yuan et al. [2020] Z. Yuan, Q. Yuan, and J. Wu. Phishing detection on ethereum via learning representation of transaction subgraphs. In BlockSys , 2020.
  • Z. et al. [2021] Lin Z., S. S. Gupta, A. Khan, and R. Luo. Temporal analysis of the entire ethereum blockchain network. In WebConf , 2021.
  • Zhang and Y. [2021] X. Zhang and Gel Y. Eager: Collaborative research: Blockchain graphs as testbeds of power grid resilience and functionality metrics, 2021.
  • Zhang et al. [2023] Z. Zhang, B. Luo, S. Lu, and B. He. Live graph lab: Towards open, dynamic and real transaction graphs with NFT. CoRR , abs/2310.11709, 2023.
  • Zhuang et al. [2021] Y. Zhuang, Z. Liu, P. Qian, Q. Liu, X. Wang, and Q. He. Smart contract vulnerability detection using graph neural networks. In IJCAI , 2021.
  • Open access
  • Published: 29 February 2024

How are texts analyzed in blockchain research? A systematic literature review

  • Xian Zhuo   ORCID: orcid.org/0000-0002-6638-3505 1 ,
  • Felix Irresberger 2 &
  • Denefa Bostandzic 1 , 3  

Financial Innovation volume  10 , Article number:  60 ( 2024 ) Cite this article

806 Accesses

1 Altmetric

Metrics details

This paper provides a systematic literature review of text analysis methodologies used in blockchain-related research to comprehend and synthesize existing studies across disciplines and define future research directions. We summarize the research scope, text data, and methodologies of 124 papers and identify the two most common combinations of these dimensions: (1) papers that focus on specific cryptocurrencies tend to apply sentiment analysis to instant user-generated content or news articles to discover the correlations between public opinion and market behavior, and (2) studies that examine the broad concept of blockchain with text data from documents published by companies tend to apply topic modeling techniques to explore classifications and trends in blockchain development. We discover five major research topics in the academic literature: relationship discovery, cryptocurrency performance prediction, classification and trend, crime and regulation, and perception of blockchain. Based on these findings, we highlight three potential research directions for researchers to select topics and implement suitable methodologies for text analysis.

Introduction

Blockchain technology and its economics have attracted considerable attention from academic researchers. The total volume of research has increased dramatically, with the proportion of empirical studies growing gradually in recent years (Casino et al. 2019 ; Xu et al. 2019 ; Frizzo-Barker et al. 2020 ). Data availability is often a primary obstacle in empirical studies in emerging research areas, such as blockchain, where it is not clear which alternative data sources should or can be used for quantitative analysis. Owing to its nature, a blockchain primarily comprises numerical data such as on-chain transactions by users or network (value) metrics, trading activity, price data of cryptoassets, or financial reports of the few available companies, most of which are readily available in the public blockchain. However, these datasets can be complemented by text data to obtain more data from consortiums and private blockchains, thus expanding the research span and deriving additional relevant insights.

Given the decentralized nature of the public blockchain ecosystem, there are limited compulsory disclosures or official platforms representing the comprehensive information of single blockchain projects that can serve as sources of blockchain-related information. Alternative sources of textual data play a vital role for different parties in gathering information and making decisions within a blockchain network. For example, the sentiments of a crowd (via news, social media, or other text sources) may be a more relevant reference for investment in the blockchain ecosystem than in corporations. Such data can affect the market, influence investors’ decisions, and provide an impetus for blockchain development. Researchers can make use of texts in blockchain-related contexts to obtain information in the data from more perspectives (i.e., explore not only the metadata describing the data but also the actual content of the data) and make inferences that cannot be made before with only numbers.

Therefore, in this study, we focus on providing an overview of text analysis methodologies and data sources as they pertain to blockchains, which differ from the text-based analyses of corporations. There is no consensus on the type of text data that should or could be used to analyze a specific blockchain network or project; therefore, our systematic overview helps alleviate this concern.

Several types of blockchain-related text data are publicly available. First, blockchain is a frequent topic in news articles reporting, with subtopics including the performance of cryptocurrencies and the latest developments in the technology. Second, because of the technical nature of blockchain technology, online platforms or forums such as Twitter, GitHub, and Reddit have been actively used by different groups (e.g., investors and developers) to express their opinions and share and track new developments (Mendoza-Tello et al. 2018 ). Blockchain startups also use social media for marketing. Third, blockchain project whitepapers provide key information (e.g., technical and marketing) to potential investors and are the primary method for understanding project details (Cohney et al. 2019 ).

In all these cases, manual examination of large-scale text content is exceptionally labor-intensive and time-consuming, if not impossible. Hence, computer-based text analysis is essential. Researchers across disciplines have provided guidelines for using such type of approaches. Grimmer and Stewart ( 2013 ), for example, illustrate the promise and the pitfalls of text analysis for political science. Günther and Quandt ( 2016 ) give a comprehensive overview of text analysis methods useful in digital journalism research. Studies in economics and finance have addressed the advantages and disadvantages of different methodologies (Loughran and McDonald 2016 ; Cong et al. 2021 ; Gentzkow et al. 2019 ).

Such reviews have not been conducted in blockchain-related research areas, despite the close connection between blockchain technology and multiple text datasets. Therefore, we argue that it is necessary to use a transparent approach and an academic standpoint to synthesize the current knowledge in the literature to better understand the relevance and potential of text analysis. In this study, we conduct a systematic literature review by examining published and unpublished academic literature, focusing on text analysis associated with blockchain topics across disciplines. We provide the fundamental principles and relevant sources of text analysis methodologies and connect the relationships of research scopes, text data, and methodologies to provide researchers with a reference for choosing suitable combinations of the above elements with respect to their research question at hand. We then pinpoint the specific research topics studied in the literature and propose directions for future research. This review serves as a guide for researchers from different disciplines interested in conducting blockchain-related text analysis studies.

Research methodology

We conduct a systematic review of the academic literature on blockchain-related research using text analysis. Research in this area has expanded because of the rapid development of blockchain technology. However, because of the interdisciplinary nature of blockchain research, research perspectives vary starkly, posing difficulties in searching for and gathering knowledge beyond a single field. We focus on computer-based text analysis used in blockchain research to comprehend and synthesize studies across disciplines that utilize text analysis as a primary or ancillary methodology. We aim to gain knowledge from the existing literature in this area and discover future research opportunities. We adopt the guidelines of Siddaway et al. ( 2019 ) and the PRISMA statement (Liberati et al. 2009 ; Moher et al. 2009 ; Page et al. 2021a , b ).

Definition of research questions

The first stage of a systematic review involves defining research questions that guide subsequent actions. We propose the following research questions to achieve the objectives of our review:

Which research scope, text data, and methodology are used to conduct text analysis in the blockchain area?

Both blockchain and text analysis are broad concepts. This question is designed to identify the specific scope of the studies (e.g., cryptocurrency, Footnote 1 smart contract Footnote 2 ), the text data being analyzed (e.g., social media posts and news), and specific methodologies or techniques used to perform the analyses (e.g., sentiment analysis). We aim to bridge and highlight the connections between these elements in each study. This will assist researchers in selecting the appropriate data and methodologies for their research.

What topics are addressed using text analysis in current literature?

The research questions determine how the research develops, and text analysis is one of the methods used to serve the purposes of a study. Regardless of whether text analysis is used alone or as part of a broader analysis, we intend to provide an interdisciplinary overview of the topics and research questions addressed in the existing literature, and illustrate how text analysis contributes to the study of these topics.

What are the research gaps and promising future research topics?

Based on the findings of our review, we identify understudied areas and future research opportunities using text analysis in blockchain research. This allows researchers to recognize promising research topics and specify the methodologies (and data) they can use.

Literature search and selection

Initial keyword searches were conducted on May 24, 2022, followed by updated searches on August 23, 2022, to find relevant studies. We chose the Web of Science (WoS) and Scopus databases to cover publications indexed in academic databases. As text analysis in blockchain research is relatively new, some studies may not have been published. Therefore, we also performed a keyword search of the Social Science Research Network (SSRN) to distinguish unpublished papers (e.g., working and discussion papers) (Garanina et al. 2021 ). Subsequently, backward snowballing of the articles obtained through keyword searches was performed to identify additional articles.

For a comprehensive result, our query keywords encompassed not only blockchain and text analysis but also synonyms and multiple specific topics relevant to the area. Relevant words from blockchain included blockchain, cryptocurrency, stablecoin , Footnote 3 crypto token, smart contract, initial coin offering (ICO), security token offering (STO), and initial exchange offering (IEO), Footnote 4 , and non-fungible token (NFT) Footnote 5 Keywords from text analysis included text analysis, textual analysis, text analytics, topic modeling, natural language processing (NLP), word embedding, sentence embedding, bag of words , and sentiment analysis . We also used asterisks (*) and quotation marks (“”) to eliminate the impacts of plural forms, hyphens, or spelling variations. A description of our keyword-selection process and a complete list of keywords are included in the Appendix.

Keywords were searched in the title, abstract, and keywords. Footnote 6 The exact query is as follows:

(blockchain* OR cryptocurrenc* OR stablecoin* OR “crypto token*” OR “smart contract*” OR “initial coin offering*” OR “security token offering*” OR “initial exchange offering*” OR “non*fungible token*”) AND (“text* analysis” OR “text analytics” OR “topic model*” OR “natural language processing*” OR “word embedding*” OR “sentence embedding*” OR “bag of words” OR “sentiment analysis”)

The details of the literature search and selection process are presented in Fig.  1 . Search queries in the two databases returned 517 records. First, we screened the metadata of the articles to remove articles that were (1) non-English articles, (2) notes, editorials, conference proceedings titles, and preliminary papers, (3) duplicates, and (4) without full-text access. We screened the titles and abstracts to remove articles based on our content-based exclusion criteria. To obtain relevant articles from multiple perspectives, we did not set inclusion/exclusion criteria by discipline. Alternatively, we checked the content of the articles and only excluded an article if (1) it did not contain information related to both blockchain and text analysis, (2) it focused purely on the technical aspect of blockchain, or (3) it did not specify the specific text analysis techniques used. After the above screening, 140 articles remained for full-text assessment, and we applied the exclusion criteria again and obtained 99 published articles. Our search on SSRN initially returned 30 articles. We removed 24 articles based on our exclusion criteria, leaving six unpublished articles. Subsequently, we conducted backward snowballing on 105 articles included in the keyword searches (i.e., we went through the references of the included articles) to find additional articles that did not appear in the keyword searches. This process yielded nineteen additioanl 19 papers. A total of 124 studies were included in the literature review.

figure 1

The flowchart of the literature selection phases

Descriptive results

This section reports the descriptive results of the papers, including publication trends, keyword networks, and citation rankings.

Publication trend

Figure 2 depicts the number of papers on a yearly basis subject to article type and research area. Although we did not set any timeframe restrictions in our keyword search, the first blockchain paper using text analysis appeared in 2015, 6 years after the birth of the Bitcoin blockchain (Nakamoto 2008 ). The total number of papers published annually has been increasing, indicating the growing interest in and recognition of text analysis as a methodology for blockchain-related research. Until 2019, conference proceedings were the main channels through which related papers were published; however, from 2020 onward, the number of papers published in journals began to increase. For several years, computer science papers have largely dominated the topic, which can be explained by the entry requirements for coding skills in many machine learning-based text analyses. Nevertheless, later years saw a growing number of papers from business-, economics-, and finance-related fields. Studies from other areas, such as social sciences and multidisciplinary studies, have also contributed to this topic. The number of papers in most of these areas remains limited. However, the growing diversification of research areas indicates that interest has begun to spread from computer science to these areas.

figure 2

The types and research areas of the publications in each year

We analyzed the network of papers’ keywords (see Fig.  3 ). Footnote 7 The size of the nodes reflects the frequency, the connection between the nodes indicates the co-occurrence of keywords in a paper, and the color of the nodes indicates the average year in which the keyword appears. The most common keywords are the three blockchain concepts: Bitcoin , cryptocurrency , and blockchain . Bitcoin had the earliest average occurrence and was associated with crime (e.g., crime, DarkNet market), social media (e.g., social networking, Twitter), and sentiment (e.g., opinion mining and sentiment analysis). Cryptocurrency is associated not only with crime but also with financial activities (e.g., financial services and investments), classification, and clustering (e.g., recurrent neural networks, deep learning, and topic modeling). The keyword blockchain tends to co-occur with specific applications (e.g., commerce and FinTech), topic modeling, and relationship analysis (e.g., network and trend analyses). Different keyword associations imply that the different scopes of topics within a blockchain are related to distinct economic activities and analyses. Individual text analysis-related keywords are mentioned less frequently; however, they appear in each blockchain scope. Sentiment analysis tends to go together with Bitcoin and cryptocurrency, whereas topic modeling and the corresponding keywords connect closely to cryptocurrency and blockchain.

figure 3

The keyword frequency and co-occurrence networks

Citation ranking

Citation analysis helps identify the impact and common concerns of papers. However, one problem with using citations as an indicator of impact is that older papers have longer periods of citation accumulation. Thus, to offset this problem, we ranked the papers in terms of both total citations and citations per year (CPY) (Dumay and Cai 2014 ) and considered the top ten papers from both criteria. Table  1 lists these papers and summarizes their text data, sample period, text analysis techniques, and brief abstracts of the papers.

Nine papers appeared on both lists; one older paper (Georgoula et al. 2015 ) fell short of CPY and was surpassed by a newer paper (Kim et al. 2020 ). The topics of high-impact papers tended to concentrate on a narrow range. Ten studies applied sentiment analysis and nine explored the predictive power of sentiment from social media platforms/news for cryptocurrency prices. Most studies focused on Bitcoin or a few altcoins with large market caps, while Kraaijeveld and de Smedt ( 2020 ) included nine cryptocurrencies, and Li et al. ( 2019 ) studied a smaller cryptocurrency called ZClassic (ZCL). One study examined the sentiments of blockchain-related tweets and found that blockchain benefits were discussed more than its drawbacks (Grover et al. 2019 ). The study by Kim et al. ( 2020 ) proposed a new topic modeling method and applied it to conduct a literature review on blockchain research to discover research trends. A detailed discussion is provided in Table  1 .

Discussion of research questions

In this section, we briefly introduce the scope, text data, and methodologies used in the papers and bridge the elements to identify the most used combinations. Figure  4 displays the connections among research scopes, text data, and methodologies in proportion to the number of papers. Footnote 8

figure 4

The connections among research scope, text data, and the methodology

Research scope

‘Specific cryptocurrency’ (72 papers, 58%) is the most frequently used scope and Bitcoin in particular is the most studied cryptocurrency. To better recognize the importance of Bitcoin, we separate studies that focus exclusively on Bitcoin (40 papers, 32%) from the others. Other studies examine cryptocurrencies with large market caps, special small cryptocurrencies (Li et al. 2019 ; Mnif et al. 2021 ; Vacca et al. 2021 ), or a large number of cryptocurrencies to represent the market (Steinert and Herff 2018 ; Schwenkler and Zheng 2021 ).

Another substantial scope is the general concept of blockchain (26 studies, 21%). These studies treat blockchain technology and its applications as a whole and discover its uses in particular fields (e.g., supply chain management (Medhi 2020 ; Hirata et al. 2021 ; Xu and He 2022 ), banking (Daluwathumullagamage and Sims 2020 ), and accounting (Garanina et al. 2021 )) and how blockchain-related topics evolve (over time) (Zhang et al. 2021a ; Chousein et al. 2020 ; Medhi 2020 ; da Silva and Moro 2021 ; Zeng et al. 2018 ; Shahid and Jungpil 2020 ; Perdana et al. 2021 ).

The literature also covers the scope of the cryptocurrency market as a whole (11 papers, 8.9%) (Caliskan 2020 ; Siu et al. 2021 ), ICO projects (13 papers, 10.5%) (Toma and Cerchiello 2020 ; Liu et al. 2021 ; Sapkota and Grobys 2021 ), and smart contract (two papers, 1.6%) (Ibba et al. 2021 ; Zhang et al. 2021a ).

It is worth noting that, in our search, the keywords also included stablecoin, NFT, and STO, but we found no papers that used text analysis to examine these scopes. This may have resulted from the late development of these blockchain use cases. However, increasing growth in such applications has been observed in recent years (Lambert et al. 2021 ; Wang et al. 2021b ), thus creating opportunities and the needs to address relevant research questions using text analysis.

Table  2 summarizes the text data and corresponding data sources we identify from the papers, which helps researchers navigate to the sources of their target data. We categorize texts into four groups: (1) corporate-produced documents, (2) user-generated content, (3) news, and (4) academic papers.

Corporate-produced document Corporate-produced documents utilize formal and technical languages to provide detailed information about the company or specific products and services. Despite the precise information provided by these documents, we found only 18 studies that used such texts. ICO whitepaper, which pitches the project idea and outlines the business plan, is a voluntary disclosure by the ICO project team to attract potential investors (Florysiak and Schandlbauer 2022 ; Thewissen et al. 2022 ). Another example of such document is smart contract code. Although the code does not strictly belong to human language, its fixed format enables researchers to obtain information regarding the subject of the contract (Ibba et al. 2021 ; Zhang et al. 2021a ). Blockchain-related texts can also be extracted from corporate documents, such as SEC and patent filings, through keyword searches and used to examine blockchain adoption (Yen and Wang 2021 ; Wang et al. 2021a ; Zhang et al. 2021a ; Stratopoulos et al. 2022 ).

User-generated content Among all text data, user-generated content was the most frequently used (85 times, 64%). This type of text features a shorter length and informal language, and generally expresses the opinions of users on a particular topic. Social media platforms offer rich resources for such texts (56 times, 42%). Specifically, most studies chose Twitter to extract text data for conducting the analyses (Patil et al. 2018 ; Huynh 2021 ; Mareddy and Gupta 2022 ), while others used Sina Weibo (a Chinese microblogging website) or Stocktwits (a social media platform focused on financial topics) (Chen et al. 2019a ; Pan et al. 2020 ; Huang et al. 2021 ).

Compared with social media platforms, online forums often have a specific focus and attract users with shared interests; therefore, they tend to offer deeper discussions. Cryptocurrency-specific forums, such as bitcointalk, XRPChat, and Ethereum Community Forum (Kim et al. 2016 ; Gurdgiev and O’Loughlin 2020 ), have sections with distinctive topics. User discussions on topic-focused forums, such as GitHub, Reddit, and StackExchange have provided insights into the development of blockchain (Hinds-Charles et al. 2019 ; Bahamazava and Reznik 2022 ; Ortu et al. 2022 ). There are numerous cummnities (i.e., subreddits) within the crypocurreny framework of Reddit (e.g., r/CryptoMarkets, r/Bitcoin), and users can join the communities to share up-to-date news or express their opinions on topics. In contrast, HackForums contains posts on illicit activities (Siu et al. 2021 ).

News News articles are one of the most widespread and accessible types of textual data. They provide up-to-date factual information on events, and commentaries/opinions on a topic. Analyzing blockchain news on a scale allows researchers to identify the evolution and public sentiment toward the technology. For instance, multiple news channels report the upcoming Ethereum Shanghai Hard Fork, but they contain different sentiments toward the event: FXStreet ( 2023 ) neutrally introduces the updates it would bring; U.Today ( 2023 ) illustrates multiple reasons for developers to be concerned about the hard fork, while Bloomberg ( 2023 ) is comparatively optimistic about it by emphasizing that “Shanghai is expected to push more people and institutional investors to stake their coins to support the Ethereum network and earn yield.”

Many studies use cryptocurrency-specific news channels (e.g., Coindesk and Cointelegraph) as their primary news data sources (Karalevicius et al. 2018 ; Farimani et al. 2022 ), whereas others search for blockchain-related news from financial newspapers (e.g., The Financial Times and The Economist) through keyword searches (Azqueta-Gavaldón 2020 ).

Academic paper Literature reviews assist researchers in understanding the current status of research, identifying research gaps, and guiding future research (Chakkarwar and Tamane 2019 ; Shahid and Jungpil 2020 ; Garanina et al. 2021 ). Unlike the standard literature, in which researchers spend time manually examining papers, the automated processing of text-analysis-assisted literature reviews enables researchers to acquire insights into a large number of papers in a specific area in a short time.

Methodology

Choosing a suitable methodology depends not only on the data characteristics but also on the research questions of the study. Our goal is not to provide a systematic classification of the methodologies, but to provide a big picture of the methodologies used in blockchain-related literature. Therefore, the methodologies presented in this section may overlap. For example, the underlying methodology of sentiment analysis can be a machine-learning-based classifier. This section outlines the principal methodologies most directly related to the research questions. In addition, we summarize the specific text analysis techniques used in the papers in Table  3 to provide supplementary details. Footnote 9

Text preprocessing Before conducting the actual analysis, multiple cleaning procedures should be applied to the raw text to prepare it as the input material. The necessary steps vary depending on the text condition and planned analysis. However, we identified standard preprocessing steps suitable for the majority of texts: removing special characters and punctuation, removing numbers and stopwords, lower-casing, spelling corrections, tokenization, assigning part-of-speech tags, and stemmization/lemmatization. Some raw texts require more cleaning than others. For example, texts from social media and online forums usually use informal language and emojis which can lead to misinterpretation. Papers therefore conducted additional procedures (Birim and Sönmez 2022 ; Critien et al. 2022 ): remove # and @user, remove URL links, convert emojis to words, and convert vocabulary abbreviations to words. These procedures remove redundant text, convert unrecognizable characters into valuable information, and are vital preparation steps.

Feature extraction The cleaned texts should be transferred to number representations to allow the computer to read and use for further analyses. It can also reduce computational complexity, enhance performance, and avoid the overfitting problem, making it an essential procedure in text analysis (Kou et al. 2020 ). This representation per se can also provides information and insight. Count-based methods are straightforward to understand and interpret. The Bag-of-words (BoW) is one of the most widely used approaches. It represents words according to their frequency in the corpus, disregarding order and context. N-grams are extensions of BoW that breaks the corpus into a contiguous sequence of n words. It can capture more context around each word, but produces a sparser feature set than BoW. BoW and N-grams assume that words that occur more frequently are more relevant and do not always hold true. Term frequency-inverse document frequency (TF-IDF) (Salton et al. 1975 ) adds another metric of how rarely a word occurs across the entire corpus and assigns rarer words a higher score. Although such representations are generally used as inputs for further analysis, we identify papers that highlight frequent words and interpret them as blockchain topics (Zeng et al. 2018 ; Burnie and Yilmaz 2019 ; El-Masri and Hussain 2021 ). However, this method can be misleading, because count-based methods discard linguistic structures and may miss crucial text information.

Word-embedding mitigates this problem by representing words in vectors to capture their semantic and syntactic contexts in a document (Cong et al. 2021 ). In the vector space, the shorter the distance between two word vectors, the higher is the similarity of the words. Word2vec (Mikolov et al. 2013 ) is one of the most frequently used word embedding methods. It includes two configurations: skip-gram and continuous bag of words (CBOW). A skip-gram uses the current word to predict the surrounding words, whereas CBOW predicts the current word using its surrounding words. A generalization of word2vec and doc2vec (Le and Mikolov 2014 ) adds a document feature vector to the word vector to capture the semantics of the paragraphs and documents. Word-embedding techniques are not frequently used in the literature, but we found that Kim et al. ( 2020 ) and Liu et al. ( 2021 ) integrated these techniques when processing their texts. Two other word-embedding models, GloVe and fastText , were used by Kilimci ( 2020 ).

Analysis Sentiment analysis is the dominant text-analysis approach in the literature (80 times, 53%). There are two major types of sentiment analysis: lexicon/rule-based and machine learning-based (Vohra and Teraiya 2013 ).

Lexicon-based sentiment analysis calculates the sentiment score of a text based on the polarity of each word (i.e., positive, negative, or neutral) from sentiment dictionaries in which each vocabulary is assigned a sentiment score. Examples of well-established sentiment dictionaries include Valence Aware Dictionary for Sentiment Reasoning (VADER) (Hutto and Gilbert 2014 ), which is particularly suitable for social media contexts, and Loughran and McDonald sentiment lexicon (LM lexicon) (Loughran and McDonald 2011 ) in the finance domain. However, off-the-shelf dictionaries can sometimes generate inaccurate results because of different sentiments of the same vocabulary in different contexts (Loughran and McDonald 2011 ). Therefore, some researchers have developed new and additional dictionaries (e.g., new vocabularies and emojis) in blockchain contexts for higher accuracy of sentiment quantification (Chen et al. 2019a ; Barth et al. 2020 ; Kraaijeveld and de Smedt 2020 ).

Machine learning-based sentiment analysis adopts machine learning classifiers to study the sentiments of texts and classify them into instinctive sentiment groups. Researchers can build a model and train their data or apply a pre-trained model (e.g., Bidirectional Encoder Representations from Transformers (BERT)) to their analysis. Compared to lexicon/rule-based sentiment analysis, it is dynamic and can better fit the research context. We identified 12 papers that adopted this approach (e.g., Patil et al. 2018 ; Balfagih and Keselj 2019 ; Inamdar et al. 2019 ; Aslam et al. 2022 ). In particular, Han et al. ( 2020 ) and Akba et al. ( 2021 ) propose and assess new models for sentiment analysis.

Sentiment analysis tools have also been utilized in academic studies (Lu et al. 2017 ; Stanley 2019 ; Caviggioli et al. 2020 ; Moustafa et al. 2022 ). Such tools develop unique algorithms and reduce the programming requirements for researchers. However, most of these tools are commercially oriented, incur high subscription fees, and lack transparency regarding their algorithms. Hence, albeit the convenience, researchers should be cautious when using such tools.

In some studies, emotion-detection metrics have been applied in conjunction with sentiment analysis to achieve more precise emotion separation. For example, the NRC-VAD Emotion lexicon has three dimensions: valence, arousal, and dominance (Mohammad 2018 ). This provides another layer for sentiment and can increase the quality of the analysis.

The Latent Dirichlet Allocation (LDA) and its variations were frequently chosen (33 times, 22%) for text analysis. LDA is a topic-modeling algorithm developed by Blei et al. ( 2003 ). Topic modeling can identify the patterns of vocabulary and phrases in documents (within the corpus of interest), detect the differences in their topics, and cluster the documents according to the topics discussed in the documents. LDA is one of the most popular topic-modeling algorithms. It assumes that each document in the corpus consists of a number of latent topics and that each topic is characterized by a word distribution. Each topic is presented with a list of words and their fitting possibilities. Its variations include dynamic topic models (DTM), which add temporal features to the model (Blei and Lafferty 2006 ) and SentLDA, which considers the boundaries between sentences and assumes that all words in a sentence are sampled from the same topic (Bao and Datta 2014 ). The texts used in LDA models are typically unlabeled, and the researchers’ task is to choose the optimal number of topics, which is primarily determined by the perplexity and coherence scores (Blei et al. 2003 ; Newman et al. 2010 ). After narrowing down the choices for the optimal number of topics, researchers become involved and integrate their interpretations to choose the optimal number of topics for the model. Together with other topic modeling and clustering algorithms, they belong to unsupervised machine learning. Evaluations of unsupervised machine learning vary from model to model, and human judgment is often required to evaluate the model quality. Nevertheless, these models are valuable for exploring the underlying features of a text without establishing an upfront framework (Grimmer and Stewart 2013 ). This is especially applicable to research in blockchain, which is still understudied and has few established classifications.

In contrast, supervised machine-learning classifiers are applied to pre-labeled texts, and the texts are classified into pre-specified groups. The idea is to first manually categorize a set of documents and then train a supervised model that automatically learns how to assign categories to documents using a training set (Bao and Datta 2014 ). Owing to the training process, they are domain-specific and better fit the research context (Grimmer and Stewart 2013 ). Multiple models are often applied to the same dataset and researchers can easily compare the performance of classifiers using certain metrics (e.g., precision, recall, accuracy, F1-score) to select the best-fitting model. Nevertheless, in blockchain-related research, they are utilized much less for text data (nine times, 6%).

Bridging the elements

Figure  4 shows that the combinations of the elements are diversified depending on the purpose of the studies. Nevertheless, we observe two primarily adopted paths for text analysis in blockchain research: (a) papers studying specific cryptocurrencies tend to apply sentiment analysis to instant user-generated content or news articles to discover the correlations between public opinions/emotions and cryptocurrency market behavior, and (b) papers studying the broad concept of blockchain primarily choose official documents from companies (e.g., SEC and patent filings) and apply topic models to explore the classifications or trends in the sector.

The links among the above elements are not permanent; that is, researchers can choose combinations according to their requirements. To select effective combinations, researchers must understand the characteristics of the data, presumptions to use a particular methodology, and the questions they intend to investigate. The design should facilitate the generation of interpretable and meaningful results to answer the research questions.

The data and methodologies are used to serve the purpose of the study and should be chosen depending on the research questions (Grimmer and Stewart 2013 ). In the following section, we summarize blockchain-related topics discussed in the existing literature that involve text analyses.

Relationship discovery

Researchers have used different text data (often combined with other variables) to identify correlations. The speculative nature and high volatility of cryptocurrencies have led to studies exploring the relationship between market fluctuations and information on online platforms. Different factors of online discussions, including the counts of specific keywords, discussions of different topics, and sentiment classes, are extracted. These factors are used as variables to test whether they are associated with cryptocurrency market activities, such as price changes and the co-movement of peer cryptocurrencies (Polasik et al. 2015 ; Phillips and Gorse 2018 ; Barth et al. 2020 ; Schwenkler and Zheng 2021 ). From more specific perspectives, studies distinguish different user groups and vocabularies and find that content from certain groups or the presence of certain words is more closely related to changes in the cryptocurrency market (Burnie and Yilmaz 2019 ; Kang et al. 2020 ). Xie ( 2021 ) explores the relationships among online discussions and demonstrates that online communities’ conflicting opinions and redundant discussions result in low trading volumes.

An ICO whitepaper, perceived as a prospectus for an initial public offering (IPO) in a less regulated way, provides information that can impact investors’ decisions and, to some extent, determine the success of projects. Many dimensions of such texts influence the performance of ICO. For instance, ICO projects with higher technological sophistication shown in whitepapers are more likely to be successful and less likely to be delisted (Liu et al. 2021 ). Those whitepapers that are unique-that is, have more project-specific information and avoid borrowing common phrases from previous whitepapers-can lead to higher fundraising amounts and better post-ICO performance (Yen and Wang 2021 ; Florysiak and Schandlbauer 2022 ). The readability and sentiment expressed in whitepapers can also affect investors’ decisions to invest in the described project (Stanley 2019 ; Sapkota and Grobys 2021 ).

For public companies that meet higher disclosure standards, blockchain-related information can be extracted from 10-K filings and used to investigate whether blockchain adoption brings value and efficiency to companies (Yen et al. 2021 ).

Cryptocurrency performance prediction

Forecasting has always been an important topic in cryptocurrency studies. In addition to econometric methods and statistical models for price prediction, sentiment has also been used as a predictor of market movement (Mao et al. 2011 ; Fang et al. 2022 ). The effect of sentiment on the cryptocurrency market could be magnified by the lack of traditional financial fundamentals in valuation, and vocal and active investors on social media (Corbet et al. 2018 ; Gurdgiev and O’Loughlin 2020 ). Machine learning models, especially supervised models, are often applied to use sentiment data for prediction. Sentiment is used as the sole input to a model or as a supplement to conventional variables (e.g., price, trading volume, blockchain metadata (Sebastião and Godinho 2021 )).

Texts from social media are extracted, and each document is assigned a sentiment score using a sentiment analysis technique (see Table  3 for details). The scores (along with other variables) are subsequently used as inputs for the prediction models. They have predictive power for the direction of price movement (Loginova et al. 2021 ; Critien et al. 2022 ) and the short-term (e.g., hourly and daily) magnitude of price changes (Li et al. 2019 ; Farimani et al. 2022 ; Ortu et al. 2022 ).

The impact of social media content depends particularly on the level of information dissemination. Thus, celebrity or opinion leader posts (i.e., influencers) or discussions about them could have more power than other posts (Kang et al. 2020 ). Huynh ( 2021 ; 2022 ) quantifies the tweet sentiments of Donald Trump and Elon Musk using LM lexicon and finds that negativity in Trump’s tweets leads to higher returns on Bitcoin, whereas both pessimistic and optimistic expressions from Musk have a positive effect on Bitcoin returns. Cary ( 2021 ) analyzes the tweet sentiment about Elon Musk’s performance on Saturday Night Live on 8 May 2021 and found that the negative opinion toward his performance led to the price decline of Dogecoin.

Prediction models have also been used in ICO studies. Text data variables (e.g., expert reviews and social media sentiment) and non-text variables (e.g., sale price, project duration, and expert ratings) are utilized simultaneously to predict the success of ICO projects (Xu et al. 2021 ; Chursook et al. 2022 ).

Overall, studies focusing on predicting market movements and project success constitute a large proportion of the papers in this review. However, the data and methodologies mainly follow a similar direction: applying sentiment analysis to Twitter posts and associating the respective sentiment metrics with high market capitalization cryptocurrencies.

Classification and trend

One step in understanding large-scale texts containing multiple documents is to categorize the documents and create classifications. Using clustering/topic models or classifiers, content features (i.e., the topics discussed) in documents can be extracted and used to group documents into different classifications. By adding a temporal dimension to the static classification, the classification information can provide the trends of a particular group of topics.

Such models can be valuable when applied to academic papers in literature reviews to facilitate an understanding of existing studies and identify further research. Unlike standard literature reviews, in which researchers read through papers to derive results, topic modeling-based literature reviews extract the titles and abstracts of papers and rely on algorithms to extract topics from the texts. Classification algorithms are used to understand the current state and development of blockchain research (Chakkarwar and Tamane 2019 ; Shahid and Jungpil 2020 ; Lee et al. 2022 ). Some studies have dived into blockchain applications within a sector (e.g., consumer trust, banking, and accounting) to facilitate researchers and practitioners in identifying future research areas and business opportunities (da Silva and Moro 2021 ; Daluwathumullagamage and Sims 2021 ; Garanina et al. 2021 ). Although it enables researchers to examine text content on a large scale without time-consuming manual reading, one of the drawbacks of using text analysis for literature reviews is the lack of an information screening process, during which irrelevant papers are excluded from the review.

Most papers included in this review (Xu and He ( 2022 ) is an exception) directly use all papers from the keyword search results as their input for topic models and further analyses. In this case, many irrelevant papers may be erroneously included in the models and the noise information they contain can be significant, leading to biased or inaccurate conclusions. To avoid undermining the advantages of topic modeling, researchers must carefully design the selection criteria for their dataset when performing such studies.

At a more technical level, the classification and trends of blockchain infrastructure and application design problems have also been addressed. Using texts from technique-oriented platforms (e.g., GitHub and StackExchange), some studies have observed a shift in developers’ interests from mining to software development (Alahi et al. 2019 ; Hinds-Charles et al. 2019 ). A special case involves the use of a smart contract code as an input for topic models or classifiers. Researchers can then discover the most common uses of smart contracts and identify Ponzi schemes by analyzing the code (Ibba et al. 2021 ; Zhang et al. 2021b ). Despite the focus on technical information, such studies have implications not only for developers and computer scientists but also benefit researchers in finance and economics by, for instance, identifying investor interests and customer demands.

The evolution of the blockchain topic is often tied to unique events that affect market activity and trigger changes in investor behavior. Linton et al. ( 2017 ), for example, study how blockchain topics change during periods of significant events in the cryptocurrency world, such as the insolvency of the MtGox Bitcoin exchange in 2014 (Goldstein and Tabuchi 2014 ) and the hack into Bitfinex in 2016 (Baldwin 2016 ) (e.g., from sole ‘Bitcoin trading’ topics to ‘security issues’ or ‘scams’ as predominant topics in online forums). Other researchers (Daluwathumullagamage and Sims 2020 ; Pan et al. 2020 ; Bahamazava and Nanda 2022 ) incorporate the influence of specific events (e.g., Bitcoin halving events, the introduction of regulations, and COVID-19) into their models to better interpret the change in interest during different periods.

Crime and regulation

Illegal activities and crimes have always surrounded discussions on cryptocurrency. Many early users appraised the (pseudo)anonymity of cryptocurrency and used it as currency for illicit purchases on DarkNet. In the early stages, cryptocurrencies were suggested that cryptocurrencies contribute to improving black markets (Foley et al. 2019 ).

Bahamazava and Reznik ( 2022 ) and Bahamazava and Nanda ( 2022 ) explore the posts from Reddit (subreddit DarkNet) to study the criminal topic evolution and the mainstream methods to trade cryptocurrencies illegally. Crime-related texts on other channels such as Twitter, Telegram, and HackForums are also used to identify the specific illegal activities discussed (Barth et al. 2020 ; Nizzoli et al. 2020 ; Siu et al. 2021 ). One rich first-hand source for examining fraud from the victim’s side is the reports from https://www.bitcoinabuse.com , where the victims of Bitcoin fraud share their experiences and post the original messages they received from the abusers. Choi et al. ( 2022 ) cluster these messages and find high similarity of a large number of messages, suggesting the existence of only slight modification of fraud messages and certain patterns of the language usages from Bitcoin fraud instigators. Zhang et al. ( 2021b ) apply an improved CatBoost classifier to smart contract codes to find the common characteristics of Ponzi schemes hidden in the lines.

Although studies inspecting illegal activities have accumulated, the number of studies exploring relevant regulations remains minimal. We identified only two studies that explicitly discussed regulatory issues. In the study by Bahamazava and Nanda ( 2022 ), after discovering the preferred methods of buying cryptocurrencies for money laundering, they cross-examined anti-money laundering regulations in Italy and Russia to see if they have corresponding paragraphs to address such purchasing methods. Chousein et al. ( 2020 ) investigate how service providers of public blockchain systems communicate with their users about the influences of the EU General Data Protection Regulation (GDPR) on their services and find a shortage of communication and transparency on GDPR compliance issues.

There are two reasons for the lack of regulation-oriented text analysis studies. First, the time lag between the introduction of regulations in different jurisdictions limits the availability of data for regulatory studies. Second, analyzing the content of regulations requires a computer program to understand the legal terms. Therefore, context-specific dictionaries are required to correctly extract information. Researchers should also have domain knowledge to interpret the results accurately, which can be challenging in many areas. Nevertheless, because understanding regulatory frameworks is essential to advance our understanding, combat blockchain crimes, and promote blockchain adoption, more research is needed from the perspective of blockchain-related regulations.

Perception of blockchain

The perception of (potential) users is crucial for the development of emerging technologies such as blockchain. Public acceptance does not merely rely on economic benefits, but also on other aspects. Studies have attempted to discover how the public perceives blockchain technology and the drivers of attitude construction. Such studies are closely associated with social and cultural factors and are, therefore, located in interdisciplinary studies, such as behavioral finance. The number of papers was not significant (seven papers) in this review; however, the questions discussed were diverse.

Blockchain was initially surrounded by suspicion and considered a questionable technology; however, its acceptance grew gradually. Users are attracted to the security, privacy, transparency, trust, and traceability offered by blockchain (Grover et al. 2019 ), but their adoption is still hindered by a lack of blockchain knowledge and distrust of blockchain (Yadav et al. 2021 ). Doubts can be removed by building channels for the public to gain knowledge about it:1) articles from the media help the public obtain more information about blockchain, which boosts further exploration of the technology and acceptance; 2) existing business problems motivate experimenting with blockchain and enhance trust (Perdana et al. 2021 ). Cultural background also helps shape the perceived value of blockchain. Grassman et al. ( 2021 ) conduct a comparative study between Sweden and Japan on the attitude towards autonomy that cryptocurrency brings. The principle of autonomy has a higher intrinsic value in Sweden, whereas Japan adopts a more pragmatic view of autonomy (i.e., facilitating investment prospects).

In broad-term blockchain, specific products with distinctive characteristics are viewed differently. Some studies (Caliskan 2020 ; Mnif et al. 2021 ; Bashchenko 2022 ) explore the perceptions of Bitcoin, Bitcoin Green, and cryptocurrency exchanges and explained the reasons for their interpretations.

We now summarize the research gaps described in the papers and observed by us and develop future research topics to which future studies could address.

Improvement of data preparation

The quality of the input data largely determines the model output results; however, the complexity of text data makes it challenging to prepare. Many current studies merely conduct standard data preparation and omit the features of different types of text. To prevent “garbage-in-garbage-out”, future research can look more deeply into the characteristics of specific texts and prepare the data in a way that fits the characteristics of the texts.

Data selection After text preprocessing, the text data should be further selected or weighted by considering the text features. This procedure is yet neglected by a substantial number of papers. For example, Twitter offers millions of short texts daily, but misinformation is omnipresent. Bots and fake accounts should not be ignored and should be separated from others (Burnie and Yilmaz 2019 ; Kraaijeveld and de Smedt 2020 ). Bashchenko ( 2022 ) divides news into two types: (a) endogenous news, which describes the past price movement; (b) fundamental news, which provides information that can have higher impacts. When using news for price prediction, endogenous news should be filtered out because it has a limited influence on future prices.

Another way to improve preparation can be achieved by setting relevance levels for the texts. Twitter accounts can be weighted according to their influence levels (e.g., number of followers, retweets, and user networks) (Jain et al. 2018 ; Li et al. 2019 ), and the influence of a patent is reflected by the number of citations.

Dictionary building Dictionaries are essential in text analysis models (e.g., sentiments and topics). However, they are generally only applicable to a specific context since vocabularies can change their meanings depending on discipline (Loughran and McDonald 2011 ). The impact of using an off-the-shelf dictionary in other areas can be a substance for blockchain studies, as new vocabularies and jargons have been invented in blockchain. Studies have indicated that designing a domain-specific lexicon for blockchain could potentially improve the accuracy of analysis (Balfagih and Keselj 2019 ; Chen et al. 2019a ; Sattarov et al. 2020 ). existing studies primarily adopt the VADER (Hutto and Gilbert 2014 ) and LM lexicons (Loughran and McDonald 2011 ), and only a few studies have developed or integrated blockchain-specific lexicons (Chen et al. 2019a ; Barth et al. 2020 ; Kraaijeveld and de Smedt 2020 ; Huang et al. 2021 ).

Extension to underused data and growing areas

In this review, we find a concentration of text data uses from social media, online forums, and academic papers. Simultaneously, many other documents containing valuable information are underused. Corporate-generated documents (e.g., SEC and patent filings) are not frequently utilized despite their importance in revealing corporate-level information. For instance, in finance studies, patent filings are used to identify specific FinTech categories (Chen et al. 2019b , 2022 ). Studies use 10-Ks for different purposes: product description sections for the new industry set according to product similarity (Hoberg and Phillips 2016 ), business descriptions for company’s asset specificity(Chen et al. 2022 ), and risk disclosures for risk detection (Bao and Datta 2014 ; Hanley and Hoberg 2019 ). Corporate disclosures are versatile, and cater to multiple research purposes. One limitation of corporate disclosures is that blockchain startups have limited mandatory disclosures. Nevertheless, future research can make greater use of such documents to gain insights into blockchain adoption strategies of established companies.

Another gap in the review is the absence of papers related to the keywords NFT, STO, IEO, and stablecoin. These are relatively new concepts in blockchain and are largely understudied. Researchers investigating these areas will contribute to a better understanding of market mechanisms. For example, potential text data in NFTs include descriptions and social media discussions of NFT items. STOs are treated as traditional securities and adhere to all rights and obligations including approved prospectuses for public offerings. IEO project whitepapers were thoroughly vetted by exchange prior to launch. Therefore, the above documents are more standardized and can be used similarly as standard corporate disclosures. Stablecoin is connected to conventional financial systems and have drawn attention to financial stability issues. News (integrated with event studies) could provide coverage from this perspective.

Given the increasing trend of cryptocurrency in the monetary system, government policies and regulations are essential for counteracting risks, restricting illicit activities, and protecting consumers (Chokor and Alfieri 2021 ).

Many jurisdictions have updated or supplemented their regulatory frameworks to accommodate the existence of cryptocurrencies and other blockchain-based decentralized applications (e.g., Market in Crypto-Assets (MiCA) and Framework for International Engagement on Digital Assets). Issues such as money laundering, terrorist financing, and tax evasion have been extensively recognized and addressed. In addition, organizations such as the International Organization for Standardization (ISO) and the Financial Stability Board (FSB) are working to establish international rules and standards to promote collaboration among jurisdictions. Many proposed frameworks are still in their initial stages or awaiting implementation, and updates can be expected.

Texts used in regulation-related research are not limited to regulatory documents, but also include other texts, such as corporate disclosures related to blockchain or cryptocurrency (SEC 2022 ), terms of service agreements, and online discussions about regulatory terms. Future research could integrate regulatory factors into the study, examine the impact of regulations on markets in different jurisdictions (Barth et al. 2020 ), and observe users’ perceptions of and reactions to specific regulations. This could provide insightful implications for practitioners and policymakers regarding the implementation of relevant regulations and how takers of specific regulations will adopt them.

The uncomplicated access and rich information in blockchain-related texts make them ideal for complementing numerical data in research. However, a comprehensive review of this topic to provide guidance for researchers is lacking.

This study addresses this issue by making several contributions to the literature. First, we provide comprehensive summaries of research scope, text data sources, and text analysis methodologies in the existing literature to guide researchers in finding pertinent resources. Second, we go beyond individual elements and exhibit the connections between them. We conflate the above elements and display the two most frequently used combinations: (1) papers focusing on cryptocurrencies conduct sentiment analysis on posts from instant user-generated content or news articles to find the correlations between sentiment and market behavior, and (2) papers examining the concept of blockchain use formal documents to apply topic modeling to discover classifications and trends. We emphasize that it is crucial to choose appropriate combinations considering variable perspectives, such as data characteristics and research questions. Finally, we integrate blokcchain-related research areas and text analysis approaches into a joint framework. By not restricting our search to one discipline, we are able to capture the use of text analysis in non-technical blockchain studies across disciplines and provide multiple perspectives on the topic. We highlight five major research topics discussed in the literature: relationship discovery, cryptocurrency performance prediction, classification and trend, crime and regulation, and the perception of blockchain. Furthermore, by referring to individual papers and aggregated information, we uncover three future research topics that researchers can explore: improvement of data preparation, studies with underused data and growing areas, and regulation-related research.

We are aware that this review shares publication bias of literature reviews. Studies with statistically significant results are more likely to be published, leading to a publication bias (Rosenthal 1979 ). To alleviate the impact of bias, we searched the most comprehensive databases for peer-reviewed papers and chapters. We also included unpublished working papers on SSRN in keyword searches. Backward snowballing was conducted on the included papers to identify more papers that did not appear in the keyword searches. We believe that through our multiple procedures for identifying targeted papers, we obtained a comprehensive collection of papers for this literature review.

Despite this limitation, this study provides a timely academic-oriented review of the text analysis approaches used in blockchain research. Our detailed summaries will help researchers navigate specific text data types and methodologies. The findings of the current research landscape and suggested future directions could facilitate the selection of promising research topics and the implementation of suitable methodologies for their analyses. Overall, this review will be useful for researchers from various disciplines interested in exploring large-scale text data in blockchain-related research.

Availability of data and materials

The papers used in this review can be obtained using the same search terms in relevant databases. The full list of papers included in this review is available from the correspondence author upon reasonable request.

The first use case for blockchains is the creation of cryptocurrencies (e.g., Bitcoin), where Nakamoto ( 2008 ) proposed a design for a decentralized payment system in which all transactions are stored in transparent blocks, and transactions are validated through a consensus protocol. The idea is to build trust through protocols and operate the system without authority (i.e., a trusted third party).

A smart contract is essentially a computer-coded contract on blockchain that is automatically executed when the contract terms are met. This increases the enforceability of business contracts without the involvement of a trusted third party (Cong and He 2019 ).

Stablecoins are cryptocurrencies designed to be price-stable by pegging their values to a specific asset (or a basket of assets), making them a better medium of exchange than typical cryptocurrencies. The most common peg is to the US dollar.

ICO is an alternative way of financing projects or startups by creating and issuing tokens on a blockchain and selling them to raise funds. IEOs can be seen as an ICO supervised by cryptocurrency exchange platforms: the project goes through due diligence before commencing the sale, which gives investors more assurance about the validity and success of the project. STOs are tokenized digital securities and are sold in security token exchanges. They are classified as securities and are subject to rigorous vetting before issuance.

NFTs differ from other tokens by its non-fungibility. A token can represent ownership of a specific item (e.g., painting, land) and is not interchangeable with other tokens because it has unique (digital) properties encoded in the smart contract that creates it.

For WoS, we also searched in Keywords Plus. It is a feature of WoS that returns the articles in results if the words or phrases in our search appear frequently in the titles of these articles’ references, but not in the title of the article itself. By doing this, we also collected articles that have the potential to be relevant to our topic but did not have the keywords placed in the article.

We cleaned the keywords of the papers before conducting the network analysis to eliminate the effects of the plural form, abbreviation, and spelling variation, etc.

Some of the papers use various types of text data and methodologies; therefore, the sums of text data and methodology exceed the number of papers.

The mathematical principles of the methodologies are beyond the scope of this review, but for each methodology, interested readers can refer to the cited studies for details.

Abbreviations

Bidirectional encoder representations from transformers

Bag-of-words

Continuous bag of words

Citation per year

Dynamic topic models

Financial Stability Board

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

General Data Protection Regulation

Initial coin offering

Initial exchange offering

Initial public offering

International Organization for Standardization

Latent dirichlet allocation

Loughran and McDonald sentiment lexicon

Long short-term memory

Market in Crypto-Assets

Non-fungible token

Natural language processing

U.S. Securities and Exchange Commission

Social Science Research Network

Security token offering

Term frequency-inverse document frequency

Valence Aware Dictionary for Sentiment Reasoning

Web of Science

Abraham J, Higdon D, Nelson J, Ibarra J (2018) Cryptocurrency price prediction using tweet volumes and sentiment analysis. SMU Data Sci Rev 1(3):1

Google Scholar  

Akba F, Medeni IT, Guzel MS, Askerzade I (2021) Manipulator detection in cryptocurrency markets based on forecasting anomalies. IEEE Access 9:108819–108831

Article   Google Scholar  

Alahi I, Islam M, Iqbal A, Bosu A (2019) Identifying the challenges of the blockchain community from Stackexchange topics and trends. 2019 IEEE 43rd Ann Comput Softw Appl Conf (COMPSAC) 1:123–128

Anamika A, Subramaniam S (2022) Do news headlines matter in the cryptocurrency market? Appl Econ 54(54):6322–6338

Aslam N, Rustam F, Lee E, Washington PB, Ashraf I (2022) Sentiment analysis and emotion detection on cryptocurrency related tweets using ensemble LSTM-GRU model. IEEE Access 10:39313–39324

Ayvaz S, Shiha MO (2018) A scalable streaming big data architecture for real-time sentiment analysis. In: Proceedings of the 2018 2nd international conference on cloud and big data computing, pp 47–51

Azqueta-Gavaldón A (2020) Causal inference between cryptocurrency narratives and prices: evidence from a complex dynamic ecosystem. Phys A: Stat Mech Appl 537:122574

Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10)

Bahamazava K, Nanda R (2022) The shift of darknet illegal drug trade preferences in cryptocurrency: the question of traceability and deterrence. Forens Sci Int: Digit Investig 40:301377

Bahamazava K, Reznik S (2022) The comparative analysis of regulations in the Italian Republic and the Russian Federation against cryptolaundering techniques. J Money Laundering Control

Baldwin C (2016) Bitcoin worth \$72 million stolen from bitfinex exchange in Hong Kong. Reuters Media. Accessed 6 Nov 2022

Balfagih AM, Keselj V (2019) Evaluating sentiment c1assifiers for Bitcoin tweets in price prediction task. In: 2019 IEEE international conference on big data (Big Data), pp 5499–5506

Bao Y, Datta A (2014) Simultaneously discovering and quantifying risk types from textual risk disclosures. Manag Sci 60(6):1371–1391

Barth JR, Herath HS, Herath TC, Xu P (2020) Cryptocurrency valuation and ethics: a text analytic approach. J Manag Anal 7(3):367–388

Bashchenko O (2022) Bitcoin price factors: natural language processing approach. Available at SSRN 4079091

Birim ŞÖ, Sönmez FE (2022) Social sentiment analysis for prediction of cryptocurrency prices using neuro-fuzzy techniques. In: International conference on intelligent and fuzzy systems, pp 606–616

Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning, pp 113–120

Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

Bloomberg: Ethereum Developers Push Ahead With Shanghai Upgrade to Enable Withdrawals (2023). https://www.bloomberg.com/news/articles/2023-01-05/ethereum-developers-push-ahead-with-update-enabling-withdrawals Accessed 01/24/2023

Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory, pp 144–152

Burnie A, Yilmaz E (2019) Social media and Bitcoin metrics: which words matter. R Soc Open Sci 6(10):191068

Article   PubMed   PubMed Central   ADS   Google Scholar  

Caliskan K (2020) Platform works as stack economization: cryptocurrency markets and exchanges in perspective. Sociologica 14(3):115–142

Cary M (2021) Down with the #dogefather: evidence of a cryptocurrency responding in real time to a crypto-tastemaker. J Theor Appl Electron Commer Res 16(6):2230–2240

Casino F, Dasaklis TK, Patsakis C (2019) A systematic literature review of blockchain-based applications: current status, classification and open issues. Telemat Inform 36:55–81

Caviggioli F, Lamberti L, Landoni P, Meola P (2020) Technology adoption news and corporate reputation: sentiment analysis about the introduction of Bitcoin. J Prod Brand Manag 29(7):877–897

Cavnar WB, Trenkle JM et al (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175

Chakkarwar V, Tamane SC (2019) Quick insight of research literature using topic modeling. In: Zhang Y-D, Mandal JK, So-In C, Thakur NV (eds) Smart trends in computing and communications 2019, vol. 165, pp 189–197

Chen CY-H, Després R, Guo L, Renault T (2019a) What makes cryptocurrencies special? Investor sentiment and return predictability during the bubble. Technical report, IRTG 1792 Discussion Paper

Chen MA, Wu Q, Yang B (2019) How valuable is fintech innovation? Rev Financ Stud 32(5):2062–2106

Chen MA, Hu S, Wang J, Wu Q (2022) Can blockchain technology help overcome contractual incompleteness? Evidence from state laws. Available at SSRN 3915895

Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd international conference on knowledge discovery and data mining, pp 785–794

Cheuque CG, Reutter LJ (2019) Bitcoin price prediction through opinion mining. In: Companion proceedings of The 2019 World Wide Web conference, pp 755–762

Chiarello F, Belingheri P, Bonaccorsi A, Fantoni G, Martini A (2021) Value creation in emerging technologies through text mining: the case of blockchain. Technol Anal Strateg Manag 33(12):1404–1420

Choi J, Lee T, Kim K, Seo M, Cui J, Shin S (2022) Discovering message templates on large scale Bitcoin abuse reports using a two-fold NLP-based clustering method. IEICE Trans Inf Syst 105(4):824–827

Chokor A, Alfieri E (2021) Long and short-term impacts of regulation in the cryptocurrency market. Q Rev Econ Financ 81:157–173

Chousein Z, Tetik HY, Sağlam RB, Bülbül A, Li S (2020) Tension between GDPR and public blockchains: a data-driven analysis of online discussions. In: 13th international conference on security of information and networks, pp 1–8

Christie WG, Huang RD (1995) Following the pied piper: do individual returns herd around the market? Financ Anal J 51(4):31–37

Chursook A, Dawod AY, Chanaim S, Naktnasukanjn N, Chakpitak N (2022) Twitter sentiment analysis and expert ratings of initial coin offering fundraising: evidence from Australia and Singapore markets. TEM J 11(1):44–55

Cohney S, Hoffman D, Sklaroff J, Wishnick D (2019) Coin-operated capitalism. Columbia Law Rev 119(3):591–676

Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60(2):283

Cong LW, He Z (2019) Blockchain disruption and smart contracts. Rev Financ Stud 32(5):1754–1797

Cong LW, Liang T, Yang B, Zhang X (2021) Analyzing textual information at scale. In: Information for Efficient decision making: big data, blockchain and relevance. World Scientific, Singapore, pp 239–271

Corbet S, Meegan A, Larkin C, Lucey B, Yarovaya L (2018) Exploring the dynamic relationships between cryptocurrencies and other financial assets. Econ Lett 165:28–34

Critien JV, Gatt A, Ellul J (2022) Bitcoin price change and trend prediction through Twitter sentiment and data volume. Financ Innov 8(1):1–20

da Silva CF, Moro S (2021) Blockchain technology as an enabler of consumer trust: a text mining literature analysis. Telemat Inform 60:101593

Dale E, Chall JS (1948) A formula for predicting readability: instructions. Educ Res Bull 27(2):37–54

Daluwathumullagamage DJ, Sims A (2021) Fantastic beasts: blockchain based banking. J Risk Financ Manag 14(4):1–43

Daluwathumullagamage DJ, Sims A (2020) Blockchain-enabled corporate governance and regulation. Int J Financ Stud 8(2):1–38

Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Dittmar R, Wu DA (2019) Initial coin offerings hyped and dehyped: an empirical examination. SSRN Electron J

Dumay J, Cai L (2014) A review and critique of content analysis as a methodology for inquiring into IC disclosure. J Intell Cap 15(2):264–290

El-Masri M, Hussain EMA (2021) Blockchain as a mean to secure internet of things ecosystems: a systematic literature review. J Enterp Inf Manag 34(5):1371–1405

Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231

Fang F, Ventre C, Basios M, Kanthan L, Martinez-Rego D, Wu F, Li L (2022) Cryptocurrency trading: a comprehensive survey. Financ Innov 8(1):1–59

Farimani SA, Jahan MV, Fard AM, Tabbakh SRK (2022) Investigating the informativeness of technical indicators and news sentiment in financial market price prediction. Knowl-Based Syst 247:108742

Flesch R (1979) How to write plain english: a book for lawyers and consumers, 1st edn. Harper & Row, New York

Florysiak D, Schandlbauer A (2022) Experts or charlatans? ICO analysts and white paper informativeness. J Bank Financ 139:106476

Foley S, Karlsen JR, Putniņš TJ (2019) Sex, drugs, and bitcoin: how much illegal activity is financed through cryptocurrencies? Rev Financ Stud 32(5):1798–1853

Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

Article   MathSciNet   Google Scholar  

Frizzo-Barker J, Chow-White PA, Adams PR, Mentanko J, Ha D, Green S (2020) Blockchain as a disruptive technology for business: a systematic review. Int J Inf Manag 51:102029

Fu C, Koh A, Griffin P (2019) Automated theme search in ICO whitepapers. J Financ Data Sci 1(4):140–158

FXStreet: Ethereum Shanghai Upgrade: guide to the ETH hard fork, unstaking and liquid staking projects (2023)

Galeshchuk S, Vasylchyshyn O, Krysovatyy A (2018) Bitcoin response to Twitter sentiments. In: CEUR workshop proceedings, pp 160–168

Gallagher RJ, Reing K, Kale D, Ver Steeg G (2017) Anchored correlation explanation: topic modeling with minimal domain knowledge. Trans Assoc Comput Linguist 5:529–542

Garanina T, Ranta M, Dumay J (2021) Blockchain in accounting research: current trends and emerging topics. Account Audit Account J 35(7):1507–1533

Ge C, Shi H, Jiang J, Xu X (2021) Investigating the demand for blockchain talents in the recruitment market: evidence from topic modeling analysis on job postings. Inf Manag 59(7):103513

Gentzkow M, Kelly B, Taddy M (2019) Text as data. J Econ Lit 57(3):535–74

Georgoula I, Pournarakis D, Bilanakos C, Sotiropoulos D, Giaglis GM (2015) Using time-series and sentiment analysis to detect the determinants of Bitcoin prices. Available at SSRN 2607167

Goldberg Y (2017) Neural network methods for natural language processing. Synth Lect Human Lang Technol 10(1):1–309

Goldstein RAM, Tabuchi H (2014) Erosion of faith was death knell for Mt. Gox. NY Times. Accessed 6 Nov 2022

Grassman R, Bracamonte V, Davis M, Sato M (2021) Attitudes to cryptocurrencies: a comparative study between Sweden and Japan. Rev Socionetwork Strateg 15(1):169–194

Article   PubMed   PubMed Central   Google Scholar  

Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21(3):267–297

Grover P, Kar AK, Janssen M, Ilavarasan PV (2019) Perceived usefulness, ease of use and user acceptance of blockchain technology for digital transactions: insights from user-generated content on Twitter. Enterp Inf Syst 13(6):771–800

Gunning R (1952) Technique of clear writing

Günther E, Quandt T (2016) Word counts and topic models. Digit J 4(1):75–88

Gurdgiev C, O’Loughlin D (2020) Herding and anchoring in cryptocurrency markets: investor reaction to fear and uncertainty. J Behav Exp Financ 25:100271

Han S, Ye S, Zhang H (2020) Visual exploration of internet news via sentiment score and topic models. Comput Vis Med 6(3):333–347

Hanley KW, Hoberg G (2019) Dynamic interpretation of emerging risks in the financial sector. Rev Financ Stud 32(12):4543–4603

Hashimoto K, Xiong C, Tsuruoka Y, Socher R (2016) A joint many-task model: growing a neural network for multiple NLP tasks. arXiv preprint arXiv:1611.01587

Hassan MK, Hudaefi FA, Caraka RE (2021) Mining netizen’s opinion on cryptocurrency: sentiment analysis of Twitter data. Stud Econ Financ 39(3):365–385

Henry E (2008) Are investors influenced by how earnings press releases are written? J Bus Commun (1973) 45(4):363–407

Hinds-Charles C, Adames J, Yang Y, Shen Y, Wang Y (2019) A longitude analysis on Bitcoin issue repository. In: 2018 1st IEEE international conference on hot information-centric networking (HotICN), pp 212–217

Hirata E, Lambrou M, Watanabe D (2021) Blockchain technology in supply chain management: insights from machine learning algorithms. Marit Bus Rev 6(2):114–128

Ho TK (1995) Random decision forests. Proc 3rd Int Conf Doc Anal Recognit 1:278–282

Hoberg G, Phillips G (2016) Text-based network industries and endogenous product differentiation. J Polit Econ 124(5):1423–1465

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Article   CAS   PubMed   Google Scholar  

Huang X, Zhang W, Tang X, Zhang M, Surbiryala J, Iosifidis V, Liu Z, Zhang J (2021) LSTM based sentiment analysis for cryptocurrency prediction. In: International conference on database systems for advanced applications, pp 617–621

Hutto C, Gilbert E (2014) VADER: a parsimonious rule-based model for sentiment analysis of social media text. Proc Int AAAI Conf Web and Soc Med 8:216–225

Huynh TLD (2021) Does Bitcoin react to Trump’s tweets? J Behav Exp Financ 31:100546

Huynh TLD (2022) When Elon musk changes his tone, does Bitcoin adjust its tune? Comput Econ. https://doi.org/10.1007/s10614-021-10230-6

Ibba G, Ortu M, Tonelli R (2021) Smart contracts categorization with topic modeling techniques. In: Marin B, Wautelet Y, Heng S, Assar S, Asprion PM, Morichetta A (eds) CEUR Workshop proceedings, vol. 3031, pp 64–73

Inamdar A, Bhagtani A, Bhatt S, Shetty PM (2019) Predicting cryptocurrency value using sentiment analysis. In: 2019 International conference on intelligent computing and control systems (ICCS), pp 932–934

Jaccard P (1912) The distribution of the flora in the Alpine zone. New Phytol 11(2):37–50

Jain A, Tripathi S, Dwivedi HD, Saxena P (2018) Forecasting price of cryptocurrencies using tweets sentiment analysis. In: 2018 Eleventh international conference on contemporary computing (IC3), pp 1–7

Jurafsky D, Martin J (2017) Naive bayes and sentiment classification. Speech and language processing. Standford University Press, Redwood City, pp 74–91

Kang K, Choo J, Kim Y (2020) Whose opinion matters? Analyzing relationships between Bitcoin prices and user groups in online community. Soc Sci Comput Rev 38(6):686–702

Karalevicius V, Degrande N, de Weerdt J (2018) Using sentiment analysis to predict interday Bitcoin price movements. J Risk Financ 19(1):56–75

Kilimci ZH (2020) Sentiment analysis based direction prediction in Bitcoin using deep learning algorithms and word embedding models. Int J Intell Syst Appl Eng 8(2):60–65

Kim S, Park H, Lee J (2020) Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: a study on blockchain technology trend analysis. Exp Syst Appl 152:113401

Kim YB, Kim JG, Kim W, Im JH, Kim TH, Kang SJ, Kim CH (2016) Predicting fluctuations in cryptocurrency transactions based on user comments and replies. PloS One 11(8):0161197

Klare GR (1974) Assessing readability. Read Res Q 10(1):62–102

Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836

Kraaijeveld O, de Smedt J (2020) The predictive power of public Twitter sentiment for forecasting cryptocurrency prices. J Int Financ Mark Inst Money 65:101188

Kwon O-W, Lee J-H (2003) Text categorization based on K-nearest neighbor approach for web site classification. Inf Process Manag 39(1):25–44

Lambert T, Liebau D, Roosenboom P (2021) Security token offerings. Small Bus Econ. https://doi.org/10.1007/s11187-021-00539-9

Laturnus V (2022) What matters to crypto investors? Insights from token offerings on the Ethereum blockchain. Available at SSRN 4087795

Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

Lee D, Seung HS (2000) Algorithms for non-negative matrix factorization. In: Leen T, Dietterich T, Tresp V (eds) Advances in neural information processing systems. vol 13. MIT Press

Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

Article   CAS   PubMed   ADS   Google Scholar  

Lee J, Zo HJ, Steinberger T (2022) Exploring trends in blockchain publications with topic modeling: implications for forecasting the emergence of industry applications. Available at SSRN 4079332

Li TR, Chamrajnagar AS, Fong XR, Rizik NR, Fu F (2019) Sentiment-based prediction of alternative cryptocurrency price fluctuations using gradient boosting tree model. Front Phys 7:98

Article   CAS   Google Scholar  

Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JP, Clarke M, Devereaux PJ, Kleijnen J, Moher D (2009) The prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration. Ann Internal Med 151(4):65

Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 375–384

Linton M, Teo EGS, Chen CY, Härdle WK (2017) Dynamic topic modelling for cryptocurrency community forums. In: Härdle WK, Chen CY-H, Overbeck L (eds) Applied quantitative finance. Springer, Berlin, pp 355–372

Chapter   Google Scholar  

Liu Y, Sheng J, Wang W (2021) Technology and cryptocurrency valuation: evidence from machine learning. Available at SSRN 3577208

Loginova E, Tsang WK, van Heijningen G, Kerkhove L-P, Benoit DF (2021) Forecasting directional Bitcoin price returns using aspect-based sentiment analysis on online text data. Mach Learn. https://doi.org/10.1007/s10994-021-06095-3

Loughran TIM, McDonald B (2016) Textual analysis in accounting and finance: a survey. J Account Res 54(4):1187–1230

Loughran T, McDonald B (2011) When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J Financ 66(1):35–65

Lu H-K, Yang L-w, Lin P-C, Yang T-H, Chen AN (2017) A study on adoption of Bitcoin in Taiwan: using big data analysis of social media. In: Proceedings of the 3rd international conference on communication and information processing, pp 32–38

MacQueen J (1967) Classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

Mai F, Shan Z, Bai Q, Wang X, Chiang RH (2018) How does social media impact Bitcoin value? A test of the silent majority hypothesis. J Manag Inf Syst 35(1):19–52

Mao H, Counts S, Bollen J (2011) Predicting financial markets: comparing survey, news, Twitter and search engine data. arXiv preprint arXiv:1112.1051

Mareddy S, Gupta D (2022) Analysis of Twitter data for identifying trending domains in blockchain technology. In: Smys S, Bestak R, Palanisamy R, Kotuliak I (eds) Computer networks and inventive communication technologies, vol. 75, pp 651–672

McLaughlin GH (1969) SMOG grading: a new readability formula. J Read 12(8):639–646

Medhi PK (2020) Blockchain-enabled supply chain transparency, supply chain structural dynamics, and sustainability of complex global supply chains: a text mining analysis. In: Information for efficient decision making: big data. Blockchain And relevance. World Scientific Publishing Co, Singapore, pp 273–312

Mendoza-Tello JC, Mora H, Pujol-López FA, Lytras MD (2018) Social commerce as a driver to enhance trust and intention to use cryptocurrencies for electronic payments. IEEE Access 6:50737–50751

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

Mnif E, Lacombe I, Jarboui A (2021) Users’ perception toward Bitcoin Green with big data analytics. Soc Bus Rev 16(4):592–615

Mohammad SM (2018) Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In: Proceedings of The annual conference of the association for computational linguistics (ACL), Melbourne, Australia

Moher D, Liberati A, Tetzlaff J, Altman DG (2009) PRISMA Group: preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Internal Med 151(4):264–269

Mousa A, Schuller B (2017) Contextual bidirectional long short-term memory recurrent neural network language models: a generative approach to sentiment analysis. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 1, Long papers. Association for Computational Linguistics, Valencia, Spain, pp 1023–1032

Moustafa H, Malli M, Hazimeh H (2022) Real-time Bitcoin price tendency awareness via social media content tracking. In: 2022 10th international symposium on digital forensics and security (ISDFS), pp 1–6

Nakamoto S (2008) A peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.pdf . Accessed 10 Apr 2022

Narman HS, Uulu AD, Liu J (2018) Profile analysis for cryptocurrency in social media. In: 2018 IEEE international symposium on signal processing and information technology (ISSPIT), pp 229–234

Newman D, et al (2010) Automatic evaluation of topic coherence. In: Human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics, pp 100–108

Nguyen TH, Shirai K (2015) Topic modeling based sentiment analysis on social media for stock market prediction. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol. 1: Long Papers, pp 1354–1364

Nielsen FÅ (2011) A new ANEW: evaluation of a word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903

Nizzoli L, Tardelli S, Avvenuti M, Cresci S, Tesconi M, Ferrara E (2020) Charting the landscape of online cryptocurrency manipulation. IEEE Access 8:113230–113245

Ortu M, Uras N, Conversano C, Bartolucci S, Destefanis G (2022) On technical trading and social media indicators for cryptocurrency price classification through deep learning. Exp Syst Appl 198:116804

Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE et al (2021) PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. bmj. https://doi.org/10.1007/s10994-021-06095-3

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ (Clin Res ed) 372:71

Pan L, Feng L, Jiayin Q (2020) Adaptive evolution mechanism of blockchain community based on token-based halving event. In: 2020 Chinese automation congress (CAC), pp 6140–6144

Pant DR, Neupane P, Poudel A, Pokhrel AK, Lama BK (2018) Recurrent neural network based Bitcoin price prediction by Twitter sentiment analysis. In: 2018 IEEE 3rd international conference on computing, communication and security (ICCCS), pp 128–132

Patil AP, Akarsh TS, Parkavi A (2018) A study of opinion mining and data mining techniques to analyse the cryptocurrency market. In: 2018 3rd international conference on computational systems and information technology for sustainable solutions (CSITSS), pp 198–203

Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

Perdana A, Robb A, Balachandran V, Rohde F (2021) Distributed ledger technology: its evolutionary path and the road ahead. Inf Manag 58(3):103316

Phillips RC, Gorse D (2018) Mutual-excitation of cryptocurrency market returns and social media topics. In: 4th international conference on frontiers of educational technologies, pp 80–86

Polasik M, Piotrowska AI, Wisniewski TP, Kotkowski R, Lightfoot G (2015) Price fluctuations and the use of Bitcoin: an empirical inquiry. Int J Electron Commer 20(1):9–49

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) Catboost: unbiased boosting with categorical features. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31. Curran Associates, Inc., 

Rahman S, Hemel JN, Anta SJA, Al Muhee H, Uddin J (2018) Sentiment analysis using R: an approach to correlate cryptocurrency price fluctuations with change in user sentiment using machine learning. In: 2018 Joint 7th international conference on informatics, electronics and vision (ICIEV) and 2018 2nd international conference on imaging, vision and pattern recognition (icIVPR), pp 492–497

Ramos J (2003) Using TF-IDF to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, vol. 242, pp 29–48

Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813

Rognone L, Hyde S, Zhang SS (2020) News sentiment in the cryptocurrency market: an empirical comparison with forex. Int Rev Financ Anal 69:101462

Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychol Bull 86(3):638

Salton G, Yang C-S, Yu CT (1975) A theory of term importance in automatic text analysis. J Am Soc Inf Sci 26(1):33–44

Sapkota N, Grobys K (2021) Fear sells: determinants of fund-raising success in the cross-section of initial coin offerings. Available at SSRN 3843138

Şaşmaz E, Tek FB (2021) Tweet sentiment analysis for cryptocurrencies. In: 2021 6th international conference on computer science and engineering (UBMK), pp 613–618

Sattarov O, Jeon HS, Oh R, Lee JD (2020) Forecasting Bitcoin price fluctuation by Twitter sentiment analysis. In: 2020 international conference on information science and communications technologies (ICISCT)

Schwenkler G, Zheng H (2021) News-driven peer co-movement in crypto markets. Available at SSRN 3572471

Sebastião H, Godinho P (2021) Forecasting and trading cryptocurrencies with machine learning under changing market conditions. Financ Innov 7(1):1–30

SEC: Staff Accounting Bulletin No. 121 (2022)

Senter R, Smith EA (1967) Automated readability index. Technical report, Cincinnati Univ OH

Shahid MN, Jungpil H (2020) A cross-disciplinary review of blockchain research trends and methodologies: topic modeling approach. In: 53rd annual hawaii international conference on system sciences, HICSS 2020, vol. 2020, pp 4053–4059

Siddaway AP, Wood AM, Hedges LV (2019) How to do a systematic review: a best practice guide for conducting and reporting narrative reviews, meta-analyses, and meta-syntheses. Ann Rev Psychol 70:747–770

Siu GA, Collier B, Hutchings A (2021) Follow the money: the relationship between currency exchange and illicit behaviour in an underground forum. In: 2021 IEEE european symposium on security and privacy workshops (EuroS &PW), pp 191–201

Stanley M (2019) The application of behavioural heuristics to initial coin offerings valuation and investment. J Br Blockchain Assoc 2(1):7776

Steinert L, Herff C (2018) Predicting altcoin returns using social media. PloS One 13(12):0208119

Stone PJ, Dunphy DC, Smith MS (1966) The general inquirer: a computer approach to content analysis. M.I.T. Press, Oxford, England

Stratopoulos TC, Wang VX, Ye H (2022) Use of corporate disclosures to identify the stage of blockchain adoption. Account Horiz 36(1):197–220

Thewissen J, Shrestha P, Torsin W, Pastwa AM (2022) Unpacking the black box of ICO white papers: a topic modeling approach. J Corp Financ 75:102225

Toma AM, Cerchiello P (2020) Initial coin offerings: risk or opportunity? Front Artif Intell 3:18

U.Today: Ethereum (ETH): Shanghai Hard Fork Causes Concern Among Developers, Here Are Reasons (2023)

Vacca S, Costerbosa CL, Spada A, Riotta G, Uras N (2021) Investigation of coronavirus impact on blockchain and cryptocurrencies markets. In: 2021 IEEE/ACM 4th international workshop on emerging trends in software engineering for blockchain (WETSEB), pp 56–60

Valencia F, Gómez-Espinosa A, Valdés-Aguirre B (2019) Price movement prediction of cryptocurrencies using sentiment analysis and machine learning. Entropy 21(6):589

Vohra S, Teraiya J (2013) A comparative study of sentiment analysis techniques. J Jikrce 2(2):313–317

Voskobojnikov A, Wiese O, Mehrabi Koushki M, Roth V, Beznosov K (2021) The U in crypto stands for usable: An empirical study of user experience with mobile cryptocurrency wallets. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14

Wang J, Fan Y, Zhang H, Feng L (2021) Technology hotspot tracking: topic discovery and evolution of china’s blockchain patents based on a dynamic LDA model. Symmetry 13(3):415

Article   ADS   Google Scholar  

Wang Q, Li R, Wang Q, Chen S (2021b) Non-fungible token (NFT): overview, evaluation, opportunities and challenges. arXiv preprint arXiv:2105.07447

Xie P (2021) The interplay between investor activity on virtual investment community and the trading dynamics: evidence from the Bitcoin market. Inf Syst Front 24(4):1287–1303

Xu M, Chen X, Kou G (2019) A systematic review of blockchain. Financ Innov 5(1):1–14

Xu W, Wang T, Chen R, Zhao JL (2021) Prediction of initial coin offering success based on team knowledge and expert evaluation. Decis Support Syst 147:113574

Xu XF, He YY (2022) Blockchain application in modern logistics information sharing: a review and case study analysis. Prod Plan Control. https://doi.org/10.1080/09537287.2022.2058997

Yadav J, Misra M, Rana NP, Singh K, Goundar S (2021) Netizens’ behavior towards a blockchain-based esports framework: a TPB and machine learning integrated approach. Int J Sports Mark Spons 23(4):665–683

Yao W, Xu K, Li Q (2019) Exploring the influence of news articles on Bitcoin price with machine learning. In: 2019 IEEE Symposium on computers and communications (ISCC), pp 1–6

Yen J-C, Wang T (2021) Stock price relevance of voluntary disclosures about blockchain technology and cryptocurrencies. Int J Account Inf Syst 40:100499

Yen J-C, Wang T, Chen Y-H (2021) Different is better: how unique initial coin offering language in white papers enhances success. Account Financ 61(4):5309–5340

Zeng S, Ni X, Yuan Y, Wang F-Y (2018) A bibliometric analysis of blockchain research. In: 2018 IEEE intelligent vehicles symposium (IV), vol. 2018, pp 102–107

Zhang H, Daim T, Zhang YP (2021) Integrating patent analysis into technology roadmapping: a latent dirichlet allocation based technology assessment and roadmapping in the field of blockchain. Technol Forecast Soc Change 167:120729

Zhang Y, Kang S, Dai W, Chen S, Zhu J (2021b) Code will speak: early detection of Ponzi smart contracts on Ethereum. In: 2021 IEEE international conference on services computing (SCC), pp 301–308

Zhang Y, Jin R, Zhou Z-H (2010) Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybernet 1(1):43–52

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers and the editors for their valuable comments and suggestions which led to great improvement of this paper through the revision process.

This research is supported by the Manchot Graduate School “Competitiveness of Young Enterprises” at the Heinrich-Heine-University of Düsseldorf. Funding was provided by the Jürgen Manchot Stiftung.

Author information

Authors and affiliations.

Manchot Graduate School, Heinrich Heine University Düsseldorf, Universitätsstr. 1, 40225, Düsseldorf, Germany

Xian Zhuo & Denefa Bostandzic

Durham University Business School, Durham University, Mill Hill Lane, DH1 3LB, Durham, Germany

Felix Irresberger

Department of Management and Entrepreneurship, Witten/Herdecke University, Alfred-Herrhausen- Str. 50, 58448, Witten, Germany

Denefa Bostandzic

You can also search for this author in PubMed   Google Scholar

Contributions

XZ: Conceptualization, Methodology, Writing. FI: Supervision, Reviewing, Writing. DB: Supervision, Reviewing, Writing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xian Zhuo .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: The list of keywords for the query

The initial list of keywords with fundamental blockchain concepts based on our knowledge of the blockchain ecosystem (i.e., blockchain, cryptocurrency, smart contract, and ICO) and expand our list by sampling academic papers that include additional keywords. In this way, we build up a wider set of keywords by adding non-redundant keywords after observing keywords used in the academic literature. Our list of keywords is an intersection of keywords used in many blockchain-related papers. The complete list of keywords for the query is as follows:

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Zhuo, X., Irresberger, F. & Bostandzic, D. How are texts analyzed in blockchain research? A systematic literature review. Financ Innov 10 , 60 (2024). https://doi.org/10.1186/s40854-023-00501-6

Download citation

Received : 21 June 2022

Accepted : 25 April 2023

Published : 29 February 2024

DOI : https://doi.org/10.1186/s40854-023-00501-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Text analysis
  • Systematic literature review
  • Machine learning algorithm
  • Topic modeling
  • Sentiment analysis

Mathematics Subject Classification

research and analysis of blockchain data

Blockchain Data Analysis from the Perspective of Complex Networks: Overview

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Architecture and Design
  • Asian and Pacific Studies
  • Business and Economics
  • Classical and Ancient Near Eastern Studies
  • Computer Sciences
  • Cultural Studies
  • Engineering
  • General Interest
  • Geosciences
  • Industrial Chemistry
  • Islamic and Middle Eastern Studies
  • Jewish Studies
  • Library and Information Science, Book Studies
  • Life Sciences
  • Linguistics and Semiotics
  • Literary Studies
  • Materials Sciences
  • Mathematics
  • Social Sciences
  • Sports and Recreation
  • Theology and Religion
  • Publish your article
  • The role of authors
  • Promoting your article
  • Abstracting & indexing
  • Publishing Ethics
  • Why publish with De Gruyter
  • How to publish with De Gruyter
  • Our book series
  • Our subject areas
  • Your digital product at De Gruyter
  • Contribute to our reference works
  • Product information
  • Tools & resources
  • Product Information
  • Promotional Materials
  • Orders and Inquiries
  • FAQ for Library Suppliers and Book Sellers
  • Repository Policy
  • Free access policy
  • Open Access agreements
  • Database portals
  • For Authors
  • Customer service
  • People + Culture
  • Journal Management
  • How to join us
  • Working at De Gruyter
  • Mission & Vision
  • De Gruyter Foundation
  • De Gruyter Ebound
  • Our Responsibility
  • Partner publishers

research and analysis of blockchain data

Your purchase has been completed. Your documents are now available to view.

A Systematic Overview of Blockchain Research

Blockchain has been receiving growing attention from both academia and practices. This paper aims to investigate the research status of blockchain-related studies and to analyze the development and evolution of this latest hot area via bibliometric analysis. We selected and explored 2451 papers published between 2013 and 2019 from the Web of Science Core Collection database. The analysis considers different dimensions, including annual publications and citation trends, author distribution, popular research themes, collaboration of countries (regions) and institutions, top papers, major publication journals (conferences), supportive funding agencies, and emerging research trends. The results show that the number of blockchain literature is still increasing, and the research priorities in blockchain-related research shift during the observation period from bitcoin, cryptocurrency, blockchain, smart contract, internet of thing, to the distributed ledger, and challenge and the inefficiency of blockchain. The findings of this research deliver a holistic picture of blockchain research, which illuminates the future direction of research, and provides implications for both academic research and enterprise practice.

1 Introduction

With the era of bitcoin, digital cash denoted as BTC makes it possible to store and transmit value through the bitcoin network [ 1 ] . And therewith, blockchain, the technology underlying bitcoin, which adopts a peer-to-peer network to authenticate transactions, has been gaining growing attention from practices, especially Libra, a global currency and financial infrastructure launched by Facebook, and digital currency electronic payment. Currently, blockchain is also an increasingly important topic in the academic field. Blockchain research has considerably progressed, attracting attention from researchers, practitioners, and policy-makers [ 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ] .

Considering the huge potential benefits that blockchain would bring in various aspects of industries, for instance, finance and economy [ 10 , 11 , 12 ] , internet of things [ 13 , 14 , 15 ] , energy [ 16 , 17 ] , supply chain [ 18 , 19 ] , and other areas. It is often compared with the Internet and is even referred to as a new form of the Internet. As a result, the number of publications in the blockchain is growing rapidly. According to an initial search on the Web of Science Core Collection, over 2000 scientific papers published are related to blockchain.

Under the circumstances where the number of research publications in the blockchain is quickly increasing, although studies have tried to provide some insights into the blockchain research via literature reviews [ 20 , 21 , 22 , 23 , 24 ] . Comprehensive scientometric analysis of academic articles published in influential journals are beneficial to the further development of blockchain research. This research conducts a bibliometric visualization review and attempts to deliver an overview of the research in this fast-growing field.

The objectives of this research are as follows. First, we intend to build an overview of the distribution of blockchain-related research by time, authors, journals, institutions, countries (regions), and areas in the blockchain academic community. Second, we probe the key research topics of blockchain study, for which purpose, we conduct keyword co-occurrence analysis. Third, we picture the intellectual structure of blockchain study based on co-citation analysis of articles and author co-citation analysis. Finally, we identify the direction for the evolution of blockchain study. We adopt Citespace to detect and visualize emerging trends in blockchain study. To achieve these targets, we posed the following research questions:

Q1: What is the distribution pattern of blockchain publications and citations over recent years? Q2: Which are the main international contributing countries (regions) and institutions in blockchain research, and the collaboration network among them? Q3: What are the characteristics of the authorship distribution pattern? Q4: What are the key blockchain subjects based on the number of publications? Q5: Which are the major journals or conferences for blockchain-related research? Q6: Which are the most influential papers in blockchain research based on the number of citations? Q7: Who are the most influential authors in blockchain research according to the author co-citation network? Q8: What are the research trends in blockchain? Q9: What are the most supportive funding agencies for blockchain research?

Our intended contributions in this research are twofold. First, it is an attempt of adopting co-citation analysis to provide comprehensive and up-to-date developing trends in the lasted hot area, blockchain. Second, this study depicts a state-of-the-art blockchain research development and gives enlightenment on the evolution of blockchain. The findings of this research will be illuminating for both academic researchers, entrepreneurs, as well as policymakers.

The rest of the article is organized as follows. The literature review mainly summarizes related work. The “Data and methodology” section describes the data source and methodological process. The “Results” section presents the main results based on the bibliometric analysis as well as statistical analysis. “Conclusions and implications” conclude this research provides answers to the aforementioned research questions and poses directions for further work.

2 Literature Review

Scientometric analysis, also known as bibliometric network visualization analysis has been widely adopted in numerous areas to identify and visualize the trends in certain fields. For instance, Bonilla, et al. analyzed the development of academic research in economics in Latin America based on a scientometric analysis [ 25 ] . Li, et al. conducted research on emerging trends in the business model study using co-citation analysis [ 26 ] . Gaviriamarin, et al. applied bibliometric analysis to analyze the publications on the Journal of Knowledge Management [ 27 ] .

Since the birth of bitcoin, as the foundation of which, blockchain has gained an increasing amount of attention in academic research and among practices. The research papers focus on the blockchain are quite abundant and are continuing to emerge. Among a host of papers, a few studies investigate the research trend of blockchain-based on a bibliometric analysis [ 22 , 23 , 28 , 29 , 30 ] .

Table 1 presents a summary of these bibliometric studies that summarized some findings on blockchain research, yet very few investigated the co-citation network and the evolution of popular topics in a timeline view. The number of papers these articles analyzed is relatively small, which may be because they used simple retrieval formula in searching blockchain-related articles, and it could pose a threat to bibliometric analysis. Therefore, this research aims to conduct a comprehensive analysis of the status of blockchain research, which is beneficial to future research and practices.

An overview of existing bibliometric studies on blockchain research

Note: NP = number of publications; WOS = Web of Science Core Collection; CNKI = China National Knowledge Infrastructure Databases; EI = EI Compendex, an engineering bibliographic database published by Elsevier; Scopus = Elsevier’s abstract and citation database.

3 Data and Methodology

This section elaborates steps to conduct a comprehensive bibliometric-based analysis: 1) data collection, 2) methodological process. The overall approach and methodology are shown in Figure 1 , the details could be seen as follows.

Figure 1 Research methodology

Research methodology

3.1 Data and Collection

As the leading database for science and literature, the Web of Science Core Collection has been widely used in bibliometrics analysis. It gives access to multidisciplinary information from over 18,000 high impact journals and over 180,000 conference proceedings, which allows for in-depth exploration of the complete network of citations in any field.

For the sake of acquiring enough articles that are relative to the blockchain, we select keywords from Wikipedia and industry information of blockchain, and some existing research literature [ 1 , 20 , 23 , 30 ] . Moreover, in consideration of that, there are a host of blockchain research papers in various fields, in fact, although some papers use keywords in abstract or the main body, blockchain is not the emphasis of the researches. Therefore, in order to get more accurate research results, we choose to conduct a title search instead of a topic search. Table 2 presents the retrieval results with different keywords in the titles, we find that among publications that are relative to the blockchain, the number of Proceeding Papers is the biggest, which is closely followed by articles, and a few reviews. Based on the comparison of five search results in Table 2 . In addition, for accuracy and comprehensiveness, we manually go through the abstract of all the papers form conducting a title search, and choose papers that are related to blockchain. Finally, a dataset with 2451 articles is used in the subsequent analysis.

The dataset we choose has good representativeness, although it may not completely cover all papers on the blockchain, it contains core papers, and in bibliometric analysis, core papers are enough to provide a holistic view for a comprehensive overview of blockchain research.

Blockchain research article characteristics by year from 2013 to 2019

Note: Document type include: Article(A), Proceedings Paper(P), Review(R); Timespan = 2013 ∼ 2019, download in May 31, 2019; Indexes = SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, ESCI, CCR-EXPANDED, IC.

3.2 Methodological Process

The bibliometric approach has received increasing attention in many research domains. In this study, the methodological process mainly includes three methods: 1) descriptive statistical analysis, 2) article co-citation, author co-citation, and cluster analysis on co-cited articles; 3) time-zone analysis on co-cited keywords.

Descriptive statistical analysis displays an overall status of the research development in the target field, which mainly presents an overview by publication years, document types, the research area of published journals, number of citations, and in terms of most cited paper, influential author, institutions and countries. Co-citation analysis helps to identify the frequency of co-cited papers and authors and provides crucial insights into the intellectual structure of certain research fields [ 31 ] . Time-zone analysis helps to understand the flow of information and research trends in the target area [ 32 ] .

Various visualization tools have been designed and developed as computer software such as Citespace and VOSviewer. In this study, we use Citespace for co-citation analysis and timezone analysis, VOSviewer is adopted for social network analysis and visualization, we also apply other tools such as Excel and Tableau for basic statistical analysis and the visualization of the bibliometric results. Notably, in Citespace, core nodes are displayed as “citation tree-rings”, which contain abundant information of an article, for instance, the color of a citation ring denotes the year of corresponding citations, and the rule of colors in Citespace is the oldest in dark blue and newest in light orange with a spectrum of colors in between, the thickness of a ring is proportional to the number of citations in a time slice [ 33 ] . Figure 2 illustrates the details of the citation tree-rings. In addition, Citespace adopts a time-slicing mechanism to produce a synthesized network visualization [ 34 ] .

Figure 2 Citation tree-rings[33]

Citation tree-rings [ 33 ]

4.1 Distribution by Publication Year

Table 3 illustrates several characteristics of blockchain-related publications sorted by the year of publication. The annual number of articles and countries has been growing continuously since the proposing of Nakamoto’s paper in 2008 [ 1 ] , and the first blockchain research paper was published in 2013. By examining the published papers over time, there were only eight articles published in 2013. Afterward, with a continuous increase, a peak of 1,148 articles was published in 2018, and the number of publications is likely to grow ever since. Meanwhile, the annual number of countries taking part in blockchain research has also rapidly increased from 6 to 93 between 2013 and 2017, whereas the average number of Times Cited for single articles declined from 34.00 to 1.73 between 2013 and 2018. Over the observation period, 97 countries took part in the research on the blockchain with a sample of 44 in the H-index of our paper.

Statistical description of Blockchain research article from 2013 to 2019

Note: NP = number of publications; No.CO = number of countries; AV.TC = average number of Times Cited.

Figure 3 presents the cumulative numbers of published articles and citations from 2013 to 2019. There was a drastic increase in the number of papers published annually after 2016. As for the cumulative number of citations, there was no citation of blockchain literature before 2013, and 272 citations in 2013. By 2018, this number has grown over 10,000, which implies a widespread influence and attention of blockchain study in recent years.

Figure 3 Cumulative growth in blockchain publications and citations, 2013–2019

Cumulative growth in blockchain publications and citations, 2013–2019

The exponential growth is a typical characteristic of the development of research fields [ 35 ] . The model can be expressed as:

where C is the cumulative number of articles or citations, Y is the publication or citation year, α , and β are parameters. In this study period, the cumulative articles and citations in the filed grow exponentially by R articles  2 = 0.9463 and R citations  2 = 0.8691 respectively. This shows that the research quantity curve of the blockchain is like an exponential function, which means the attention of academic circles on the blockchain has been increasing in recent years.

4.2 Distribution and International Collaboration Among Countries/Regions

A total of 97 countries/areas have participated in blockchain research during the observation period. Table 4 shows the number of articles for each country (region) contributing to publications. Remarkably, an article may be written by several authors from different countries/areas, therefore, the sum of articles published by each country is large than the total number of articles. As can be seen from Table 4 , the USA and China play leading roles amongst all countries/areas observed, with publications of 532 (20.94%) and 489 (19.24%) articles respectively, followed by the UK, which published 214 (8.42%) articles.

Blockchain research country (region) ranked by number of articles (top 25)

Note: NP = number of publications; No.TC = number of total Times Cited; AV.TC = average number of Times Cited; No.CA = number of Citing Articles.

From the perspective of citations, according to country/area distribution in Table 4 , we also find that USA-authored papers were cited by 1,810 papers with 3,709 (36.57%) citations, accounting for 36.57% of total citations. Meanwhile, articles from the USA also have a very high average number of citations per paper with a frequency of 6.97, which ranks third among the top 25 countries/ areas. Interestingly, the articles from Austria and Singapore appeared with the highest average number of citations per paper, with a frequency of 7.44 and 7.16 respectively, whereas the number of publications from these two countries was relatively low compared with the USA. The second was China, following the USA, papers were cited by 753 articles with 1,357 (13.38%) citations. Although the number of articles from China is close to the USA, the average number of citations per paper is lower with a frequency of 2.78. The subsequent countries include the UK, Germany, and Italy. The results indicate that the USA is the most influential country in blockchain.

International collaboration in science research is both a reality and a necessity [ 36 ] . A network consisting of nodes with the collaborating countries (regions) during the observation period is shown in Figure 4 . The network is created with the VOS viewer in which the thickness of the linking lines between two countries (regions) is directly proportional to their collaboration frequency. We can see from Figure 4 that the USA has the closest collaborative relationships with China, the UK, Australia, Germany, and Canada. China has the closest collaborative relationships with the USA, Australia, Singapore, UK, and South Korea. UK has the closest collaborative relationships with the USA, China, France, and Switzerland. Overall, based on the collaboration network, collaboration mainly emerges in highly productive countries (regions).

Figure 4 International collaboration network of the top 25 countries (territories), 2013–2019

International collaboration network of the top 25 countries (territories), 2013–2019

4.3 Institution Distribution and Collaboration

A total of 2,190 institutions participated in blockchain-related research, and based on the number of publications, the top 25 of the most productive institutions are shown in Table 5 . Chinese Academy of Sciences had the highest number of publications with 43 papers, followed by the University of London with 42 papers, and Beijing University of Posts Telecommunications ranked third with 36 papers. The subsequent institutions included the University of California System and the Commonwealth Scientific Industrial Research Organization (CSIRO). In terms of the number of total Times Cited, Cornell University is cited most with 499 citations, and the average number of Times Cited is 20.79. Massachusetts Institute of Technology followed closely with 407 citations and with an average number of Times Cited of 22.61. The University of California System ranks third with 258 citations and an average number of Times Cited of 8.06. ETH Zurich ranked fourth with 257 citations and an average number of Times Cited of 10.28. It is notable that the National University of Singapore also had a high average number of Times Cited of 12.56. These results indicate that most of the influential institutions are mainly in the USA and Europe and Singapore. The number of publications from institutions in China is large, whereas few of the papers are highly recorded in average Times Cited. Papers from the National University of Defense Technology China took the highest of average Times Cited of 7.79.

Blockchain research country (territory) ranked by number of articles (top 25)

To further explore data, the top 186 institutions with at least 5 articles each are chosen for collaboration network analysis. The collaboration network map is shown in Figure 5 , the thickness of linking lines between two institutions is directly proportional to their collaboration frequency. As seen from the cooperation network in the Chinese Academy of Sciences, Cornell University, Commonwealth Scientific Industrial Research Organization (CSIRO), University of Sydney, and ETH Zurich cooperated widely with other institutions. This shows that collaboration between institutions may boost the research of blockchain which echoes with extant research that proposes with-institution collaboration and international collaboration may all contribute to article quality [ 37 ] .

Figure 5 Collaboration network for institutions, 2013–2019

Collaboration network for institutions, 2013–2019

4.4 Authorship Distribution

The total number of authors who contribute to the publications of blockchain is 5,862. Remarkably, an article may be written by several authors from different countries (regions) or institutions. Therefore, the total number of authors is bigger than the total number of articles. In fact, during the observation period, the average number of authors per paper is 2.4 articles. Reveals the distribution of the number of authors with different numbers of papers. As seen from the results, most of the authors had a tiny number of papers, i.e., among 5,862 authors, 4,808 authors have only one paper, 662 authors have two papers, and 213 authors have three papers.

According to the participation number of articles, the most productive author in the blockchain is Choo, Kim-Kwang Raymond from Univ Texas San Antonio, who took part in 14 articles in blockchain, followed by Marchesi, Michele from Univ of Cagliari, who took part in 13 articles related to blockchain. The third most productive author is Bouri, Elie from the Holy Spirit University of Kaslik, and David Roubaud from Montpellier Business School. Miller, Andrew, Shetty, Sachin, and Xu, Xiwei ranked fourth, who took part in 10 articles related to blockchain.

The distribution of number of author with different numbers of articles

Note: No.AU = number of author; No.AR = number of articles.

Figure 6 displays the collaboration network for authors. The thickness of the linking lines between the two authors is directly proportional to their collaboration frequency. As we can see from Figure 6 , it indicates the most productive authors cooperate widely with others.

Figure 6 Collaboration network for authors, 2013–2019

Collaboration network for authors, 2013–2019

4.5 Distribution of Subject Categories

Table 7 presents the top 25 blockchain categories ranked in terms of the number of articles published. As can be seen from Table 7 , among the top 10 categories, six are related to the Computer Science field, which indicates that blockchain-related researches are more abundant in the field of Computer Science compared with other research fields. Besides, there are also publications in the category of Business & Economics with 385 records.

The top 25 blockchain categories ranked by the number of publications

Figure 7 illustrates the betweenness centrality network of papers of the above categories by using Citespace after being simplified with Minimum Spanning Tree network scaling, which remains the most prominent connections. We can see from Figure 7 , the centrality of Computer Science, Engineering Electrical Electronic, Telecommunications, Engineering, and Business & Economics are notable.

Figure 7 Categories involved in blockchain, 2013–2019

Categories involved in blockchain, 2013–2019

4.6 Journal Distribution

The research of blockchain is published in 1,206 journals (conferences), the top 25 journals (conferences) are displayed in Table 8 . Blockchain research papers are concentrated in these top journals (conferences) and with a concentration ratio of nearly 20%. The major blockchain research journals include Lecture Notes in Computer Science, IEEE Access, Economics Letters, Future Generation Computer Systems, and Finance Research Letters, with more than 20 articles in each one. Meanwhile, the major blockchain research conferences include IEEE International Conference on Hot Information-Centric Networking, International Conference on Parallel and Distributed Systems Proceedings, International Conference on New Technologies Mobility, and Security, and Financial Cryptography and Data Security, with at least 14 articles published in each of these.

The top 25 blockchain publication journals (conferences)

Note: NP = number of papers; No.TC = number of total Times Cited; Italic represents conference.

4.7 Intellectual Structure of Blockchain

Since the notion of co-citation was introduced, there are a host of researchers have adopted the visualization of co-citation relationships. The work is followed by White and Griffith [ 38 ] , who identified the intellectual structure of science, researches then broaden the unit of analysis from articles to authors [ 39 , 40 ] . There are two major types of co-citation analysis, namely, article cocitation analysis and author co-citation analysis, which are commonly adopted to visualize the intellectual structure of the research field. In this study, we explore the intellectual structure of blockchain by using both article co-citation analysis and author co-citation analysis. We apply Citespace to analyze and visualize the intellectual structure [ 41 ] .

In this study, mining spanning trees was adopted to present the patterns in the author cocitation network, a visualization of the network of author co-citation is demonstrated in Figure 8 . In the visualization of the co-citation network, pivot points are highlighted with a purple ring, and landmark nodes are identified with a large radius. From Figure 8 , there are six pivot nodes and landmark nodes: Nakamoto S, Buterin V, Eyal I, Wood G, Swan M, Christidis K. These authors truly played crucial roles during the development of blockchain research. Table 9 shows the ranking of author citation counts, as well as their prominent publications.

Figure 8 Network of author co-citation, 2013–2019

Network of author co-citation, 2013–2019

The top 15 co-cited author ranked by citation counts

Nakamoto S, as the creator of bitcoin, authored the bitcoin white paper, created and deployed bitcoin’s original reference implementation, is not surprised at the top of the co-citation count ranking, and has 1,202 citations in our dataset. Buterin V, a Russian-Canadian programmer, and writer primarily are known as a co-founder of ethereum and as a co-founder of Bitcoin Magazine, follows Nakamoto S, receives 257 citations. Eyal I, an assistant professor in technion, is a third of the ranking, with a representative article is “majority is not enough: Bitcoin mining is vulnerable”. Wood G, the ethereum founder, and free-trust technologist ranks fourth with 244 citations. The other core author with high citations includes Swan M, Christidis K, Bonneau J, Szabo N, Zyskind G, Castro M, and Meiklejohn S, with more than 150 citations of each person, and the typical publications of there are present in Table 9 .

To further investigate the features of the intellectual structure of blockchain research, we conducted an article co-citation analysis, using cluster mapping of co-citation articles networks to complete a visualization analysis of the evolution in the research field of blockchain. According to the article co-citation network, we adopted Citespace to divide the co-citation network into several clusters of co-cited articles. The visualization of clusters of co-cited articles is displayed in Figure 9 .

Figure 9 Clusters of co-cited articles, 2013–2019

Clusters of co-cited articles, 2013–2019

As we mentioned earlier in the “Data and Methodology” section, the colors of citation rings and links are corresponding to the different time slices. Therefore, the deeper purple cluster (Cluster #1) is relatively old, and the prominent clusters (Cluster #0 and #2) are more recent. Cluster #0 is the youngest and Cluster #1 is the oldest. Cluster labels are identified based on burst terms extracted from titles, abstracts, keywords of bibliographic records [ 26 , 41 ] . Table 10 demonstrates six predominant clusters by the number of members in each cluster.

Results show that the research priorities of the clusters keep changing during the observation period. From the earlier time (Cluster # 1), bitcoin and bitcoin network are the major priorities of researchers, then some researchers changed the focuses onto cryptocurrency in blockchain research. Notably, more researchers are most interested in blockchain technology and public ledger recently.

According to the characteristics of pivot nodes and landmark nodes in the co-citation article network. The landmark and pivot nodes in co-citation articles are shown in Figure 10 , Five pivot nodes are Nakamoto S [ 1 ] , Wood G [ 44 ] , Kosba A [ 51 ] , Eyal I [ 12 ] and Maurer B [ 55 ] . The main landmark nodes are Christidis K [ 45 ] . Swan M [ 2 ] , Zyskind G [ 48 ] Nakamoto S [ 1 ] , Kosba A [ 51 ] , Notably, some nodes can be landmark and pivot at the same time.

Figure 10 Landmark and pivot nodes, 2013–2019

Landmark and pivot nodes, 2013–2019

Summary of the largest 6 blockchain clusters

Details of the largest cluster (Cluster #0, top10)

Details of the largest cluster (Cluster #1, top10)

Details of the largest cluster (Cluster #2, top10)

As seen from Table 10 , Cluster #0 is the largest cluster, containing 36 nodes, for the sake of obtaining more information about these clusters, we explored the details of the largest clusters. Table 11 illustrates the details of the Cluster 0#.

We also explored Cluster #1 and #2 in more detail. Table 12 and Table 13 present the details of Cluster #1 and Cluster #2 respectively, it is notable that the most active citation in Cluster #1 is “bitcoin: A peer-to-peer electronic cash system”, and the most active citation in Cluster #2 is “bitcoin: Economics, technology, and governance”. The core members of Cluster #1 and Cluster #2 deliver milestones of blockchain research related to the bitcoin system and cryptocurrency.

Table 14 lists the first 10 most cited blockchain research articles indexed by the Web of Science. These articles are ranked according to the total number of citations during the observation period. Among these articles, the publication of “blockchains and smart contracts for the internet of things” by Christidis is identified as the most cited paper of 266 citations. The paper also has the highest average number of citations per year.

The top 10 cited blockchain articles

4.8 Keywords Co-Citation Analysis

According to Callon, et al. [ 77 ] co-word analysis is a useful way of examining the evolution of science. In our study, among 2,451 articles related to blockchain, we obtained 4,834 keywords, 594 keywords appeared 3 times, 315 keywords appeared 5 times, and 130 keywords appeared 10 times. Table 15 presents the most important keywords according to frequency. As seen, ‘blockchain’ ranks first with an occurrence frequency of 1,105, followed by ‘bitcoin’ of 606. The other high occurrence frequency keywords include: ‘cryptocurrency’, ‘smart contract’, and ‘iot’ (internet of thing).

The top 25 keywords ranked by frequency

For the sake of further exploration of the relation amongst the major keywords in blockchain research papers, we adopted the top 315 keywords with a frequency no less than 5 times for co-occurrence network analysis. The keywords co-occurrence network is illustrated in Figure 11 . In a co-occurrence network, the size of the node represents the frequency of the keywords co-occurrence with other keywords. The higher the co-occurrence frequency of the two keywords, the closer the relationship between them.

Figure 11 The keywords co-occurrence network, 2013–2019

The keywords co-occurrence network, 2013–2019

We can see from Figure 11 , the size of blockchain and bitcoin are the largest among all keywords. This means, in general, blockchain and bitcoin have more chances to co-occurrence with other keywords. Besides, blockchain is closer with a smart contract, iot, Ethereum, security, internet, and privacy, whereas bitcoin is closer with digital currency and cryptocurrency.

Figure 12 displays the time-zone view of co-cited keywords, which puts nodes in order from left to right according to their years being published. The left-sided nodes were published in the last five years, and on the right-hand side, they were published in recent two years. Correspondingly, some pivot nodes of keywords are listed in the boxes. We hope to show the evolution of blockchain in general and the changes of focuses in blockchain study.

Figure 12 The time-zone view of co-cited keywords, 2013–2019

The time-zone view of co-cited keywords, 2013–2019

The results suggest that, in 2013, when blockchain research begins to surface, bitcoin dominated the blockchain research field. Reasonably, the bitcoin is the first cryptocurrency based on blockchain technology, and the influential essays include quantitative analysis of the full bitcoin transaction graph [ 54 ] ; a fistful of bitcoins: Characterizing payments among men with no

names [ 50 ] ; and bitcoin meets google trends and Wikipedia: Quantifying the relationship between phenomena of the internet era [ 69 ] . Afterward, as various altcoins appeared, cryptocurrency and digital currency are widely discussed in blockchain-related research. The high-citation article is Zerocash: Decentralized anonymous payments from bitcoin [ 74 ] and privacy, which is the prominent characteristic of cryptocurrency. In 2015, blockchain and smart contract become a hotspot, the core publications include blockchain: A blueprint for a new economy [ 2 ] ; decentralizing privacy: Using blockchain to protect personal data [ 48 ] ; at the same time, some researchers also focus on the volatility and mining of cryptocurrency. In 2016, a growing number of researchers focus on the internet of things. The most popular article is blockchains and smart contracts for the internet of things [ 45 ] . In 2017, distributed ledger and blockchain technology become a research focus point. From 2018 onward, research focus on the challenge, and the inefficiency of blockchain appear.

4.9 Funding Agencies of Blockchain-Related Research

Based on all 2451 funding sources we analyzed in this study, the National Natural Science Foundation of China (NSFC) has supported the biggest number of publications with 231 papers, followed by the National Key Research and Development Program of China, which supported the publication of 88 papers. Comparatively, the National Science Foundation of the USA has only supported 46 papers. It is remarkable that the “Ministry of Science and Technology Taiwan” supported 22 papers, which is more than the European Union. Table 16 illustrates the top 20 funding agencies for blockchain research ranked by the number of supported papers. The results indicate that China is one of the major investing countries in Blockchain research with the biggest number of supporting articles.

The top 20 funding agencies of blockchain-related research

5 Conclusions and Implications

5.1 conclusions.

This research comprehensively investigates blockchain-related publications based on the Web of Science Core Collection and provides a quick overview of blockchain research. In this study, a coherent comprehensive bibliometric evaluation framework is adopted to investigate the hot and promising blockchain domain. We outline the core development landscape of blockchain, including the distribution of publications over time, by authors, journals, categories, institutions, countries (territories), intellectual structure, and research trends in the blockchain academic community. Combining the results of statistical analysis and co-cited articles, authors, and keywords, we formulate the answers to the following research questions:

RQ1 What is the distribution pattern of blockchain publications and citations over recent years?

The published blockchain papers significantly increased since 2013, when the first blockchain paper was published. An increasing number of articles were published since. In 2018, 1,148 articles were published at the peak, and the number of publications is likely to continuously grow. As for the cumulative number of citations, there were only 272 citations in 2013. By 2018 this number has grown to more than 10,000, which implies a widespread influence and attention attracted by blockchain study in recent years.

RQ2 Which are the main international contributing countries (regions) and institutions in blockchain research, as well as collaboration networks among them?

A total of 97 countries (regions) participated in blockchain research during the observation period. USA and China play the leading roles among all countries (regions), with publications of 532 (20.94%) and 489 (19.24%) articles respectively, followed by the UK, Germany, Italy, and Australia. From the aspect of citations, USA-authored papers were cited by 1,810 papers with 3,709 (36.57%) citations, accounting for 36.57% of total citations. Articles from the USA also have a very high average number of citations per paper with a frequency of 6.97. Although the number of articles from China is close to the USA, the average number of citations per paper is lower with a frequency of 2.78. The results indicate that the USA is the most influential country in the field of blockchain.

A total of 2,190 institutions participated in blockchain-related research. Among them, the Chinese Academy of Sciences has the highest number of publications with 43 papers, followed by the University of London, Beijing University of Posts Telecommunications, University of California System, Commonwealth Scientific Industrial Research Organization (CSIRO), Beihang University, University of Texas System, ETH Zurich. In respect of the number of total Times Cited and the average number of Times Cited, Cornell University is cited the most with 499 citations, and the average number of Times Cited is 20.79. followed by the Massachusetts Institute of Technology, University of California System, and ETH Zurich. The number of publications forms institutions in China is large, whereas few papers own high average Times Cited.

In terms of collaboration networks among different institutions, we found that the Chinese Academy of Sciences, Cornell University, Commonwealth Scientific Industrial Research Organization (CSIRO), University of Sydney, and ETH Zurich cooperated widely with other institutions.

RQ3 What are the characteristics of the authorship distribution?

The total number of authors who contribute to the publications of blockchain is 5,862. the average number of authors per paper is 2.4. Among 5,862 authors, 4,808 authors have only one paper, 662 authors have two papers, and 213 authors have three papers. Based on the number of participated papers, the most productive author in the field of blockchain is Choo, Kim-Kwang Raymond from Univ Texas San Antonio, who participated in 14 articles in the field of blockchain, followed by Marchesi M, Bouri E, David R, Miller A, Shetty S and Xu X.

RQ4 What are the core blockchain subjects and journals based on the number of publications?

Blockchain-related researches are more abundant in the field of Computer Science compared with other categories. Other major fields include Engineering, Business & Economics, Telecommunications, and Business & Economics.

RQ5 What are the major journals or conferences for blockchain-related research?

The research of blockchain is published in 1,206 journals (conferences), the major blockchain research journals include Lecture Notes In Computer Science, IEEE Access, Economics Letters, Future Generation Computer Systems, and Finance Research Letters. Meanwhile, the major blockchain research conferences include IEEE International Conference on Hot Information-Centric Networking, International Conference on Parallel and Distributed Systems Proceedings, International Conference on New Technologies Mobility and Security, and Financial Cryptography and Data Security.

RQ6 What are the most influential papers in blockchain research based on the number of citations?

Ranked by the total number of citations during the observation period, the publication: “blockchains and smart contracts for the internet of things” by Christidis and Devetsikiotis [ 45 ] is identified as the most cited paper with 266 citations, which also has a highest average number of citation per year, followed by decentralizing privacy: Using blockchain to protect personal data [ 48 ] with 169 citations and 33.80 average number of citations per year.

According to the number of times co-cited, the top five influential publications are as follows: Bitcoin: A peer-to-peer electronic cash system [ 1 ] , A next-generation smart contract and decentralized application platform [ 42 ] , Majority is not enough: Bitcoin mining is vulnerable [ 12 ] , Ethereum: A secure decentralised generalised transaction ledger [ 44 ] , Blockchain: Blueprint for a new economy [ 2 ] .

RQ7 Who are the most influential authors in blockchain research according to the author co-citation network?

Some authors played a crucial role during the development of blockchain research, Nakamoto S, as the creator of Bitcoin, and the author of the bitcoin white paper, created and deployed bitcoin’s original reference, therefore is not surprised at the top of the co-citation count ranking and got 1,202 citations in our dataset. Buterin V, a Russian-Canadian, programmer, and writer, primarily known as a co-founder of Ethereum and as a co-founder of Bitcoin Magazine who follows Nakamoto S and receives 257 citations. Other core authors with high citations include Eyal I, Wood G, Swan M, Christidis K, Bonneau J, Szabo N, Zyskind G, Castro M, and Meiklejohn S.

According to co-cited articles clusters, the research priorities in blockchain-related research keep changing during the observation period. Bitcoin and bitcoin network are the main priorities of researchers, then some researchers changed to focus on cryptocurrency in blockchain research.

RQ8 What are the research trends of blockchain?

The research priorities in blockchain-related research evolve during the observation period. As early as 2013, when the research on blockchain first appears, bitcoin dominated the blockchain research field. Then only one year later, as various altcoins begin to appear, cryptocurrency and digital currency are widely discussed in blockchain-related research. In 2015, blockchain and smart contracts become a hotspot till 2016 when a growing body of researches begin to focus on the internet of things. In 2017, distributed ledger and blockchain technology become the research focal point. From 2018 onward, research focus on the challenge and inefficiency of blockchain.

RQ9 What are the most supportive funding agencies of blockchain research?

The most supportive funding agency of blockchain research is the National Natural Science Foundation of China (NSFC) which has supported the publication of 231 papers. The results indicate that China is one of the major investing countries in Blockchain research with the biggest number of supporting articles.

Given the potential power of blockchain, it is noticeable that governments, enterprises, and researchers all pay increasing attention to this field. The application of blockchain in various industries, the supervision of cryptocurrencies, the newly rising central bank digital currency and Libra, are becoming the central issues of the whole society.

In our research, we conducted a comprehensive exploration of blockchain-related research via a bibliometrics analysis, our results provide guidance and implications for academic research and practices. First, the findings present a holistic view of research in the blockchain domain which benefits researchers and practitioners wanting to quickly obtain a visualized overview of blockchain research. Second, according to our findings of the evolution and trends in blockchain research, researchers could better understand the development and status of blockchain, which is helpful in choosing valuable research topics, the distributed ledger, the discussions on the inefficiency and challenges of blockchain technology, the supervision of cryptocurrencies, the central bank digital currency are emerging research topics, which deserve more attention from the academic community.

5.2 Limitations and Future Work

As with any research, the design employed incorporates limitations that open avenues for future research. First, this study is based on 2,451 articles retrieved from the Web of Science of Core Collection, although the Web of Science of Core Collection is truly a powerful database for bibliometric analysis, we can’t ignore the limitation brought by a unique data source. Future research can deal with this limitation by merging the publications from other sources, for instance, Scopus, CNKI, as well as patent database and investment data of blockchain, and it could help to validate the conclusion. Second, we mainly adopt the frequency indicator to outline the state-of-the art of blockchain research, although the frequency is most commonly used in the bibliometric analysis, and we also used H-index, citation to improve our analysis, some other valuable indicators are ignored, such as sigma and between centrality, therefore, it’s beneficial to combine those indicators in future research. Besides, it should be noted that, in co-citation analysis, a paper should be published for a certain period before it is cited by enough authors [ 26 ] , the newest published papers may not include in co-citation analysis, it’s also an intrinsic drawback of bibliometric methods.

Supported by the National Natural Science Foundation of China (71872171), and the Open Project of Key

Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences

[1] Nakamoto S. Bitcoin: A peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.eps 2008. Search in Google Scholar

[2] Swan M. Blockchain: Blueprint for a new economy. O’Reilly, 2015. Search in Google Scholar

[3] Weber I, Xu X, Riveret R, et al. Untrusted business process monitoring and execution using blockchain. Business Process Management, 2016: 329–347. 10.1007/978-3-319-45348-4_19 Search in Google Scholar

[4] Mettler M. Blockchain technology in healthcare: The revolution starts here. International Conference on E-health Networking, Applications and Services, 2016: 1–3. 10.1109/HealthCom.2016.7749510 Search in Google Scholar

[5] Paech P. The governance of blockchain financial networks. Modern Law Review, 2016, 80(6): 1073–1110. 10.1111/1468-2230.12303 Search in Google Scholar

[6] Yue X, Wang H, Jin D, et al. Healthcare data gateways: Found healthcare intelligence on blockchain with novel privacy risk control. Journal of Medical Systems, 2016, 40(10): 1–8. 10.1007/s10916-016-0574-6 Search in Google Scholar PubMed

[7] Maupin J. Mapping the global legal landscape of blockchain and other distributed ledger technologies. https://ssrn.com/abstract=2930077 2017. 10.2139/ssrn.2930077 Search in Google Scholar

[8] Sachin K, Angappa G, Himanshu A. Understanding the blockchain technology adoption in supply chains. International Journal of Production Research, 2018: 1–25. Search in Google Scholar

[9] Tasca P, Tessone C J. A taxonomy of blockchain technologies: Principles of identification and classification. Ledger, 2019, 4: 140. 10.5195/ledger.2019.140 Search in Google Scholar

[10] Catalini C, Gans J S. Some simple economics of the blockchain. Communications of the ACM, 2016, 63(7): 80–90. 10.3386/w22952 Search in Google Scholar

[11] Csoka P, Herings P J. Decentralized clearing in financial networks. Management Science, 2017, 64(5): 4681–4699. 10.1287/mnsc.2017.2847 Search in Google Scholar

[12] Eyal I. Blockchain technology: Transforming libertarian cryptocurrency dreams to finance and banking realities. Computer, 2017, 50(9): 38–49. 10.1109/MC.2017.3571042 Search in Google Scholar

[13] Huckle S, Bhattacharya R, White M, et al. Internet of things, blockchain and shared economy applications. Procedia Computer Science, 2016: 461–466. 10.1016/j.procs.2016.09.074 Search in Google Scholar

[14] Bahga A, Madisetti V K. Blockchain platform for industrial internet of things. Journal of Software Engineering and Applications, 2016, 9(10): 533–546. 10.4236/jsea.2016.910036 Search in Google Scholar

[15] Dorri A, Kanhere S S, Jurdak R. Towards an optimized blockchain for IoT. IEEE/ACM Second International Conference on Internet-of-Things Design and Implementation (IoTDI), 2017: 173–178. 10.1145/3054977.3055003 Search in Google Scholar

[16] Aitzhan N Z, Svetinovic D. Security and privacy in decentralized energy trading through multi-signatures, blockchain and anonymous messaging streams. IEEE Transactions on Dependable and Secure Computing, 2018, 15(5): 840–852. 10.1109/TDSC.2016.2616861 Search in Google Scholar

[17] Mengelkamp E, Notheisen B, Beer C, et al. A blockchain-based smart grid: Towards sustainable local energy markets. Computer Science — Research and Development, 2018: 207–214. 10.1007/s00450-017-0360-9 Search in Google Scholar

[18] Korpela K, Hallikas J, Dahlberg T. Digital supply chain transformation toward blockchain integration. Hawaii International Conference on System Sciences, 2017: 1–10. 10.24251/HICSS.2017.506 Search in Google Scholar

[19] Toyoda K, Mathiopoulos P T, Sasase I, et al. A novel blockchain-based product ownership management system (POMS) for anti-counterfeits in the post supply chain. IEEE Access, 2017: 17465–17477. 10.1109/ACCESS.2017.2720760 Search in Google Scholar

[20] Yuan Y, Wang F Y. Blockchain: The state of the art and future trends. Acta Automatica Sinica, 2016, 42(4): 481–494. Search in Google Scholar

[21] Yli-Huumo J, Ko D, Choi S, et al. Where is current research on blockchain technology? — A systematic review. PLoS One, 2016, 11(10): e0163477. 10.1371/journal.pone.0163477 Search in Google Scholar PubMed PubMed Central

[22] Zeng S, Ni X. A bibliometric analysis of blockchain research. IEEE Intelligent Vehicles Symposium, 2018: 102–107. 10.1109/IVS.2018.8500606 Search in Google Scholar

[23] Miau S, Yang J. Bibliometrics-based evaluation of the blockchain research trend: 2008–2017. Technology Analysis & Strategic Management, 2018, 30(9): 1029–1045. 10.1080/09537325.2018.1434138 Search in Google Scholar

[24] Hawlitschek F, Notheisen B, Teubner T. The limits of trust-free systems: A literature review on blockchain technology and trust in the sharing economy. Electronic Commerce Research and Applications, 2018: 50–63. 10.1016/j.elerap.2018.03.005 Search in Google Scholar

[25] Bonilla C A, Merigo J M, Torres-Abad C. Economics in Latin America: A bibliometric analysis. Scientometrics, 2015, 105(2): 1239–1252. 10.1007/s11192-015-1747-7 Search in Google Scholar

[26] Li X, Qiao H, Wang S. Exploring evolution and emerging trends in business model study: A co-citation analysis. Scientometrics, 2017, 111(2): 869–887. 10.1007/s11192-017-2266-5 Search in Google Scholar

[27] GaviriamarinM,Merigo JM, Popa S. Twenty years of the journal of knowledge management: A bibliometric analysis. Journal of Knowledge Management, 2018, 22(8): 1655–1687. 10.1108/JKM-10-2017-0497 Search in Google Scholar

[28] Faming W, Meijuan Z. Bibliometric analysis on the research hotspots of blockchain in China. Journal of Intelligence, 2017, 36(12): 69–74. Search in Google Scholar

[29] Mu-Nan L. Analyzing intellectual structure of related topics to blockchain and bitcoin: From co-citation clustering and bibliographic coupling perspectives. Acta Automatica Sinica, 2017, 43(9): 1501–1519. Search in Google Scholar

[30] Dabbagh M, Sookhak M, Safa N S. The evolution of blockchain: A bibliometric study. IEEE Access, 2019: 19212–19221. 10.1109/ACCESS.2019.2895646 Search in Google Scholar

[31] Chen C, Ibekwesanjuan F, Hou J. The structure and dynamics of co citation clusters: A multiple perspective co-citation analysis. Journal of the Association for Information Science and Technology, 2010, 61(7): 1386–1409. Search in Google Scholar

[32] Chen C. Visualizing scientific paradigms: An introduction. Journal of the Association for Information Science and Technology, 2003, 54(5): 392–393. 10.1002/asi.10224 Search in Google Scholar

[33] Chen C. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the Association for Information Science and Technology, 2006, 57(3): 359–377. 10.1002/asi.20317 Search in Google Scholar

[34] Chen C, Morris S A. Visualizing evolving networks: Minimum spanning trees versus pathfinder networks. IEEE Symposium on Information Visualization, 2003: 67–74. Search in Google Scholar

[35] Guan J, Ma N. China’s emerging presence in nanoscience and nanotechnology: A comparative bibliometric study of several nanoscience giants. Research Policy, 2007, 36(6): 880–886. 10.1016/j.respol.2007.02.004 Search in Google Scholar

[36] Wagner C S, Leydesdorff L. Network structure, self-organization, and the growth of international collaboration in science. Research Policy, 2005, 34(10): 1608–1618. 10.1016/j.respol.2005.08.002 Search in Google Scholar

[37] He Z, Geng X, Campbellhunt C. Research collaboration and research output: A longitudinal study of 65 biomedical scientists in a New Zealand University. Research Policy, 2009, 38(2): 306–317. 10.1016/j.respol.2008.11.011 Search in Google Scholar

[38] White H D, Griffith B C. Author cocitation: A literature measure of intellectual structure. Journal of the American Society for Information Science, 1981, 32(3): 163–171. 10.1002/asi.4630320302 Search in Google Scholar

[39] Chen C. Searching for intellectual turning points: Progressive knowledge domain visualization. Proceedings of the National Academy of Sciences of the United States of America, 2004: 5303–5310. 10.1073/pnas.0307513100 Search in Google Scholar PubMed PubMed Central

[40] Borner K, Chen C, Boyack K W. Visualizing knowledge domains. The Artist and Journal of Home Culture, 2005, 37(1): 179–255. 10.1002/aris.1440370106 Search in Google Scholar

[41] Chen C. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the Association for Information Science and Technology, 2006, 57(3): 359–377. 10.1002/asi.20317 Search in Google Scholar

[42] Buterin V. A next-generation smart contract and decentralized application platform. https://ethereum.org/en/whitepaper/ 2014. Search in Google Scholar

[43] Eyal I, Sirer E G. Majority is not enough: Bitcoin mining is vulnerable. International Conference on Financial Cryptography and Data Security, 2014: 436–454. 10.1145/3212998 Search in Google Scholar

[44] Wood G. Ethereum: A secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper, 2014, 151: 1–32. Search in Google Scholar

[45] Christidis K, Devetsikiotis M. Blockchains and smart contracts for the internet of things. IEEE Access, 2016: 2292–2303. 10.1109/ACCESS.2016.2566339 Search in Google Scholar

[46] Bonneau J, Miller A, Clark J, et al. Sok: Research perspectives and challenges for bitcoin and cryptocurrencies. IEEE Symposium on Security and Privacy, 2015: 104–121. 10.1109/SP.2015.14 Search in Google Scholar

[47] Szabo N. Formalizing and securing relationships on public networks. First Monday, 1997. 10.5210/fm.v2i9.548 Search in Google Scholar

[48] Zyskind G, Nathan O, Pentland A. Decentralizing privacy: Using blockchain to protect personal data. IEEE Security and Privacy Workshops, 2015: 180–184. 10.1109/SPW.2015.27 Search in Google Scholar

[49] Castro M, Liskov B. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS), 2002, 20(4): 398–461. 10.1145/571637.571640 Search in Google Scholar

[50] Meiklejohn S, Pomarole M, Jordan G, et al. A fistful of bitcoins: Characterizing payments among men with no names. Internet Measurement Conference, 2013: 127–140. 10.1145/2504730.2504747 Search in Google Scholar

[51] Kosba A, Miller A, Shi E, et al. Hawk: The blockchain model of cryptography and privacy-preserving smart contracts. IEEE Symposium on Security and Privacy, 2016: 839–858. 10.1109/SP.2016.55 Search in Google Scholar

[52] Reid F, Harrigan M. An analysis of anonymity in the bitcoin system. Security and Privacy in Social Networks, 2013: 197–223. 10.1109/PASSAT/SocialCom.2011.79 Search in Google Scholar

[53] Luu L, Narayanan V, Zheng C, et al. A secure sharding protocol for open blockchains. ACM SIGSAC Conference on Computer and Communications Security, 2016: 17–30. 10.1145/2976749.2978389 Search in Google Scholar

[54] Ron D, Shamir A. Quantitative analysis of the full bitcoin transaction graph. International Conference on Financial Cryptography and Data Security, 2013: 6–24. 10.1007/978-3-642-39884-1_2 Search in Google Scholar

[55] Maurer B, Nelms T C, Swartz L. When perhaps the real problem is money itself? The practical materiality of bitcoin. Social Semiotics, 2013, 23(2): 261–277. 10.1080/10350330.2013.777594 Search in Google Scholar

[56] Tschorsch F, Scheuermann B. Bitcoin and beyond: A technical survey on decentralized digital currencies. IEEE Communications Surveys & Tutorials, 2016, 18(3): 2084–2123. 10.1109/COMST.2016.2535718 Search in Google Scholar

[57] Radziwill N. Blockchain revolution: How the technology behind bitcoin is changing money, business, and the world. The Quality Management Journal, 2018, 25(1): 64–65. 10.1080/10686967.2018.1404373 Search in Google Scholar

[58] Azaria A, Ekblaw A, Vieira T, et al. Medrec: Using blockchain for medical data access and permission management. International Conference on Open and Big Data (OBD), 2016: 25–30. 10.1109/OBD.2016.11 Search in Google Scholar

[59] Narayanan A, Bonneau J, Felten E, et al. Bitcoin and cryptocurrency technologies: A comprehensive introduction. Princeton University Press, 2016. Search in Google Scholar

[60] Miers I, Garman C, Green M, et al. Zerocoin: Anonymous distributed e-cash from bitcoin. IEEE Symposium on Security and Privacy, 2013: 397–411. 10.1109/SP.2013.34 Search in Google Scholar

[61] Ober M, Katzenbeisser S, Hamacher K. Structure and anonymity of the bitcoin transaction graph. Future Internet, 2013, 5(2): 237–250. 10.3390/fi5020237 Search in Google Scholar

[62] Moore T, Christin N. Beware the middleman: Empirical analysis of bitcoin-exchange risk. International Conference on Financial Cryptography and Data Security, 2013: 25–33. 10.1007/978-3-642-39884-1_3 Search in Google Scholar

[63] Androulaki E, Karame G O, Roeschlin M, et al. Evaluating user privacy in bitcoin. International Conference on Financial Cryptography and Data Security, 2013: 34–51. 10.1007/978-3-642-39884-1_4 Search in Google Scholar

[64] Barber S, Boyen X, Shi E, et al. Bitter to better — How to make bitcoin a better currency. International International Conference on Financial Cryptography and Data Security, 2012: 399–414. 10.1007/978-3-642-32946-3_29 Search in Google Scholar

[65] Böhme R, Christin N, Edelman B, et al. Bitcoin: Economics, technology, and governance. Journal of Economic Perspectives, 2015, 29(2): 213–238. 10.1257/jep.29.2.213 Search in Google Scholar

[66] Cheah E T, Fry J. Speculative bubbles in bitcoin markets? An empirical investigation into the fundamental value of bitcoin. Economics Letters, 2015, 130: 32–36. 10.1016/j.econlet.2015.02.029 Search in Google Scholar

[67] Urquhart A. The inefficiency of bitcoin. Economics Letters, 2016, 148: 80–82. 10.1016/j.econlet.2016.09.019 Search in Google Scholar

[68] Dyhrberg A H. Bitcoin, gold and the dollar — A GARCH volatility analysis. Finance Research Letters, 2016, 16: 85–92. 10.1016/j.frl.2015.10.008 Search in Google Scholar

[69] Ciaian P, Rajcaniova M, Kancs D A. The economics of bitcoin price formation. Applied Economics, 2016, 48(19): 1799–1815. 10.1080/00036846.2015.1109038 Search in Google Scholar

[70] Kristoufek L. BitCoin Meets Google Trends and Wikipedia: Quantifying the relationship between phenomena of the internet era. Scientific Reports, 2013, 3(1): 3415–3415. 10.1038/srep03415 Search in Google Scholar PubMed PubMed Central

[71] Dwyer G P. The economics of bitcoin and similar private digital currencies. Journal of Financial Stability, 2015, 17: 81–91. 10.1016/j.jfs.2014.11.006 Search in Google Scholar

[72] Nadarajah S, Chu J. On the inefficiency of bitcoin. Economics Letters, 2017, 150: 6–9. 10.1016/j.econlet.2016.10.033 Search in Google Scholar

[73] Katsiampa P. Volatility estimation for bitcoin: A comparison of GARCH models. Economics Letters, 2017, 158: 3–6. 10.1016/j.econlet.2017.06.023 Search in Google Scholar

[74] Bouri E, Gupta R, Tiwari A K, et al. Does bitcoin hedge global uncertainty? Evidence from wavelet-based quantile-in-quantile regressions. Finance Research Letters, 2017, 23: 87–95. 10.1016/j.frl.2017.02.009 Search in Google Scholar

[75] Sasson E B, Chiesa A, Garman C, et al. Zerocash: Decentralized anonymous payments from bitcoin. IEEE Symposium on Security and Privacy, 2014: 459–474. 10.1109/SP.2014.36 Search in Google Scholar

[76] Garay J, Kiayias A, Leonardos N. The bitcoin backbone protocol: Analysis and applications. International Conference on the Theory and Applications of Cryptographic Techniques, 2015: 281–310. 10.1007/978-3-662-46803-6_10 Search in Google Scholar

[77] Callon M, Courtial J P, Laville F. Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry. Scientometrics, 1991, 22(1): 155–205. 10.1007/BF02019280 Search in Google Scholar

© 2021 Walter de Gruyter GmbH, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

  • X / Twitter

Supplementary Materials

Please login or register with De Gruyter to order this product.

Journal of Systems Science and Information

Journal and Issue

Articles in the same issue.

research and analysis of blockchain data

  • GET THE APP
  •     Newsletter

Ekta Mourya

XRP stays above $0.50 while Ripple backs research on new frontiers in blockchain technology

  • Ripple highlighted study from its University Blockchain Research Initiative on the latest developments in blockchain. 
  • The research outlines challenges faced by blockchains like XRP Ledger, against fast-evolving quantum computing. 
  • XRP sustained its gains, holds steady above key support at $0.50 on Monday. 

XRP price holds steady above the $0.50 key support level and edges higher on Monday, trading at 0.5130 and rising 0.70% in the day at the time of writing. Ripple (XRP) published its research from its University Blockchain Research Initiative (UBRI) on the intersection of blockchain and quantum computing. The May 17 report on Ripple’s website surveys the impact of quantum computing technology and its advancement on blockchains like the XRP Ledger, which is key to future technical developments in the firm’s network. 

XRP holders await a ruling in the Securities and Exchange Commission (SEC) vs. Ripple lawsuit on two major issues: the fine to be imposed on the payment remittance firm for allegedly selling securities (XRP) to institutional investors and an injunction on Ripple’s On-Demand Liquidity (ODL) platform, used by clients to manage liquidity. 

Daily Digest Market Movers: Ripple supports technical development-focused research

  • Ripple funds a University Blockchain Research Initiative to aid research and development in new frontiers of blockchain technology. 
  • The payment remittance firm published a report on May 17, highlighting a leading cause for concern among scholars in the sector, quantum computing and its impact on blockchains like the XRP Ledger. 
  • A Deloitte report titled “Quantum computers and the Bitcoin blockchain” shares the concern and reads:
  • “…the security of cryptocurrencies is threatened by future quantum computers, which are expected to break some of the currently used cryptography algorithms for which there are no clear replacements.”
  • In other words, advancements in quantum computing pose a threat to Bitcoin and other encrypted blockchains that rely on cryptography for security. 
  • Ripple dives into these issues in its Ripple Insights series. 
  • The SEC vs. Ripple lawsuit is another key catalyst influencing XRP price on Monday. 
  • XRP holders are awaiting a ruling on two major issues: a fine (the SEC proposed a $2 billion fine on Ripple for alleged violations of securities laws and sale of XRP to institutions, Ripple countered with a $10 million offer), and a likely injunction (order to stop) the sale of XRP to institutional investors through its On-Demand Liquidity (ODL) platform. 
  • Data from Cryptoeq.io shows that the sentiment among XRP holders is positive on Monday. 

Technical analysis: XRP poised for correction on longer timeframe

Ripple is likely to correct further as seen in the 1-week XRP/USDT price chart . While XRP sustained above key support at $0.50, the long timeframe chart shows bearish signs from indicators. 

The Moving Average Convergence Divergence (MACD), a momentum indicator, flashes red histogram bars below the neutral line, corresponding to the decline in XRP price between April 8 and May 20. A recent crossover of MACD below the signal line supports a bearish thesis for XRP. 

XRP price could suffer a nearly 9% drop to the April 19 low at $0.4665. Once the altcoin sweeps liquidity at this level, it could reattempt a recovery, depending on signs from technical indicators this week . 

XRP

XRP/USDT 1-week chart 

On the 1-day timeframe, XRP could attempt to test resistance at $0.5310, the 50% Fibonacci retracement of the decline from the April 9 top of $0.6431 to the April 13 low at $0.4188. A failure to break above this resistance could see XRP testing support at the key level of $0.50. XRP price has consistently closed above support at $0.50 throughout last week. 

The MACD supports a short-term bullish thesis on the 1-day chart, as seen below. 

XRP

XRP/USDT 1-day chart

Looking up, clearing the resistance at $0.5310 paves the way for XRP to test the May 7 high at $0.5443 (nearly 6% gains from current levels) and target the April 4 low at $0.5623. 

SEC vs Ripple lawsuit FAQs

Is xrp a security.

It depends on the transaction, according to a court ruling released on July 14: For institutional investors or over-the-counter sales, XRP is a security. For retail investors who bought the token via programmatic sales on exchanges, on-demand liquidity services and other platforms, XRP is not a security.

How does the ruling affect Ripple in its legal battle against the SEC?

The United States Securities & Exchange Commission (SEC) accused Ripple and its executives of raising more than $1.3 billion through an unregistered asset offering of the XRP token. While the judge ruled that programmatic sales aren’t considered securities, sales of XRP tokens to institutional investors are indeed investment contracts. In this last case, Ripple did breach the US securities law and will need to keep litigating over the around $729 million it received under written contracts.

What are the implications of the ruling for the overall crypto industry?

The ruling offers a partial win for both Ripple and the SEC, depending on what one looks at. Ripple gets a big win over the fact that programmatic sales aren’t considered securities, and this could bode well for the broader crypto sector as most of the assets eyed by the SEC’s crackdown are handled by decentralized entities that sold their tokens mostly to retail investors via exchange platforms, experts say. Still, the ruling doesn’t help much to answer the key question of what makes a digital asset a security, so it isn’t clear yet if this lawsuit will set precedent for other open cases that affect dozens of digital assets. Topics such as which is the right degree of decentralization to avoid the “security” label or where to draw the line between institutional and programmatic sales are likely to persist.

Is the SEC stance toward crypto assets likely to change after the ruling?

The SEC has stepped up its enforcement actions toward the blockchain and digital assets industry, filing charges against platforms such as Coinbase or Binance for allegedly violating the US Securities law. The SEC claims that the majority of crypto assets are securities and thus subject to strict regulation. While defendants can use parts of Ripple’s ruling in their favor, the SEC can also find reasons in it to keep its current strategy of regulation by enforcement.

Can the court ruling be overturned?

The court decision is a partial summary judgment. The ruling can be appealed once a final judgment is issued or if the judge allows it before then. The case is in a pretrial phase, in which both Ripple and the SEC still have the chance to settle.

Information on these pages contains forward-looking statements that involve risks and uncertainties. Markets and instruments profiled on this page are for informational purposes only and should not in any way come across as a recommendation to buy or sell in these assets. You should do your own thorough research before making any investment decisions. FXStreet does not in any way guarantee that this information is free from mistakes, errors, or material misstatements. It also does not guarantee that this information is of a timely nature. Investing in Open Markets involves a great deal of risk, including the loss of all or a portion of your investment, as well as emotional distress. All risks, losses and costs associated with investing, including total loss of principal, are your responsibility. The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of FXStreet nor its advertisers. The author will not be held responsible for information that is found at the end of links posted on this page.

If not otherwise explicitly mentioned in the body of the article, at the time of writing, the author has no position in any stock mentioned in this article and no business relationship with any company mentioned. The author has not received compensation for writing this article, other than from FXStreet.

FXStreet and the author do not provide personalized recommendations. The author makes no representations as to the accuracy, completeness, or suitability of this information. FXStreet and the author will not be liable for any errors, omissions or any losses, injuries or damages arising from this information and its display or use. Errors and omissions excepted.

The author and FXStreet are not registered investment advisors and nothing in this article is intended to be investment advice.

Recommended content

Recommended content, editors’ picks, cardano price forecast: ada's recent breakout could trigger 15% rally.

Cardano price rally on Monday suggests that a larger bullish move might be in the works. On-chain data shows a comeback in investor interest and entry of institutional investors.

Ondo price poised to break ATH of $1.04

Ondo price is slowly testing its major support level at $0.63-$0.61. Aggressive buyers can consider accumulating around the immediate support zone, which extends from $0.83 to $0.80.

Shiba Inu price flashes buy signal, 25% rally likely Premium

Shiba Inu price has flipped bullish to the tune of the crypto market and breached key hurdles, showing signs of a potential rally. Investors looking to accumulate SHIB have a good opportunity to do so before the meme coin shoots up.

New York Attorney General reaches $2 billion settlement with Genesis after claims of fraud

New York Attorney General Letitia James says her office has secured a $2 billion settlement from Genesis. AG Letitia James sued the crypto company in October on account of lying and defrauding investors of over $1 billion.

Bitcoin: Is BTC out of the woods? Premium

Bitcoin price shows signs of continuing its uptrend, providing a buying opportunity between $64,580 to $63,095. On-chain metrics forecast a bullish outlook for BTC ahead. If BTC clears $70,000, the chances of resuming the uptrend would skyrocket.

Crypto partners in your location

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 08 May 2024

A meta-analysis on global change drivers and the risk of infectious disease

  • Michael B. Mahon   ORCID: orcid.org/0000-0002-9436-2998 1 , 2   na1 ,
  • Alexandra Sack 1 , 3   na1 ,
  • O. Alejandro Aleuy 1 ,
  • Carly Barbera 1 ,
  • Ethan Brown   ORCID: orcid.org/0000-0003-0827-4906 1 ,
  • Heather Buelow   ORCID: orcid.org/0000-0003-3535-4151 1 ,
  • David J. Civitello 4 ,
  • Jeremy M. Cohen   ORCID: orcid.org/0000-0001-9611-9150 5 ,
  • Luz A. de Wit   ORCID: orcid.org/0000-0002-3045-4017 1 ,
  • Meghan Forstchen 1 , 3 ,
  • Fletcher W. Halliday 6 ,
  • Patrick Heffernan 1 ,
  • Sarah A. Knutie 7 ,
  • Alexis Korotasz 1 ,
  • Joanna G. Larson   ORCID: orcid.org/0000-0002-1401-7837 1 ,
  • Samantha L. Rumschlag   ORCID: orcid.org/0000-0003-3125-8402 1 , 2 ,
  • Emily Selland   ORCID: orcid.org/0000-0002-4527-297X 1 , 3 ,
  • Alexander Shepack 1 ,
  • Nitin Vincent   ORCID: orcid.org/0000-0002-8593-1116 1 &
  • Jason R. Rohr   ORCID: orcid.org/0000-0001-8285-4912 1 , 2 , 3   na1  

Nature ( 2024 ) Cite this article

6499 Accesses

607 Altmetric

Metrics details

  • Infectious diseases

Anthropogenic change is contributing to the rise in emerging infectious diseases, which are significantly correlated with socioeconomic, environmental and ecological factors 1 . Studies have shown that infectious disease risk is modified by changes to biodiversity 2 , 3 , 4 , 5 , 6 , climate change 7 , 8 , 9 , 10 , 11 , chemical pollution 12 , 13 , 14 , landscape transformations 15 , 16 , 17 , 18 , 19 , 20 and species introductions 21 . However, it remains unclear which global change drivers most increase disease and under what contexts. Here we amassed a dataset from the literature that contains 2,938 observations of infectious disease responses to global change drivers across 1,497 host–parasite combinations, including plant, animal and human hosts. We found that biodiversity loss, chemical pollution, climate change and introduced species are associated with increases in disease-related end points or harm, whereas urbanization is associated with decreases in disease end points. Natural biodiversity gradients, deforestation and forest fragmentation are comparatively unimportant or idiosyncratic as drivers of disease. Overall, these results are consistent across human and non-human diseases. Nevertheless, context-dependent effects of the global change drivers on disease were found to be common. The findings uncovered by this meta-analysis should help target disease management and surveillance efforts towards global change drivers that increase disease. Specifically, reducing greenhouse gas emissions, managing ecosystem health, and preventing biological invasions and biodiversity loss could help to reduce the burden of plant, animal and human diseases, especially when coupled with improvements to social and economic determinants of health.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

research and analysis of blockchain data

Similar content being viewed by others

research and analysis of blockchain data

Towards common ground in the biodiversity–disease debate

research and analysis of blockchain data

Biological invasions facilitate zoonotic disease emergences

research and analysis of blockchain data

Measuring the shape of the biodiversity-disease relationship across systems reveals new findings and key gaps

Data availability.

All the data for this Article have been deposited at Zenodo ( https://doi.org/10.5281/zenodo.8169979 ) 52 and GitHub ( https://github.com/mahonmb/GCDofDisease ) 53 .

Code availability

All the code for this Article has been deposited at Zenodo ( https://doi.org/10.5281/zenodo.8169979 ) 52 and GitHub ( https://github.com/mahonmb/GCDofDisease ) 53 . R markdown is provided in Supplementary Data 1 .

Jones, K. E. et al. Global trends in emerging infectious diseases. Nature 451 , 990–994 (2008).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Civitello, D. J. et al. Biodiversity inhibits parasites: broad evidence for the dilution effect. Proc. Natl Acad. Sci USA 112 , 8667–8671 (2015).

Halliday, F. W., Rohr, J. R. & Laine, A.-L. Biodiversity loss underlies the dilution effect of biodiversity. Ecol. Lett. 23 , 1611–1622 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Rohr, J. R. et al. Towards common ground in the biodiversity–disease debate. Nat. Ecol. Evol. 4 , 24–33 (2020).

Article   PubMed   Google Scholar  

Johnson, P. T. J., Ostfeld, R. S. & Keesing, F. Frontiers in research on biodiversity and disease. Ecol. Lett. 18 , 1119–1133 (2015).

Keesing, F. et al. Impacts of biodiversity on the emergence and transmission of infectious diseases. Nature 468 , 647–652 (2010).

Cohen, J. M., Sauer, E. L., Santiago, O., Spencer, S. & Rohr, J. R. Divergent impacts of warming weather on wildlife disease risk across climates. Science 370 , eabb1702 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Rohr, J. R. et al. Frontiers in climate change-disease research. Trends Ecol. Evol. 26 , 270–277 (2011).

Altizer, S., Ostfeld, R. S., Johnson, P. T. J., Kutz, S. & Harvell, C. D. Climate change and infectious diseases: from evidence to a predictive framework. Science 341 , 514–519 (2013).

Article   ADS   CAS   PubMed   Google Scholar  

Rohr, J. R. & Cohen, J. M. Understanding how temperature shifts could impact infectious disease. PLoS Biol. 18 , e3000938 (2020).

Carlson, C. J. et al. Climate change increases cross-species viral transmission risk. Nature 607 , 555–562 (2022).

Halstead, N. T. et al. Agrochemicals increase risk of human schistosomiasis by supporting higher densities of intermediate hosts. Nat. Commun. 9 , 837 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Martin, L. B., Hopkins, W. A., Mydlarz, L. D. & Rohr, J. R. The effects of anthropogenic global changes on immune functions and disease resistance. Ann. N. Y. Acad. Sci. 1195 , 129–148 (2010).

Rumschlag, S. L. et al. Effects of pesticides on exposure and susceptibility to parasites can be generalised to pesticide class and type in aquatic communities. Ecol. Lett. 22 , 962–972 (2019).

Allan, B. F., Keesing, F. & Ostfeld, R. S. Effect of forest fragmentation on Lyme disease risk. Conserv. Biol. 17 , 267–272 (2003).

Article   Google Scholar  

Brearley, G. et al. Wildlife disease prevalence in human‐modified landscapes. Biol. Rev. 88 , 427–442 (2013).

Rohr, J. R. et al. Emerging human infectious diseases and the links to global food production. Nat. Sustain. 2 , 445–456 (2019).

Bradley, C. A. & Altizer, S. Urbanization and the ecology of wildlife diseases. Trends Ecol. Evol. 22 , 95–102 (2007).

Allen, T. et al. Global hotspots and correlates of emerging zoonotic diseases. Nat. Commun. 8 , 1124 (2017).

Sokolow, S. H. et al. Ecological and socioeconomic factors associated with the human burden of environmentally mediated pathogens: a global analysis. Lancet Planet. Health 6 , e870–e879 (2022).

Young, H. S., Parker, I. M., Gilbert, G. S., Guerra, A. S. & Nunn, C. L. Introduced species, disease ecology, and biodiversity–disease relationships. Trends Ecol. Evol. 32 , 41–54 (2017).

Barouki, R. et al. The COVID-19 pandemic and global environmental change: emerging research needs. Environ. Int. 146 , 106272 (2021).

Article   CAS   PubMed   Google Scholar  

Nova, N., Athni, T. S., Childs, M. L., Mandle, L. & Mordecai, E. A. Global change and emerging infectious diseases. Ann. Rev. Resour. Econ. 14 , 333–354 (2021).

Zhang, L. et al. Biological invasions facilitate zoonotic disease emergences. Nat. Commun. 13 , 1762 (2022).

Olival, K. J. et al. Host and viral traits predict zoonotic spillover from mammals. Nature 546 , 646–650 (2017).

Guth, S. et al. Bats host the most virulent—but not the most dangerous—zoonotic viruses. Proc. Natl Acad. Sci. USA 119 , e2113628119 (2022).

Nelson, G. C. et al. in Ecosystems and Human Well-Being (Millennium Ecosystem Assessment) Vol. 2 (eds Rola, A. et al) Ch. 7, 172–222 (Island Press, 2005).

Read, A. F., Graham, A. L. & Raberg, L. Animal defenses against infectious agents: is damage control more important than pathogen control? PLoS Biol. 6 , 2638–2641 (2008).

Article   CAS   Google Scholar  

Medzhitov, R., Schneider, D. S. & Soares, M. P. Disease tolerance as a defense strategy. Science 335 , 936–941 (2012).

Torchin, M. E. & Mitchell, C. E. Parasites, pathogens, and invasions by plants and animals. Front. Ecol. Environ. 2 , 183–190 (2004).

Bellay, S., de Oliveira, E. F., Almeida-Neto, M. & Takemoto, R. M. Ectoparasites are more vulnerable to host extinction than co-occurring endoparasites: evidence from metazoan parasites of freshwater and marine fishes. Hydrobiologia 847 , 2873–2882 (2020).

Scheffer, M. Critical Transitions in Nature and Society Vol. 16 (Princeton Univ. Press, 2020).

Rohr, J. R. et al. A planetary health innovation for disease, food and water challenges in Africa. Nature 619 , 782–787 (2023).

Reaser, J. K., Witt, A., Tabor, G. M., Hudson, P. J. & Plowright, R. K. Ecological countermeasures for preventing zoonotic disease outbreaks: when ecological restoration is a human health imperative. Restor. Ecol. 29 , e13357 (2021).

Hopkins, S. R. et al. Evidence gaps and diversity among potential win–win solutions for conservation and human infectious disease control. Lancet Planet. Health 6 , e694–e705 (2022).

Mitchell, C. E. & Power, A. G. Release of invasive plants from fungal and viral pathogens. Nature 421 , 625–627 (2003).

Chamberlain, S. A. & Szöcs, E. taxize: taxonomic search and retrieval in R. F1000Research 2 , 191 (2013).

Newman, M. Fundamentals of Ecotoxicology (CRC Press/Taylor & Francis Group, 2010).

Rohatgi, A. WebPlotDigitizer v.4.5 (2021); automeris.io/WebPlotDigitizer .

Lüdecke, D. esc: effect size computation for meta analysis (version 0.5.1). Zenodo https://doi.org/10.5281/zenodo.1249218 (2019).

Lipsey, M. W. & Wilson, D. B. Practical Meta-Analysis (SAGE, 2001).

R Core Team. R: A Language and Environment for Statistical Computing Vol. 2022 (R Foundation for Statistical Computing, 2020); www.R-project.org/ .

Viechtbauer, W. Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36 , 1–48 (2010).

Pustejovsky, J. E. & Tipton, E. Meta-analysis with robust variance estimation: Expanding the range of working models. Prev. Sci. 23 , 425–438 (2022).

Lenth, R. emmeans: estimated marginal means, aka least-squares means. R package v.1.5.1 (2020).

Bartoń, K. MuMIn: multi-modal inference. Model selection and model averaging based on information criteria (AICc and alike) (2019).

Burnham, K. P. & Anderson, D. R. Multimodel inference: understanding AIC and BIC in model selection. Sociol. Methods Res. 33 , 261–304 (2004).

Article   MathSciNet   Google Scholar  

Marks‐Anglin, A. & Chen, Y. A historical review of publication bias. Res. Synth. Methods 11 , 725–742 (2020).

Nakagawa, S. et al. Methods for testing publication bias in ecological and evolutionary meta‐analyses. Methods Ecol. Evol. 13 , 4–21 (2022).

Gurevitch, J., Koricheva, J., Nakagawa, S. & Stewart, G. Meta-analysis and the science of research synthesis. Nature 555 , 175–182 (2018).

Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 , 1–48 (2015).

Mahon, M. B. et al. Data and code for ‘A meta-analysis on global change drivers and the risk of infectious disease’. Zenodo https://doi.org/10.5281/zenodo.8169979 (2024).

Mahon, M. B. et al. Data and code for ‘A meta-analysis on global change drivers and the risk of infectious disease’. GitHub github.com/mahonmb/GCDofDisease (2024).

Download references

Acknowledgements

We thank C. Mitchell for contributing data on enemy release; L. Albert and B. Shayhorn for assisting with data collection; J. Gurevitch, M. Lajeunesse and G. Stewart for providing comments on an earlier version of this manuscript; and C. Carlson and two anonymous reviewers for improving this paper. This research was supported by grants from the National Science Foundation (DEB-2109293, DEB-2017785, DEB-1518681, IOS-1754868), National Institutes of Health (R01TW010286) and US Department of Agriculture (2021-38420-34065) to J.R.R.; a US Geological Survey Powell grant to J.R.R. and S.L.R.; University of Connecticut Start-up funds to S.A.K.; grants from the National Science Foundation (IOS-1755002) and National Institutes of Health (R01 AI150774) to D.J.C.; and an Ambizione grant (PZ00P3_202027) from the Swiss National Science Foundation to F.W.H. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Michael B. Mahon, Alexandra Sack, Jason R. Rohr

Authors and Affiliations

Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA

Michael B. Mahon, Alexandra Sack, O. Alejandro Aleuy, Carly Barbera, Ethan Brown, Heather Buelow, Luz A. de Wit, Meghan Forstchen, Patrick Heffernan, Alexis Korotasz, Joanna G. Larson, Samantha L. Rumschlag, Emily Selland, Alexander Shepack, Nitin Vincent & Jason R. Rohr

Environmental Change Initiative, University of Notre Dame, Notre Dame, IN, USA

Michael B. Mahon, Samantha L. Rumschlag & Jason R. Rohr

Eck Institute of Global Health, University of Notre Dame, Notre Dame, IN, USA

Alexandra Sack, Meghan Forstchen, Emily Selland & Jason R. Rohr

Department of Biology, Emory University, Atlanta, GA, USA

David J. Civitello

Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA

Jeremy M. Cohen

Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA

Fletcher W. Halliday

Department of Ecology and Evolutionary Biology, Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA

Sarah A. Knutie

You can also search for this author in PubMed   Google Scholar

Contributions

J.R.R. conceptualized the study. All of the authors contributed to the methodology. All of the authors contributed to investigation. Visualization was performed by M.B.M. The initial study list and related information were compiled by D.J.C., J.M.C., F.W.H., S.A.K., S.L.R. and J.R.R. Data extraction was performed by M.B.M., A.S., O.A.A., C.B., E.B., H.B., L.A.d.W., M.F., P.H., A.K., J.G.L., E.S., A.S. and N.V. Data were checked for accuracy by M.B.M. and A.S. Analyses were performed by M.B.M. and J.R.R. Funding was acquired by D.J.C., J.R.R., S.A.K. and S.L.R. Project administration was done by J.R.R. J.R.R. supervised the study. J.R.R. and M.B.M. wrote the original draft. All of the authors reviewed and edited the manuscript. J.R.R. and M.B.M. responded to reviewers.

Corresponding author

Correspondence to Jason R. Rohr .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Colin Carlson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 prisma flowchart..

The PRISMA flow diagram of the search and selection of studies included in this meta-analysis. Note that 77 studies came from the Halliday et al. 3 database on biodiversity change.

Extended Data Fig. 2 Summary of the number of studies (A-F) and parasite taxa (G-L) in the infectious disease database across ecological contexts.

The contexts are global change driver ( A , G ), parasite taxa ( B , H ), host taxa ( C , I ), experimental venue ( D , J ), study habitat ( E , K ), and human parasite status ( F , L ).

Extended Data Fig. 3 Summary of the number of effect sizes (A-I), studies (J-R), and parasite taxa (S-a) in the infectious disease database for various parasite and host contexts.

Shown are parasite type ( A , J , S ), host thermy ( B , K , T ), vector status ( C , L , U ), vector-borne status ( D , M , V ), parasite transmission ( E , N , W ), free living stages ( F , O , X ), host (e.g. disease, host growth, host survival) or parasite (e.g. parasite abundance, prevalence, fecundity) endpoint ( G , P , Y ), micro- vs macroparasite ( H , Q , Z ), and zoonotic status ( I , R , a ).

Extended Data Fig. 4 The effects of global change drivers and subsequent subcategories on disease responses with Log Response Ratio instead of Hedge’s g.

Here, Log Response Ratio shows similar trends to that of Hedge’s g presented in the main text. The displayed points represent the mean predicted values (with 95% confidence intervals) from a meta-analytical model with separate random intercepts for study. Points that do not share letters are significantly different from one another (p < 0.05) based on a two-sided Tukey’s posthoc multiple comparison test with adjustment for multiple comparisons. See Table S 3 for pairwise comparison results. Effects of the five common global change drivers ( A ) have the same directionality, similar magnitude, and significance as those presented in Fig. 2 . Global change driver effects are significant when confidence intervals do not overlap with zero and explicitly tested with two-tailed t-test (indicated by asterisks; t 80.62  = 2.16, p = 0.034 for CP; t 71.42  = 2.10, p = 0.039 for CC; t 131.79  = −3.52, p < 0.001 for HLC; t 61.9  = 2.10, p = 0.040 for IS). The subcategories ( B ) also show similar patterns as those presented in Fig. 3 . Subcategories are significant when confidence intervals do not overlap with zero and were explicitly tested with two-tailed one sample t-test (t 30.52  = 2.17, p = 0.038 for CO 2 ; t 40.03  = 4.64, p < 0.001 for Enemy Release; t 47.45  = 2.18, p = 0.034 for Mean Temperature; t 110.81  = −4.05, p < 0.001 for Urbanization); all other subcategories have p > 0.20. Note that effect size and study numbers are lower here than in Figs. 3 and 4 , because log response ratios cannot be calculated for studies that provide coefficients (e.g., odds ratio) rather than raw data; as such, all observations within BC did not have associated RR values. Despite strong differences in sample size, patterns are consistent across effect sizes, and therefore, we can be confident that the results presented in the main text are not biased because of effect size selection.

Extended Data Fig. 5 Average standard errors of the effect sizes (A) and sample sizes per effect size (B) for each of the five global change drivers.

The displayed points represent the mean predicted values (with 95% confidence intervals) from the generalized linear mixed effects models with separate random intercepts for study (Gaussian distribution for standard error model, A ; Poisson distribution for sample size model, B ). Points that do not share letters are significantly different from one another (p < 0.05) based on a two-sided Tukey’s posthoc multiple comparison test with adjustment for multiple comparisons. Sample sizes (number of studies, n, and effect sizes, k) for each driver are as follows: n = 77, k = 392 for BC; n = 124, k = 364 for CP; n = 202, k = 380 for CC; n = 517, k = 1449 for HLC; n = 96, k = 355 for IS.

Extended Data Fig. 6 Forest plots of effect sizes, associated variances, and relative weights (A), Funnel plots (B), and Egger’s Test plots (C) for each of the five global change drivers and leave-one-out publication bias analyses (D).

In panel A , points are the individual effect sizes (Hedge’s G), error bars are standard errors of the effect size, and size of the points is the relative weight of the observation in the model, with larger points representing observations with higher weight in the model. Sample sizes are provided for each effect size in the meta-analytic database. Effect sizes were plotted in a random order. Egger’s tests indicated significant asymmetries (p < 0.05) in Biodiversity Change (worst asymmetry – likely not bias, just real effect of positive relationship between diversity and disease), Climate Change – (weak asymmetry, again likely not bias, climate change generally increases disease), and Introduced Species (relatively weak asymmetry – unclear whether this is a bias, may be driven by some outliers). No significant asymmetries (p > 0.05) were found in Chemical Pollution and Habitat Loss/Change, suggesting negligible publication bias in reported disease responses across these global change drivers ( B , C ). Egger’s test included publication year as moderator but found no significant relationship between Hedge’s g and publication year (p > 0.05) implying no temporal bias in effect size magnitude or direction. In panel D , the horizontal red lines denote the grand mean and SE of Hedge’s g and (g = 0.1009, SE = 0.0338). Grey points and error bars indicate the Hedge’s g and SEs, respectively, using the leave-one-out method (grand mean is recalculated after a given study is removed from dataset). While the removal of certain studies resulted in values that differed from the grand mean, all estimated Hedge’s g values fell well within the standard error of the grand mean. This sensitivity analysis indicates that our results were robust to the iterative exclusion of individual studies.

Extended Data Fig. 7 The effects of habitat loss/change on disease depend on parasite taxa and land use conversion contexts.

A) Enemy type influences the magnitude of the effect of urbanization on disease: helminths, protists, and arthropods were all negatively associated with urbanization, whereas viruses were non-significantly positively associated with urbanization. B) Reference (control) land use type influences the magnitude of the effect of urbanization on disease: disease was reduced in urban settings compared to rural and peri-urban settings, whereas there were no differences in disease along urbanization gradients or between urban and natural settings. C) The effect of forest fragmentation depends on whether a large/continuous habitat patch is compared to a small patch or whether disease it is measured along an increasing fragmentation gradient (Z = −2.828, p = 0.005). Conversely, the effect of deforestation on disease does not depend on whether the habitat has been destroyed and allowed to regrow (e.g., clearcutting, second growth forests, etc.) or whether it has been replaced with agriculture (e.g., row crop, agroforestry, livestock grazing; Z = 1.809, p = 0.0705). The displayed points represent the mean predicted values (with 95% confidence intervals) from a metafor model where the response variable was a Hedge’s g (representing the effect on an infectious disease endpoint relative to control), study was treated as a random effect, and the independent variables included enemy type (A), reference land use type (B), or land use conversion type (C). Data for (A) and (B) were only those studies that were within the “urbanization” subcategory; data for (C) were only those studies that were within the “deforestation” and “forest fragmentation” subcategories. Sample sizes (number of studies, n, and effect sizes, k) in (A) for each enemy are n = 48, k = 98 for Virus; n = 193, k = 343 for Protist; n = 159, k = 490 for Helminth; n = 10, k = 24 for Fungi; n = 103, k = 223 for Bacteria; and n = 30, k = 73 for Arthropod. Sample sizes in (B) for each reference land use type are n = 391, k = 1073 for Rural; n = 29, k = 74 for Peri-urban; n = 33, k = 83 for Natural; and n = 24, k = 58 for Urban Gradient. Sample sizes in (C) for each land use conversion type are n = 7, k = 47 for Continuous Gradient; n = 16, k = 44 for High/Low Fragmentation; n = 11, k = 27 for Clearcut/Regrowth; and n = 21, k = 43 for Agriculture.

Extended Data Fig. 8 The effects of common global change drivers on mean infectious disease responses in the literature depends on whether the endpoint is the host or parasite; whether the parasite is a vector, is vector-borne, has a complex or direct life cycle, or is a macroparasite; whether the host is an ectotherm or endotherm; or the venue and habitat in which the study was conducted.

A ) Parasite endpoints. B ) Vector-borne status. C ) Parasite transmission route. D ) Parasite size. E ) Venue. F ) Habitat. G ) Host thermy. H ) Parasite type (ecto- or endoparasite). See Table S 2 for number of studies and effect sizes across ecological contexts and global change drivers. See Table S 3 for pairwise comparison results. The displayed points represent the mean predicted values (with 95% confidence intervals) from a metafor model where the response variable was a Hedge’s g (representing the effect on an infectious disease endpoint relative to control), study was treated as a random effect, and the independent variables included the main effects and an interaction between global change driver and the focal independent variable (whether the endpoint measured was a host or parasite, whether the parasite is vector-borne, has a complex or direct life cycle, is a macroparasite, whether the study was conducted in the field or lab, habitat, the host is ectothermic, or the parasite is an ectoparasite).

Extended Data Fig. 9 The effects of five common global change drivers on mean infectious disease responses in the literature only occasionally depend on location, host taxon, and parasite taxon.

A ) Continent in which the field study occurred. Lack of replication in chemical pollution precluded us from including South America, Australia, and Africa in this analysis. B ) Host taxa. C ) Enemy taxa. See Table S 2 for number of studies and effect sizes across ecological contexts and global change drivers. See Table S 3 for pairwise comparison results. The displayed points represent the mean predicted values (with 95% confidence intervals) from a metafor model where the response variable was a Hedge’s g (representing the effect on an infectious disease endpoint relative to control), study was treated as a random effect, and the independent variables included the main effects and an interaction between global change driver and continent, host taxon, and enemy taxon.

Extended Data Fig. 10 The effects of human vs. non-human endpoints for the zoonotic disease subset of database and wild vs. domesticated animal endpoints for the non-human animal subset of database are consistent across global change drivers.

(A) Zoonotic disease responses measured on human hosts responded less positively (closer to zero when positive, further from zero when negative) than those measured on non-human (animal) hosts (Z = 2.306, p = 0.021). Note, IS studies were removed because of missing cells. (B) Disease responses measured on domestic animal hosts responded less positively (closer to zero when positive, further from zero when negative) than those measured on wild animal hosts (Z = 2.636, p = 0.008). These results were consistent across global change drivers (i.e., no significant interaction between endpoint and global change driver). As many of the global change drivers increase zoonotic parasites in non-human animals and all parasites in wild animals, this may suggest that anthropogenic change might increase the occurrence of parasite spillover from animals to humans and thus also pandemic risk. The displayed points represent the mean predicted values (with 95% confidence intervals) from a metafor model where the response variable was a Hedge’s g (representing the effect on an infectious disease endpoint relative to control), study was treated as a random effect, and the independent variable of global change driver and human/non-human hosts. Data for (A) were only those diseases that are considered “zoonotic”; data for (B) were only those endpoints that were measured on non-human animals. Sample sizes in (A) for zoonotic disease measured on human endpoints across global change drivers are n = 3, k = 17 for BC; n = 2, k = 6 for CP; n = 25, k = 39 for CC; and n = 175, k = 331 for HLC. Sample sizes in (A) for zoonotic disease measured on non-human endpoints across global change drivers are n = 25, k = 52 for BC; n = 2, k = 3 for CP; n = 18, k = 29 for CC; n = 126, k = 289 for HLC. Sample sizes in (B) for wild animal endpoints across global change drivers are n = 28, k = 69 for BC; n = 21, k = 44 for CP; n = 50, k = 89 for CC; n = 121, k = 360 for HLC; and n = 29, k = 45 for IS. Sample sizes in (B) for domesticated animal endpoints across global change drivers are n = 2, k = 4 for BC; n = 4, k = 11 for CP; n = 7, k = 20 for CC; n = 78, k = 197 for HLC; and n = 1, k = 2 for IS.

Supplementary information

Supplementary information.

Supplementary Discussion, Supplementary References and Supplementary Tables 1–3.

Reporting Summary

Peer review file, supplementary data 1.

R markdown code and output associated with this paper.

Supplementary Table 4

EcoEvo PRISMA checklist.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Mahon, M.B., Sack, A., Aleuy, O.A. et al. A meta-analysis on global change drivers and the risk of infectious disease. Nature (2024). https://doi.org/10.1038/s41586-024-07380-6

Download citation

Received : 02 August 2022

Accepted : 03 April 2024

Published : 08 May 2024

DOI : https://doi.org/10.1038/s41586-024-07380-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

research and analysis of blockchain data

An Analysis of Blockchain and GDPR under the Data Lifecycle Perspective

  • Published: 29 August 2020
  • Volume 26 , pages 266–276, ( 2021 )

Cite this article

research and analysis of blockchain data

  • Gislaine Parra Freund 1 ,
  • Priscila Basto Fagundes   ORCID: orcid.org/0000-0002-9461-311X 1 &
  • Douglas Dyllon Jeronimo de Macedo 1  

1432 Accesses

16 Citations

1 Altmetric

Explore all metrics

The purpose of this article is to present an analysis of the privacy principles of personal data prescribed in the General Data Protection Regulation and the treatment of data in Blockchain technology in its primary version, guided by the phases of the Data LifeCycle. The Data Life Cycle presents the stages in which the data act in a given process and are related to each other, forming a chain of dependence between them. The General Data Protection Regulation, on the other hand, presents privacy principles that contemplate the necessary treatment for data at all stages of its life cycle. This analysis made it possible to identify the influence that the phases of the Data Lifecycle have on the adequacy of the principles of the General Data Protection Regulation and the treatment of the data performed on the Blockchain technology associated with the phases, presenting an approach to lead the process of adapting the technology to compliance with the legislation. In this context, some data treatment options were presented for the phases that presented gaps, making it possible to conclude that the result of this analysis can be used as a support tool to systematize the process of adapting to the law by organizations that use or intend to adopt Blockchain technology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research and analysis of blockchain data

Similar content being viewed by others

research and analysis of blockchain data

Blockchain and Data Protection: An Evaluation of the Challenges and Solutions Mentioned by German Stakeholders

research and analysis of blockchain data

Blockchain and the GDPR: New Challenges for Privacy and Security

research and analysis of blockchain data

An Introduction to Blockchain Technology and Their Applications in the Actuality with a View of Its Security Aspects

GDPR (General Data Protection Regulation) (2016) Regulation (EU) 2016/679 of the European Parliament and of the Council. https://eur-lex.europa.eu/eli/reg/2016/679/oj. Accessed 24 April 2020

Collins R (2016) Blockchain: A New Architecture for Digital Content, EContent. http://www.econtentmag.com/Articles/Editorial/Commentary/Blockchain-A-New-Architecture-for-Digital-Content-114161.htm. Accessed 20 April 2020

Hernon P (1994) Information lifecycle: its place in the management of US government information resources. Gov Inf Q 11(2):143–170. https://doi.org/10.1016/0740-624X(94)90002-7

Article   Google Scholar  

Kreutz D, Malichevskyy O, Feitosa E, Cunha H, da Rosa RR, de Macedo DDJ (2016) A cyber-resilient architecture for critical security services. J Netw Comput Appl 63:173–189. https://doi.org/10.1016/j.jnca.2015.09.014

Wallauer J, von Wangenheim A, Andrade R, de Macedo DDJ (2008). A telemedicine network using secure techniques and intelligent user access control. 21st IEEE international symposium on computer-based medical systems, pp. 105-107. https://doi.org/10.1109/CBMS.2008.124

de Macedo DDJ, von Wangenheim A, Dantas MA (2015) A data storage approach for large-scale distributed medical systems. Ninth international conference on complex, intelligent, and software intensive systems. Pp. 486-490. https://doi.org/10.1109/CISIS.2015.88

Gomes EH, Dantas MA, de Macedo DDJ, Rolt CRD, Dias J, Foschini L (2018) An infrastructure model for smart cities based on big data. International Journal of Grid and Utility Computing 9(4):322–332. https://doi.org/10.1504/IJGUC.2018.095435

de Souza IA, Andrade R, von Wangenheim A, and Macedo DDJ (2014). Designing an information retrieval system for the STT/SC. 16th international conference on e-health networking, applications and services (Healthcom), pp. 500–505. https://doi.org/10.1109/HealthCom.2014.7001893

Sant’Ana RCG (2016) Data life cycle: A perspective from the Information Science. Informação & Informação 21(2):116–142. https://doi.org/10.5433/1981-8920.2016v21n2p116

Freund GP, Fagundes PB, Macedo DDJ (2020) identification of the relationships between the stages of the data lifecycle andthe principles of the Brazilian general data protection act. Lecture notes of the Institute for Computer Sciences, social informatics and telecommunications engineering. (Ed.): DIONE 2020, LNICST 319, pp. 1–10, 2020. https://doi.org/10.1007/978-3-030-50072-6_7 (in press)

Voigt P, Von dem Bussche A (2017) The EU general data protection regulation (GDPR): a practical guide, 1st edn. Springer International Publishing, Cham

Book   Google Scholar  

Hoofnagle CJ, Van der Sloot B, Borgesius FZ (2019) The European Union general data protection regulation: what it is and what it means. Information & Communications Technology Law 28(1):65–98. https://doi.org/10.1080/13600834.2019.1573501

Mercer S (2020) The limitations of European data protection as a model for global privacy regulation. AJIL Unbound 114:20–25. https://doi.org/10.1017/aju.2019.83

Ruparelia NB (2010) Software development lifecycle models. SIGSOFT Softw. Eng Notes 35(3):8–13. https://doi.org/10.1145/1764810.1764814

Nayak A, Dutta K (2017) Blockchain: the perfect data protection tool. International conference on intelligent computing and control (I2C2) pp. 1-3. https://doi.org/10.1109/I2C2.2017.8321932

Macrinici D, Cartofeanu C, Gao S (2018) Smart contract applications within Blockchain technology: a systematic mapping study. Telematics Inform 35(8):2337–2354. https://doi.org/10.1016/j.tele.2018.10.004

Lacity M (2018) Addressing key challenges to making Enterprise Blockchain applications a reality. MIS Q Exec 17(3):201–222 https://aisel.aisnet.org/misqe/vol17/iss3/3/ .

Google Scholar  

Gupta V. A brief history of Blockchain. Harv Bus Rev https://hbr.org/2017/02/a-brief-history-of-blockchain . Accessed 20 April 2020

Bernabe JB, Canovas JL, Hernandez-Ramos JL, Moreno RT, Skarmeta A (2019) Privacy-preserving solutions for Blockchain: review and challenges. IEEE Access 7:164908–164940. https://doi.org/10.1109/ACCESS.2019.2950872

Suzuki S, Murai J (2017) Blockchain as an audit-able Communication Channel. Conference: 2017 IEEE 41st annual computer software and applications conference (COMPSAC), pp. 516-522. https://doi.org/10.1109/COMPSAC.2017.72

Christidis K, Devetsikiotis M (2016) Blockchains and smart contracts for the internet of things. IEEE Access 4:2292–2303. https://doi.org/10.1109/ACCESS.2016.2566339

Zheng Z, Xie S, Daí H, Chen X, Wang H (2017) An overview of Blockchain technology: architecture, consensus and future trends. 2017 IEEE international congress on big data, (BigData congress), pp. 557–564 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8029379 . Accessed 16 April 2020

Bayle A, Koscina M, Manset D, Perez-Kempner O (2018) When Blockchain meets the right to be forgotten: technology versus law in the healthcare industry. 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI) pp 788–792 https://doi.org/10.1109/WI.2018.00133

Zemler F, Westner M (2019) Blockchain and GDPR: Application Scenarios and Compliance Requirements. Portland International Conference on Management of Engineering and Technology (PICMET), pp. 1–8. https://doi.org/10.23919/PICMET.2019.8893923

Rieger A, Guggenmos F, Locki J, Fridgen G, Urbach N (2019) Building a blockchain application that complies with the eu general data protection regulation. MIS Quarterly Executive 18(4):263–279. https://doi.org/10.17705/2msqe.00020

Millard C (2018) Blockchain and law: incompatible codes? Computer Law & Security Review 34(4):843–846. https://doi.org/10.1016/j.clsr.2018.06.006

Article   MathSciNet   Google Scholar  

Moerel L (2018) Blockchain & data protection…and why they are not on a collision course. European review of private law 26(6 ): 825-851. http://www.kluwerlawonline.com/abstract.php?area=Journals&id=ERPL2018057

Powell RR, Connaway LS (2004) Basic research methods for librarians (4th ed.) Westport, CT: libraries unlimited

Download references

Acknowledgements

This research was partially supported by Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior (CAPES) and the Santa Catarina Research Foundation (FAPESC) by grant Public Note FAPESC N° 03/2017.

Author information

Authors and affiliations.

Department of Information Science (CIN), Federal University of Santa Catarina (UFSC), Florianópolis, Santa Catarina, Brazil

Gislaine Parra Freund, Priscila Basto Fagundes & Douglas Dyllon Jeronimo de Macedo

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Gislaine Parra Freund .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Freund, G.P., Fagundes, P.B. & de Macedo, D.D.J. An Analysis of Blockchain and GDPR under the Data Lifecycle Perspective. Mobile Netw Appl 26 , 266–276 (2021). https://doi.org/10.1007/s11036-020-01646-9

Download citation

Published : 29 August 2020

Issue Date : February 2021

DOI : https://doi.org/10.1007/s11036-020-01646-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data lifecycle
  • General data protection regulation
  • Data privacy
  • Find a journal
  • Publish with us
  • Track your research

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

When Online Content Disappears

  • Acknowledgments

Table of Contents

  • Webpages from the last decade
  • Links on government websites
  • Links on news websites
  • Reference links on Wikipedia
  • Posts on Twitter
  • Collection and analysis of Twitter data
  • Data collection for World Wide Web websites, government websites and news websites
  • Data collection for Wikipedia source links
  • Evaluating the status of pages and links
  • Definition of links

This report is a collaborative effort based on the input and analysis of the following individuals:

Primary Researchers

Athena Chapekis, Data Science Analyst Samuel Bestvater, Computational Social Scientist Emma Remy, Former Data Science Analyst Gonzalo Rivero, Former Associate Director, Data Labs

Research Team

Aaron Smith, Director, Data Labs Brian Broderick, Senior Data Engineer Galen Stocking, Senior Computational Social Scientist Regina Widjaya, Computational Social Scientist Meltem Odabaş, Former Computational Social Scientist

Editorial and Graphic Design

Alissa Scheller, Senior Information Graphics Designer Anna Jackson, Editorial Assistant

Communications and Web Publishing

Sogand Afkari, Communications Manager Janakee Chavda, Assistant Digital Producer

In addition, the project benefited greatly from feedback by Jeff Diamant, Jenn Hatfield, Monica Anderson and Lee Rainie of Pew Research Center.

Sign up for our weekly newsletter

Fresh data delivery Saturday mornings

Sign up for The Briefing

Weekly updates on the world of news & information

  • Internet & Technology
  • Online Search
  • Public Knowledge

A quarter of U.S. teachers say AI tools do more harm than good in K-12 education

Teens and video games today, americans’ views of technology companies, 6 facts about americans and tiktok, many americans think generative ai programs should credit the sources they rely on, most popular, report materials.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Not all data are created equal; some are structured, but most of them are unstructured. Structured and unstructured data are sourced, collected and scaled in different ways and each one resides in a different type of database.

In this article, we will take a deep dive into both types so that you can get the most out of your data.

Structured data—typically categorized as quantitative data—is highly organized and easily decipherable by  machine learning algorithms .  Developed by IBM® in 1974 , structured query language (SQL) is the programming language used to manage structured data. By using a  relational (SQL) database , business users can quickly input, search and manipulate structured data.

Examples of structured data include dates, names, addresses, credit card numbers, among others. Their benefits are tied to ease of use and access, while liabilities revolve around data inflexibility:

  • Easily used by machine learning (ML) algorithms:  The specific and organized architecture of structured data eases the manipulation and querying of ML data.
  • Easily used by business users:  Structured data do not require an in-depth understanding of different types of data and how they function. With a basic understanding of the topic relative to the data, users can easily access and interpret the data.
  • Accessible by more tools:  Since structured data predates unstructured data, there are more tools available for using and analyzing structured data.
  • Limited usage:  Data with a predefined structure can only be used for its intended purpose, which limits its flexibility and usability.
  • Limited storage options:  Structured data are usually stored in data storage systems with rigid schemas (for example, “ data warehouses ”). Therefore, changes in data requirements necessitate an update of all structured data, which leads to a massive expenditure of time and resources.
  • OLAP :  Performs high-speed, multidimensional data analysis from unified, centralized data stores.
  • SQLite : (link resides outside ibm.com)  Implements a self-contained,  serverless , zero-configuration, transactional relational database engine.
  • MySQL :  Embeds data into mass-deployed software, particularly mission-critical, heavy-load production system.
  • PostgreSQL :  Supports SQL and JSON querying as well as high-tier programming languages (C/C+, Java,  Python , among others.).
  • Customer relationship management (CRM):  CRM software runs structured data through analytical tools to create datasets that reveal customer behavior patterns and trends.
  • Online booking:  Hotel and ticket reservation data (for example, dates, prices, destinations, among others.) fits the “rows and columns” format indicative of the pre-defined data model.
  • Accounting:  Accounting firms or departments use structured data to process and record financial transactions.

Unstructured data, typically categorized as qualitative data, cannot be processed and analyzed through conventional data tools and methods. Since unstructured data does not have a predefined data model, it is best managed in  non-relational (NoSQL) databases . Another way to manage unstructured data is to use  data lakes  to preserve it in raw form.

The importance of unstructured data is rapidly increasing.  Recent projections  (link resides outside ibm.com) indicate that unstructured data is over 80% of all enterprise data, while 95% of businesses prioritize unstructured data management.

Examples of unstructured data include text, mobile activity, social media posts, Internet of Things (IoT) sensor data, among others. Their benefits involve advantages in format, speed and storage, while liabilities revolve around expertise and available resources:

  • Native format:  Unstructured data, stored in its native format, remains undefined until needed. Its adaptability increases file formats in the database, which widens the data pool and enables data scientists to prepare and analyze only the data they need.
  • Fast accumulation rates:  Since there is no need to predefine the data, it can be collected quickly and easily.
  • Data lake storage:  Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases scalability.
  • Requires expertise:  Due to its undefined or non-formatted nature, data science expertise is required to prepare and analyze unstructured data. This is beneficial to data analysts but alienates unspecialized business users who might not fully understand specialized data topics or how to utilize their data.
  • Specialized tools:  Specialized tools are required to manipulate unstructured data, which limits product choices for data managers.
  • MongoDB :  Uses flexible documents to process data for cross-platform applications and services.
  • DynamoDB :  (link resides outside ibm.com) Delivers single-digit millisecond performance at any scale through built-in security, in-memory caching and backup and restore.
  • Hadoop :  Provides distributed processing of large data sets using simple programming models and no formatting requirements.
  • Azure :  Enables agile cloud computing for creating and managing apps through Microsoft’s data centers.
  • Data mining :  Enables businesses to use unstructured data to identify consumer behavior, product sentiment and purchasing patterns to better accommodate their customer base.
  • Predictive data analytics :  Alert businesses of important activity ahead of time so they can properly plan and accordingly adjust to significant market shifts.
  • Chatbots :  Perform text analysis to route customer questions to the appropriate answer sources.

While structured (quantitative) data gives a “birds-eye view” of customers, unstructured (qualitative) data provides a deeper understanding of customer behavior and intent. Let’s explore some of the key areas of difference and their implications:

  • Sources:  Structured data is sourced from GPS sensors, online forms, network logs, web server logs,  OLTP systems , among others; whereas unstructured data sources include email messages, word-processing documents, PDF files, and others.
  • Forms:  Structured data consists of numbers and values, whereas unstructured data consists of sensors, text files, audio and video files, among others.
  • Models:  Structured data has a predefined data model and is formatted to a set data structure before being placed in data storage (for example, schema-on-write), whereas unstructured data is stored in its native format and not processed until it is used (for example, schema-on-read).
  • Storage:  Structured data is stored in tabular formats (for example, excel sheets or SQL databases) that require less storage space. It can be stored in data warehouses, which makes it highly scalable. Unstructured data, on the other hand, is stored as media files or NoSQL databases, which require more space. It can be stored in data lakes, which makes it difficult to scale.
  • Uses:  Structured data is used in machine learning (ML) and drives its algorithms, whereas unstructured data is used in  natural language processing  (NLP) and text mining.

Semi-structured data (for example, JSON, CSV, XML) is the “bridge” between structured and unstructured data. It does not have a predefined data model and is more complex than structured data, yet easier to store than unstructured data.

Semi-structured data uses “metadata” (for example, tags and semantic markers) to identify specific data characteristics and scale data into records and preset fields. Metadata ultimately enables semi-structured data to be better cataloged, searched and analyzed than unstructured data.

  • Example of metadata usage:  An online article displays a headline, a snippet, a featured image, image alt-text, slug, among others, which helps differentiate one piece of web content from similar pieces.
  • Example of semi-structured data vs. structured data:  A tab-delimited file containing customer data versus a database containing CRM tables.
  • Example of semi-structured data vs. unstructured data:  A tab-delimited file versus a list of comments from a customer’s Instagram.

Recent developments in  artificial intelligence  (AI) and machine learning (ML) are driving the future wave of data, which is enhancing business intelligence and advancing industrial innovation. In particular, the data formats and models that are covered in this article are helping business users to do the following:

  • Analyze digital communications for compliance:  Pattern recognition and email threading analysis software that can search email and chat data for potential noncompliance.
  • Track high-volume customer conversations in social media:  Text analytics and sentiment analysis that enables monitoring of marketing campaign results and identifying online threats.
  • Gain new marketing intelligence:  ML analytics tools that can quickly cover massive amounts of data to help businesses analyze customer behavior.

Furthermore, smart and efficient usage of data formats and models can help you with the following:

  • Understand customer needs at a deeper level to better serve them
  • Create more focused and targeted marketing campaigns
  • Track current metrics and create new ones
  • Create better product opportunities and offerings
  • Reduce operational costs

Whether you are a seasoned data expert or a novice business owner, being able to handle all forms of data is conducive to your success. By using structured, semi-structured and unstructured data options, you can perform optimal data management that will ultimately benefit your mission.

Get the latest tech insights and expert thought leadership in your inbox.

To better understand data storage options for whatever kind of data best serves you, check out IBM Cloud Databases

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.

  • Future Students
  • Current Students
  • Faculty/Staff

Stanford Graduate School of Education

News and Media

  • News & Media Home
  • Research Stories
  • School's In
  • In the Media

You are here

70 years after brown v. board of education, new research shows rise in school segregation.

Kids getting onto a school bus

As the nation prepares to mark the 70th anniversary of the landmark U.S. Supreme Court ruling in Brown v. Board of Education , a new report from researchers at Stanford and USC shows that racial and economic segregation among schools has grown steadily in large school districts over the past three decades — an increase that appears to be driven in part by policies favoring school choice over integration.

Analyzing data from U.S. public schools going back to 1967, the researchers found that segregation between white and Black students has increased by 64 percent since 1988 in the 100 largest districts, and segregation by economic status has increased by about 50 percent since 1991.

The report also provides new evidence about the forces driving recent trends in school segregation, showing that the expansion of charter schools has played a major role.  

The findings were released on May 6 with the launch of the Segregation Explorer , a new interactive website from the Educational Opportunity Project at Stanford University. The website provides searchable data on racial and economic school segregation in U.S. states, counties, metropolitan areas, and school districts from 1991 to 2022. 

“School segregation levels are not at pre- Brown levels, but they are high and have been rising steadily since the late 1980s,” said Sean Reardon , the Professor of Poverty and Inequality in Education at Stanford Graduate School of Education and faculty director of the Educational Opportunity Project. “In most large districts, school segregation has increased while residential segregation and racial economic inequality have declined, and our findings indicate that policy choices – not demographic changes – are driving the increase.” 

“There’s a tendency to attribute segregation in schools to segregation in neighborhoods,” said Ann Owens , a professor of sociology and public policy at USC. “But we’re finding that the story is more complicated than that.”

Assessing the rise

In the Brown v. Board decision issued on May 17, 1954, the U.S. Supreme Court ruled that racially segregated public schools violated the Equal Protection Clause of the Fourteenth Amendment and established that “separate but equal” schools were not only inherently unequal but unconstitutional. The ruling paved the way for future decisions that led to rapid school desegregation in many school districts in the late 1960s and early 1970s.

Though segregation in most school districts is much lower than it was 60 years ago, the researchers found that over the past three decades, both racial and economic segregation in large districts increased. Much of the increase in economic segregation since 1991, measured by segregation between students eligible and ineligible for free lunch, occurred in the last 15 years.

White-Hispanic and white-Asian segregation, while lower on average than white-Black segregation, have both more than doubled in large school districts since the 1980s. 

Racial-economic segregation – specifically the difference in the proportion of free-lunch-eligible students between the average white and Black or Hispanic student’s schools – has increased by 70 percent since 1991. 

School segregation is strongly associated with achievement gaps between racial and ethnic groups, especially the rate at which achievement gaps widen during school, the researchers said.  

“Segregation appears to shape educational outcomes because it concentrates Black and Hispanic students in higher-poverty schools, which results in unequal learning opportunities,” said Reardon, who is also a senior fellow at the Stanford Institute for Economic Policy Research and a faculty affiliate of the Stanford Accelerator for Learning . 

Policies shaping recent trends 

The recent rise in school segregation appears to be the direct result of educational policy and legal decisions, the researchers said. 

Both residential segregation and racial disparities in income declined between 1990 and 2020 in most large school districts. “Had nothing else changed, that trend would have led to lower school segregation,” said Owens. 

But since 1991, roughly two-thirds of districts that were under court-ordered desegregation have been released from court oversight. Meanwhile, since 1998, the charter sector – a form of expanded school choice – has grown.

Expanding school choice could influence segregation levels in different ways: If families sought schools that were more diverse than the ones available in their neighborhood, it could reduce segregation. But the researchers found that in districts where the charter sector expanded most rapidly in the 2000s and 2010s, segregation grew the most. 

The researchers’ analysis also quantified the extent to which the release from court orders accounted for the rise in school segregation. They found that, together, the release from court oversight and the expansion of choice accounted entirely for the rise in school segregation from 2000 to 2019.

The researchers noted enrollment policies that school districts can implement to mitigate segregation, such as voluntary integration programs, socioeconomic-based student assignment policies, and school choice policies that affirmatively promote integration. 

“School segregation levels are high, troubling, and rising in large districts,” said Reardon. “These findings should sound an alarm for educators and policymakers.”

Additional collaborators on the project include Demetra Kalogrides, Thalia Tom, and Heewon Jang. This research, including the development of the Segregation Explorer data and website, was supported by the Russell Sage Foundation, the Robert Wood Johnson Foundation, and the Bill and Melinda Gates Foundation.   

More Stories

Albaraa Basfar, a Stanford postdoc in a pilot fellowship program led by the GSE and the School of Medicine, presents research in progress at a meeting in March.

⟵ Go to all Research Stories

Get the Educator

Subscribe to our monthly newsletter.

Stanford Graduate School of Education

482 Galvez Mall Stanford, CA 94305-3096 Tel: (650) 723-2109

  • Contact Admissions
  • GSE Leadership
  • Site Feedback
  • Web Accessibility
  • Career Resources
  • Faculty Open Positions
  • Explore Courses
  • Academic Calendar
  • Office of the Registrar
  • Cubberley Library
  • StanfordWho
  • StanfordYou

Improving lives through learning

Make a gift now

  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Non-Discrimination
  • Accessibility

© Stanford University , Stanford , California 94305 .

COMMENTS

  1. A comprehensive review of blockchain technology ...

    The presented work provides a comprehensive analysis and survey of blockchain technology covering all aspects from the background, history, needs, roles, issues, applications, and challenges. ... This makes the blockchain tamperproof and ensures that once a block is added to the blockchain its data cannot be modified, forged, replaced, or ...

  2. Machine Learning for Blockchain Data Analysis: Progress and Opportunities

    We apply rigorous criteria to select and evaluate papers that contribute the most to the "ML for Blockchain Data Analysis" field. They encompass factors such as the relevance of the research, the significance of the problem addressed, the quality of the methodology employed, and the impact of the findings on the broader artificial intelligence community.

  3. Research and Analysis of Blockchain Data

    Abstract. Blockchain technology is characterized by anti-counterfeiting, non-tampering and easy to implement smart contracts, and is known as a new technology that will lead to social change. Therefore, the study of data in blockchain has important theoretical and practical significance. The author proposes a three-layer model from the ...

  4. (PDF) Research and Analysis of Blockchain Data

    1. Research and Analysis of Blockchain Data. Xiaojing Yang, Jinshan Liu, Xiaohe Li. College of Computer Science, Xi'an Shiyou University, Xi'an 710 065, China. [email protected]. Abstract ...

  5. Research and Analysis of Blockchain Data

    The author proposes a three-layer model from the perspective of data analysis; on the basis of this model, the data structure and data type of blockchain, as well as theData structure and operation principle of smart contract are studied. Blockchain technology is characterized by anti-counterfeiting, non-tampering and easy to implement smart contracts, and is known as a new technology that ...

  6. A Systematic Overview of Blockchain Research

    3 Data and Methodology. This section elaborates steps to conduct a comprehensive bibliometric-based analysis: 1) data collection, 2) methodological process. The overall approach and methodology are shown in Figure 1, the details could be seen as follows. Figure 1 Research methodology.

  7. (PDF) Visualization of Blockchain Data: A Systematic Review

    In this work, we systematically reviewed 76 blockchain data visualization. sources—14 visualization articles, 17 data analysis articles, and 45 online web-based tools. W e classified those ...

  8. A General Framework for Blockchain Data Analysis

    The architecture of the framework is shown in Fig. 1. The Data collector first gathers raw data on concerning blockchain networks and stores it in a cloud database. The raw data, including blocks, transactions, and events, is the core of the networks and can be reused in multiple analysis projects.

  9. Deciphering Bitcoin Blockchain Data by Cohort Analysis

    Scientific Data - Deciphering Bitcoin Blockchain Data by Cohort Analysis. ... Data Science Research Center and Social Science Division, Duke Kunshan University, Kunshan, Jiangsu, 215316, China ...

  10. An Analysis of Data Modelling for Blockchain

    Additionally, the research topic of blockchain analysis Footnote 6 [1, 31], shows us that (old) new problems, such as fast data querying and management, arise from the volume of data generated by blockchain networks, where data is often churned in order to extract extra information from it. Some examples of applications are: (i) Detecting ...

  11. Visualization of Blockchain Data: A Systematic Review

    Abstract: We present a systematic review of visual analytics tools used for the analysis of blockchains-related data. The blockchain concept has recently received considerable attention and spurred applications in a variety of domains. We systematically and quantitatively assessed 76 analytics tools that have been proposed in research as well as online by professionals and blockchain enthusiasts.

  12. The landscape of Blockchain research: impacts and opportunities

    The blockchain technology envisioned a new form of the economy with the core value of trust. Blockchain 3.0 is a blueprint for popularizing the technology in fields other than cryptocurrency and finance, such as government, health, science, culture, and the arts (Swan 2015).It focuses on the regulation and governance of blockchain-based decentralization in every aspect of society.

  13. (PDF) Data Analytics and Blockchain: A Review

    Data analytics in the Blockchain is a new field, with only a few studies looking into the difficulties involving Blockchain data analytics. In this article, a Systematic analysis presents how data ...

  14. How are texts analyzed in blockchain research? A systematic literature

    This paper provides a systematic literature review of text analysis methodologies used in blockchain-related research to comprehend and synthesize existing studies across disciplines and define future research directions. We summarize the research scope, text data, and methodologies of 124 papers and identify the two most common combinations of these dimensions: (1) papers that focus on ...

  15. Blockchain Data Analysis from the Perspective of Complex Networks: Overview

    a summary of seven research problems on the data analysis of the blockchain. With the increasingly large cryptocurrency transaction networks, an ever-increasing number of blockchain data analysts have chosen to apply social network analysis methods to cryptocurrency data analysis. Complex networks[10] have been widely proven to be a powerful

  16. Blockchain Data Analysis from the Perspective of Complex Networks

    Cryptocurrency based on blockchain technology has gradually become a choice for people to invest in, and several users have participated in the accumulation of massive transaction data. Complete transaction records in blockchains and the openness of data provide researchers with opportunities to mine and analyze data in blockchains. Network modeling and analysis of cryptocurrency transaction ...

  17. Blockchain integration in big data: Review, vision, and opportunities

    A technical taxonomy is presented for blockchain integration into big data, including data storage, security, and applications based on the field of articles selected by the SLR method. Finally, the achievements and shortcomings of each study are discussed, and future research challenges and open issues related to blockchain integration in big ...

  18. Knowledge Mapping Visualization Analysis of Research on Blockchain in

    As an emerging technology enabling economic, social, and environmental sustainability, blockchain has drawn considerable attention from the academic community in management and economics. It is essential to clarify this field's research status, hotspots, and evolution trends. This study took 642 pieces of literature on blockchain in management and economics from the Web of Science (WoS) as ...

  19. VLVRI%ORFNFKDLQ'DWD

    blockchain data. Therefore, the study of blockchain data has important theoretical and practical significance. At the same time, since the blockchain technology is still in the initial stage, the analysis of blockchain data is also in the exploratory stage, and there is a lack of research on blockchain data.

  20. A Systematic Overview of Blockchain Research

    Blockchain has been receiving growing attention from both academia and practices. This paper aims to investigate the research status of blockchain-related studies and to analyze the development and evolution of this latest hot area via bibliometric analysis. We selected and explored 2451 papers published between 2013 and 2019 from the Web of Science Core Collection database.

  21. A Comprehensive Study of Blockchain Technology and Its Role in ...

    Blockchain technology has evolved as an innovative strategy that substantially transformed industries by offering a variety of applications across multiple domains including manufacturing, construction, supply chain, food, health, energy, transport and retailing industry. By offering a decentralized ledger that enables the source-tracking of materials and goods, blockchain enables a quick ...

  22. Blockchain Intelligence to Investigate Crypto Crime

    When it comes to efficiently and effectively investigating crypto crime, blockchain intelligence provides a game-changing advantage for law enforcement as it offers the ability to analyze extensive on-chain datasets, graphing capabilities to organize that data, and contextualized insight into illicit crypto activity.

  23. A Quantitative and Qualitative Review of Blockchain Research from 2015

    The First Institute of Resources and Environment Investigation of Henan Province, Zhengzhou 450000, China. * Correspondence: [email protected]. Abstract: Blockchain has the potential to ...

  24. XRP stays above $0.50 while Ripple backs research on new ...

    Data from Cryptoeq.io shows that the sentiment among XRP holders is positive on Monday. Technical analysis: XRP poised for correction on longer timeframe Ripple is likely to correct further as ...

  25. A meta-analysis on global change drivers and the risk of infectious

    The database resulting from our literature search includes 972 studies and 2,938 observations of global change drivers on disease or parasitism from 1,006 parasite taxa, 480 host taxa and 1,497 ...

  26. An Analysis of Blockchain and GDPR under the Data Lifecycle ...

    The purpose of this article is to present an analysis of the privacy principles of personal data prescribed in the General Data Protection Regulation and the treatment of data in Blockchain technology in its primary version, guided by the phases of the Data LifeCycle. The Data Life Cycle presents the stages in which the data act in a given process and are related to each other, forming a chain ...

  27. Acknowledgments

    This report is a collaborative effort based on the input and analysis of the following individuals: Primary Researchers. Athena Chapekis, Data Science Analyst Samuel Bestvater, Computational Social Scientist Emma Remy, Former Data Science Analyst Gonzalo Rivero, Former Associate Director, Data Labs. Research Team. Aaron Smith, Director, Data Labs

  28. Structured vs. unstructured data: What's the difference?

    Structured data—typically categorized as quantitative data—is highly organized and easily decipherable by machine learning algorithms. Developed by IBM® in 1974, structured query language (SQL) is the programming language used to manage structured data.By using a relational (SQL) database, business users can quickly input, search and manipulate structured data.

  29. 70 years after Brown v. Board of Education, new research shows rise in

    As the nation prepares to mark the 70th anniversary of the landmark U.S. Supreme Court ruling in Brown v. Board of Education, a new report from researchers at Stanford and USC shows that racial and economic segregation among schools has grown steadily in large school districts over the past three decades — an increase that appears to be driven in part by policies favoring