Subscribe to the PwC Newsletter

Join the community, natural language processing, classification.

natural language processing research papers 2021

Text Classification

natural language processing research papers 2021

Graph Classification

natural language processing research papers 2021

Audio Classification

natural language processing research papers 2021

Medical Image Classification

Representation learning.

natural language processing research papers 2021

Disentanglement

Graph representation learning, sentence embeddings.

natural language processing research papers 2021

Network Embedding

Language modelling.

natural language processing research papers 2021

Long-range modeling

Protein language model, sentence pair modeling, deep hashing, table retrieval, question answering.

natural language processing research papers 2021

Open-Ended Question Answering

natural language processing research papers 2021

Open-Domain Question Answering

Conversational question answering.

natural language processing research papers 2021

Answer Selection

Translation, image generation.

natural language processing research papers 2021

Image-to-Image Translation

natural language processing research papers 2021

Image Inpainting

natural language processing research papers 2021

Text-to-Image Generation

natural language processing research papers 2021

Conditional Image Generation

Data augmentation.

natural language processing research papers 2021

Image Augmentation

natural language processing research papers 2021

Text Augmentation

Machine translation.

natural language processing research papers 2021

Transliteration

Bilingual lexicon induction.

natural language processing research papers 2021

Multimodal Machine Translation

natural language processing research papers 2021

Unsupervised Machine Translation

Text generation.

natural language processing research papers 2021

Dialogue Generation

natural language processing research papers 2021

Data-to-Text Generation

natural language processing research papers 2021

Multi-Document Summarization

Text style transfer.

natural language processing research papers 2021

Topic Models

natural language processing research papers 2021

Document Classification

natural language processing research papers 2021

Sentence Classification

natural language processing research papers 2021

Emotion Classification

2d semantic segmentation, image segmentation.

natural language processing research papers 2021

Scene Parsing

natural language processing research papers 2021

Reflection Removal

Visual question answering (vqa).

natural language processing research papers 2021

Visual Question Answering

natural language processing research papers 2021

Machine Reading Comprehension

natural language processing research papers 2021

Chart Question Answering

natural language processing research papers 2021

Embodied Question Answering

Named entity recognition (ner).

natural language processing research papers 2021

Nested Named Entity Recognition

Chinese named entity recognition, few-shot ner, sentiment analysis.

natural language processing research papers 2021

Aspect-Based Sentiment Analysis (ABSA)

natural language processing research papers 2021

Multimodal Sentiment Analysis

natural language processing research papers 2021

Aspect Sentiment Triplet Extraction

natural language processing research papers 2021

Twitter Sentiment Analysis

Few-shot learning.

natural language processing research papers 2021

One-Shot Learning

natural language processing research papers 2021

Few-Shot Semantic Segmentation

Cross-domain few-shot.

natural language processing research papers 2021

Unsupervised Few-Shot Learning

Word embeddings.

natural language processing research papers 2021

Learning Word Embeddings

natural language processing research papers 2021

Multilingual Word Embeddings

Embeddings evaluation, contextualised word representations, optical character recognition (ocr).

natural language processing research papers 2021

Active Learning

natural language processing research papers 2021

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, text summarization.

natural language processing research papers 2021

Abstractive Text Summarization

Document summarization, extractive text summarization, continual learning.

natural language processing research papers 2021

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning, information retrieval.

natural language processing research papers 2021

Passage Retrieval

Cross-lingual information retrieval, table search, relation extraction.

natural language processing research papers 2021

Relation Classification

Document-level relation extraction, joint entity and relation extraction, temporal relation extraction, link prediction.

natural language processing research papers 2021

Inductive Link Prediction

Dynamic link prediction, anchor link prediction, calibration for link prediction, natural language inference.

natural language processing research papers 2021

Answer Generation

natural language processing research papers 2021

Visual Entailment

Cross-lingual natural language inference, reading comprehension.

natural language processing research papers 2021

Implicit Relations

Intent recognition, active object detection, large language model, emotion recognition.

natural language processing research papers 2021

Speech Emotion Recognition

natural language processing research papers 2021

Emotion Recognition in Conversation

natural language processing research papers 2021

Multimodal Emotion Recognition

Emotion-cause pair extraction, natural language understanding.

natural language processing research papers 2021

Emotional Dialogue Acts

Semantic textual similarity.

natural language processing research papers 2021

Paraphrase Identification

natural language processing research papers 2021

Cross-Lingual Semantic Textual Similarity

Image captioning.

natural language processing research papers 2021

3D dense captioning

Controllable image captioning, aesthetic image captioning.

natural language processing research papers 2021

Relational Captioning

Event extraction, event causality identification, zero-shot event extraction, dialogue state tracking, task-oriented dialogue systems.

natural language processing research papers 2021

Visual Dialog

Dialogue understanding, semantic parsing.

natural language processing research papers 2021

AMR Parsing

Semantic dependency parsing, drs parsing, ucca parsing, coreference resolution, coreference-resolution, cross document coreference resolution, semantic similarity, conformal prediction.

natural language processing research papers 2021

Text Simplification

natural language processing research papers 2021

Music Source Separation

Audio source separation.

natural language processing research papers 2021

Decision Making Under Uncertainty

natural language processing research papers 2021

In-Context Learning

natural language processing research papers 2021

Sentence Embedding

Sentence compression, joint multilingual sentence representations, sentence embeddings for biomedical texts, code generation.

natural language processing research papers 2021

Code Translation

natural language processing research papers 2021

Code Documentation Generation

Library-oriented code generation, class-level code generation, dependency parsing.

natural language processing research papers 2021

Transition-Based Dependency Parsing

Prepositional phrase attachment, unsupervised dependency parsing, cross-lingual zero-shot dependency parsing, specificity, information extraction, extractive summarization, low resource named entity recognition, temporal information extraction, cross-lingual, cross-lingual transfer, cross-lingual document classification.

natural language processing research papers 2021

Cross-Lingual Entity Linking

Cross-language text summarization, response generation, common sense reasoning.

natural language processing research papers 2021

Physical Commonsense Reasoning

Riddle sense, anachronisms, memorization, instruction following, visual instruction following, data integration.

natural language processing research papers 2021

Entity Alignment

natural language processing research papers 2021

Entity Resolution

Table annotation, entity linking.

natural language processing research papers 2021

Question Generation

Poll generation, part-of-speech tagging.

natural language processing research papers 2021

Unsupervised Part-Of-Speech Tagging

natural language processing research papers 2021

Topic coverage

Dynamic topic modeling, prompt engineering.

natural language processing research papers 2021

Visual Prompting

Abuse detection, hate speech detection, open information extraction.

natural language processing research papers 2021

Mathematical Reasoning

natural language processing research papers 2021

Math Word Problem Solving

Formal logic, geometry problem solving, abstract algebra, hope speech detection, hate speech normalization, data mining.

natural language processing research papers 2021

Argument Mining

natural language processing research papers 2021

Opinion Mining

Subgroup discovery, parallel corpus mining, cognitive diagnosis, word sense disambiguation.

natural language processing research papers 2021

Word Sense Induction

Few-shot relation classification, implicit discourse relation classification, cause-effect relation classification, language identification, dialect identification, native language identification, bias detection, selection bias, relational reasoning.

natural language processing research papers 2021

Fake News Detection

natural language processing research papers 2021

Semantic Role Labeling

natural language processing research papers 2021

Predicate Detection

Semantic role labeling (predicted predicates).

natural language processing research papers 2021

Textual Analogy Parsing

Slot filling.

natural language processing research papers 2021

Zero-shot Slot Filling

Extracting covid-19 events from twitter, grammatical error correction.

natural language processing research papers 2021

Grammatical Error Detection

Text matching, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection, pos tagging, deep clustering, trajectory clustering, deep nonparametric clustering, nonparametric deep clustering, multi-modal entity alignment, spoken language understanding, dialogue safety prediction, intent detection.

natural language processing research papers 2021

Open Intent Detection

Word similarity, stance detection, stance detection (us election 2020 - biden), stance detection (us election 2020 - trump), text-to-speech synthesis.

natural language processing research papers 2021

Prosody Prediction

Zero-shot multi-speaker tts, zero-shot cross-lingual transfer, cross-lingual ner, intent classification.

natural language processing research papers 2021

Constituency Parsing

natural language processing research papers 2021

Constituency Grammar Induction

Entity typing.

natural language processing research papers 2021

Entity Typing on DH-KGs

Language acquisition, grounded language learning, fact verification, document ai, document understanding, self-learning, ad-hoc information retrieval, document ranking.

natural language processing research papers 2021

Cross-Modal Retrieval

Image-text matching, multilingual cross-modal retrieval.

natural language processing research papers 2021

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, word alignment, open-domain dialog, dialogue evaluation, novelty detection, multimodal deep learning, multimodal text and image classification, discourse parsing, discourse segmentation, connective detection.

natural language processing research papers 2021

text-guided-image-editing

Text-based image editing.

natural language processing research papers 2021

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis.

natural language processing research papers 2021

Multi-Label Text Classification

Shallow syntax, model editing, knowledge editing, sarcasm detection.

natural language processing research papers 2021

De-identification

Privacy preserving deep learning, lemmatization, explanation generation, morphological analysis.

natural language processing research papers 2021

Session Search

natural language processing research papers 2021

Chinese Word Segmentation

Handwritten chinese text recognition, chinese spelling error correction, chinese zero pronoun resolution, offline handwritten chinese character recognition, molecular representation, entity disambiguation.

natural language processing research papers 2021

Aspect Extraction

Extract aspect.

natural language processing research papers 2021

Aspect-oriented Opinion Extraction

natural language processing research papers 2021

Aspect-Category-Opinion-Sentiment Quadruple Extraction

natural language processing research papers 2021

Aspect-Sentiment-Opinion Triplet Extraction

Speech-to-text translation, simultaneous speech-to-text translation, conversational search, text clustering.

natural language processing research papers 2021

Short Text Clustering

natural language processing research papers 2021

Open Intent Discovery

Source code summarization, method name prediction, keyphrase extraction, linguistic acceptability.

natural language processing research papers 2021

Column Type Annotation

Cell entity annotation, columns property annotation, row annotation, text-to-video generation, text-to-video editing, subject-driven video generation.

natural language processing research papers 2021

Visual Storytelling

natural language processing research papers 2021

KG-to-Text Generation

natural language processing research papers 2021

Unsupervised KG-to-Text Generation

Abusive language, few-shot text classification, zero-shot out-of-domain detection, term extraction, text2text generation, keyphrase generation, figurative language visualization, sketch-to-text generation, protein folding, phrase grounding, grounded open vocabulary acquisition, deep attention, morphological inflection, word translation, multilingual nlp, spam detection, context-specific spam detection, traditional spam detection, summarization, unsupervised extractive summarization, query-focused summarization.

natural language processing research papers 2021

Knowledge Base Population

Natural language transduction, cross-lingual word embeddings, conversational response selection, text annotation, image-to-text retrieval, passage ranking, news classification, biomedical information retrieval.

natural language processing research papers 2021

SpO2 estimation

Authorship verification.

natural language processing research papers 2021

Graph-to-Sequence

Sentence summarization, unsupervised sentence summarization, keyword extraction, story generation, multimodal association, multimodal generation, automated essay scoring, key information extraction, morphological tagging, nlg evaluation, temporal processing, timex normalization, document dating, meme classification, hateful meme classification, weakly supervised classification, weakly supervised data denoising, entity extraction using gan.

natural language processing research papers 2021

Key Point Matching

Component classification, argument pair extraction (ape), claim extraction with stance classification (cesc), claim-evidence pair extraction (cepe), rumour detection, sentence ordering, lexical simplification, semantic composition.

natural language processing research papers 2021

Token Classification

Toxic spans detection.

natural language processing research papers 2021

Blackout Poetry Generation

Semantic retrieval, subjectivity analysis.

natural language processing research papers 2021

Taxonomy Learning

Taxonomy expansion, hypernym discovery, conversational response generation.

natural language processing research papers 2021

Personalized and Emotional Conversation

Comment generation.

natural language processing research papers 2021

Review Generation

Sentence-pair classification, emotional intelligence, dark humor detection, lexical normalization, pronunciation dictionary creation, negation detection, negation scope resolution, question similarity, medical question pair similarity computation, intent discovery, propaganda detection, propaganda span identification, propaganda technique identification, lexical analysis, lexical complexity prediction, goal-oriented dialog, user simulation, passage re-ranking, punctuation restoration, reverse dictionary, humor detection.

natural language processing research papers 2021

Meeting Summarization

Table-based fact verification, question rewriting, pretrained multilingual language models, formality style transfer, semi-supervised formality style transfer, word attribute transfer, attribute value extraction, diachronic word embeddings, legal reasoning, persian sentiment analysis, clinical concept extraction.

natural language processing research papers 2021

Clinical Information Retreival

Constrained clustering.

natural language processing research papers 2021

Only Connect Walls Dataset Task 1 (Grouping)

Incremental constrained clustering, dialog act classification, extreme summarization.

natural language processing research papers 2021

Hallucination Evaluation

Recognizing emotion cause in conversations.

natural language processing research papers 2021

Causal Emotion Entailment

natural language processing research papers 2021

Aspect Category Detection

Nested mention recognition, relationship extraction (distant supervised), clickbait detection, decipherment, semantic entity labeling, handwriting verification, bangla spelling error correction, ccg supertagging, probing language models, text compression, toponym resolution, binary classification, cancer-no cancer per breast classification, cancer-no cancer per image classification, llm-generated text detection, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification.

natural language processing research papers 2021

Timeline Summarization

Multimodal abstractive text summarization, reader-aware summarization, code repair, gender bias detection, thai word segmentation, stock prediction, text-based stock prediction, event-driven trading, pair trading.

natural language processing research papers 2021

Face to Face Translation

Multimodal lexical translation, aggression identification, arabic text diacritization, commonsense causal reasoning, fact selection, linguistic steganography, suggestion mining, temporal relation classification, vietnamese datasets, vietnamese word segmentation, arabic sentiment analysis, complex word identification, cross-lingual bitext mining, morphological disambiguation, scientific document summarization, lay summarization, text attribute transfer.

natural language processing research papers 2021

Image-guided Story Ending Generation

Speculation detection, speculation scope resolution, abstract argumentation, aspect category polarity, dialogue rewriting, logical reasoning reading comprehension.

natural language processing research papers 2021

Unsupervised Sentence Compression

Sign language production, stereotypical bias analysis, temporal tagging, anaphora resolution, bridging anaphora resolution.

natural language processing research papers 2021

Abstract Anaphora Resolution

Hope speech detection for english, hope speech detection for malayalam, hope speech detection for tamil, hidden aspect detection, latent aspect detection, chinese spell checking, cognate prediction, japanese word segmentation, memex question answering, polyphone disambiguation, spelling correction, table-to-text generation.

natural language processing research papers 2021

KB-to-Language Generation

Text anonymization, zero-shot sentiment classification, conditional text generation, contextualized literature-based discovery, multimedia generative script learning, image-sentence alignment, open-world social event classification, personality generation, personality alignment, action parsing, author attribution, binary condescension detection, conversational web navigation, croatian text diacritization, czech text diacritization, definition modelling, document-level re with incomplete labeling, domain labelling, french text diacritization, hungarian text diacritization, irish text diacritization, latvian text diacritization, misogynistic aggression identification, morpheme segmentaiton, multi-agent integration, multi-label condescension detection, news annotation, open relation modeling, reading order detection, record linking, role-filler entity extraction, romanian text diacritization, slovak text diacritization, spanish text diacritization, syntax representation, text-to-video search, turkish text diacritization, turning point identification, twitter event detection.

natural language processing research papers 2021

Vietnamese Text Diacritization

Zero-shot machine translation.

natural language processing research papers 2021

Conversational Sentiment Quadruple Extraction

Attribute extraction, legal outcome extraction, automated writing evaluation, chemical indexing, clinical assertion status detection.

natural language processing research papers 2021

Coding Problem Tagging

Collaborative plan acquisition, commonsense reasoning for rl, context query reformulation.

natural language processing research papers 2021

Variable Disambiguation

Cross-lingual text-to-image generation, crowdsourced text aggregation.

natural language processing research papers 2021

Description-guided molecule generation

natural language processing research papers 2021

Multi-modal Dialogue Generation

Page stream segmentation.

natural language processing research papers 2021

Email Thread Summarization

Emergent communications on relations, emotion detection and trigger summarization, extractive tags summarization.

natural language processing research papers 2021

Hate Intensity Prediction

Hate span identification, job prediction, joint entity and relation extraction on scientific data, joint ner and classification, literature mining, math information retrieval, meme captioning, multi-grained named entity recognition, multilingual machine comprehension in english hindi, multimodal text prediction, negation and speculation cue detection, negation and speculation scope resolution, only connect walls dataset task 2 (connections), overlapping mention recognition, paraphrase generation, multilingual paraphrase generation, personality recognition in conversation.

natural language processing research papers 2021

Phrase Ranking

Phrase tagging, phrase vector embedding, poem meters classification, query wellformedness.

natural language processing research papers 2021

Question-Answer categorization

Readability optimization, reliable intelligence identification, sentence completion, hurtful sentence completion, speaker attribution in german parliamentary debates (germeval 2023, subtask 1), text effects transfer, text-variation, vietnamese aspect-based sentiment analysis, sentiment dependency learning, web page tagging, workflow discovery, incongruity detection, multi-word expression embedding, multi-word expression sememe prediction, trustable and focussed llm generated content, pcl detection, semeval-2022 task 4-1 (binary pcl detection), semeval-2022 task 4-2 (multi-label pcl detection), automatic writing, complaint comment classification, counterspeech detection, face selection, job classification, multi-lingual text-to-image generation, multlingual neural machine translation, optical charater recogntion, bangla text detection, question to declarative sentence, relation mention extraction.

natural language processing research papers 2021

Tweet-Reply Sentiment Analysis

Vietnamese parsing.

natural language processing Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

Model Transformation Development Using Automated Requirements Analysis, Metamodel Matching, and Transformation by Example

In this article, we address how the production of model transformations (MT) can be accelerated by automation of transformation synthesis from requirements, examples, and metamodels. We introduce a synthesis process based on metamodel matching, correspondence patterns between metamodels, and completeness and consistency analysis of matches. We describe how the limitations of metamodel matching can be addressed by combining matching with automated requirements analysis and model transformation by example (MTBE) techniques. We show that in practical examples a large percentage of required transformation functionality can usually be constructed automatically, thus potentially reducing development effort. We also evaluate the efficiency of synthesised transformations. Our novel contributions are: The concept of correspondence patterns between metamodels of a transformation. Requirements analysis of transformations using natural language processing (NLP) and machine learning (ML). Symbolic MTBE using “predictive specification” to infer transformations from examples. Transformation generation in multiple MT languages and in Java, from an abstract intermediate language.

A Computational Look at Oral History Archives

Computational technologies have revolutionized the archival sciences field, prompting new approaches to process the extensive data in these collections. Automatic speech recognition and natural language processing create unique possibilities for analysis of oral history (OH) interviews, where otherwise the transcription and analysis of the full recording would be too time consuming. However, many oral historians note the loss of aural information when converting the speech into text, pointing out the relevance of subjective cues for a full understanding of the interviewee narrative. In this article, we explore various computational technologies for social signal processing and their potential application space in OH archives, as well as neighboring domains where qualitative studies is a frequently used method. We also highlight the latest developments in key technologies for multimedia archiving practices such as natural language processing and automatic speech recognition. We discuss the analysis of both visual (body language and facial expressions), and non-visual cues (paralinguistics, breathing, and heart rate), stating the specific challenges introduced by the characteristics of OH collections. We argue that applying social signal processing to OH archives will have a wider influence than solely OH practices, bringing benefits for various fields from humanities to computer sciences, as well as to archival sciences. Looking at human emotions and somatic reactions on extensive interview collections would give scholars from multiple fields the opportunity to focus on feelings, mood, culture, and subjective experiences expressed in these interviews on a larger scale.

Which environmental features contribute to positive and negative perceptions of urban parks? A cross-cultural comparison using online reviews and Natural Language Processing methods

Natural language processing for smart construction: current status and future directions, attention-based unsupervised keyphrase extraction and phrase graph for covid-19 medical literature retrieval.

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

An ensemble approach for healthcare application and diagnosis using natural language processing

Machine learning and natural language processing enable a data-oriented experimental design approach for producing biochar and hydrochar from biomass, export citation format, share document.

  • ODSC EUROPE
  • AI+ Training
  • Speak at ODSC

natural language processing research papers 2021

  • Data Engineering
  • Data Visualization
  • Deep Learning
  • Generative AI
  • Machine Learning
  • NLP and LLMs
  • Business & Use Cases
  • Career Advice
  • Write for us
  • ODSC Community Slack Channel
  • Upcoming Webinars

Top Recent NLP Research

Top Recent NLP Research

Featured Post Modeling NLP & LLMs posted by Daniel Gutierrez, ODSC October 1, 2021 Daniel Gutierrez, ODSC

Natural language processing (NLP) including conversational AI is arguably one of the most exciting technology fields today. NLP is important because it works to resolve ambiguity in language and adds useful analytical structure to the data for a plethora of downstream applications such as speech recognition and text analytics. NLP helps computers communicate with humans in their own language and scales other language-centric tasks. For example, NLP makes it possible for computers to read text, listen to speech, interpret conversations, measure sentiment, and determine which segments are important. Even though budgets were hit hard by the pandemic, 53% of technical leaders said their NLP budget was at least 10% higher compared to 2019 . In addition, many NLP breakthroughs are moving from research to production, with much coming from recent NLP research.

The last couple of years have been big for NLP with a number of high-profile research efforts involving: generative pre-training model (GPT), transfer learning, transformers (e.g. BERT, ELMO), multilingual NLP, training models with reinforcement learning, automating customer service with a new era of chatbots, NLP for social media monitoring, fake news detection, and so much more. 

In this article, I’ll help get you up to speed with current NLP research efforts by curating a list of the top recent papers published with a variety of research destinations including: arXiv.org , The International Conference on Learning Representations (ICLR) , The Stanford NLP Group , NeurIPS , and KDD . Enjoy!

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often result in improved performance on downstream tasks. However, at some point, further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, this paper presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that the proposed methods lead to models that scale much better compared to the original BERT. Also used is a self-supervised loss that focuses on modeling inter-sentence coherence, and shows it consistently helps downstream tasks with multi-sentence inputs. As a result, the best model from this NLP research establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The GitHub repo associated with this paper can be found HERE . 

CogLTX: Applying BERT to Long Texts

BERTs are incapable of processing long texts due to quadratically increasing memory and time consumption. The attempts to address this problem, such as slicing the text by a sliding window or simplifying transformers, suffer from insufficient long-range attentions or need customized CUDA kernels. The limited text length of BERT reminds us of the limited capacity (5 ∼ 9 chunks) of the working memory of humans – then how do human beings “Cognize Long TeXts?” Founded on the cognitive theory stemming from Baddeley, the CogLTX framework described in this NLP research paper identifies key sentences by training a judge model, concatenates them for reasoning, and enables multi-step reasoning via rehearsal and decay. Since relevance annotations are usually unavailable, it is proposed to use treatment experiments to create supervision. As a general algorithm, CogLTX outperforms or gets comparable results to SOTA models on NewsQA, HotpotQA, multi-class, and multi-label long-text classification tasks with memory overheads independent of the text length.

https://odsc.com/california/#register

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, this NLP research paper proposes a more sample-efficient pre-training task called replaced token detection . Instead of masking the input, the new approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the new approach trains a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by this approach substantially outperform the ones learned by BERT given the same model size, data, and compute. 

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. This NLP research paper explores a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. RAG models are introduced where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. Two RAG formulations are compared, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token.

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers a large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, some heads only need to learn local dependencies, which means the existence of computation redundancy. This NLP research paper proposes a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. BERT is equipped with this mixed attention design using a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training costs and fewer model parameters. 

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

In NLP, enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. The work in this paper combines these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, matching subnetworks at 40% to 90% sparsity is found. These subnetworks are found at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, the results demonstrate that the main lottery ticket observations remain relevant in this context. The GitHub repo associated with this paper can be found HERE .

BERT Loses Patience: Fast and Robust Inference with Early Exit

This NLP research paper proposes Patience-based Early Exit , a straightforward yet effective inference method that can be used as a plug-and-play technique to simultaneously improve the efficiency and robustness of a pretrained language model (PLM). To achieve this, the approach couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers do not change for a pre-defined number of steps. The approach improves inference efficiency as it allows the model to predict with fewer layers. Meanwhile, experimental results with an ALBERT model show that the method can improve the accuracy and robustness of the model by preventing it from overthinking and exploiting multiple classifiers for prediction, yielding a better accuracy-speed trade-off compared to existing early exit methods.

The Curious Case of Neural Text Degeneration

Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as a training objective leads to high-quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. This NLP research paper reveals surprising distributional differences between human text and machine text. In addition, it’s found that decoding strategies alone can dramatically affect the quality of machine text, even when generated from exactly the same neural language model. The findings motivate Nucleus Sampling , a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

Encoding word order in complex embeddings

Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. This NLP research paper presents a novel and principled solution for modeling both the global absolute positions of words and their order relationships. The solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions is extended to a complex-valued domain due to richer representations. CNN, RNN, and Transformer NNs are extended to complex-valued versions to incorporate complex embedding. 

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

This paper introduces Stanza , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. Stanza was trained on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. The GitHub repo associated with this NLP research paper, along with source code, documentation, and pretrained models for 66 languages can be found HERE . 

Mogrifier LSTM

Many advances in NLP have been based upon more expressive models for how inputs interact with the context in which they occur. Recurrent networks, which have enjoyed a modicum of success, still lack the generalization and systematicity ultimately required for modeling language. This NLP research paper proposes an extension to the venerable Long Short-Term Memory (LSTM) in the form of mutual gating of the current input and the previous output. This mechanism affords the modeling of a richer space of interactions between inputs and their context. Equivalently, the model can be viewed as making the transition function given by the LSTM context-dependent. 

DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

For sequence models with large vocabularies, a majority of network parameters lie in the input and output layers. This NLP research paper describes a new method, DeFINE, for learning deep token representations efficiently. The architecture uses a hierarchical structure with novel skip-connections which allows for the use of low dimensional input and output layers, reducing total parameters and training time while delivering similar or better performance versus existing methods. DeFINE can be incorporated easily in new or existing sequence models. Compared to state-of-the-art methods including adaptive input representations, this technique results in a 6% to 20% drop in perplexity. 

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. This paper proposes a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, it is applied to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. The GitHub repo associated with this paper can be found HERE . 

Dynabench: Rethinking Benchmarking in NLP

This paper introduces Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. It is argued that Dynabench addresses a critical need in the NLP community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. The paper reports on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

Causal Effects of Linguistic Properties

This paper considers the problem of using observational data to estimate the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? This paper addresses two technical challenges related to the problem before developing a practical method. First, formalize the causal quantity of interest as the effect of a writer’s intent, and establish the assumptions necessary to identify this from observational data. Second, in practice, access is only offered to noisy proxies for the linguistic properties of interest—e.g., predictions from classifiers and lexicons. An estimator is proposed for this setting and proof that its bias is bounded when we perform an adjustment for the text. Based on these results, TEXTCAUSE is introduced, an algorithm for estimating causal effects of linguistic properties. The method leverages (1) distant supervision to improve the quality of noisy proxies, and (2) a pre-trained language model (BERT) to adjust for the text. It is shown that the proposed method outperforms related approaches when estimating the effect of Amazon review sentiment on semi-simulated sales figures.

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical/grammatical sentence pairs, but manually annotating such pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. This paper shows how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. This LM-Critic and BIFI is applied along with a large set of unlabeled sentences to bootstrap realistic ungrammatical/grammatical pairs for training a corrector. 

Generative Adversarial Transformers

This paper introduces the GANformer, a novel and efficient type of transformer, and explores it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. The model’s strength and robustness are demonstrated through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency. The GitHub repo associated with this paper can be found HERE .

Learn More About NLP and NLP Research at ODSC West 2021

At our upcoming event this November 16th-18th in San Francisco,  ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on NLP and NLP research. You can register now for 30% off all ticket types before the discount drops to 20% in a few weeks. Some highlighted sessions on NLP and NLP research  include:

  • Transferable Representation in Natural Language Processing: Kai-Wei Chang, PhD | Director/Assistant Professor | UCLA NLP/UCLA CS
  • Build a Question Answering System using DistilBERT in Python: Jayeeta Putatunda | Data Scientist | MediaMath
  • Introduction to NLP and Topic Modeling: Zhenya Antić, PhD | NLP Consultant/Founder | Practical Linguistics Inc
  • NLP Fundamentals: Leonardo De Marchi | Lead Instructor | ideai.io

Sessions on Deep Learning and Deep Learning Research:

  • GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
  • Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
  • Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
  • Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google

Sessions on Machine Learning:

  • Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
  • Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
  • Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
  • Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability, and Analytics Researcher | Northwestern University

natural language processing research papers 2021

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

DE Summit Square

Hiring Today’s AI Users: Lessons From 10 Years of Leading Data Science Teams

Business + Management posted by ODSC Community Mar 14, 2024 In this ever-evolving landscape of management and technology, one thing will always be true: Within every...

Optimizing Workplace with AI and Generative Bots

Optimizing Workplace with AI and Generative Bots

East 2024 Career Insights posted by ODSC Community Mar 14, 2024 Editor’s note: Tamilla Triantoro and Aleksandra Przegalinska are speakers for ODSC East this April 23-25. Be...

Build and Deploy Multiple Large Language Models in Kubernetes with LangChain

Build and Deploy Multiple Large Language Models in Kubernetes with LangChain

East 2024 Modeling posted by ODSC Community Mar 13, 2024 Editor’s note: Ezequiel Lanza is a speaker for ODSC East this April 23-25. Be sure to...

AI weekly square

AIM logo Black

  • Last updated December 17, 2021
  • In AI Mysteries

Most Popular NLP Papers Of 2021

  • Published on December 17, 2021
  • by Debolina Biswas

NLP, NLP Papers

Natural Language Processing or NLP is a technique to teach computers to process and comprehend human/natural languages. NLP is a part of data science and includes the analysis of data to extract, process, and output meaningful information. Some of the important applications of NLP include: 

  • Text mining 
  • Text and sentiment analysis 
  • Speech generation 
  • Text classification 
  • Speech Generation 
  • Speech Classification 

In this article, Analytics India Magazine lists the top journals for NLP that one must read. These journals are information repositories that can help one stay at the top of their NLP game. 

(Note that the list is in no particular order.)

Dynabench: Rethinking Benchmarking in NLP 

This year, researchers from Facebook and Stanford University open-sourced Dynabench, a platform for model benchmarking and dynamic dataset creation. Dynabench runs on the web and supports human-and-model-in-the-loop dataset creation. It addresses how contemporary models quickly achieve performance on benchmark tasks but fail on simple examples or real-world scenarios. Dynabench helps in dataset creation, model development, and model assessment which leads to more robust and informative benchmarks.

Causal Effects of Linguistic Properties 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy, your newsletter subscriptions are subject to aim privacy policy and terms and conditions..

This paper on Causal Effects of Linguistic Properties deals with the problem of using observational data. The paper addresses challenges related to the problem before developing a practical method. Based on the result, it introduces TextCause— an algorithm to estimate the causal effects of linguistic properties. It leverages distant supervision to improve noisy proxies’ quality; and BERT, the pre-trained language model, to adjust for the text. Finally, it presents an applied case study to investigate the effects. The paper was presented at the NAACL 2021. 

Transformer-based Binary Word Sense Disambiguation 

Released at the second International Conference on NLP and Big Data, this paper deals with the word sense disambiguation problem as a classification task and presents a model for text ambiguity problems with the help of transformers. In recent solutions for NLP tasks, transformers have shown improvements. However, researchers find the correct meaning of every word in a particular text in this task. This paper further depicts how the usage of pre-train transformer models improve the accuracy of the architecture. These experiments also showcase how NLP task performance can be improved with the help of data augmentation techniques. 

Single Headed Attention RNN: Stop thinking with your head 

Published by Harvard University graduate Steven Merity, the paper ‘ Single Headed Attention RNN: Stop thinking with your head ’, introduces a state-of-the-art NLP model called Single Headed Attention RNN or SHA-RNN. The author does so by using the example of the LSTM model with SHA in order to achieve state-of-the-art, byte-level language model results on enwik8 . 

NLP applied on issue trackers 

The NLP applied on issue trackers paper discusses the various NLP techniques, including top analysis, similarity algorithms (N-grams, Jaccard, LSI algorithm), descriptive statistics, and others, along with machine learning (ML) algorithms such as support vector machines (SVM) and Decision trees. These techniques are usually used for a better understanding of the characteristics, classification, lexical relations, and prediction of duplicate development tasks. Tuning the different features to predict the development tasks with a Fidelity loss function, a system can identify duplicate tasks with almost 100 percent accuracy. 

Attention in Natural Language Processing

Attention is a popular mechanism in neural architectures and has been realised in various formats. However, owing to the fast-paced advances in this domain, a systematic overview of attention is still missing. This paper defines a unified model for attention architectures in NLP while focusing on those that are designed to work with vector representations of textual data. The writers have proposed a taxonomy of attention models according to four dimensions: 

  • Representation of input 
  • Compatibility function 
  • Distribution function 
  • Multiplicity of the input and output 

Additionally, the paper provides instances of how prior information can be exploited in attention models while discussing ongoing research efforts and open challenges, providing extensive categorisation of the huge body of literature. 

Access all our open Survey & Awards Nomination forms in one place >>

Debolina Biswas

Debolina Biswas

Download our mobile app.

natural language processing research papers 2021

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative ai skilling for enterprises, our customized corporate training program on generative ai provides a unique opportunity to empower, retain, and advance your talent., 3 ways to join our community, telegram group.

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox, recent stories.

natural language processing research papers 2021

AI4Bharat Rolls Out IndicLLMSuite for Building LLMs in Indian Languages

It covers 22 languages with 251 billion tokens and 74.8 million instruction-response pairs. 

natural language processing research papers 2021

Apple acquires DarwinAI in a bid to push ahead in the AI Race

This comes after the company recently shut down its EV venture

natural language processing research papers 2021

Meet the Creators of महामराठी 

GPT-5

What to Expect From GPT-5

An AI agent that can do the captcha would be an interesting added capability.

natural language processing research papers 2021

GenAI is Going to Change India’s Agriculture Forever

natural language processing research papers 2021

Devin is Just Canva for Developers

Databricks Invests in Mistral AI

Databricks Invests in Mistral AI, Integrates AI Models for Its Customers

Databricks Data Intelligence Platform now includes Mistral 7B and Mixtral 8×7.

natural language processing research papers 2021

AI Infrastructure Momentum to Drive Data Center Capex by 17% in 2024: Report

narendra modi semiconductor

Semiconductors Could Be Narendra Modi’s Legacy 

The Modi-led administration recently approved the first semiconductor fabrication unit to be established in India.

Our mission is to bring about better-informed and more conscious decisions about technology through authoritative, influential, and trustworthy journalism.

Shape the future of ai.

© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2024

  • Terms of use
  • Privacy Policy

Natural Language Processing and Its Applications in Machine Translation: A Diachronic Review

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Natural Language Processing

Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more.

Our work spans the range of traditional NLP tasks, with general-purpose syntax and semantic algorithms underpinning more specialized systems. We are particularly interested in algorithms that scale well and can be run efficiently in a highly distributed environment.

Our syntactic systems predict part-of-speech tags for each word in a given sentence, as well as morphological features such as gender and number. They also label relationships between words, such as subject, object, modification, and others. We focus on efficient algorithms that leverage large amounts of unlabeled data, and recently have incorporated neural net technology.

On the semantic side, we identify entities in free text, label them with types (such as person, location, or organization), cluster mentions of those entities within and across documents (coreference resolution), and resolve the entities to the Knowledge Graph.

Recent work has focused on incorporating multiple sources of knowledge and information to aid with analysis of text, as well as applying frame semantics at the noun phrase, sentence, and document level.

Recent Publications

Some of our teams.

We're always looking for more talented, passionate people.

Careers

Advertisement

Issue Cover

  • Next Article

1 Introduction

2 nlp, cl, and related disciplines, 3 machine translation, 4 grammar formalisms and parsing, 5 text mining for biomedicine, 6 future – conclusions, acknowledgments, natural language processing and computational linguistics.

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Junichi Tsujii; Natural Language Processing and Computational Linguistics. Computational Linguistics 2021; 47 (4): 707–727. doi: https://doi.org/10.1162/coli_a_00420

Download citation file:

  • Ris (Zotero)
  • Reference Manager

As an engineering field, research on natural language processing (NLP) is much more constrained by currently available resources and technologies, compared with theoretical work on computational linguistics (CL). In today’s technology-driven society, it is almost impossible to imagine the degree to which computational resources, the capacity of secondary and main storage, and software technologies were restricted when I embarked upon my research career 50 years ago. While these restrictions inevitably shaped my early research into NLP, my subsequent work evolved, according to the significant progress made in associated technologies and related academic fields, particularly CL.

Figure 1 shows the research topics in which I have been engaged. My initial NLP research was concerned with a question answering system, which I worked on during my M.Eng and D.Eng degrees. The research focused on reasoning and language understanding, which I soon found was too ambitious and ill-defined. After receiving my D.Eng., I changed my direction of research, and began to be engaged in processing forms of language expressions, with less commitment to language understanding, machine translation (MT), and parsing. However, I returned to research into reasoning and language understanding in the later stage of my career, with clearer definitions of tasks and relevant knowledge, and equipped with access to more advanced supporting technologies.

Research topics.

Research topics.

In this article, I begin by briefly describing my views on mutual relationships among disciplines related to CL and NLP, and then move on to discussing my own research.

Language is a complex topic to study, infinitely harder than I first imagined when I began to work in the field of NLP.

There is a whole discipline on the study of language—namely, linguistics. Linguistics is concerned not only with language per se, but must also deal with how humans model the world. 1 The study of semantics, for example, must relate language expressions to their meanings, which reside in the mental models possessed by humans.

Apart from linguistics, there are two fields of science that are concerned with language, that is, brain science and psychology. These are concerned with how humans process language. Then, there are two disciplines in which we are involved—namely, CL and NLP.

Figure 2 is a schematic view of these research disciplines. Both of the lower disciplines are concerned with processing language, that is, how language is processed in our minds or our brains, and how computer systems should be designed to process language efficiently and effectively.

Language-related disciplines.

Language-related disciplines.

The top discipline, linguistics, on the other hand, is concerned with rules that are followed by languages. That is to say, linguists study language as a system. This schematic view is certainly oversimplified, and there are subject fields in which these disciplines overlap. Psycholinguistics, for example, is a subfield of linguistics which is concerned with how the human mind processes language. A broader definition of CL may include NLP as its subfield.

In this article, for the sake of discussion, I adopt narrower definitions of linguistics and CL. In this narrower definition, linguistics is concerned with the rules followed by languages as a system, whereas CL, as a subfield of linguistics, is concerned with the formal or computational description of rules that languages follow. 2

CL, which focuses on formal/computational description of languages as a system, is expected to bridge broader fields of linguistics with the lower disciplines, which are concerned with processing of language.

Given my involvement in NLP, I would like to address the question of whether the narrowly defined CL is relevant to NLP. The simple answer is yes. However, the answer is not so straightforward, and requires us to examine the degree to which the representations used to describe language as a system are relevant to the representations used for processing language.

Although my colleagues and I have been engaged in diverse research areas, I pick up only on a subset of these, to illustrate how I view the relationships between NLP and CL. Due to the nature of the article, I ignore technical details and focus instead on the motivation of the research and the lessons which I have learned through research.

Background and Motivation. Following the ALPAC report Pierce et al. ( 1966 ), research into MT had been largely abandoned by academia, with the exception of a small number of institutes (notably, GETA at Grenoble, France, and Kyoto University, Japan). There were only a handful of commercial MT systems, being used for limited purposes. These commercial systems were legacy systems that had been developed over years and had become complicated collections of ad hoc programs. They had become too convoluted to allow for changes and improvements. To re-initiate MT research in academia, we had to have more systematic and disciplined design methodologies.

On the other hand, theoretical linguistics, initiated by Noam Chomsky (Chomsky 1957 , 1965 ) had attracted linguists with a mathematical orientation, who were interested in formal frameworks of describing rules followed by language. Those linguists with interests in formal ways of describing rules were the first generation of computational linguists.

Although computational linguists did not necessarily follow the Chomskyan way of thinking, they shared the general view of treating language as a system of rules. They had developed formal ways of describing rules of language and showed that these rules consisted of different layers, such as morphology, syntax, and semantics, and that each layer required different formal frameworks with different computational powers. Their work had also motivated work on how one could process language by computerizing its rules of language. This work constituted the beginning of NLP research, and resulted in the development of parsing algorithms for context-free language, finite-state machines, and so forth. 3 It was natural to use this work as the basis for designing the second generation of MT systems, which was initiated by an MT project (MU project, 1082-1986) led by Prof. M. Nagao (Nagao, Tsujii, and Nakamura 1985 ).

Research Contributions. When I began research into MT in the late 1970s, there was a common view largely shared by the community, which had been advocated by the group of GETA, in France. The view was called the transfer approach of MT (Boitet 1987 ).

The transfer approach viewed translation as a process consisting of three phases: analysis, transfer, and generation. According to linguists, a language is a system of rules. The analysis and generation phases were monolingual phases that were concerned with a set of rules for a single language, the analysis phase using the rules of the source language and the generation phase using the rules of the target language. Only the transfer phase was a bilingual phase.

Another view shared by the community was an abstraction hierarchy of representation, called the triangle of translation . For example, Figure 3(a) , 4 shows the hierarchy of representation used in the Eurotra project, with their definition of each level ( Figure 3(b) ).

(a) Hierarchy of representation of the transfer approach. (b) Hierarchy of representation (Eurotra).

(a) Hierarchy of representation of the transfer approach. (b) Hierarchy of representation (Eurotra).

By climbing up such a hierarchy, the differences among languages would become increasingly small, so that the mapping (i.e., the transfer phase) from one language to another would become as simple as possible. Independently of the target language, the goal of the analysis phase was to climb up the hierarchy, while the aim of the generation phase was to climb down the hierarchy to generate surface expressions in the target language. Both phases are concerned only with rules of single languages.

In the extreme view, the top of the hierarchy was taken as the language-independent representation of meaning. Proponents of the interlingual approach claimed that, if the analysis phase reached this level, then no transfer phase would be required. Rather, translation would consist only of the two monolingual phases (i.e., the analysis and generation phases).

However, in Tsujii ( 1986 ), I claimed, and still maintain, that this was a mistaken view about the nature of translation. In particular, this view assumed that a translation pair (consisting of the source and target sentences) encodes the same “information”. This assumption does not hold, in particular, for a language pair such as Japanese and English, that belong to very different language families. Although a good translation should preserve the information conveyed by the source sentence as much as possible in the target sentence, translation may lose some information or add extra information. 5

Furthermore, the goal of translation may not be to preserve information but to convey the same pragmatic effects to readers of the translation.

More seriously, the abstract level of representation such as Interface Structure 6 in Eurotra focused only on the propositional content encoded in language, and tended to abstract away other aspects of information, such as the speaker’s empathy, distinction of old/new information, emphasis, and so on.

To climb up the hierarchy led to loss of information in lower levels of representation. In Tsujii ( 1986 ), instead of mapping at the abstract level, I proposed “transfer based on a bundle of features of all the levels”, in which the transfer would refer to all levels of representation in the source language to produce a corresponding representation in the target language ( Figure 4 ). Because different levels of representation require different geometrical structures (i.e., different tree structures), the realization of this proposal had to wait for development of a clear mathematical formulation of feature-based representation with reentrancy, which allowed multiple levels (i.e., multiple trees) to be represented with their mutual relationships (see the next section).

Description-based transfer (Tsujii 1986).

Description-based transfer (Tsujii 1986 ).

Another idea we adopted to systematize the transfer phase was recursive transfer (Nagao and Tsujii 1986 ), which was inspired by the idea of compositional semantics in CL. According to the views of linguists at the time, a language is an infinite set of expressions which, in turn, is defined by a finite set of rules. By applying this finite number of rules, one can generate infinitely many grammatical sentences of the language. Compositional semantics claimed that the meaning of a phrase was determined by combining the meanings of its subphrases, using the rules that generated the phrase. Compositional translation applied the same idea to translation. That is, the translation of a phrase was determined by combining the translations of its subphrases. In this way, translations of infinitely many sentences of the source language could be generated.

Using the compositional translation approach, the translation of a sentence would be undertaken by recursively tracing a tree structure of a source sentence. The translation of a phrase would then be formulated by combining the translations of its subphrases. That is, translation would be constructed in a bottom up manner, from smaller units of translation to larger units.

Furthermore, because the mapping of a phrase from the source to the target would be determined by the lexical head of the phrase, the lexical entry for the head word specified how to map a phrase to the target. In the MU project, we called this lexicon-driven, recursive transfer (Nagao and Tsujii 1986 ) ( Figure 5 ).

Lexicon-driven recursive structure transfer (Nagao and Tsujii 1986).

Lexicon-driven recursive structure transfer (Nagao and Tsujii 1986 ).

Compared with the first-generation MT systems, which replaced source expressions with target ones in an undisciplined and ad hoc order, the order of transfer in the MU project was clearly defined and systematically performed.

Lessons. Research and development of the second-generation MT systems benefitted from research into CL, allowing more clearly defined architectures and design principles than first-generation MT systems. The MU project successfully delivered English-Japanese and Japanese-English MT systems within the space of four years. Without these CL-driven design principles, we could not have delivered these results in such a short period of time.

However, the differences between the objectives of the two disciplines also became clear. Whereas CL theories tend to focus on specific aspects of language (such as morphology, syntax, semantics, discourse, etc.), MT systems must be able to handle all aspects of information conveyed by language. As discussed, climbing up a hierarchy that focuses on propositional content alone does not result in good translation.

A more serious discrepancy between CL and NLP is the treatment of ambiguities of various kinds. Disambiguation is the single most significant challenge in most NLP tasks; it requires the context in which expressions to be disambiguated occur to be processed. In other words, it requires understanding of context.

Typical examples of disambiguation are shown in Figure 6 . The Japanese word asobu has a core meaning of “spend time without engaging in any specific useful tasks”, and would be translated into “to play”, “to have fun”, “to spend time”, “to hang around”, and so on, depending on the context (Tsujii 1986 ).

Disambiguation at lexical transfer.

Disambiguation at lexical transfer.

Considering context for disambiguation contradicts with recursive transfer, since it requires larger units to be handled (i.e., the context in which a unit to be translated occurs). The nature of disambiguation made the process of recursive transfer clumsy. Disambiguation was also a major problem in the analysis phase, which I discuss in the next section.

The major (although hidden) general limitation of CL or linguistics is that it tends to view language as an independent, closed system and avoids the problem of understanding, which requires reference to knowledge or non-linguistic context. 7 However, many NLP tasks, including MT, require an understanding or interpretation of language expressions in terms of knowledge and context, which may involve other input modalities, such as visual stimuli, sound, and so forth. I discuss this in the section on the future of research.

Background and Motivation. At the time I was engaged in MT research, new developments took place in CL, namely, feature-based grammar formalisms (Kriege 1993 ).

At its early stage, transformational grammar in theoretical linguistics by N. Chomsky assumed that sequential stages of application of tree transformation rules linked the two levels of structures, that is, deep and surface structures. A similar way of thinking was also shared by the MT community. They assumed that climbing up the hierarchy would involve sequential stages of rule application, which map from the representation at one level to another representation at the next adjacent level. 8 Because each level of the hierarchy required its own geometrical structure, it was not considered possible to have a unified non-procedural representation, in which representations of all the levels co-exist.

This view was changed by the emergence of feature-based formalisms that used directed acyclic graphs (DAGs) to allow reentrancy. Instead of mappings from one level to another, it described mutual relationships among different levels of representation in a declarative manner. This view was in line with our idea of description-based transfer, which used a bundle of features of different levels for transfer. Moreover, some grammar formalisms at the time emphasized the importance of lexical heads. That is, local structures of all the levels are constrained by the lexical head of a phrase, and these constraints are encoded in lexicon. This was also in line with our lexicon-driven transfer.

A further significant development in CL took place at the same time. Namely, a number of sizable tree bank projects, most notably the Penn Treebank and the Lancaster/IBM Treebank, had reinvigorated corpus linguistics and started to have significant impacts on research into CL and NLP (Marcus et al. 1994 ). From the NLP point of view, the emergence of large tree banks led to the development of powerful tools (i.e., probabilistic models) for disambiguation. 9

We started research that would combine these two trends to systematize the analysis phase—that is, parsing based on feature-based grammar formalisms.

Research Contributions. It is often claimed that ambiguities occur because of insufficient constraints. In the analysis phase of the “climbing up the hierarchy” model, lower levels of processing could not refer to constraints in higher levels of representation. This was considered the main cause of the combinatorial explosion of ambiguities at the early stages of climbing up the hierarchy. Syntactic analysis could not refer to semantic constraints, meaning that ambiguities in syntactic analysis would explode.

On the other hand, because the feature-based formalisms could describe constraints at all levels in a single unified framework, it was possible to refer to constraints at all levels, to narrow down the set of possible interpretations.

However, in practice, the actual grammar was still vastly underconstrained. This was partly because we do not have effective ways of expressing semantic and pragmatic constraints. Computational linguists were interested in formal declarative ways for relating syntactic and semantic levels of representation, but not so much in how semantic constraints are to be expressed. To specify semantic or pragmatic constraints, one may have to refer to the mental models of the world (i.e., how humans see the world), or discourse structures beyond single sentences, and so on. These fell outside of the scope of CL research at the time, whose main focus is on grammar formalisms.

Furthermore, it is questionable whether semantics or pragmatics can be used as constraints. They may be more concerned with the plausibility of an interpretation than the constraints which an interpretation should satisfy (for example, see the discussion in Wilks [ 1975 ]).

Therefore, even for parsing using feature-based formalisms, issues of disambiguation and how to handle the explosion of ambiguities remained major issues for NLP.

Probabilistic models were one of the most powerful tools for disambiguation and handling the plausibility of an interpretation. However, probabilistic models for simpler formalisms, such as regular and context-free grammars, had to be changed for more complex grammar formalisms. Techniques for handling combinatorial explosion, such as packing, had to be reformulated for feature-based formalisms.

Furthermore, although feature-based formalisms were neat in terms of describing constraints in a declarative manner, the unification operation, which was a basic operation for treating feature-based descriptions, was computationally very expensive. To deliver practical NLP systems, we had to develop efficient implementation technologies and processing architectures for feature-based formalisms.

The team at the University of Tokyo started to study how we could transform a feature-based grammar (we chose HPSG) into effective and efficient representations for parsing. The research included:

Design of an abstract machine for processing of typed-feature structures and development of a logic programming system—LiLFeS (Makino et al. 1998 ; Miyao et al. 2000 ).

Transforming HPSG grammar into a more processing-oriented representation, such as extracting CFG skeletons (Torisawa and Tsujii 1996 ; Torisawa et al. 2000 ) and supertags from original HPSG.

Packing of feature structures (feature forest) and long-linear probabilistic models (Miyao and Tsujii 2003 , 2005 , 2008 ).

A staged architecture of parsing based on transformation of grammar formalisms and their probabilistic modeling (Matsuzaki, Miyao and Tsujii 2007 ; Ninomiya et al. 2010 ).

A simplified representation of our parsing model is shown in Figure 7 . Given a sentence, its representation of all the levels was constructed at the final stage by using the HPSG grammar. Disambiguation took place mainly in the first two phases. The first phase was a supertagger that would disambiguate supertags assigned to words in a sentence. Supertags were derived from the original HPSG grammar and a set of supertags were attached to a word in its lexicon. A suppertagger would choose the most probable sequence of supertags for the given sequence of words. The task was a sequence labeling task, which could be carried out in a very efficient manner (Zhang, Matsuzaki, and Tsujii 2009 ). This means that the surface local context (i.e., local sequences of supertags) was used for disambiguation, without constructing actual DAGs of features.

Staged architecture of parsing.

Staged architecture of parsing.

The second phase was CFG filtering. A CFG skeleton, which also was derived from the HPSG grammar, was used to check whether sequences of supertags chosen by the first phase could reach a successful derivation tree. The supertagger did not build actual parse trees explicitly to check whether a chosen sequence could reach legitimate derivation trees or not. The second phase of CFG filtering would filter out supertag sequences that could not reach legitimate trees.

The final phase not only built the final representation of all the levels, but it also checked extra constraints specified in the original grammar. Because the first two phases only use partial constraints specified in the HPSG grammar, the final phase would reject results produced by the first two phases if they failed to satisfy these extra constraints. In this case, the system would backtrack to the previous phases to obtain the next candidate.

All of these research efforts collectively produced a practical efficient parser based on HPSG (Enju 9 ).

Lessons. As in MT, CL theories were effective for the systematic development of NLP systems. Feature-based grammar formalisms drastically changed the view of parsing as “climbing up the hierarchy”. Moreover, mathematically well-defined formalisms helped the systematic implementation of efficient implementations of unification, transformation of grammar into supertags, CFG skeletons, and so forth. These formalisms also provided solid ground for operations in NLP such as packing of feature structures, and so on, which are essential for treating combinatorial explosion.

On the other hand, direct application of CL theories to NLP did not work, since this would result in extremely slow processing. We had to transform them into more processing-oriented formats, which required significant efforts and time on the NLP side. For example, we had to transform the original HPSG grammar into processing-oriented forms, such as supertags, CFG skeletons, and so on. It is worth noting that, while the resultant architecture was similar to the climbing-up hierarchy processing, each stage in the final architecture was clearly defined and related to each other through the single declarative grammar.

I also note that advances in the fields of computer science/engineering significantly changed what was possible to achieve in NLP. For example, the design of an abstract machine and its efficient implementation for unification in LiLFeS (Makino et al. 1998 ), effective support systems for maintaining large banks of parsed trees (Ninomiya, Makino, and Tsujii 2002 ; Ninomiya, Tsujii, and Miyao 2004 ), and so forth, would be impossible without advances in the broader fields of computer science/engineering and without much improved computational power (Taura et al. 2010 ).

On the other hand, disambiguation remained the major issue in NLP. Probabilistic models enabled major breakthroughs in terms of solving the problem. Compared with the fairly clumsy rule-based disambiguation that we adopted for the MU project, 10 probabilistic modeling provided the NLP community with systematic ways of handling ambiguities. Combined with large tree banks, objective quantitative comparison of different models also became feasible, which made systematic development of NLP systems possible. However, the error rate in parsing remained (and still remains) high.

While reported error rates are getting lower, measuring the error rate in terms of the number of incorrectly recognized dependency relations was misleading. At the sentence level, the error rate remains high. That is, a sentence in which all dependency relations are correctly recognized remains very rare. Because most of dependency relations are trivial (i.e., pairs of adjacent words or pairs of close neighbors), errors in semantically critical dependencies, such as PP-attachments, scopes of conjunction, etc., remain abundant (Hara, Miyao, and Tsujii 2009 ).

Even using probabilistic models, there are obvious limits to disambiguation performance, unless a deeper understanding is taken into account. This leads me to the next research topic: language and knowledge.

Background and Motivation. I was interested in the topic of how to relate language with knowledge at the very beginning of my career. At the time, my naiveté led me to believe initially that a large collection of text could be used as a knowledge base and was engaged in research of a question-answering system based on a large text base (Nagao and Tsujii 1973 , 1979 ). However, resources such as a large collection of text, storage capacity, processing speed of computer systems, and basic NLP technologies, such as parsing, were not available at the time.

I soon realized, however, that the research would involve a whole range of difficult research topics in artificial intelligence, such as representation of common sense, human ways of reasoning, and so on. Moreover, the topics had to deal with uncertainty and peculiarities of individual humans. Knowledge or the world models that individual humans have may differ from one person to another. I felt that the research target was ill-defined.

However, through research in MT and parsing in the later stages of my career, I started to realize that NLP research is incomplete if it ignores how knowledge is involved in processing, and that challenging NLP problems are all related to issues of understanding and knowledge. At the same time, considering NLP as an engineering field, I took it to be essential to have a clear definition of knowledge or information with which language is to be related. I would like to avoid too much vagueness of research into commonsense knowledge and reasoning and to restrict our research focus to the relationship between language and knowledge. As a research strategy, I chose to focus on the biomedicine as the application domain. There were two reasons for the choice.

One reason was that microbiology colleagues at the two universities with which I was affiliated told me that, in order to understand life-related phenomena, it had become increasingly important for them to organize pieces of information scattered in a large collection of published papers in diverse subject fields such as microbiology, medical sciences, chemistry, and agriculture. In addition to the large collection of papers, they also had diverse databases that had to be linked with each other. In other words, they had a solid body of knowledge shared by domain specialists that was to be linked with information in text.

The other reason was that there were colleagues at the University of Manchester who were interested in sublanguages. According to the discussion on information formats in a medical sublanguage by the NYU group (Sager 1978 ) and research into medical terminology at the University of Manchester, focusing on relations between terms and concepts (Ananiadou 1994 ; Frantzi and Ananiadou 1996 ; Mima et al. 2002 ), the biomedical domain had been a natural choice of sublanguage research. The important point here was that information formats in a sublanguage and terminology concepts were defined by the target domain, and not by NLP researchers. Furthermore, domain experts had actual needs and concrete requirements to help solve their own problems in the target domains.

Research Contributions. Although there had been quite a large amount of research into information retrieval and text mining for the biomedical domain, there had been no serious efforts to apply structure-based NLP techniques to text mining in the domain. To address this, the teams at the University of Manchester and the University of Tokyo jointly launched a new research program in this direction.

Because this was a novel research program, we first had to define concrete tasks to solve, to prepare resources, and to involve not only NLP researchers, but also experts in the target domains.

Regarding the involvement of NLP researchers and domain experts, we found that a few groups in the world also began to be interested in similar research topics. In response to this, we organized a number of research gatherings in collaboration with colleagues around the world, which led to establishment of a SIG (SIGBIOMED) at ACL. The first workshop took place in 2002, collocated with the ACL conference (Workshop 2002 ). The SIG now organizes annual workshops and co-located shared tasks. It has been expanding rapidly and has become one of the most active SIGs in NLP applications. The research field of application of structure-based NLP to text-mining is broadening to cover clinical/medical domains (Xu et al. 2012 ; Sohrab et al. 2020 ), chemistry, and material science domains (Kuniyoshi et al. 2019 ).

Research contributions by the two teams include the GENIA corpus (Kim et al. 2003 ; Thompson, Ananiadou, and Tsujii 2017 ), a large repository of acronyms with their original terms (Okazaki, Ananiadou, and Tsujii 2008 , 2010 ), the GENIA POS tagger Tsuruoka et al. ( 2005 ), the BRAT annotation tool (Stenetorp et al. 2012 ), a workflow design tool for information extraction (Kano et al. 2011 ), an intelligent search system based on entity association (Tsuruoka, Tsujii, Ananiadou 2008 ), and a system for pathway construction (Kemper et al. 2010 ).

The GENIA annotated corpus is one of the most frequently used corpora in the biomedical domain. To see what information domain experts considered important in text and how it was encoded in language, we annotated 2000 abstracts, not only from the linguistic point of view but also from the viewpoint of domain experts. Two types of annotations, namely, linguistic annotations (POS, and syntactic trees) and domain-specific annotations (biological entities, relations, and events) were added to the corpus (Ohta et al. 2006 ).

Domain-specific annotations were linked with ontologies of the target domain (GENE ontology, anatomy ontology, etc.) which had been constructed by the target domain communities to share information in diverse databases.

To involve domain experts in annotation, we developed a user-friendly annotation tool with intuitive visualization (BRAT), which is now used widely by the NLP community. In close cooperation with domain experts, we defined a set of NLP tasks (Hirshman et al. 2002 ; Ananiadou, Friedman, and Tsujii 2004 ; Ananiadou, Kell, and Tsujii 2006 ; Ananiadou et al. 2020 ), and developed a set of basic IE tools (Nobata et al. 2008 ; Miwa et al. 2009 ; Pyysalo et al. 2012 ) for solving them, which were to be combined into workflows to meet specific needs of individual groups of domain experts (Kano et al. 2011 ; Rak et al. 2012 ).

As a result of this work, we recognized large discrepancies between linguistic units such as words, phrases, and clauses, and domain-specific semantic units, such as named entities, and relations and events that link them together ( Figure 8 ). The mapping between linguistic structures and the semantic ones defined by domain specialists was far more complex than the mapping assumed by computational semanticists.

Event and Relation Extraction (Ananiadou et al. 2020).

Event and Relation Extraction (Ananiadou et al. 2020 ).

We soon realized that, as an NLP task, information extraction (IE) was very different from MT. In particular, there were considerable differences between the information that the authors intended to convey and encode in text and the information that the readers (i.e., a group of domain experts) wanted to identify and extract from text. Regardless of the information that the authors intended to convey, the reader would identify the information that they were interested in. 11

These characteristics of IE as an NLP task made the mapping from language to information very different from the transfer phase in MT, which attempts to covey the same information in the source and target languages. While the task of named entity recognition (NER) benefited from linguistic structures (i.e., noun phrases and their coordination), linguistic structures would only give cues for the automatic recognition of relations and events, and these cues were to be combined with other cues. The mapping became similar to the transfer based on a bundle of features.

Like the transfer in the higher level of representation, we first used the HPSG parser to climb up the hierarchy to the IS (which we called PAS [predicate-argument structure]), from which we tried to identify a set of pattern-rules to extract events ( Figure 9 ) (Yakushiji 2001 , 2006 ). We assumed that, although extraction patterns based on surface sequences of words may be diverse, 12 this diversity would reduce at a higher level of abstraction—that is, the same approach to simple transfer at the abstract level. Although this approach initially achieved reasonable performance, it soon reached its limit; extracted patterns became increasingly clumsy and convoluted.

Event recognition of the climbing-up model (Yakushiji 2006).

Event recognition of the climbing-up model (Yakushiji 2006 ).

As discussed above, we realized that this was because of the nature of IE tasks, and switched to the approach based on a bundle of features ( Figure 10 ) (Miwa et al. 2009 ). This shift continued further to the ongoing research, which uses a large language model (BioBERT). In this recent work, linguistic information is assumed to be implicitly embedded in the language model. The information is not represented explicitly in IE systems ( Figure 11 ) (Ju, Miwa, and Ananiadou 2018 ; Trieu et al. 2020 ).

Event recognizer using diverse cues including parse trees (Miwa et al. 2009).

Event recognizer using diverse cues including parse trees (Miwa et al. 2009 ).

Event extraction using BERT (Trieu et al. 2020).

Event extraction using BERT (Trieu et al. 2020 ).

On the other hand, our interest in biomedical text mining extended beyond the traditional IE tasks and moved toward coherent integration of extracted information. In this integration, it became apparent that linguistic structures play more significant roles.

For example, claims about an event extracted from different articles often contradict each other. As such, techniques for measuring the credibility or reliability of claims are crucial.

In scientific fields such as biology and medical sciences, claims about an event can be made affirmatively or speculatively, with different degrees of confidence. To measure the degree of confidence of a claim, we have to examine the type of linguistic structure in which the claim is embedded (Zerva et al. 2017 ; Zerva and Ananiadou 2018 ). Additionally, depending on the position of an extracted event in a sentence, it may be considered as a pre-supposed fact, hypothesis, and so forth. The manner in which structural information recognized by a parser can be utilized to detect and integrate contradicting claims remains an important research issue.

Another typical example of an integration problem is the automatic curation of pathways, in which an NLP system is used to combine a set of different events extracted from different articles to build a coherent network of events (Kemper et al. 2010 ). In this task, a system must be able to decide whether two events reported in different articles can be treated as the same event in a pathway. To do this, the system must be able to detect the biological environments in which two reported events take place, by considering the surrounding contexts of the events. This may also require linguistic structures to be taken into account.

Lessons. By focusing on the biomedical domain, we introduced concrete forms of extra-linguistic knowledge (i.e., domain ontologies built by the target domain communities) and diverse databases, which include manually curated pathway networks. The task of linking information in text with these resources helped to define concrete research topics focusing on the relation between language and knowledge of the target domains. Because scientific communities such as microbiologists have agreed views on which pieces of information constitute their domain knowledge, we can avoid the uncertainty and individuality of knowledge that may have hampered research in the general domain.

Linguistic structures, with which NLP technologies such as parsing have previously been concerned, play less important roles than we initially expected. Nevertheless, ongoing research into the integration of extracted information has started to reassess the importance of linguistic structures.

Another important finding is the nature of human reasoning. The CL and NLP communities tend to consider reasoning as a kind of logical process based on symbolic knowledge. However, the actual reasoning that the experts in the biomedical domain perform may not be so symbolic in nature.

For example, we found that the reasoning carried out by domain experts on pathways is based on similarities between entities. For example, they infer that a protein A is likely to be involved in an event by observing a reported fact that a protein B, which is similar to protein A, is involved in the same type of event. The similarity between protein A and B is based on the similarities between their 3D structures. Because such similarities among proteins are scarcely manifested in their occurrences in text, large language models trained on a large collection of papers would be unable to capture their similarities. Symbolic domain ontologies (i.e., a classification scheme of biological entities) also fail to capture such fine-grained similarities. Accordingly, it may be necessary to use heterogenous sources of information, such as databases of protein structures, large collections of pathways, and so on, to capture such semantic similarities among entities and to carry out reasoning based on them.

We have witnessed the rapid progress and significant changes that neural network (NN) models and deep learning (DL) have brought to the field of NLP. This is a typical example in which advances in broader fields of computer science/engineering open up new opportunities to change and enhance the NLP field.

The changes brought by NN and DL are broad and have had a profound impact not only on NLP but also visual/image processing, speech/signal processing, and many other areas of artificial intelligence. It is a paradigm shift.

The new paradigm has significantly improved the performance of diverse NLP tasks. Furthermore, I expect it will contribute significantly toward solving the most challenging NLP problems, by integrating NLP with the processing of other information modalities (images, sounds, haptics, etc.), and with knowledge processing, and so on.

In technological fields such as image and speech processing, reasoning based on knowledge traditionally used different modeling and processing techniques. They now share the same technological basis of NN and DL. It is becoming much easier to integrate heterogeneous forms of processing, meaning that carrying out NLP in multimodal contexts and NLP with knowledge bases are far more feasible than we previously thought. The research teams of the institutions with which I am affiliated are now working on these directions (Kumar Sahu et al. 2019 ; Iso et al. 2020 ; Christopoulou, Miwa, and Ananiadou 2021 ).

On the negative side, NLP based on large language models is increasingly separating itself from other research disciplines that involve the study of language. The black box nature of NN and DL also makes the analytical methods way of assessing NLP systems difficult.

Although the characteristics are very different, I fear that the paradigm may encounter similar difficulties to those suffered by first-generation MT systems. One could improve the overall performance by tweaking computational models, but without rational and systematic analysis of problems, this failed to solve real difficulties and recognize the limit of the technology.

As revealed through detailed analysis of parsing errors, even when the overall quantitative performance improved, semantically crucial errors of specific types remained unsolved.

Without analysis based on theories provided by other language-related disciplines, erratic and unexpected behaviors of NN-based NLP systems will remain and limit potential applications.

On the other hand, CL tends to treat language as a closed system or focus on study on specific aspects of regularities that language show. By examining what takes place in NLP systems, together with NLP practitioners, CL researchers would be able to enrich the scope of their theories and to provide a theoretical basis for analytic assessment of NLP systems.

It is the time to re-connect NLP and CL.

I was introduced to the field of NLP by my long-time mentor, Professor Makoto Nagao, who was a recipient of the Lifetime Achievement Award (2003). He passed away last May. It is unfortunate that I could not share my honor and happiness with him.

In my career of almost 50 years, I have conducted research into NLP at several institutes worldwide, including Kyoto University; CNRS (GETA, Grenoble), France; University of Manchester, UK; the University of Tokyo; and Microsoft Research, China. I receive the honor on behalf of the colleagues, research fellows, and students who worked with me at these institutions. Without them, my research could not have progressed in the way that it has. I deeply appreciate their support.

This statement is a bit of a simplification. Some formal semanticists, like R. Montague, do not assume a “mental” model. They are interested in a universal theory of semantics of language. Language need not be human language (Montague 1970 ).

In this definition, I take research on parsing as part of NLP, since it is concerned with processing of language. Research on parsing algorithms, however, may be quite different in nature from the engineering side of NLP. Thus, the boundary between NLP and CL is not so clear-cut.

The explanation here is simplified. The formal theory of language was not necessarily concerned with human language. They revealed the relationship between classes of language and the computational power of their recognizers. Parsing algorithms of formal languages were studied not necessarily for human languages. Strictly speaking, this research was not conducted as NLP research.

The left side of this figure is the transfer with a pre- and post-cycle of adjustment. These two cycles are required to treat language pairs like Japanese and English. The cycles adjust differences in grammaticalization of the two languages. Differences are abundant when we treat languages that belong to very different language families (Tsujii 1982).

Apart from the equality of information, the interlingual approach assumed that the language-independent representation consists only of language-independent lexemes. This involved implausible work of defining a set of language-independent concepts.

IS (Interface Structure) is dependent on a specific language. In particular, unlike the interlingual approach, Eurotra did not assume language-independent leximemes in ISs so that the transfer phase between the two ISs (source and target ISs) was indispensable. See footnote 5.

This is overgeneralization. Theoretical linguistics by N. Chomsky explicitly avoided problems related with interpretation and treated language as a closed system. Other linguistic traditions have had more relaxed, open attitudes.

Note that transformational grammar considered a set of rules applied to generate surface structures from the deep structure. On the hand, the “climbing-up the hierarchy” model of analysis considered a set of rules to reveal the abstract level of representation from the surface level of representation. The directions are opposite. Ambiguities did not cause problems in transformational grammar.

There had been attempts to construct probabilistic models without supervision of annotated tree banks (for example, Fujisaki [ 1984 ]). They used EM algorithms such as the inside-outside algorithm. However, they could not have brought significant results on their own. Significant progresses were made possible by large treebanks (for example, Fujisaki ( 1984 )).

The MU project, like the GETA group in France, took the approach of “procedural grammar” in which rules for disambiguation check the context explicitly to choose plausible interpretation. Without probabilistic models, the approach would be the only option for delivering working systems. However, the analysis phase in this approach becomes clumsy and convoluted (Tsujii, Nakamura, and Nagao 1984 ; Tsujii et al. 1988 ).

This means that the reader identifies information in text that may not be the main information the writer intends to convey. The task of IE is essentially concerned with interpretation of text by the reader, and the reader infers diverse sorts of information from text.

The information format in String Grammar by the NYU group tried to use surface patterns as rules of sublanguage grammar. I thought this approach would not work on language in published papers. Compared with language in medical records, language in published papers is not so restricted and intertwined with rules of general language.

Email alerts

Related articles, related book chapters, affiliations.

  • Online ISSN 1530-9312
  • Print ISSN 0891-2017

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

  • Skip to content

Georgia Tech at EMNLP 2021

GT Research in Natural Language Processing

Main Content

Georgia tech at emnlp 2021.

Welcome to Georgia Tech’s virtual experience for Empirical Methods in Natural Language Processing 2021 (EMNLP) . Look around, explore, and discover insights into the institute’s research contributions at the conference, taking place Nov. 7-11. EMNLP is a leading venue for the areas of natural language processing and artificial intelligence. It provides researchers the opportunity to present and discuss their work on the latest contributions to the field. Our coverage includes trends from EMNLP’S entire papers program.

natural language processing research papers 2021

Georgia Tech research papers include natural language processing work from across the EMNLP program. Explore all the people and individual papers by clicking the data graphic . Georgia Tech’s 28 authors in the program contributed to 18 long papers and two short papers.

natural language processing research papers 2021

Georgia Tech authors have work represented in 14 of EMNLP’s 22 main research tracks. The map shows all authors from Georgia Tech and their external partners by track. The institute has seven authors in  Computational Social Science and Cultural Analytics,  where GT is strongest. Georgia Tech’s presence here includes the most authors from any organization, and all seven authors have oral papers, the highest tier of research in the program. Explore the map .

FACULTY FOCUS | Q&A with Experts at EMNLP

natural language processing research papers 2021

Alan Ritter , Associate Professor School of Interactive Computing

If you had to pick, which of your research results accepted at EMNLP would you want to highlight?

I am excited about my Ph.D. student, Yang Chen’s paper “Model Selection for Cross-Lingual Transfer” .  In this work, Yang addressed an important problem related to cross-lingual transfer learning, where the goal is to solve a natural language understanding task on a specific language, for example German, but the model only has access to supervised training data in English. Recently, pre-trained transformer models, such as multilingual BERT or XLM-RoBERTa have demonstrated surprisingly good performance, yet there is a lot of variance in performance on the target language between different fine-tuning runs. It is hard to know which model will work best, because in the zero-shot setting, you don’t have access to any labeled data in the target language. Our solution is a learned ranker that learns to predict which multilingual model will perform best on a given target language. This was found to work better than English development data, which is the standard method currently used in the literature.

What recent advancements or new challenges have you seen in the research areas that you are involved in?

One challenge is that pre-trained models, such as BERT, GPT, etc. are becoming more and more costly to train. In our paper “Pre-train or Annotate? Domain Adaptation with a Constrained Budget” , of my Ph.D. students, Fan Bai together with myself and professor Wei Xu investigated the costs of renting GPUs using a service such as Amazon EC2 or Google Cloud, and whether that money could have been better spent if it were instead allocated towards hiring annotators to simply label more supervised training data. We found that for small budgets, it is actually better to simply invest all your funding into data annotation, however as the available budget increases, a combination of both annotation and pre-training is the most economic strategy.  This is a surprising result, because the conventional wisdom in the field has been that annotating data is expensive, so methods that can make use of unlabeled data, through pre-training or other methods, are more cost-effective.

If you were to focus on just one social application of your EMNLP work, what might it be?  

In our paper “Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts” , written by my Ph.D. student Ashutosh Baheti, together with myself, Maarten Sap and Mark Riedl, we showed that neural chatbots, such as DialoGPT or GPT-3 are two times more likely to agree with toxic comments made by users on Reddit. It appears these models have learned to imitate an echo-chamber effect that appears in the data they are trained on: users are less likely to reply to an offensive comment online, unless they agree with the author.  There was a well-written article covering this paper on The Next Web .

natural language processing research papers 2021

Wei Xu , Assistant Professor School of Interactive Computing

Three papers happen to be on three different topics I am very excited about, in collaboration with different researchers. Some are pointing to a new direction, some are tackling a long-standing research problem: 

  • Natural language generation —  “BiSECT: Learning to Split and Rephrase Sentences with Bitexts”
  • Economics of pre-trained language models —  “Pre-train or Annotate? Domain Adaptation with a Constrained Budget”
  • Stylistics to reduce biases in language —  “WIKIBIAS: Detecting Multi-Span Subjective Biases in Language”

It has been very exciting that large pre-trained language models, such as T5, have significantly improved natural language generation; but at the same time, the generated outputs are still far from good enough (e.g., with hallucinations, incapability to generate or recognize paraphrases) to be useful for real-world applications. My research group aims to develop a complete set solution with new machine learning models, dataset, and evaluation metrics. 

If you were to focus on just one social application of your EMNLP work, what might it be?

Our work in the paper “BiSECT: Learning to Split and Rephrase Sentences with Bitexts” is developed for simplifying complex long sentences into shorter sentences to help people read. This particular NLP research topic is called “Text Simplification”, which has direct application to Accessibility and Education. We are interested in improving textual accessibility for all people. It can help school children to read news articles and STEM material, help the general public to read medical documents and government policies, help non-native speakers to read in their second language, and more.

natural language processing research papers 2021

Diyi Yang , Assistant Professor School of Interactive Computing

I would pick “ Latent Hatred ”. Hate speech is pervasive in social media, causing serious consequences for victims of all demographics.  Mai  and  Caleb ’s work significantly advances our understanding of how implicit hate speech works today. Their work introduces a theoretically-justified taxonomy of implicit hate speech and a benchmark corpus with fine-grained labels to support research on detecting and explaining implicit hate speech. This interdisciplinary work is a collaboration between the School of Interactive Computing, and the Sam Nunn School of International Affairs within Georgia Tech.

Despite the increasing success of NLP, current language technologies often fail in low-resource data, language, and dialect settings, and still suffer from bias and fairness issues. My lab has been addressing these issues and we will continue working on this emerging direction of socially aware language technologies.  

Automatic meeting notes-taker ! We have been working on automatic conversation summarization by combining rich linguistic structures in conversations/meetings and large NLP models for a while, and  The  CODA that we are going to present at EMNLP will enable us to do such conversation summarization with limited supervision. With the rise of remote/video meetings since COVID-19, our automatic meeting note taker would benefit a lot of people.

natural language processing research papers 2021

Tuo Zhao , Assistant Professor H. Milton Stewart School of Industrial and Systems Engineering

I would highlight our result on automated evaluation of dialogue system. Dialogue system is one of the most fundamental and important problems in Natural Language Processing. However, an ideal environment for evaluating dialog agents, i.e., the Turing test, needs to involve human interaction, which is not affordable for large-scale experiments. Our EMNLP 2021 paper (arxiv.org/abs/2102.10242) proposes a new reinforcement learning framework – ENIGMA – for automating the Turing test. ENIGMA adopts an off-policy approach and only requires a handful of pre-collected experience data. Therefore, the evaluation does not involve human interaction with the target agent during the evaluation, making automated evaluations feasible. We believe that our proposed framework can serve as a fundamental building block to evaluation of dialogue systems and will motivate more sophisticated and stronger follow-up work on the automated Turing test.

The ultra-large neural language models (NLMs) are undoubtedly one of the most important breakthroughs in natural language processing in the past few years. They have significantly advanced research on natural language understanding, generation, etc. Despite the huge success, most of the efforts were only devoted to developing ultra-large models, and the importance of the training algorithms were less recognized. As a result, these large models are becoming more and more parameter inefficient. Making a model 10 times larger only leads to a marginal improvement in terms of the prediction performance. Our EMNLP 2021 paper (arxiv.org/abs/2104.04886) proposes a new computational framework – SALT – for training large NLMs. SALT adopts a game-theoretic approach, which introduces a Stackelberg competition between a NLM and the adversarially perturbed training data. Such a competition encourages the NLM to make stable and robust predictions and significantly improves the generalization performance. Our experiments suggest that NLMs trained by SALT can outperform other conventionally trained ones that are several times larger.

My research interests in NLP mainly focus on developing new methodologies for some fundamental problems such as dialogue system and neural language models. I am not quite familiar with the social application. If I had to choose one social application, I guess I would be interested in chatbots for health screening because it can greatly help evolve triage and screening processes in a scalable and non-contact manner. This is particularly useful for dealing with infectious disease such as COVID-19.

Georgia Tech Authors

► Click here to meet our experts at EMNLP and connect with them!

natural language processing research papers 2021

EMNLP ’21 | GLOBAL COMMUNITY

EXPLORE MAP Special thanks to the EMNLP program committee for the collaboration

natural language processing research papers 2021

CLICK 🌐 IN PHOTO CAPTIONS TO EXPLORE THE ORAL, POSTER, AND FINDINGS PAPERS

natural language processing research papers 2021

Machine Learning Center at Georgia Tech

The Machine Learning Center was founded in 2016 as an interdisciplinary research center (IRC) at the Georgia Institute of Technology. Since then, we have grown to include over 190 affiliated faculty members and 60 Ph.D. students, all publishing at world-renowned conferences. The center aims to research and develop innovative and sustainable technologies using machine learning and artificial intelligence (AI) that serve our community in socially and ethically responsible ways. Our mission is to establish a research community that leverages the Georgia Tech interdisciplinary context, trains the next generation of machine learning and AI pioneers, and is home to current leaders in machine learning and AI.

Book cover

Proceedings of the Future Technologies Conference

FTC 2020: Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1 pp 634–647 Cite as

A Systematic Literature Review of Natural Language Processing: Current State, Challenges and Risks

  • Eghbal Ghazizadeh 17 &
  • Pengxiang Zhu 17  
  • Conference paper
  • First Online: 31 October 2020

2071 Accesses

4 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1288))

In this research paper, a comprehensive literature review was undertaken in order to analyze Natural Language Processing (NLP) application based in different domains. Also, by conducting qualitative research, we will try to analyze the development of the current state and the challenge of NLP technology as a key for Artificial Intelligence (AI) technology, pointing out some of the limitations, risks and opportunities. In our research, we rely on primary data from applicable legislation and secondary public domain data sources providing related information from case studies. By studying the structure and content of the published literature, the NLP-based applications have been clearly classified into different fields which include natural language understanding, natural language generation, voice or speech recognition, machine translation, spell correction and grammar check. The development trend, open issues and limitations have also been analyzed.

  • Natural Language Processing
  • Voice recognition
  • Artificial intelligence
  • NLP applications

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Cambria, E., White, B.: Jumping NLP curves: a review of natural language processing research. IEEE Comput. Intell. Mag. 9 (2), 48–57 (2014). https://doi.org/10.1109/mci.2014.2307227

Article   Google Scholar  

Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13 (3), 55–75 (2018). https://doi.org/10.1109/mci.2018.2840738

Locke, J., Rowbottom, N., Troshani, I.: Sites of translation in digital reporting. Acc. Auditing Account. J. 31 (7), 2006–2030 (2018). https://doi.org/10.1108/aaaj-07-2017-3005

Wu, D., He, D.: Exploring the further integration of machine translation in English-Chinese cross language information access. Program 46 (4), 429–457 (2012). https://doi.org/10.1108/00330331211276495

Zhang, X., Meng, M., Sun, X., Bai, Y.: FactQA: question answering over domain knowledge graph based on two-level query expansion. Data Technol. Appl. 54 (1), 34–63 (2019). https://doi.org/10.1108/dta-02-2019-0029

Liu, D., Li, Y., Thomas, M.A.: A roadmap for natural language processing research in information systems. In: 2017 Proceedings of the 50th Hawaii International Conference on System Sciences (2017). https://doi.org/10.24251/hicss.2017.132

Dasgupta, S., Ng, V.: Mine the easy, classify the hard. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009, vol. 2 (2009). https://doi.org/10.3115/1690219.1690244

Liu, Y., Zhang, M.: Neural network methods for natural language processing. Comput. Linguist. 44 (1), 193–195 (2018). https://doi.org/10.1162/coli_r_00312

Article   MathSciNet   Google Scholar  

Mills, M.T., Bourbakis, N.G.: Graph-based methods for natural language processing and understanding—a survey and analysis. IEEE Trans. Syst. Man. Cybern.: Syst. 44 (1), 59–71 (2014). https://doi.org/10.1109/tsmcc.2012.2227472

Briner, R.B., Denyer, D.: Systematic review and evidence synthesis as a practice and scholarship tool. Oxford (2012). https://doi.org/10.1093/oxfordhb/9780199763986.013.0007

Moher, D.: Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann. Intern. Med. 151 (4), 264 (2009). https://doi.org/10.7326/0003-4819-151-4-200908180-00135

Gurbuz, O., Rabhi, F., Demirors, O.: Process ontology development using natural language processing: a multiple case study. Bus. Process Manag. J. 25 (6), 1208–1227 (2019). https://doi.org/10.1108/bpmj-05-2018-0144

Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Comput. Surv. 43 (3), 1–46 (2011). https://doi.org/10.1145/1922649.1922654

Article   MATH   Google Scholar  

Taskin, Z., Al, U.: Natural language processing applications in library and information science. Online Inf. Rev. 43 (4), 676–690 (2019). https://doi.org/10.1108/oir-07-2018-0217

Wahl, H., Winiwarter, W., Quirchmayr, G.: Towards an intelligent integrated language learning environment. Int. J. Pervasive Comput. Commun. 7 (3), 220–239 (2011). https://doi.org/10.1108/17427371111173013

Vlachidis, A., Tudhope, D.: Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation. Program 49 (2), 118–134 (2015). https://doi.org/10.1108/prog-10-2014-0076

Chen, X., Qiu, X., Zhu, C., Huang, X.: Gated recursive neural network for Chinese word segmentation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015). https://doi.org/10.3115/v1/p15-1168

Chen, J., Ding, R., Jiang, S., Knudson, R.: A preliminary evaluation of metadata records machine translation. Electron. Libr. 30 (2), 264–277 (2012). https://doi.org/10.1108/02640471211221377

Mukherjee, S., Bala, P.K.: Detecting sarcasm in customer tweets: an NLP based approach. Indu. Manag. Data Syst. 117 (6), 1109–1126 (2017). https://doi.org/10.1108/imds-06-2016-0207

Rodrigo, A., Penas, A.: On evaluating the contribution of validation for question answering. IEEE Trans. Knowl. Data Eng. 27 (4), 1157–1161 (2015). https://doi.org/10.1109/tkde.2014.2373363

Demirtas, K., Cicekli, N.K., Cicekli, I.: Automatic categorization and summarization of documentaries. J. Inf. Sci. 36 (6), 671–689 (2010). https://doi.org/10.1177/0165551510382070

Schubotz, M., Scharpf, P., Dudhat, K., Nagar, Y., Hamborg, F., Gipp, B.: Introducing MathQA: a math-aware question answering system. Inf. Discovery Deliv. 46 (4), 214–224 (2018). https://doi.org/10.1108/idd-06-2018-0022

Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining systems. Inf. Fusion 36 , 10–25 (2017). https://doi.org/10.1016/j.inffus.2016.10.004

Download references

Author information

Authors and affiliations.

Whitireia Polytechnic, 450 Queen Street, Auckland, 1010, New Zealand

Eghbal Ghazizadeh & Pengxiang Zhu

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Eghbal Ghazizadeh .

Editor information

Editors and affiliations.

Faculty of Science and Engineering, Saga University, Saga, Japan

The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK

Supriya Kapoor

Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Cite this paper.

Ghazizadeh, E., Zhu, P. (2021). A Systematic Literature Review of Natural Language Processing: Current State, Challenges and Risks. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. FTC 2020. Advances in Intelligent Systems and Computing, vol 1288. Springer, Cham. https://doi.org/10.1007/978-3-030-63128-4_49

Download citation

DOI : https://doi.org/10.1007/978-3-030-63128-4_49

Published : 31 October 2020

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-63127-7

Online ISBN : 978-3-030-63128-4

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Help | Advanced Search

Computer Science > Computation and Language

Title: datasets: a community library for natural language processing.

Abstract: The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at this https URL .

Submission history

Access paper:.

  • Download PDF
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Including Signed Languages in Natural Language Processing

Kayo Yin , Amit Moryossef , Julie Hochgesang , Yoav Goldberg , Malihe Alikhani

Export citation

  • Preformatted

Markdown (Informal)

[Including Signed Languages in Natural Language Processing](https://aclanthology.org/2021.acl-long.570) (Yin et al., ACL-IJCNLP 2021)

  • Including Signed Languages in Natural Language Processing (Yin et al., ACL-IJCNLP 2021)
  • Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021. Including Signed Languages in Natural Language Processing . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 7347–7360, Online. Association for Computational Linguistics.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • v.7(3); 2021 Mar

Logo of heliyon

Natural language processing for urban research: A systematic review

Associated data.

Data included in article/supplementary material/referenced in article.

Natural language processing (NLP) has shown potential as a promising tool to exploit under-utilized urban data sources. This paper presents a systematic review of urban studies published in peer-reviewed journals and conference proceedings that adopted NLP. The review suggests that the application of NLP in studying cities is still in its infancy. Current applications fell into five areas: urban governance and management, public health, land use and functional zones, mobility, and urban design. NLP demonstrates the advantages of improving the usability of urban big data sources, expanding study scales, and reducing research costs. On the other hand, to take advantage of NLP, urban researchers face challenges of raising good research questions, overcoming data incompleteness, inaccessibility, and non-representativeness, immature NLP techniques, and computational skill requirements. This review is among the first efforts intended to provide an overview of existing applications and challenges for advancing urban research through the adoption of NLP.

Natural language processing; Urban research; Urban big data; Text mining

1. Introduction

The advancement of technologies is not just changing cities ( Urban Land Institute, 2019 ); it is also transforming the way urban researchers are able to study cities. Gray's notion of the fourth paradigm of science pointed out that the wide availability of data changes the practice of science ( Hey et al., 2009 ). Abundant urban big data is being generated and stored at unprecedented speed and scale; researchers nowadays are able to ask and answer questions in ways that were impossible in the past.

The paradigm shift of scientific research highlights the need for a new generation of scientific tools and methods. Among all existing data, 95% are in unstructured form, which lacks an identifiable tabular organization required by traditional data analysis methods ( Gandomi and Haider, 2015 ). Unstructured data, such as Web pages, emails, and mobile phone records, may contain numerical information (e.g. dates) but is usually text-heavy. Unlike numbers, textual data are inherently inaccurate and vague. According to a conservative estimate by Britton (1978) , at least 32% of the words used in English text are lexically ambiguous. The messy reality of textual data makes it challenging for researchers to take advantage of urban big data.

On the other hand, the large quantity of textual data provides new opportunities for urban researchers to examine people's perceptions, attitudes, and behaviors, so as to advance the knowledge and understanding of urban dynamics. For example, Jang and Kim (2019) have proved that crowd-sourced text data gathered from social media can effectively represent the collective identity of urban space. Conventional data gathering techniques, such as surveys, focus groups, and interviews, are oftentimes expensive and time-consuming. If used wisely, organic text data without pre-specified purposes could be incredibly powerful and complement purposefully designed data collection.

Natural language processing (NLP) has demonstrated tremendous capabilities of harvesting the abundance of textual data. As a form of artificial intelligence, it uses computational algorithms to learn, understand, and produce human language content ( Hirschberg and Manning, 2015 ). It is interrelated with machine learning and deep learning. Basic NLP procedures include processing text data, converting text to features, and identifying semantic relationships ( Ghosh and Gunning, 2019 ). In addition to its ability to structure large volumes of unstructured data, NLP can improve the accuracy of text processing and analysis because it follows rules and criteria in a consistent way. NLP has proven to be useful in many fields. For example, in medical research, Guetterman et al. (2018) conducted an experiment to compare the results from an NLP analysis and a traditional text analysis. They reported that NLP was able to identify major themes that were manually summarized by traditional text analysis.

Here, a comprehensive review of the ways that researchers have utilized NLP in urban studies is presented. This work is among the first efforts intended to provide a synthesis of opportunities and challenges for advancing urban research through the adoption of NLP.

2. Methodology

2.1. literature search.

The aim of this literature search was to gather all scientific publications in urban studies that utilized the method of NLP. To serve this aim, journal articles and conference papers were searched in four online databases: EBSCO Urban Studies Abstracts, Scopus, ProQuest, and Web of Science. Due to the fact that each database has different searchable fields and filtering options, slightly different search criteria were adopted depending on the different databases used (see Table 1 ). Besides the criteria listed in Table 1 , the language of publications in all four database searches was also constrained so the results only included literature in English. The search timeframe was “all years,” which means the results contained all publications to date (November 2019).

Table 1

Literature search criteria.

The initial search returned 271 publications: 6 from EBSCO Urban Studies Abstracts, 69 from Scopus, 125 from ProQuest, and 71 from Web of Science. After removing 73 duplicates, the titles and abstracts of the remaining articles were reviewed. The publications were further narrowed down by determining that 152 were of irrelevant topics to urban research, such as travel planning, regional linguistic variations, or corpus development; 18 studies did not use the method of NLP; and four articles were without full-text access. The above mentioned 174 articles were removed and the remaining 24 studies were reviewed in full texts. Two articles identified from citations of the included studies were added for review. As a result of reviewing the publications found based on criteria of relevance and full-text access, this study included a total of 26 publications for detailed analysis.

2.2. Limitations

While the strategy used during the literature search was meant to be a comprehensive and systematic approach, it had several limitations. First, the search had a language bias because it only included studies published in English. Articles in non-English languages with English abstracts were not included either. Second, the method for retrieving publications may have excluded studies that used NLP techniques but had been labeled with other terminology. For example, studies that used latent Dirichlet allocation (LDA), a statistical model in NLP, listed LDA as a keyword rather than NLP and therefore did not match the literature search criteria. Third, this review only included peer-reviewed journal articles and conference papers, which eliminated possible NLP applications documented in dissertations, theses, reports, and working papers. This was a tradeoff between literature quantity and quality.

3. Literature search results

3.1. amount of publications.

The systematic literature search returned a total of 26 urban studies that used NLP, of which 21 were journal articles and five were conference papers. All of those appeared from 2012 onwards and more than half (62%) from 2018 onwards ( Figure 1 ). The exponential increase in the number of publications reflects the growing interest in NLP among urban researchers.

Figure 1

Amount of urban studies using NLP by year.

3.2. NLP application

Urban researchers have explored diverse topics using NLP as summarized in Table 2 . In general, researchers have applied NLP in five areas: urban governance and management, public health, land use and functional zones, mobility, and urban design. Urban governance and management is the most dominant topic (39% of all literatures), which includes discussions on citizen engagement, disaster response, crime detection, and construction management. Researchers also have used NLP to study urban health (19% of all literatures), such as urban epidemics prediction, air quality monitoring, and assessment of living environments. Land use and functional zones is another popular area of research (19% of all literatures), in which researchers used NLP to model urban spatiotemporal dynamics. Besides, though only a limited number, researchers have adopted NLP in mobility (15% of all literatures) and urban design research (8% of all literatures).

Table 2

Summary of included literatures.

The majority of studies involved in this review used social media as their data source, including Twitter, Instagram, Facebook, Foursquare, Craigslist, Minube, and Yelp ( Table 2 ). Researchers typically extract and analyze the text along with geolocation information embedded in social media posts. However, the data source is not limited to social media, researchers used NLP to process information gathered from interviews, focus groups, phone call records, building permits, online hotel reviews, event listings, and neighborhood reviews. The data size could be big (e.g. millions of tweets) or small (e.g. a dozen of interviews). Additionally, researchers have extended the usage of NLP from analyzing textual data to non-textual data such as points of interest (POIs) data in maps and GPS trajectories generated by cell phones and taxis. It is worth mentioning that when a study objective was predictive modeling, it was common to check the validity of NLP results with records from official sources.

3.4. Study area

Studies using NLP have covered a wide range of geographic locations ( Table 2 ). Most studies focused on major cities with large populations, such as New York City, US and Beijing, China. Some examined multiple cities for comparison. The scale of analysis ranges from a single city to a continent.

4. Applications of NLP in urban research

Using NLP has the advantages of improving the usability of urban big data sources, expanding study areas and scales, and reducing research costs. In this section, the opportunities shown in the current applications of NLP are discussed in five areas: urban governance and management, public health, land use and functional zones, mobility, and urban design.

4.1. Urban governance and management

NLP adds new opportunities to citizen engagement, which is the most dominant topic among studies in urban governance challenges ( Cruz et al., 2019 ). NLP techniques combined with online crowd-sourced data opens up a communication channel between city managers and the general public. From 2001 to 2004, the Electronic Democracy European Network (EDEN) project launched a real-life pilot to test if a particular NLP approach could improve communication between citizens and public administrators ( European Commission, 2015 ). Though the EDEN project run into multiple obstacles, the project managers and engineers concluded that “it seems reasonable to approach e-democracy by seeking a democratic approach to software solutions” ( Carenini, Whyte, Bertorello and Vanocchi, 2007 , p. 27). Computer scientists also developed NLP applications to function as a citizen feedback gathering tool ( Estévez-Ortiz et al., 2016 ), a citizen concern detector ( Abali et al., 2018 ), and an urban community identifier ( Vargas-Calderón and Camargo, 2019 ), and all showed promising results. In addition, combining NLP with interviews and focus groups, Bardhan et al. (2019) discovered gender inequality in Indian slum rehabilitation housing management, which suggested a need for a more systematic participatory approach to improve well-being among the rehabilitated occupants.

Additionally, NLP shows potential to support natural disaster responses. According to the US Congress's think tank, there are two ways that government agencies could use social media in emergency and disaster management: 1) as an outlet for information dissemination, and 2) as a systematic tool for emergency communication, victim assistance, situation monitoring, and damage estimation ( Lindsay, 2011 ). The second category is where NLP has a direct role. An early work by Imran et al. (2013) trained a model that extracts disaster-relevant information from tweets and achieved 40%–80% correctness. More recently, Hong et al. (2018) built an unsupervised NLP topic model that requires minimal human efforts in text collecting and analyzing, which could help government agencies to identify citizens' needs and prioritize tasks during natural disasters. Additionally, with the integration of NLP and geospatial clustering methods, Hu et al., 2019a , Hu et al., 2019b collected local place names from housing advertisements, which has implications in disaster response, because these names may not exist in official gazetteers and could lead to miscommunication between local residents and disaster responders.

Furthermore, researchers have completed proof-of-concept studies for the method of using NLP, machine learning, and spatial analysis to spot urban crime. In Brazil, Souza et al. (2016) trained a classification model by emergency phone call records from the state police department and their model was able to analyze real-time tweets for crime detection. In the US, Helderop et al. (2019) were successful in detecting prostitution activities by examining hotel reviews, locations, and prices. Though the generalizability of the methods used in these studies needs further verification, with future improvements, they could eventually contribute to improving urban security.

Finally, scholars demonstrated the power of NLP in the research of construction management. Lai and Kontokosta (2019) conducted an exploratory study analyzing building permit records to uncover building renovation and adaptive reuse patterns in seven major cities in the US. The method they developed may benefit the monitoring of building alterations in urban areas.

4.2. Public health

Urban public health has drawn growing attention among researchers in recent years. Studies have revealed that various economic, social, and environmental factors, including the spread of infectious diseases, poor living conditions, unhealthy lifestyles, and pollution, could negatively affect public health in urban areas ( Moscato and Poscia, 2015 ).

NLP is essential to large-scale application of social media as sensors to predict epidemic outbreaks. Traditional epidemic monitors rely on clinical reports gathered by public health authorities ( Vaughan et al., 1989 ). For instance, health care providers in the US depend on the information provided by the Centers for Disease Control and Prevention (CDC) to learn about disease outbreaks ( CDC, 2018 ). However, the time lag between the date that a disease starts and the date that clinical cases are reported to authorities is a major drawback of official surveillance systems ( CDC, 2018 ). For this reason, many researchers have developed data processing and modeling techniques to use social media as a data source to conduct real-time epidemic analysis ( Al-garadi et al., 2016 ). Though manually filtering and classifying relevant messages eliminates false positive and negative errors, the tradeoff is a slow analysis process ( Nagar et al., 2014 ). NLP classification, on the other hand, can process data relatively fast with reasonable accuracy, supporting early detection of a disease. For example, in a Japanese nationwide study, Wakamiya et al. (2018) used an NLP module to effectively estimate when and where influenza outbreaks were happening.

In a similar sense, researchers view social media users as soft sensors to measure urban air quality. Riga and Karatzas (2014) adopted an NLP bag-of-words model to process social media posts and concluded that users’ reports of their surrounding environmental conditions on social media platforms are highly correlated with the actual observations obtained from official monitoring sites.

Analyzing urban residents' perceptions of living environments and evaluating urban communities' lifestyles is another sphere of public health research in which NLP appears to be useful. Hu et al. (2019a) used NLP to process online neighborhood reviews to assess New Yorkers' satisfaction with their living conditions and their perceived quality of life. Also using NLP, Fu et al. (2018) derived urban citizens' activities through their linguistic patterns. Additionally, Rahimi, Mottahedi, and Liu (2018) were able to examine different communities’ food consumption behaviors and lifestyles in ten major cities in North America by a bag-of-words model. Findings from these studies could serve as a valuable reference for city policymakers as they provide multifaceted health-related information complementary to conventional Census.

4.3. Land use and functional zones

Looking back into the history of urban planning, collecting information concerning land use functions is a critical step before laying out urban plans ( Breheny and Batey, 1981 ). Traditional approaches to examine structures and changes in urban land use include analyzing aerial photographs ( Philipson, 1997 ), field survey ( Pissourios, 2019 ), and remote sensing ( Bowden, 1975 ).

More recently, researchers have extended the usage of NLP from analyzing textual data to non-textual data, and applied it to urban land use and functionality studies. NLP typically detects underlying correlations between words according to their context. To capture urban spatial structures, researchers consider a region as a text document, a function as a topic, and research entities as words ( Li, Fei and Zhang, 2019 ; Yuan et al., 2015 ). In this way, various NLP modeling methods allow researchers to determine contextual relationships between urban functional regions or different land use types based on the similarities among entities (i.e. geographic space interactions). This method makes use of urban data generated by sensors, vehicle geolocation tracking systems, and location-based services.

Yuan et al. (2015) explained the concept of mobility semantics by arguing that people's socioeconomic activities in a region are strongly correlated with the spatiotemporal patterns of those who visit the region (i.e. mobility semantics). Another key concept, location semantics, refers to urban road networks and the allocation of POIs ( Yuan et al., 2015 ). By leveraging mobility and location semantics, Yuan et al. (2012 , 2015) identified urban functional zones (e.g. residential, business, and educational areas) through topic modeling. Similarly, Yao et al. (2017) classified urban land use at the level of irregular land parcels by integrating a semantic model and deep learning. Huang et al. (2018) quantified industrial land use changes in a bay area in China using POIs data. Based on a Word2Vec model, Li et al. (2019) proposed a regionalization method to cluster similar spatial units in an area and inspect the clusters' socioeconomic patterns by analyzing all mobility trajectories of people in that area.

Demonstrating the advantages of being efficient and capable of handling large volumes of data, these researchers’ approaches of NLP modeling to classify land use and functional zones show potential as great tools to monitor urban landscape dynamics and provide calibrations and for urban planning.

4.4. Mobility

Urban mobility researchers have begun to leverage NLP in their studies as well. Serna, Gerrikagoitia, Bernabé, and Ruiz (2017) demonstrated the feasibility of using NLP to automatically identify sustainable mobility issues from social media data, which could enrich the data of traditional travel surveys. Markou, Kaiser, and Pereira (2019) were able to predict taxi demand hotspots for special events by a tool they developed that scans the internet for time-series data.

Similar to the previously explained usage of NLP in land use and functional zones, researchers also adopted NLP for non-textual data analysis on urban mobility. By analyzing taxi moving paths recorded by the Global Navigation Satellite System, researchers measured spatiotemporal relationships among roads ( Liu et al., 2017 ) and identified the interaction pattern of vehicle movements on road networks ( Liu et al., 2019 ), which, they argued, could be useful in understanding and managing urban traffic.

4.5. Urban design

NLP can facilitate urban design with imageability analysis. “Urban design is the process of understanding people and place in an urban context, leading to a strategy for the improvement of urban life and the evolution of the built environment...” ( Building Design Partnership, 1991 , p. 14). Introduced by Lynch (1960) , imageability is an important concept in urban design that is still being discussed today. It involves subjective urban identity, emphasizing the quality of the built environment perceived and assessed by observers ( Lynch, 1960 ). NLP enables researchers to evaluate the emotional responses evoked by urban places and visually map urban identity at various scales and times.

Researchers have already used NLP to process hashtags from Instagram photos; using this information together with photos' geolocations, they created a cognitive map of the Seoul metropolitan area in Korea to represent its residents’ collective perceptions of the place ( Jang and Kim, 2019 ). To explore the emotional dynamics of urban space, Iaconesi (2015) used NLP to connect geographical locations within a city and emotions expressed in social media, through which established urban emotional landmarks. The observation of urban identity and emotional landmarks helps urban designers and planners make interventions and shape urban spaces into more positive and imageable places.

5. NLP challenges in urban research

Ultimately, a method or technique alone will never solve any problem. NLP opens up an exciting new direction, but at the same time it brings about more challenges. In this section, four aspects of potential challenges that apply to urban research are discussed: research questions, data, the method itself, and researchers.

The challenge of research questions lies in identifying novel issues that could not be well solved by traditional techniques. NLP holds great promise for the quest to untangle the complex relationships among urban systems, however, what questions it enables researchers to answer are still waiting to be explored. In fact, the study of cities involves a variety of disciplines in urban contexts ( Ramadier, 2004 ). The existing urban studies have complemented or, sometimes, completely replaced traditional methods with NLP to solve problems in various urban-related fields. While conventional text analysis methods have high accuracy, NLP has advantages in dealing with a massive amount of data at a large scale with fine resolution. In a reality of limited time and resources, NLP could provide insight into questions that are impossible to answer with traditional methods. Looking ahead, more urban studies using NLP to answer questions that could not be otherwise answered will sure to emerge.

The data challenge for NLP goes hand-in-hand with the characteristics of urban big data. As pointed out by Salganik (2018) , big data's characteristics of incompleteness, inaccessibility, and non-representativeness are generally problematic for academic research. Data incompleteness refers to the fact that no matter the size, urban big data are not purposeful designed structural data and are very likely to miss some valuable information to research such as demographic factors. In addition, inaccessibility means that data owned by private companies or government agencies are not always accessible to urban researchers due to legal or ethical barriers. Moreover, big data usually could not represent a certain urban population. As a result, studies that use NLP to process big data are not likely to yield generalizable results and face the risk of overlooking certain populations.

Though there have been revolutionary advances in NLP, its mainstream application is still very limited ( Hirschberg and Manning, 2015 ). While the goal of NLP is that algorithms will ultimately be able to determine the relationships between words and grammar in human language and organize meaning by computer logic, the current techniques do not have the exact same capabilities of resolving natural language as humans do. NLP still needs improvements in “deal [ing] with irony, humor and other linguistic, psychological, anthropologic and cultural issues” ( Iaconesi, 2015 , p. 16), which is a difficult task for human analysts as well. Most recently, significant progress has been made in the field of NLP with increasing ease of implementing pre-trained models such as ULMFiT ( Howard and Ruder, 2018 ) and BERT ( Devlin et al., 2019 ). While this may trigger more adoption of NLP among researchers, it is important to validate NLP analysis with results reached by traditional methods.

Furthermore, people who study cities usually do not have professional training or a background in computer science. As a result, the complexity of detecting patterns, fitting models, and training classifiers limits urban researchers’ ability to take full advantage of NLP. This further hinders transferring knowledge into practice for urban planners and designers. In order to harness the new opportunities offered by NLP, urban researchers face the challenge to expand their skill set. On the other hand, computer scientists who wish to conduct urban research face the challenge to comprehend sophisticated social concepts and theories. Robust collaboration among researchers in different fields is likely to drive NLP applications in the study of cities.

While some of these challenges are common to studies using NLP in all fields, some others are more prominent for urban studies. Almost every researcher adopting NLP faces the challenges of acquiring good data and immature NLP techniques. For example, addressing implicit bias is still a daunting task when building NLP models. The success of NLP application in urban studies and other domains depends highly on the quality of data and modeling. Asking good research questions and skill requirements are more specific challenges facing urban researchers who intend to use NLP to facilitate their work. The spatial aspect of urban studies further compounds the challenge of adopting NLP. For instance, while NLP is effective in harvesting location data in texts, urban researchers need to be mindful of the massiveness and messiness of such data and assess the accuracy of uncovered geospatial information.

6. Conclusion

This systematic literature review suggests that there have been only a limited number of urban studies that adopted the approach of NLP. Current applications fell into five areas of study: urban governance and management, public health, land use and functional zones, mobility, and urban design. Using NLP in urban research demonstrates the advantages of improving the usability of urban big data sources, expanding study areas and scales, and reducing research costs. While recognizing this new opportunity is exciting, it is important for urban researchers not to overestimate what NLP is capable of accomplishing and acknowledge its limitations. To take advantage of NLP, urban researchers face challenges of raising good research questions, overcoming data incompleteness, inaccessibility, and non-representativeness, immature NLP techniques, and computational skill requirements.

Declarations

Author contribution statement.

M. Cai: Developed and the wrote this article.

Funding statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability statement

Declaration of interests statement.

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

Acknowledgements

I am very grateful to Dr. Mark Wilson, Dr. Jason Zhao, and the two anonymous reviewers for their insightful suggestions and comments on this paper.

  • Abali G., Karaarslan E., Hurriyetoglu A., Dalkilic F. 2018 6th International Istanbul Smart Grids and Cities Congress and Fair (ICSG), 30–33. 2018. Detecting citizen problems and their locations using twitter data. [ Google Scholar ]
  • Al-garadi M.A., Khan M.S., Varathan K.D., Mujtaba G., Al-Kabsi A.M. Using online social networks to track a pandemic: a systematic review. J. Biomed. Inf. 2016; 62 :1–11. [ PubMed ] [ Google Scholar ]
  • Bardhan R., Sunikka-Blank M., Haque A.N. Sentiment analysis as tool for gender mainstreaming in slum rehabilitation housing management in Mumbai, India. Habitat Int. 2019; 92 :102040. [ Google Scholar ]
  • Bowden L.W. Vol. 12. American Society of Photogrammetry; 1975. Urban environments: Inventory and analysis; pp. 1815–1880. (Manual of Remote Sensing). [ Google Scholar ]
  • Breheny M.J., Batey P.W.J. The history of planning methodology: a preliminary sketch. Built. Environ. (1978) 1981; 7 (2):109–120. JSTOR. [ Google Scholar ]
  • Britton B.K. Lexical ambiguity of words used in English text. Behav. Res. Methods Instrum. 1978; 10 (1):1–7. [ Google Scholar ]
  • Building Design Partnership Urban design in practice. Urban Design Quarterly. 1991; 40 [ Google Scholar ]
  • Carenini M., Whyte A., Bertorello L., Vanocchi M. Improving communication in E-democracy using natural language processing. IEEE Intell. Syst. 2007; 22 (1):20–27. [ Google Scholar ]
  • CDC . Centers for Disease Control and Prevention; 2018. November 16). Interpretation Of Epidemic (Epi) Curves During Ongoing Outbreak Investigations . https://www.cdc.gov/foodsafety/outbreaks/investigating-outbreaks/epi-curves.html [ Google Scholar ]
  • Cruz N. F. da, Rode P., McQuarrie M. New urban governance: a review of current themes and future priorities. J. Urban Aff. 2019; 41 (1):1–19. [ Google Scholar ]
  • Devlin J., Chang M.-W., Lee K., Toutanova K. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [ Google Scholar ]
  • Estévez-Ortiz F.-J., García-Jiménez A., Glösekötter P. An application of people’s sentiment from social media to smart cities. El Prof. Inf. 2016; 25 (6):851. [ Google Scholar ]
  • European Commission . 2015 June 13. Electronic Democracy European Network | EDEN Project. CORDIS | European Commission. https://cordis.europa.eu/project/rcn/57135/factsheet/en [ Google Scholar ]
  • Fu C., McKenzie G., Frias-Martinez V., Stewart K. Identifying spatiotemporal urban activities through linguistic signatures. Comput. Environ. Urban Syst. 2018; 72 :25–37. [ Google Scholar ]
  • Gandomi A., Haider M. Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015; 35 (2):137–144. [ Google Scholar ]
  • Ghosh S., Gunning D. Packt Publishing Ltd; 2019. Natural Language Processing Fundamentals: Build Intelligent Applications that Can Interpret the Human Language to Deliver Impactful Results. [ Google Scholar ]
  • Guetterman T.C., Chang T., DeJonckheere M., Basu T., Scruggs E., Vydiswaran V.V. Augmenting qualitative text analysis with natural language processing: methodological study. J. Med. Internet Res. 2018; 20 (6) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Helderop E., Huff J., Morstatter F., Grubesic A., Wallace D. Hidden in plain sight: a machine learning approach for detecting prostitution activity in phoenix, Arizona. Appl. Spat. Analy. Pol. 2019; 12 (4):941–963. [ Google Scholar ]
  • Hey T., Tansley S., Tolle K. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/ [ Google Scholar ]
  • Hirschberg J., Manning C.D. Advances in natural language processing. Science. 2015; 349 (6245):261–266. [ PubMed ] [ Google Scholar ]
  • Hong L., Fu C., Wu J., Frias-Martinez V. Information needs and communication gaps between citizens and local governments online during natural disasters. Inf. Syst. Front. New York. 2018; 20 (5):1027–1039. [ Google Scholar ]
  • Howard J., Ruder S. 2018. Universal language model fine-tuning for text classification; pp. 328–339. (Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)). [ Google Scholar ]
  • Hu Y., Deng C., Zhou Z. A semantic and sentiment analysis on online neighborhood reviews for understanding the perceptions of people toward their living environments. Ann. Assoc. Am. Geogr. 2019; 109 (4):1052–1073. [ Google Scholar ]
  • Hu Y., Mao H., McKenzie G. A natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements. Int. J. Geogr. Inf. Sci. 2019; 33 (4):714–738. [ Google Scholar ]
  • Huang L., Wu Y., Zheng Q., Zheng Q., Zheng X., Gan M., Wang K., Shahtahmassebi A., Deng J., Wang J., Zhang J. Quantifying the spatiotemporal dynamics of industrial land uses through mining free access social datasets in the mega hangzhou bay region, China. Sustainability. 2018; 10 (10):3463. [ Google Scholar ]
  • Iaconesi S. Emotional landmarks in cities. Sociologica. 2015; 9 (3):22. [ Google Scholar ]
  • Imran M., Elbassuoni S., Castillo C., Diaz F., Meier P. Proceedings of the 22nd International Conference on World Wide Web - WWW ’13 Companion, 1021–1024. 2013. Practical extraction of disaster-relevant information from social media. [ Google Scholar ]
  • Jang K.M., Kim Y. Crowd-sourced cognitive mapping: a new way of displaying people’s cognitive perception of urban space. PloS One. 2019; 14 (6) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lai Y., Kontokosta C.E. Topic modeling to discover the thematic structure and spatial-temporal patterns of building renovation and adaptive reuse in cities. Comput. Environ. Urban Syst. 2019; 78 :101383. [ Google Scholar ]
  • Li Y., Fei T., Zhang F. A regionalization method for clustering and partitioning based on trajectories from NLP perspective. Int. J. Geogr. Inf. Sci. 2019; 33 (12):2385–2405. [ Google Scholar ]
  • Lindsay B.R. 2011. Social Media and Disasters: Current Uses, Future Options, and Policy Considerations; p. 13. [ Google Scholar ]
  • Liu K., Gao S., Lu F. Identifying spatial interaction patterns of vehicle movements on urban road networks by topic modelling. Comput. Environ. Urban Syst. 2019; 74 :50–61. [ Google Scholar ]
  • Liu K., Gao S., Qiu P., Liu X., Yan B., Lu F. Road2Vec: measuring traffic interactions in urban road system from massive travel routes. ISPRS Int. J. Geo Inf. 2017; 6 (11):321. [ Google Scholar ]
  • Lynch K. MIT Press; 1960. The Image of the City. [ Google Scholar ]
  • Markou I., Kaiser K., Pereira F.C. Predicting taxi demand hotspots using automated Internet Search Queries. Transport. Res. C Emerg. Technol. 2019; 102 :73–86. [ Google Scholar ]
  • Moscato U., Poscia A. Urban public health. In: Boccia S., Villari P., Ricciardi W., editors. A Systematic Review of Key Issues in Public Health. Springer International Publishing; 2015. pp. 223–247. [ Google Scholar ]
  • Nagar R., Yuan Q., Freifeld C.C., Santillana M., Nojima A., Chunara R., Brownstein J.S. A case study of the New York city 2012-2013 influenza season with daily geocoded twitter data from temporal and spatiotemporal perspectives. J. Med. Internet Res. 2014; 16 (10) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Philipson W.R. Vol. 2. American Society of Photogrammetry and Remote Sensing; 1997. Urban analysis and planning; pp. 517–554. (Manual of Photographic Interpretation). [ Google Scholar ]
  • Pissourios I.A. Survey methodologies of urban land uses: an oddment of the past, or a gap in contemporary planning theory? Land Use Pol. 2019; 83 :403–411. [ Google Scholar ]
  • Rahimi S., Mottahedi S., Liu X. The geography of taste: using Yelp to study urban culture. ISPRS Int. J. Geo Inf. Basel. 2018; 7 (9) [ Google Scholar ]
  • Ramadier T. Transdisciplinarity and its challenges: the case of urban studies. Futures. 2004; 36 (4):423–439. [ Google Scholar ]
  • Riga M., Karatzas K. Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14) - WIMS ’14, 1–7. 2014. Investigating the relationship between social media content and real-time observations for urban air quality and public health. [ Google Scholar ]
  • Salganik M. Princeton University Press; 2018. Bit By Bit: Social Research in the Digital Age (Open Review Edition) https://www.bitbybitbook.com/en/preface/ [ Google Scholar ]
  • Serna A., Gerrikagoitia J.K., Bernabé U., Ruiz T. Sustainability analysis on urban mobility based on social media content. Transp. Res. Proc. 2017; 24 :1–8. [ Google Scholar ]
  • Souza A., Figueredo M., Cacho N., Araujo D., Coelho J., Prolo C.A. 2016 IEEE International Smart Cities Conference (ISC2), 1–6. 2016. Social smart city: a platform to analyze social streams in smart city initiatives. [ Google Scholar ]
  • Urban Land Institute . 2019. Urban Technology Framework. https://ulidigitalmarketing.blob.core.windows.net/ulidcnc/2019/05/ULI-Urban-Technology-Framework-2019.pdf [ Google Scholar ]
  • Vargas-Calderón V., Camargo J.E. Characterization of citizens using word2vec and latent topic analysis in a large set of tweets. Cities. 2019; 92 :187–196. [ Google Scholar ]
  • Vaughan J.P., Morrow R.H., Organization W.H. World Health Organization; 1989. Manual of Epidemiology for District Health Management. http://apps.who.int/iris/handle/10665/37032 [ Google Scholar ]
  • Wakamiya S., Kawai Y., Aramaki E. Twitter-based influenza detection after flu peak via tweets with indirect information: text mining study. JMIR Publ. Health Surv. 2018; 4 (3):e65. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yao Y., Li X., Liu X., Liu P., Liang Z., Zhang J., Mai K. Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. Int. J. Geogr. Inf. Sci. 2017; 31 (4):825–848. [ Google Scholar ]
  • Yuan J., Zheng Y., Xie X. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD. Vol. 12. 2012. Discovering regions of different functions in a city using human mobility and POIs; p. 186. [ Google Scholar ]
  • Yuan N.J., Zheng Y., Xie X., Wang Y., Zheng K., Xiong H. Discovering urban functional zones using latent activity trajectories. IEEE Trans. Knowl. Data Eng. 2015; 27 (3):712–725. [ Google Scholar ]

ScienceDaily

Natural language processing research: Signed languages

Advancements in natural language processing (NLP) enable computers to understand what humans say and help people communicate through tools like machine translation, voice-controlled assistants and chatbots.

But NLP research often only focuses on spoken languages, excluding the more than 200 signed languages around the world and the roughly 70 million people who might rely on them to communicate.

Kayo Yin, a master's student in the Language Technologies Institute, wants that to change. Yin co-authored a paper that called for NLP research to include signed languages.

"Signed languages, even though they are a significant part of the languages used in the world, aren't included," Yin said. "There is a demand and an importance in having technology that can handle signed languages."

The paper, "Including Signed Languages in Natural Language Processing," won the Best Theme Paper award at this month's 59th Annual Meeting of the Association for Computational Linguistics. Yin's co-authors included Amit Moryossef of Bar-Ilan University in Israel; Julie Hochgesang of Gallaudet University; Yoav Goldberg of Bar-Ilan University and the Allen Institute for AI; and Malihe Alikhani of the University of Pittsburgh's School of Computing and Information.

The authors wrote that communities relying on signed language have fought for decades both to learn and use those languages, and for them to be recognized as legitimate.

"However, in a predominantly oral society, deaf people are constantly encouraged to use spoken languages through lipreading or text-based communication," the authors wrote. "The exclusion of signed languages from modern language technologies further suppresses signing in favor of spoken languages."

Yin first became interested in sign language while doing outreach work at a homeless shelter while she was an undergraduate at École Polytechnique in Paris. There, she met a deaf woman and saw how difficult it was for her to establish social connections with others. Yin started learning French sign language and pursued sign language translation as part of her undergraduate research.

Once at the LTI, she noticed that almost all NLP research addressed only spoken languages. Computer vision research sought to understand signed languages but often lost the linguistic properties that signed languages share with spoken languages.

Signed languages use hand gestures, facial expressions, and head and body movements and can convey multiple words at once. For example, someone could sign "I am happy," but shake their head while doing it to indicate that they are not happy. Signed languages also employ shortcuts similar to the use of pronouns in spoken languages. Natural language processing tools are better equipped than computer vision methods alone to handle these types of complexities.

"We need researchers in both fields to work hand in hand," Yin said. "We can't fully understand signed language if we only look at the visuals."

Hochgesang, a deaf linguist who focuses on signed languages, said that when she was studying for her degree, there was barely any mention of signed languages in the literature, in her linguistics classes and in research like NLP. Language was speech; other methods of expressing language were ignored.

"On a personal scale, this hurt. It completely ignored my way of being," Hochgesang said. "When I was a student, I didn't see myself in the data being described and that made it really hard for me to connect. That it still hasn't improved much these days is unfortunate. The only way this kind of thing will change is if we are included more."

Yin said the paper was well received by both natural language processing researchers and people studying and using signed languages -- the two groups she sought to bring together.

"It's really exciting to see a paper the I wrote motivate people, and I hope can make a change in these communities," Yin said.

  • Artificial Intelligence
  • Information Technology
  • Communications
  • Computers and Internet
  • Educational Technology
  • Computer Science
  • Distributed Computing
  • Computing power everywhere
  • Voice over IP
  • Speech recognition
  • Information and communication technologies
  • Computer and video games
  • Digital economy
  • World Wide Web

Story Source:

Materials provided by Carnegie Mellon University . Original written by Aaron Aupperlee. Note: Content may be edited for style and length.

Cite This Page :

  • Accelerated Melting Under Greenland's Glaciers
  • Economic Losses from Climate Change
  • AI-Generated Food Images Look Tastier
  • Giant Volcano Discovered On Mars
  • Weaving Wearable Electronics Into Fabrics
  • Warming Seas: Deep Whirpools in Ocean Currents?
  • Alaska Dinosaur Tracks Reveal Lush Environment
  • A Coral Superhighway in the Indian Ocean
  • Cheetahs' Unrivalled Speed: Size Matters
  • Sand Ripples On Mars and Earth: New Theory

IMAGES

  1. (PDF) The Rise and Rise of Natural Language Processing Research, 1958-2021

    natural language processing research papers 2021

  2. (PDF) Development of a Natural Language Processing System using the

    natural language processing research papers 2021

  3. Natural Language Processing (NLP) for Machine Learning

    natural language processing research papers 2021

  4. (PDF) Natural Language Processing and Machine Learning: A Review

    natural language processing research papers 2021

  5. The 5 Top Trends Shaping Natural Language Processing in 2021

    natural language processing research papers 2021

  6. **TOP 10 NATURAL LANGUAGE PROCESSING PAPERS: RECOMMENDED READING

    natural language processing research papers 2021

VIDEO

  1. Natural language processing

  2. Mega Countdown Revision for NET DAY 4

  3. LINGUISTICS RESEARCH REPORT

  4. intro to NLP

  5. Natural Language Processing

  6. 443 LINGUISTICS RESEARCH REPORT

COMMENTS

  1. Natural Language Processing

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... Or, discuss a change on Slack. Browse SoTA > Natural Language Processing Natural Language Processing. 2297 benchmarks • 655 tasks • 1976 datasets • 26424 papers with code Classification Classification. 368 benchmarks

  2. [2112.11739] A Survey of Natural Language Generation

    A Survey of Natural Language Generation. Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin Chen, Junxin Li, Ying Shen, Min Yang. This paper offers a comprehensive review of the research on Natural Language Generation (NLG) over the past two decades, especially in relation to data-to-text generation and text-to-text generation deep learning methods ...

  3. natural language processing Latest Research Papers

    Hindi Language. Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language ...

  4. Vision, status, and research topics of Natural Language Processing

    The field of Natural Language Processing (NLP) has evolved with, and as well as influenced, recent advances in Artificial Intelligence (AI) and computing technologies, opening up new applications and novel interactions with humans. ... Fig. 1 shows the trend of the NLP-related scientific papers from 1999 to 2021, with an overall growing ...

  5. [2111.01243] Recent Advances in Natural Language Processing via Large

    Large, pre-trained transformer-based language models such as BERT have drastically changed the Natural Language Processing (NLP) field. We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches. We also present approaches that use pre-trained language models to generate data for ...

  6. NLP Research: Top Papers from 2021 So Far

    This paper introduces Stanza , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological ...

  7. Conference on Empirical Methods in Natural Language Processing (2021

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 43 papers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts 7 papers. Findings of the Association for Computational Linguistics: EMNLP 2021 425 papers.

  8. Most Popular NLP Papers Of 2021

    Published by Harvard University graduate Steven Merity, the paper ' Single Headed Attention RNN: Stop thinking with your head ', introduces a state-of-the-art NLP model called Single Headed Attention RNN or SHA-RNN. The author does so by using the example of the LSTM model with SHA in order to achieve state-of-the-art, byte-level language ...

  9. Annual Meeting of the Association for Computational Linguistics (2021

    Volumes. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 572 papers Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers ...

  10. Deep Learning in Natural Language Processing: A State-of-the-Art Survey

    Deep learning raises interests of research community as their overwhelming successes in information processing such specific tasks as video/speech recognition. In this paper, we provide a state-of-the-art analysis of deep learning with its applications in an important direction: natural language processing. We attempt to provide a clear and critical summarization for researchers and ...

  11. Graph Neural Networks for Natural Language Processing: A Survey

    Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP). Although text inputs are typically represented as a sequence of tokens, there isa rich variety of NLP problems that can be best expressed with a graph structure. As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP ...

  12. Natural Language Processing and Its Applications in ...

    As an essential part of artificial intelligence technology, natural language processing is rooted in multiple disciplines such as linguistics, computer science, and mathematics. The rapid advancements in natural language processing provides strong support for machine translation research. This paper first introduces the key concepts and main content of natural language processing, and briefly ...

  13. Natural Language Processing

    Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more. Our work spans the range of traditional NLP tasks, with general-purpose syntax and ...

  14. Natural language processing: state of the art, current trends and

    Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...

  15. The State of the Art of Natural Language Processing—A Systematic

    ABSTRACT. Nowadays, natural language processing (NLP) is one of the most popular areas of, broadly understood, artificial intelligence. Therefore, every day, new research contributions are posted, for instance, to the arXiv repository. Hence, it is rather difficult to capture the current "state of the field" and thus, to enter it. This brought the id-art NLP techniques to analyse the NLP ...

  16. A systematic review of applications of natural language processing and

    The paper presents a systematic literature review of the existing literature published between 2005 and 2021 in TBED. This review has meticulously examined 63 research papers from the IEEE, Science Direct, Scopus, and Web of Science databases to address four primary research questions. ... A systematic review of applications of natural language ...

  17. [2107.13586] Pre-train, Prompt, and Predict: A Systematic Survey of

    This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning". Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks ...

  18. Natural Language Processing and Computational Linguistics

    As an engineering field, research on natural language processing (NLP) is much more constrained by currently available resources and technologies, compared with theoretical work on computational linguistics (CL). In today's technology-driven society, it is almost impossible to imagine the degree to which computational resources, the capacity of secondary and main storage, and software ...

  19. Georgia Tech at EMNLP 2021

    at EMNLP 2021. Welcome to Georgia Tech's virtual experience for Empirical Methods in Natural Language Processing 2021 (EMNLP). Look around, explore, and discover insights into the institute's research contributions at the conference, taking place Nov. 7-11. EMNLP is a leading venue for the areas of natural language processing and artificial ...

  20. A Systematic Literature Review of Natural Language Processing: Current

    In this research paper, a comprehensive literature review was undertaken in order to analyze Natural Language Processing (NLP) application based in different domains. Also, by conducting qualitative research, we will try to analyze the development of the current state and the challenge of NLP technology as a key for Artificial Intelligence (AI ...

  21. Datasets: A Community Library for Natural Language Processing

    The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for ...

  22. Natural Language Processing: History, Evolution, Application, and

    PDF | On Jan 1, 2021, Prashant Johri and others published Natural Language Processing: History, Evolution, Application, and Future Work | Find, read and cite all the research you need on ResearchGate

  23. Including Signed Languages in Natural Language Processing

    This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact. We first discuss the linguistic properties of signed languages to consider during their modeling. Then, we review the limitations of current SLP models and identify the open challenges to extend NLP to signed ...

  24. Natural Language Processing with Improved Deep Learning ...

    As one of the core tasks in the field of natural language processing, syntactic analysis has always been a hot topic for researchers, including tasks such as Questions and Answer (Q&A), Search String Comprehension, Semantic Analysis, and Knowledge Base Construction. This paper aims to study the application of deep learning and neural network in natural language syntax analysis, which has ...

  25. Natural language processing for urban research: A systematic review

    Natural language processing (NLP) has demonstrated tremendous capabilities of harvesting the abundance of textual data. As a form of artificial intelligence, it uses computational algorithms to learn, understand, and produce human language content ( Hirschberg and Manning, 2015 ). It is interrelated with machine learning and deep learning.

  26. Natural language processing research: Signed languages

    Carnegie Mellon University. (2021, August 9). Natural language processing research: Signed languages. ScienceDaily. Retrieved March 10, 2024 from www.sciencedaily.com / releases / 2021 / 08 ...