Help | Advanced Search

Computer Science > Machine Learning

Title: data engineering for everyone.

Abstract: Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce, which presents a severe challenge to ML deployment at scale. Much like the software-engineering revolution, where mass adoption of open-source software replaced the closed, in-house development model for infrastructure code, there is a growing need to enable rapid development and open contribution to massive machine learning data sets. This article shows that open-source data sets are the rocket fuel for research and innovation at even some of the largest AI organizations. Our analysis of nearly 2000 research publications from Facebook, Google and Microsoft over the past five years shows the widespread use and adoption of open data sets. Open data sets that are easily accessible to the public are vital to accelerating ML innovation for everyone. But such open resources are scarce in the wild. So, what if we are able to accelerate data-set creation via automatic data set generation tools?

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

research papers on data engineering

Category-Level Contrastive Learning for Unsupervised Hashing in Cross-Modal Retrieval

  • Mengying Xu

research papers on data engineering

An Overview Based on the Overall Architecture of Traffic Forecasting

  • Xuehua Liao

research papers on data engineering

Uncovering Flat and Hierarchical Topics by Community Discovery on Word Co-occurrence Network

  • Eric Austin
  • Shraddha Makwana
  • Osmar R. Zaïane

research papers on data engineering

A Meta-adversarial Framework for Cross-Domain Cold-Start Recommendation

  • Shaoqing Wang

research papers on data engineering

Special Issue Editorial on “The Innovative Use of Data Science to Transform How We Work and Live”

  • Yee Ling Boo
  • Manik Gupta
  • Philippe Fournier-Viger

Leveraging Semantic Information for Enhanced Community Search in Heterogeneous Graphs

  • Guosheng Zang
  • Tingjian Ge

research papers on data engineering

FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning

research papers on data engineering

Welcome to a New Era of the Data Science and Engineering Journal (DSE)

research papers on data engineering

Efficient Top- k Frequent Itemset Mining on Massive Data

  • Xiaolong Wan

research papers on data engineering

Where To Go at the Next Timestamp

  • Xiangfu Meng
  • Guihong Liu

research papers on data engineering

Construct and Query A Fine-Grained Geospatial Knowledge Graph

research papers on data engineering

DB-GPT: Large Language Model Meets Database

  • Xuanhe Zhou
  • Zhaoyan Sun
  • Guoliang Li

research papers on data engineering

Explicit Behavior Interaction with Heterogeneous Graph for Multi-behavior Recommendation

  • Zhongping Zhang

research papers on data engineering

AIoT-CitySense: AI and IoT-Driven City-Scale Sensing for Roadside Infrastructure Maintenance

  • Abdur Rahim Mohammad Forkan
  • Yong-Bin Kang
  • Prem Prakash Jayaraman

research papers on data engineering

Anomaly Detection with Sub-Extreme Values: Health Provider Billing

  • Rob Muspratt
  • Musa Mammadov

research papers on data engineering

Graph Neural Network-Based Short‑Term Load Forecasting with Temporal Convolution

  • Chenchen Sun
  • Tiezheng Nie

research papers on data engineering

Joint Representation Learning with Generative Adversarial Imputation Network for Improved Classification of Longitudinal Data

  • Sharon Torao Pingi
  • Duoyi Zhang
  • Richi Nayak

research papers on data engineering

A Reinduction-Based Approach for Efficient High Utility Itemset Mining from Incremental Datasets

  • Satish Chand

research papers on data engineering

Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention

  • Shanna Zhong
  • Jiahui Wang

research papers on data engineering

Efficient Network Representation Learning via Cluster Similarity

  • Yasuhiro Fujiwara
  • Yasutoshi Ida
  • Naonori Ueda

research papers on data engineering

An Efficient Keywords Search in Temporal Social Networks

  • Zitong Chen

research papers on data engineering

Special Issue of DASFAA 2023

  • Maria Luisa Sapino
  • Hongzhi Yin

Fully Dynamic Contraction Hierarchies with Label Restrictions on Road Networks

  • Liping Wang

research papers on data engineering

Learning with Small Data: Subgraph Counting Queries

  • Kangfei Zhao

research papers on data engineering

A Survey of Personalized News Recommendation

  • Hongjin Huo

research papers on data engineering

Combining Graph Contrastive Embedding and Multi-head Cross-Attention Transfer for Cross-Domain Recommendation

  • Dongqing Zhu
  • Zhenzhen Huang

research papers on data engineering

Personalized Re-ranking for Recommendation with Mask Pretraining

research papers on data engineering

Deep Learning-Based Bloom Filter for Efficient Multi-key Membership Testing

  • Haitian Chen

research papers on data engineering

Evolving Interest with Feature Co-action Network for CTR Prediction

  • Zhiyang Yuan
  • Wenguang Zheng
  • Yingyuan Xiao

research papers on data engineering

SSTP: Social and Spatial-Temporal Aware Next Point-of-Interest Recommendation

  • Junzhuang Wu
  • Yujing Zhang
  • Zhenyu Zhang

research papers on data engineering

A Neural Inference of User Social Interest for Item Recommendation

  • Junyang Chen
  • Zhiguo Gong

research papers on data engineering

A One-Size-Fits-Three Representation Learning Framework for Patient Similarity Search

  • Yefan Huang

research papers on data engineering

Signal Contrastive Enhanced Graph Collaborative Filtering for Recommendation

  • Zhi-Yuan Li
  • Man-Sheng Chen
  • Chang-Dong Wang

research papers on data engineering

PosKHG: A Position-Aware Knowledge Hypergraph Model for Link Prediction

research papers on data engineering

A Framework to Maximize Group Fairness for Workers on Online Labor Platforms

  • Anis El Rabaa
  • Shady Elbassuoni
  • Sihem Amer-Yahia

research papers on data engineering

A Survey of Advanced Information Fusion System: from Model-Driven to Knowledge-Enabled

  • Hailian Yin

research papers on data engineering

A Survey on the Integration of Blockchains and Databases

  • Changhao Zhu
  • Meihui Zhang

research papers on data engineering

UMP-MG: A Uni-directed Message-Passing Multi-label Generation Model for Hierarchical Text Classification

  • Shuangyong Song

research papers on data engineering

Improving Gender-Related Fairness in Sentence Encoders: A Semantics-Based Approach

  • Tommaso Dolci
  • Fabio Azzalini
  • Mara Tanelli

research papers on data engineering

Probing the Impacts of Visual Context in Multimodal Entity Alignment

  • Yinghui Shi
  • Yefeng Zheng

research papers on data engineering

Memory-Enhanced Transformer for Representation Learning on Temporal Heterogeneous Graphs

  • Zhaohang Luo

research papers on data engineering

Learning Weight Signed Network Embedding with Graph Neural Networks

  • Qiancheng Yu
  • Qinwen Yang

research papers on data engineering

Construct Trip Graphs by Using Taxi Trajectory Data

  • Taohong Zhang

research papers on data engineering

A Personalized Explainable Learner Implicit Friend Recommendation Method

  • Chunying Li
  • Bingyang Zhou

research papers on data engineering

A Novel Link Prediction Framework Based on Gravitational Field

  • Yanlin Yang
  • Zhonglin Ye

research papers on data engineering

Multi-Model Fusion-Based Hierarchical Extraction for Chinese Epidemic Event

  • Zenghua Liao
  • Zongqiang Yang

research papers on data engineering

A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce

  • Guozheng Wang
  • Yongmei Lei

research papers on data engineering

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

  • Guixian Zhang
  • Rongjiao Liang

research papers on data engineering

Joint Attention Networks with Inherent and Contextual Preference-Awareness for Successive POI Recommendation

  • Haiting Zhong

research papers on data engineering

An Efficient Algorithm of Star Subgraph Queries on Urban Traffic Knowledge Graph

research papers on data engineering

  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. International Journal of Scientific Engineering and Technology (IJSET

    research papers on data engineering

  2. (PDF) Experimental Research Methods for Students in Built Environment

    research papers on data engineering

  3. Scientific Research Paper Sample

    research papers on data engineering

  4. (PDF) Introduction to Data Science and Engineering

    research papers on data engineering

  5. Draft For Research Paper Example : How to Write an APA Research Paper

    research papers on data engineering

  6. IJSER Publishing

    research papers on data engineering

VIDEO

  1. SCAM सभी TEACHERS & STUDENTS हो जाओ सावधान

  2. Introduction to Data Engineering in ArcGIS Pro

  3. Introduction to Data Engineering

  4. Zenodo Datasets Repository

  5. SQL for data engineers

  6. Data Engineering in E-Commerce

COMMENTS

  1. [2102.11447] Data Engineering for Everyone

    Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce, which presents a severe challenge to ML deployment at scale. Much like the software-engineering revolution, where mass adoption of open-source software ...

  2. Articles

    Lei Zhao. Research Paper Open access 12 October 2022 Pages: 316 - 327. Part of 1 collection: Information Extraction. 1. 2. …. 5. Data Science and Engineering is a peer-reviewed, open access journal focusing on theoretical background and advanced engineering approaches in data science ...