Help | Advanced Search
Computer Science > Machine Learning
Title: data engineering for everyone.
Abstract: Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce, which presents a severe challenge to ML deployment at scale. Much like the software-engineering revolution, where mass adoption of open-source software replaced the closed, in-house development model for infrastructure code, there is a growing need to enable rapid development and open contribution to massive machine learning data sets. This article shows that open-source data sets are the rocket fuel for research and innovation at even some of the largest AI organizations. Our analysis of nearly 2000 research publications from Facebook, Google and Microsoft over the past five years shows the widespread use and adoption of open data sets. Open data sets that are easily accessible to the public are vital to accelerating ML innovation for everyone. But such open resources are scarce in the wild. So, what if we are able to accelerate data-set creation via automatic data set generation tools?
Submission history
Access paper:.
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
DBLP - CS Bibliography
Bibtex formatted citation.
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
Category-Level Contrastive Learning for Unsupervised Hashing in Cross-Modal Retrieval
- Mengying Xu
An Overview Based on the Overall Architecture of Traffic Forecasting
- Xuehua Liao
Uncovering Flat and Hierarchical Topics by Community Discovery on Word Co-occurrence Network
- Eric Austin
- Shraddha Makwana
- Osmar R. Zaïane
A Meta-adversarial Framework for Cross-Domain Cold-Start Recommendation
- Shaoqing Wang
Special Issue Editorial on “The Innovative Use of Data Science to Transform How We Work and Live”
- Yee Ling Boo
- Manik Gupta
- Philippe Fournier-Viger
Leveraging Semantic Information for Enhanced Community Search in Heterogeneous Graphs
- Guosheng Zang
- Tingjian Ge
FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning
Welcome to a New Era of the Data Science and Engineering Journal (DSE)
Efficient Top- k Frequent Itemset Mining on Massive Data
- Xiaolong Wan
Where To Go at the Next Timestamp
- Xiangfu Meng
- Guihong Liu
Construct and Query A Fine-Grained Geospatial Knowledge Graph
DB-GPT: Large Language Model Meets Database
- Xuanhe Zhou
- Zhaoyan Sun
- Guoliang Li
Explicit Behavior Interaction with Heterogeneous Graph for Multi-behavior Recommendation
- Zhongping Zhang
AIoT-CitySense: AI and IoT-Driven City-Scale Sensing for Roadside Infrastructure Maintenance
- Abdur Rahim Mohammad Forkan
- Yong-Bin Kang
- Prem Prakash Jayaraman
Anomaly Detection with Sub-Extreme Values: Health Provider Billing
- Rob Muspratt
- Musa Mammadov
Graph Neural Network-Based Short‑Term Load Forecasting with Temporal Convolution
- Chenchen Sun
- Tiezheng Nie
Joint Representation Learning with Generative Adversarial Imputation Network for Improved Classification of Longitudinal Data
- Sharon Torao Pingi
- Duoyi Zhang
- Richi Nayak
A Reinduction-Based Approach for Efficient High Utility Itemset Mining from Incremental Datasets
- Satish Chand
Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention
- Shanna Zhong
- Jiahui Wang
Efficient Network Representation Learning via Cluster Similarity
- Yasuhiro Fujiwara
- Yasutoshi Ida
- Naonori Ueda
An Efficient Keywords Search in Temporal Social Networks
- Zitong Chen
Special Issue of DASFAA 2023
- Maria Luisa Sapino
- Hongzhi Yin
Fully Dynamic Contraction Hierarchies with Label Restrictions on Road Networks
- Liping Wang
Learning with Small Data: Subgraph Counting Queries
- Kangfei Zhao
A Survey of Personalized News Recommendation
- Hongjin Huo
Combining Graph Contrastive Embedding and Multi-head Cross-Attention Transfer for Cross-Domain Recommendation
- Dongqing Zhu
- Zhenzhen Huang
Personalized Re-ranking for Recommendation with Mask Pretraining
Deep Learning-Based Bloom Filter for Efficient Multi-key Membership Testing
- Haitian Chen
Evolving Interest with Feature Co-action Network for CTR Prediction
- Zhiyang Yuan
- Wenguang Zheng
- Yingyuan Xiao
SSTP: Social and Spatial-Temporal Aware Next Point-of-Interest Recommendation
- Junzhuang Wu
- Yujing Zhang
- Zhenyu Zhang
A Neural Inference of User Social Interest for Item Recommendation
- Junyang Chen
- Zhiguo Gong
A One-Size-Fits-Three Representation Learning Framework for Patient Similarity Search
- Yefan Huang
Signal Contrastive Enhanced Graph Collaborative Filtering for Recommendation
- Zhi-Yuan Li
- Man-Sheng Chen
- Chang-Dong Wang
PosKHG: A Position-Aware Knowledge Hypergraph Model for Link Prediction
A Framework to Maximize Group Fairness for Workers on Online Labor Platforms
- Anis El Rabaa
- Shady Elbassuoni
- Sihem Amer-Yahia
A Survey of Advanced Information Fusion System: from Model-Driven to Knowledge-Enabled
- Hailian Yin
A Survey on the Integration of Blockchains and Databases
- Changhao Zhu
- Meihui Zhang
UMP-MG: A Uni-directed Message-Passing Multi-label Generation Model for Hierarchical Text Classification
- Shuangyong Song
Improving Gender-Related Fairness in Sentence Encoders: A Semantics-Based Approach
- Tommaso Dolci
- Fabio Azzalini
- Mara Tanelli
Probing the Impacts of Visual Context in Multimodal Entity Alignment
- Yinghui Shi
- Yefeng Zheng
Memory-Enhanced Transformer for Representation Learning on Temporal Heterogeneous Graphs
- Zhaohang Luo
Learning Weight Signed Network Embedding with Graph Neural Networks
- Qiancheng Yu
- Qinwen Yang
Construct Trip Graphs by Using Taxi Trajectory Data
- Taohong Zhang
A Personalized Explainable Learner Implicit Friend Recommendation Method
- Chunying Li
- Bingyang Zhou
A Novel Link Prediction Framework Based on Gravitational Field
- Yanlin Yang
- Zhonglin Ye
Multi-Model Fusion-Based Hierarchical Extraction for Chinese Epidemic Event
- Zenghua Liao
- Zongqiang Yang
A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce
- Guozheng Wang
- Yongmei Lei
A Multi-level Mesh Mutual Attention Model for Visual Question Answering
- Guixian Zhang
- Rongjiao Liang
Joint Attention Networks with Inherent and Contextual Preference-Awareness for Successive POI Recommendation
- Haiting Zhong
An Efficient Algorithm of Star Subgraph Queries on Urban Traffic Knowledge Graph
- Find a journal
- Publish with us
- Track your research
IMAGES
VIDEO
COMMENTS
Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce, which presents a severe challenge to ML deployment at scale. Much like the software-engineering revolution, where mass adoption of open-source software ...
Lei Zhao. Research Paper Open access 12 October 2022 Pages: 316 - 327. Part of 1 collection: Information Extraction. 1. 2. …. 5. Data Science and Engineering is a peer-reviewed, open access journal focusing on theoretical background and advanced engineering approaches in data science ...