- SNAP C++ Main Page
- SNAP C++ Download
- SNAP C++ Documentation
- Snap.py Python Main Page
- Snap.py Python Download
- Snap.py Python Documentation
- Large networks
- Web datasets
- Other resources
- BIOSNAP Datasets
- Activity Inequality
- Higher-order
- Disinformation
- Memetracker
- Temporal Motifs
- Citing SNAP
Web data: Amazon movie reviews
Dataset information.
This dataset consists of movie reviews from amazon . The data span a period of more than 10 years, including all ~8 million reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories .
Source (citation)
- J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews . WWW, 2013.
Data format
- product/productId : asin , e.g. amazon.com/dp/B00006HAXW
- review/userId : id of the user, e.g. A1RSDE90N6RSZF
- review/profileName : name of the user
- review/helpfulness : fraction of users who found the review helpful
- review/score : rating of the product
- review/time : time of the review (unix time)
- review/summary : review summary
Amazon Review Data (2018)
Jianmo Ni , UCSD
Description
- The total number of reviews is 233.1 million (142.8 million in 2014).
- Current data includes reviews in the range May 1996 - Oct 2018.
- Product information, e.g. color (white or black), size (large or small), package type (hardcover or electronics), etc.
- Product images that are taken after the user received the product.
- Bullet-point descriptions under product title.
- Technical details table (attribute-value pairs).
- Similar products table.
- Includes 5 new product categories.
You can also download the review data from our previous datasets.
Amazon review (2014)
Amazon review (2013)
Please cite the following paper if you use the data in any way:
Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP) , 2019 pdf
05/2021 We updated high resolution image urls to the metadata!
08/2020 We have updated the metadata and now it includes much less HTML/CSS code. Feel free to download the updated data!
- Load the metadata (e.g. as JSON or DataFrame)
- Check if title has HTML contents and filter them
We provide a colab notebook that helps you find target products and obtain their reviews!
- Unparsed HTML contents
- Duplicate items which have same reviews
- Files complete data K-cores and ratings-only data sample review sample metadata
Complete review data
Please only download these (large!) files if you really need them. We recommend using the smaller datasets (i.e. k-core and CSV files) as shown in the next section .
raw review data (34gb) - all 233.1 million reviews
user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user
product review data (18gb) - duplicate items removed, sorted by product
5-core (14.3gb) - subset of the data in which all users and items have at least 5 reviews (75.26 million reviews)
Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:
aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)
Per-category data - the review and product metadata for each category.
To download the complete review data and the per-category files, the following links will direct you to enter a form. Please contact me if you can't get access to the form.
"Small" subsets for experimentation
If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files.
K-cores (i.e., dense subsets): These data have been reduced to extract the k-core , such that each of the remaining users and items have k reviews each.
Ratings only: These datasets include no metadata or reviews, but only (item,user,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.
You can directly download the following smaller per-category datasets.
Data format
Format is one-review-per-line in json. See examples below for further help reading the data.
Sample review:
{ "image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"], "overall": 5.0, "vote": "2", "verified": True, "reviewTime": "01 1, 2018", "reviewerID": "AUI6WTTT0QZYS", "asin": "5120053084", "style": { "Size:": "Large", "Color:": "Charcoal" }, "reviewerName": "Abbey", "reviewText": "I now have 4 of the 5 available colors of this shirt... ", "summary": "Comfy, flattering, discreet--highly recommended!", "unixReviewTime": 1514764800 } { "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "vote": 5, "style": { "Format:": "Hardcover" }, "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- vote - helpful votes of the review
- style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
- image - images that users post after they have received the product
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (24gb) - metadata for 15.5 million products
Sample metadata:
{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "feature": ["Botiquecutie Trademark exclusive Brand", "Hot Pink Layered Zebra Print Tutu", "Fits girls up to a size 4T", "Hand wash / Line Dry", "Includes a Botiquecutie TM Exclusive hair flower bow"], "description": "This tutu is great for dress up play for your little ballerina. Botiquecute Trade Mark exclusive brand. Hot Pink Zebra print tutu.", "price": 3.17, "imageURL": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "imageURLHighRes": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL.jpg", "also_buy": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- feature - bullet-point format features of the product
- description - description of the product
- price - price in US dollars (at time of crawl)
- imageURL - url of the product image
- imageURL - url of the high resolution product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
- tech1 - the first technical detail table of the product
- tech2 - the second technical detail table of the product
- similar - similar product table
Visual Features
We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.
visual features (141gb) - visual features for all products
The images themselves can be extracted from the image field in the metadata files.
Below are files for individual product categories, which have already had duplicate item reviews removed.
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
def parse(path): g = gzip.open(path, 'r') for l in g: yield json.loads(l)
Convert to 'strict' json
The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:
import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')
Pandas data frame
This code reads the data into a pandas data frame:
import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield json.loads(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('reviews_Video_Games.json.gz')
Convert to CSV
This code converts (a selection of fields from) the above files to CSV format:
import csv fields = ["asin", "description", "brand"] csvOut = gzip.open("meta_Video_Games.csv.gz", 'w') writer = csv.writer(csvOut) for product in parse("meta_Video_Games.json.gz"): line = [] for f in fields: if product.has_key(f): line.append(product[f]) else: line.append("") writer.writerow(line)
Read image features
import array def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break a = array.array('f') a.fromfile(f, 4096) yield asin, a.tolist()
Example: compute average rating
ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)
Example: latent-factor model in mymedialite
Predicts ratings from a rating-only CSV file
./rating_prediction --recommender=BiasedMatrixFactorization --training-file=ratings_Video_Games.csv --test-ratio=0.1
Introduction #
[ 🤗 Huggingface Datasets ] · [ 📑 Paper ] · [ 💻 GitHub ]
This is a large-scale Amazon Reviews dataset, collected in 2023 by McAuley Lab , and it includes rich features such as:
User Reviews ( ratings , text , helpfulness votes , etc.);
Item Metadata ( descriptions , price , raw image , etc.);
Links ( user-item / bought together graphs).
What’s New? #
In the Amazon Reviews’23, we provide:
Larger Dataset: We collected 571.54M reviews, 245.2% larger than the last version;
Newer Interactions: Current interactions range from May. 1996 to Sep. 2023;
Richer Metadata: More descriptive features in item metadata;
Fine-grained Timestamp: Interaction timestamp at the second or finer level;
Cleaner Processing: Cleaner item metadata than previous versions;
Standard Splitting: Standard data splits to encourage RecSys benchmarking.
Basic Statistics #
We define the #R_Tokens as the number of tokens in user reviews and #M_Tokens as the number of tokens if treating the dictionaries of item attributes as strings. We emphasize them as important statistics in the era of LLMs.
We count the number of items based on user reviews rather than item metadata files. Note that some items lack metadata.
Compared to Previous Versions #
Grouped by category #.
Check Pure ID files and corresponding data splitting strategies in Common Data Processing section.
Quick Start #
Load user reviews #, load item metadata #.
Check data loading examples and Huggingface datasets APIs in Common Data Loading section.
Data Fields #
For user reviews #, for item metadata #, contact us #.
Report Bugs : To report bugs in the dataset, please file an issue on our GitHub .
Others : For research collaborations or other questions, please email yphou AT ucsd.edu .
- Español – América Latina
- Português – Brasil
- Tiếng Việt
TFDS now supports the Croissant 🥐 format ! Read the documentation to know more.
amazon_us_reviews
- Description :
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.
Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters).
Each Dataset contains the following columns : marketplace - 2 letter country code of the marketplace where the review was written. customer_id - Random identifier that can be used to aggregate reviews written by a single author. review_id - The unique ID of the review. product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id. product_parent - Random identifier that can be used to aggregate reviews for the same product. product_title - Title of the product. product_category - Broad product category that can be used to group reviews (also used to group the dataset into coherent parts). star_rating - The 1-5 star rating of the review. helpful_votes - Number of helpful votes. total_votes - Number of total votes the review received. vine - Review was written as part of the Vine program. verified_purchase - The review is on a verified purchase. review_headline - The title of the review. review_body - The review text. review_date - The date the review was written.
Homepage : https://s3.amazonaws.com/amazon-reviews-pds/readme.html
Source code : tfds.datasets.amazon_us_reviews.Builder
- 0.1.0 (default): No release notes.
Feature structure :
- Feature documentation :
Supervised keys (See as_supervised doc ): None
Figure ( tfds.show_examples ): Not supported.
amazon_us_reviews/Wireless_v1_00 (default config)
Config description : A dataset consisting of reviews of Amazon Wireless_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.59 GiB
Dataset size : 7.21 GiB
Auto-cached ( documentation ): No
- Examples ( tfds.as_dataframe ):
amazon_us_reviews/Watches_v1_00
Config description : A dataset consisting of reviews of Amazon Watches_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 155.42 MiB
Dataset size : 753.08 MiB
amazon_us_reviews/Video_Games_v1_00
Config description : A dataset consisting of reviews of Amazon Video_Games_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 453.19 MiB
Dataset size : 1.78 GiB
amazon_us_reviews/Video_DVD_v1_00
Config description : A dataset consisting of reviews of Amazon Video_DVD_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.41 GiB
Dataset size : 5.31 GiB
amazon_us_reviews/Video_v1_00
Config description : A dataset consisting of reviews of Amazon Video_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 132.49 MiB
Dataset size : 465.08 MiB
amazon_us_reviews/Toys_v1_00
Config description : A dataset consisting of reviews of Amazon Toys_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 799.61 MiB
Dataset size : 3.61 GiB
amazon_us_reviews/Tools_v1_00
Config description : A dataset consisting of reviews of Amazon Tools_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 318.32 MiB
Dataset size : 1.37 GiB
amazon_us_reviews/Sports_v1_00
Config description : A dataset consisting of reviews of Amazon Sports_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 832.06 MiB
Dataset size : 3.64 GiB
amazon_us_reviews/Software_v1_00
Config description : A dataset consisting of reviews of Amazon Software_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 89.66 MiB
Dataset size : 366.16 MiB
amazon_us_reviews/Shoes_v1_00
Config description : A dataset consisting of reviews of Amazon Shoes_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 612.50 MiB
Dataset size : 3.06 GiB
amazon_us_reviews/Pet_Products_v1_00
Config description : A dataset consisting of reviews of Amazon Pet_Products_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 491.92 MiB
Dataset size : 2.11 GiB
amazon_us_reviews/Personal_Care_Appliances_v1_00
Config description : A dataset consisting of reviews of Amazon Personal_Care_Appliances_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 16.82 MiB
Dataset size : 75.03 MiB
Auto-cached ( documentation ): Yes
amazon_us_reviews/PC_v1_00
Config description : A dataset consisting of reviews of Amazon PC_v1_00 products in US marketplace. Each product has its own version as specified with it.
Dataset size : 5.93 GiB
amazon_us_reviews/Outdoors_v1_00
Config description : A dataset consisting of reviews of Amazon Outdoors_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 428.16 MiB
Dataset size : 1.83 GiB
amazon_us_reviews/Office_Products_v1_00
Config description : A dataset consisting of reviews of Amazon Office_Products_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 488.59 MiB
Dataset size : 2.12 GiB
amazon_us_reviews/Musical_Instruments_v1_00
Config description : A dataset consisting of reviews of Amazon Musical_Instruments_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 184.43 MiB
Dataset size : 792.16 MiB
amazon_us_reviews/Music_v1_00
Config description : A dataset consisting of reviews of Amazon Music_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.42 GiB
Dataset size : 5.16 GiB
amazon_us_reviews/Mobile_Electronics_v1_00
Config description : A dataset consisting of reviews of Amazon Mobile_Electronics_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 21.81 MiB
Dataset size : 94.97 MiB
amazon_us_reviews/Mobile_Apps_v1_00
Config description : A dataset consisting of reviews of Amazon Mobile_Apps_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 532.11 MiB
Dataset size : 3.13 GiB
amazon_us_reviews/Major_Appliances_v1_00
Config description : A dataset consisting of reviews of Amazon Major_Appliances_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 23.23 MiB
Dataset size : 96.36 MiB
amazon_us_reviews/Luggage_v1_00
Config description : A dataset consisting of reviews of Amazon Luggage_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 57.53 MiB
Dataset size : 274.07 MiB
amazon_us_reviews/Lawn_and_Garden_v1_00
Config description : A dataset consisting of reviews of Amazon Lawn_and_Garden_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 464.22 MiB
Dataset size : 2.00 GiB
amazon_us_reviews/Kitchen_v1_00
Config description : A dataset consisting of reviews of Amazon Kitchen_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 887.63 MiB
Dataset size : 3.85 GiB
amazon_us_reviews/Jewelry_v1_00
Config description : A dataset consisting of reviews of Amazon Jewelry_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 235.58 MiB
Dataset size : 1.22 GiB
amazon_us_reviews/Home_Improvement_v1_00
Config description : A dataset consisting of reviews of Amazon Home_Improvement_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 480.02 MiB
Dataset size : 2.08 GiB
amazon_us_reviews/Home_Entertainment_v1_00
Config description : A dataset consisting of reviews of Amazon Home_Entertainment_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 184.22 MiB
Dataset size : 741.78 MiB
amazon_us_reviews/Home_v1_00
Config description : A dataset consisting of reviews of Amazon Home_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.01 GiB
Dataset size : 4.60 GiB
amazon_us_reviews/Health_Personal_Care_v1_00
Config description : A dataset consisting of reviews of Amazon Health_Personal_Care_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 964.34 MiB
Dataset size : 4.21 GiB
amazon_us_reviews/Grocery_v1_00
Config description : A dataset consisting of reviews of Amazon Grocery_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 382.74 MiB
Dataset size : 1.77 GiB
amazon_us_reviews/Gift_Card_v1_00
Config description : A dataset consisting of reviews of Amazon Gift_Card_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 11.57 MiB
Dataset size : 93.82 MiB
amazon_us_reviews/Furniture_v1_00
Config description : A dataset consisting of reviews of Amazon Furniture_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 142.08 MiB
Dataset size : 646.69 MiB
amazon_us_reviews/Electronics_v1_00
Config description : A dataset consisting of reviews of Amazon Electronics_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 666.45 MiB
Dataset size : 2.74 GiB
amazon_us_reviews/Digital_Video_Games_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Video_Games_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 26.17 MiB
Dataset size : 124.19 MiB
amazon_us_reviews/Digital_Video_Download_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Video_Download_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 483.49 MiB
Dataset size : 2.68 GiB
amazon_us_reviews/Digital_Software_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Software_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 18.12 MiB
Dataset size : 89.59 MiB
amazon_us_reviews/Digital_Music_Purchase_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Music_Purchase_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 241.82 MiB
Dataset size : 1.20 GiB
amazon_us_reviews/Digital_Ebook_Purchase_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Ebook_Purchase_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 2.51 GiB
Dataset size : 10.82 GiB
amazon_us_reviews/Camera_v1_00
Config description : A dataset consisting of reviews of Amazon Camera_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 422.15 MiB
Dataset size : 1.69 GiB
amazon_us_reviews/Books_v1_00
Config description : A dataset consisting of reviews of Amazon Books_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 2.55 GiB
Dataset size : 10.01 GiB
amazon_us_reviews/Beauty_v1_00
Config description : A dataset consisting of reviews of Amazon Beauty_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 871.73 MiB
Dataset size : 3.88 GiB
amazon_us_reviews/Baby_v1_00
Config description : A dataset consisting of reviews of Amazon Baby_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 340.84 MiB
Dataset size : 1.45 GiB
amazon_us_reviews/Automotive_v1_00
Config description : A dataset consisting of reviews of Amazon Automotive_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 555.18 MiB
Dataset size : 2.54 GiB
amazon_us_reviews/Apparel_v1_00
Config description : A dataset consisting of reviews of Amazon Apparel_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 618.59 MiB
Dataset size : 3.99 GiB
amazon_us_reviews/Digital_Ebook_Purchase_v1_01
Config description : A dataset consisting of reviews of Amazon Digital_Ebook_Purchase_v1_01 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.21 GiB
Dataset size : 4.87 GiB
amazon_us_reviews/Books_v1_01
Config description : A dataset consisting of reviews of Amazon Books_v1_01 products in US marketplace. Each product has its own version as specified with it.
Dataset size : 8.48 GiB
amazon_us_reviews/Books_v1_02
Config description : A dataset consisting of reviews of Amazon Books_v1_02 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.24 GiB
Dataset size : 4.15 GiB
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
Datasets: McAuley-Lab / Amazon-Reviews-2023 like 25
Need help to make the dataset viewer work? Open a discussion for direct support.
Amazon Reviews 2023
Please also visit amazon-reviews-2023.github.io/ for more details, loading scripts, and preprocessed benchmark files.
[April 7, 2024] We add two useful files:
- all_categories.txt : 34 lines (33 categories + "Unknown"), each line contains a category name.
- asin2category.json : A mapping between parent_asin (item ID) to its corresponding category name.
This is a large-scale Amazon Reviews dataset, collected in 2023 by McAuley Lab , and it includes rich features such as:
- User Reviews ( ratings , text , helpfulness votes , etc.);
- Item Metadata ( descriptions , price , raw image , etc.);
- Links ( user-item / bought together graphs).
What's New?
In the Amazon Reviews'23, we provide:
- Larger Dataset: We collected 571.54M reviews, 245.2% larger than the last version;
- Newer Interactions: Current interactions range from May. 1996 to Sep. 2023;
- Richer Metadata: More descriptive features in item metadata;
- Fine-grained Timestamp: Interaction timestamp at the second or finer level;
- Cleaner Processing: Cleaner item metadata than previous versions;
- Standard Splitting: Standard data splits to encourage RecSys benchmarking.
Basic Statistics
We define the #R_Tokens as the number of tokens in user reviews and #M_Tokens as the number of tokens if treating the dictionaries of item attributes as strings. We emphasize them as important statistics in the era of LLMs.
We count the number of items based on user reviews rather than item metadata files. Note that some items lack metadata.
Compared to Previous Versions
Grouped by category.
Check Pure ID files and corresponding data splitting strategies in Common Data Processing section.
Quick Start
Load user reviews, load item metadata.
Check data loading examples and Huggingface datasets APIs in Common Data Loading section.
Data Fields
For user reviews, for item metadata.
Report Bugs : To report bugs in the dataset, please file an issue on our GitHub .
Others : For research collaborations or other questions, please email yphou AT ucsd.edu .
Models trained or fine-tuned on McAuley-Lab/Amazon-Reviews-2023
hyp1231/blair-roberta-base
Hyp1231/blair-roberta-large.
Amazon product data
Julian McAuley , UCSD
New!: See our updated (2018) version of the Amazon data here
New: repository of recommender systems datasets.
See a variety of other datasets for recommender systems research on our lab's dataset webpage
Description
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
"Small" subsets for experimentation
If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files. To obtain the larger files you will need to contact me to obtain access.
K-cores (i.e., dense subsets): These data have been reduced to extract the k-core , such that each of the remaining users and items have k reviews each.
Ratings only: These datasets include no metadata or reviews, but only (user,item,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.
Complete review data
Please see the per-category files below, and only download these (large!) files if you really need them:
raw review data (20gb) - all 142.8 million reviews
The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the files below:
user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user
product review data (18gb) - duplicate items removed, sorted by product
5-core (9.9gb) - subset of the data in which all users and items have at least 5 reviews (41.13 million reviews)
Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:
aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)
Format is one-review-per-line in (loose) json. See examples below for further help reading the data.
Sample review:
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (3.1gb) - metadata for 9.4 million products
Sample metadata:
{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- price - price in US dollars (at time of crawl)
- imUrl - url of the product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
Visual Features
We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.
visual features (141gb) - visual features for all products
The images themselves can be extracted from the imUrl field in the metadata files.
Below are files for individual product categories, which have already had duplicate item reviews removed.
Please cite one or both of the following if you use the data in any way:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW , 2016 pdf
Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR , 2015 pdf
Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining , 2015 pdf
Hidden factors and hidden topics: understanding rating dimensions with review text J. McAuley, J. Leskovec RecSys pdf | reviews | bibtex | code (C++) slides
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)
Convert to 'strict' json
The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:
import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')
Pandas data frame
This code reads the data into a pandas data frame:
import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield eval(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('reviews_Video_Games.json.gz')
Convert to CSV
This code converts (a selection of fields from) the above files to CSV format:
import csv fields = ["asin", "description", "brand"] csvOut = gzip.open("meta_Video_Games.csv.gz", 'w') writer = csv.writer(csvOut) for product in parse("meta_Video_Games.json.gz"): line = [] for f in fields: if product.has_key(f): line.append(product[f]) else: line.append("") writer.writerow(line)
Read image features
import array def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break a = array.array('f') a.fromfile(f, 4096) yield asin, a.tolist()
Example: compute average rating
ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)
Example: latent-factor model in mymedialite
Predicts ratings from a rating-only CSV file
./rating_prediction --recommender=BiasedMatrixFactorization --training-file=ratings_Video_Games.csv --test-ratio=0.1
Navigation Menu
Search code, repositories, users, issues, pull requests..., provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
amazon-review-dataset
Here are 18 public repositories matching this topic..., vinaykanigicherla / amazon_reviews_sentiment.
Sentiment Analysis on the Amazon Reviews Dataset using BERT-based transfer learning approach.
- Updated Apr 19, 2021
- Jupyter Notebook
Kavitha-Kothandaraman / Product-Recommendation-Systems
To build a recommendation system to recommend products to customers based on the their previous ratings for other products
- Updated May 29, 2020
imdeepmind / RatePrediction
Rate Prediction using Amazon Review Dataset and Deep Learning
- Updated Nov 21, 2022
pallavitilloo / Data-Mining-on-Amazon-Reviews
Data Mining on Amazon user reviews for musical instruments
- Updated Mar 3, 2023
rkarwayun / MSCI-641
Assignments for MSCI 641: Text Analytics, Spring 2020 at University of Waterloo.
- Updated Aug 24, 2020
NohanJoemon / Automatic-review-labelling-using-BERT
Sentiment analysis of amazon reviews dataset using BERT - model development and deployment
- Updated Nov 11, 2023
Rahulraj31 / NLP_Review_SportsAndOutdoor
Performing NLP on Amazon's review on sports and outdoor
- Updated Jul 14, 2021
rdadrl / reviewlytics
Sentimentally analyze product reviews to predict opinion honesty.
- Updated May 26, 2020
joshivaibhav / AmazonCustomerReview
Analysing Amazon customer reviews via Clustering, Visualization and Classification
- Updated Feb 17, 2021
MrRaghav / Complaints-mining-from-Hindi-product-reviews
The public dataset in Hindi language published for paper 28 - AICS2020, Ireland
- Updated Nov 13, 2020
Harsh251299 / Sentimental-Analysis
Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral.
- Updated Sep 16, 2021
InsiderPants / AmazonReview-Sentiment-Analysis
Sentiment Analysis using Conv1D and LSTM
- Updated Apr 17, 2019
dewith / reviews_polarity
Predicting polarity of Amazon user reviews using Deep Learning 🎭
- Updated Dec 23, 2020
jirenmaa / test_sentiment_analysis_dataset
A simple sentiment analysis using SGD and LinearSVC for Amazon Reviews
- Updated Nov 13, 2023
banurekhaMohan279 / AmazonReviews-Analyser
React App in AWS with CI/CD workflow
- Updated Jul 20, 2023
lcarcamo1526 / Amazon-Reviews-Analysis
Amazon Reviews Analysis
- Updated May 23, 2019
kuldeep27396 / Apparel-recommendation
Apparel-recommendation-engine-Machine-Learning
- Updated Mar 28, 2021
roshancyriacmathew / Deep-Learning-on-Amazon-Alexa-Reviews
This notebook will show you how to implement a deep leaning algorithm (LSTM) on the Amazon Alexa Reviews dataset
- Updated Apr 5, 2023
Improve this page
Add a description, image, and links to the amazon-review-dataset topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the amazon-review-dataset topic, visit your repo's landing page and select "manage topics."
Recommender Systems and Personalization Datasets
Julian McAuley , UCSD
Description
This page contains a collection of datasets that have been collected for research by our lab. Datasets contain the following features:
- user/item interactions
- star ratings
- product reviews
- social networks
- item-to-item relationships (e.g. copurchases, compatibility)
- product images
- price, brand, and category information
- heart-rate sequences
- other metadata
Please cite the appropriate reference if you use any of the datasets below.
Datasets are in (loose) json format unless specified otherwise, meaning they can be treated as python dictionary objects. A simple script to read json-formatted data is as follows:
def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)
Directory by Dataset
Twitch live-streaming interactions
NPR interview dialog data
This American Life podcast transcripts
Recipes and interactions from food.com
Paired Recipes from food.com
EndoMondo fitness tracking data
Amazon product reviews and metadata
Amazon question/answer data
Amazon marketing bias data
Google Local business reviews and metadata
Google Restaurants restaurant reviews and metadata
Steam video game reviews and bundles
Goodreads book reviews
Goodreads spoilers
Fashion explanations
Pinterest fashion compatibility data
ModCloth clothing fit feedback
ModCloth marketing bias data
RentTheRunway clothing fit feedback
Tradesy bartering data
RateBeer bartering data
Gameswap bartering data
Behance community art reviews and image features
Librarything reviews and social data
Epinions reviews and social data
Cant understanding data
Dance Dance Revolution step charts
NES song data
BeerAdvocate multi-aspect beer reviews
RateBeer multi-aspect beer reviews
Facebook social circles data
Twitter social circles data
Google+ social circles data
Reddit submission popularity and metadata
Directory by Metadata Type
The datasets below can be roughly organized in terms of the types of metadata they contain:
Review text: see Amazon , BeerAdvocate, RateBeer , Google Local , Google Restaurants
Image data: Amazon , Behance , Pinterest , Google Restaurants
Item-to-item relationships: Amazon
Q/A data: Amazon Q/A
Geographical data: Google Local , Google Restaurants , EndoMondo
Heart-Rate data: EndoMondo
Bundle data: Steam
Peer-to-peer trades: Tradesy, RateBeer, Gameswap
Social connections: Librarything, Epinions
Fit feedback: Modcloth, Renttherunway
Multple aspects: BeerAdvocate, RateBeer
This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days.
Basic statistics
- User ID (anonymized)
- Streamer username
1,34347669376,grimnax,5415, 5419 1,34391109664,jtgtv,5869,5870 1,34395247264,towshun,5898, 5899 1,34405646144,mithrain,6024, 6025 2,33848559952,chfhdtpgus1,206, 207 2,33881429664,sal_gu,519,524 2,33921292016,chfhdtpgus1,922, 924
Download link
See our data folder containing all Twitch files. The file full_a.csv.gz contains the full dataset while 100k.csv is a subset of 100k users for benchmark purposes. The code is available in our Github repository .
Please cite the following if you use the data:
Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption Jérémie Rappaz, Julian McAuley and Karl Aberer RecSys , 2021
Interview: NPR Media Dialog Data
This dataset contains interview transcripts from National Public Radio (NPR) . Data includes full interview transcripts and news article headlines.
- Episode Date and Title
- Speaker Names
- Speaker Utterances
- News Article Headlines
episode: 79679 program: Talk of the Nation title: Forecasting the Future of the Internet date: 2006-05-26 episode_order: 48 speaker: Professor LARRY PETERSON (Princeton University) utterance: And this is almost like the neutrality aspect of the issue, that there are places you just can't get to and the universal connectivity of the original Internet is deteriorating. Because of a lack of security built into the Internet your only recourse is to throw up all sorts of protections that are extremely suspicious of every bit of traffic that happens to fly by.
See the Interview Dataset Page for download information.
Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley EMNLP , 2020 pdf
This American Life Podcast Transcripts
This dataset contains program transcripts from This American Life . Data includes full program transcripts and associated audio.
- Episode Act
- Utterance Lengths
- Episode Audio
episode: ep-1 act: prologue utterance_start: 39.96 utterance_end: 54.89 duration: 14.93 speaker: ira glass utterance: Well, one great thing about starting a new show is utter anonymity. Nobody really knows what to expect from you. This interviewee did not know us from Adam.
See the This American Life Dataset Page for download information.
Speech Recognition and Multi-Speaker Diarization of Long Conversations Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison W. Cottrell INTERSPEECH , 2020 pdf
Food.com Recipe & Review Data
These datasets contain recipe details and reviews from Food.com (formerly GeniusKitchen). Data includes cooking recipes and review texts.
- Ratings and Reviews
- Recipe Name, Description, Ingredients, and Directions
- Recipe Categories (Tags)
- Recipe Nutrition Information
name: beer mac n cheese soup id: 499490 minutes: 45 contributor_id: 560491 submitted: 2013-04-27 tags: 60-minutes-or-less time-to-make preparation nutrition: 678.8 70.0 20.0 46.0 61.0 134.0 11.0 n_steps: 7 steps: cook the bacon in a pan over medium heat and set aside on paper towels to drain , reserving 2 tablespoons of the grease in the pan add the onion , carrot , celery and jalapeno and cook until tender , about 10-15 minutes add the garlic and cook until fragrant , about a minute mix in the flour and let it cook for 2-3 minutes add the broth , beer , nutmeg , bacon and macaroni and let cook until the macaroni is al-dente , about 7-8 minutes add the cream , mustard , worcestershire sauce and cheese and cook until the cheese has melted without bringing it back to a boil season with cayenne , salt and pepper to taste description: all of the flavors of mac n' cheese in the form of a hot bowl of soup! submitted by kevin lynch ingredients: bacon onion carrots celery jalapeno pepper garlic cloves flour chicken broth beer nutmeg elbow macaroni heavy cream dijon mustard worcestershire sauce cheddar cheese cayenne salt and pepper n_ingredients: 17
user_id: 8937 recipe_id: 44394 date: 2002-12-01 rating: 4 review: This worked very well and is EASY. I used not quite a whole package (10oz) of white chips. Great!
See the Food.com Dataset Page for download information.
Generating Personalized Recipes from Historical User Preferences Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley EMNLP , 2019 pdf
Recipe Pairs data
This is a collection recipes paired with variants, e.g. a recipe matched with a vegan version of the same recipe.
See the Recipe Pairs Dataset Page for download information.
SHARE: a System for Hierarchical Assistive Recipe Editing Shuyang Li, Yufei Li, Jianmo Ni, Julian McAuley EMNLP , 2022 pdf
EndoMondo Fitness Tracking Data
This is a collection of workout logs from users of EndoMondo . Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.
- User Identifier
- Latitude/Longitude/Altitude sequences (with timestamps)
- Heart rates
- Various derived sequences
userId: 10921915 gender: male sport: bike id: 396826535 longitude: [24.64977040886879, 24.65014273300767, 24.650910682976246, 24.650668865069747, 24.649145286530256, ...] latitude: [60.173348765820265, 60.173239801079035, 60.17298021353781, 60.172477969899774, 60.17186114564538, ...] altitude: [-1.8044666444624418, -1.8190453555595787, -1.8190453555595787, -1.8511185199732794, -1.871528715509271, ...] timestamp: [1408898746, 1408898754, 1408898765, 1408898778, 1408898794, ...] time_elapsed: [-0.12256752559145224, -0.12221090169596584, -0.12172054383967204, -0.12114103000950663, -0.12042778221853381, ...] heart_rate: [-8.197369036801112, -5.867841701016304, -3.961864789919643, -4.173640002263717, -3.961864789919643, ...] derived_speed: [-7.0829444390064396, -2.8061928357004815, -0.3976286593020398, -0.7571073884764162, 2.6415189187026646, ...] distance: [-4.372303649217691, -2.374952819539426, -0.07926348591212737, 0.4284751220389811, 4.710835498111755, ...] tar_heart_rate: [100, 111, 120, 119, 120, ...] tar_derived_speed: [0, 10.751376415573548, 16.806294372816662, 15.902596545765366, 24.446443398153843, ...] since_begin: [1378478.8892184314, 1378478.8892184314, 1378478.8892184314, 1378478.8892184314, 1378478.8892184314, ...] since_last: [2158.84607810351, 2158.84607810351, 2158.84607810351, 2158.84607810351, 2158.84607810351, ...]
See the FitRec Dataset Page for download information.
Modeling heart rate and activity data for personalized fitness recommendation Jianmo Ni, Larry Muhlstein, Julian McAuley WWW , 2019 pdf
Amazon Product Reviews
This is a large-scale Amazon Reviews dataset collected in 2023. This dataset contains 48.19 million items, and 571.54 million reviews from 54.51 million users.
- User Reviews (ratings, text, helpfulness votes, etc.);
- Item Metadata (descriptions, price, raw image, etc.);
- Links (user-item / bought together graphs).
{ "sort_timestamp": 1634275259292, "rating": 3.0, "helpful_votes": 0, "title": "Meh", "text": "These were lightweight and soft but much too small for my liking. I would have preferred two of these together to make one loc. For that reason I will not be repurchasing.", "images": [ { "small_image_url": " https://m.media-amazon.com/images/I/81FN4c0VHzL._SL256_.jpg ", "medium_image_url": " https://m.media-amazon.com/images/I/81FN4c0VHzL._SL800_.jpg ", "large_image_url": " https://m.media-amazon.com/images/I/81FN4c0VHzL._SL1600_.jpg ", "attachment_type": "IMAGE" } ], "asin": "B088SZDGXG", "verified_purchase": true, "parent_asin": "B08BBQ29N5", "user_id": "AEYORY2AVPMCPDV57CE337YU5LXA" }
See the Amazon Dataset Page for download information.
See the Amazon Reviews 2023 page for download information.
You can also download data from previous versions of these datasets:
Amazon Reviews 2018
Amazon Reviews 2014
2023 version
Bridging Language and Items for Retrieval and Recommendation Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, Julian McAuley arXiv pdf
2018 version
Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley EMNLP , 2019 pdf
2014 version
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering Ruining He, Julian McAuley WWW , 2016 pdf
Image-based recommendations on styles and substitutes Julian McAuley, Christopher Targett, Javen Shi, Anton van den Hengel SIGIR , 2015 pdf
This is a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users.
- reviews and ratings
- item-to-item relationships (e.g. "people who bought X also bought Y")
- helpfulness votes
- product image (and CNN features)
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
The 2014 version of this dataset is also available .
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW , 2016 pdf
Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR , 2015 pdf
Amazon Question and Answer Data
These datasets contain questions and answers about products from the Amazon dataset above.
- question and answer text
- is the question binary (yes/no), and if so does it have a yes/no answer?
- product ID (to reference the review dataset)
{ "asin": "B000050B6Z", "questionType": "yes/no", "answerType": "Y", "answerTime": "Aug 8, 2014", "unixTime": 1407481200, "question": "Can you use this unit with GEL shaving cans?", "answer": "Yes. If the can fits in the machine it will despense hot gel lather. I've been using my machine for both , gel and traditional lather for over 10 years." }
See the Amazon Q/A Page for download information.
Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems Mengting Wan, Julian McAuley International Conference on Data Mining (ICDM) , 2016 pdf
Addressing complex and subjective product-related queries with customer reviews Julian McAuley, Alex Yang World Wide Web (WWW) , 2016 pdf
Marketing Bias data
These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.
- user identities
- item sizes, user genders
Example (ModCloth)
item_id,user_id,rating,timestamp,size,fit,user_attr,model_attr,c... 7443,Alex,4,2010-01-21 08:00:00+00:00,,,Small,Small,Dresses,,2012,0 7443,carolyn.agan,3,2010-01-27 08:00:00+00:00,,,,Small,Dresses,,... 7443,Robyn,4,2010-01-29 08:00:00+00:00,,,Small,Small,Dresses,,20... 7443,De,4,2010-02-13 08:00:00+00:00,,,,Small,Dresses,,2012,0 7443,tasha,4,2010-02-18 08:00:00+00:00,,,Small,Small,Dresses,,20... 7443,gina.chihos,5,2010-02-25 08:00:00+00:00,,,,Small,Dresses,,2... 7443,Kim,2,2010-02-26 08:00:00+00:00,,,Small,Small,Dresses,,2012,0 7443,jess.betcher,5,2010-03-26 07:00:00+00:00,,,,Small,Dresses,,...
Download links
See our project page for download links.
Addressing Marketing Bias in Product Recommendations Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley WSDM , 2020 pdf
Google Local Reviews (2021)
This dataset contains review information from Google Maps (ratings, text, images, etc.), business metadata (address, geographic info, descriptions, category information, price, open hours, etc.), and links (related businesses) up to Sep 2021 in the United States.
See also two variants of this dataset below, including a 2021 version, and a version containing item images.
{ 'user_id': '101463350189962023774', 'name': 'Jordan Adams', 'time': 1627750414677, 'rating': 5, 'text': 'Cool place, great people, awesome dentist!', 'pics': [ { 'url': ['https://lh5.googleusercontent.com/p/AF1QipNq2nZC5TH4_M7h5xRAd 61hoTgvY1o9lozABguI=w150-h150-k-no-p'] } ], 'resp': { 'time': 1628455067818, 'text': 'Thank you for your five-star review! -Dr. Blake' }, 'gmap_id': '0x87ec2394c2cd9d2d:0xd1119cfbee0da6f3' }
- user_id - ID of the reviewer
- name - name of the reviwer
- time - time of the review (unix time)
- rating - rating of the business
- text - text of the review
- pics - pictures of the review
- resp - business response to the review including unix time and text of the response
- gmap_id - ID of the business
{ 'name': 'Walgreens Pharmacy', 'address': 'Walgreens Pharmacy, 124 E North St, Kendallville, IN 46755', 'gmap_id': '0x881614ce7c13acbb:0x5c7b18bbf6ec4f7e', 'description': 'Department of the Walgreens chain providing prescription medications & other health-related items.', 'latitude': 41.451859999999996, 'longitude': -85.2666757, 'category': ['Pharmacy'], 'avg_rating': 4.2, 'num_of_reviews': 5, 'price': '$$', 'hours': [['Thursday', '8AM–1:30PM'], ['Friday', '8AM–1:30PM'], ['Saturday', '9AM–1:30PM'], ['Sunday', '10AM–1:30PM'], ['Monday', '8AM–1:30PM'], ['Tuesday', '8AM–1:30PM'], ['Wednesday', '8AM–1:30PM']], 'MISC': { 'Service options': ['Curbside pickup', 'Drive-through', 'In-store pickup', 'In-store shopping'], 'Health & safety': ['Mask required', 'Staff wear masks', 'Staff get temperature checks'], 'Accessibility': ['Wheelchair accessible entrance', 'Wheelchair accessible parking lot'], 'Planning': ['Quick visit'], 'Payments': ['Checks', 'Debit cards'] }, 'state': 'Closes soon ⋅ 1:30PM ⋅ Reopens 2PM', 'relative_results': ['0x881614cd49e4fa33:0x2d507c24ff4f1c74', '0x8816145bf5141c89:0x535c1d605109f94b', '0x881614cda24cc591:0xca426e3a9b826432', '0x88162894d98b91ef:0xd139b34de70d3e03', '0x881615400b5e57f9:0xc56d17dbe420a67f'], 'url': 'https://www.google.com/maps/place//data=!4m2!3m1!1s0x881614ce7c13acb b:0x5c7b18bbf6ec4f7e?authuser=-1&hl=en&gl=us' }
- name - name of the business
- address - address of the business
- description - description of the business
- latitude - latitude of the business
- longitude - longitude of the business
- category - category of the business
- avg_rating - average rating of the business
- num_of_reviews - number of reviews
- price - price of the business
- hours - open hours
- MISC - MISC information
- state - the current status of the business (e.g., permanently closed)
- relative_results - relative businesses recommended by Google
- url - URL of the business
See the Google Local Dataset Page for download information.
UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining Jiacheng Li, Jingbo Shang, Julian McAuley Annual Meeting of the Association for Computational Linguistics (ACL) , 2022 pdf
Personalized Showcases: Generating Multi-Modal Explanations for Recommendations An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, Julian Mcauley arXiv:2207.00422 , 2022 pdf
Google Local Reviews (2018)
These datasets contain reviews about businesses from Google Local (Google Maps). Data includes geographic information for each business as well as reviews.
- GPS coordinates and address
- User information (places lived, jobs)
- business category, opening hours, etc.
Example (review)
{ 'rating': 3.0, 'reviewerName': u'an lam', 'reviewText': u'Ch\u1ea5t l\u01b0\u1ee3ng t\u1ea1m \u1ed5n', 'categories': [u'Gi\u1ea3i Tr\xed - Caf\xe9'], 'gPlusPlaceId': u'108103314380004200232', 'unixReviewTime': 1372686659, 'reviewTime': u'Jul 1, 2013', 'gPlusUserId': u'100000010817154263736' }
Example (business)
{ 'name': u'Diamond Valley Lake Marina', 'price': None, 'address': [u'2615 Angler Ave', u'Hemet, CA 92545'], 'hours': [[u'Monday', [[u'6:30 am--4:15 pm']]], [u'Tuesday', [[u'6:30 am--4:15 pm']]], [u'Wednesday', [[u'6:30 am--4:15 pm']], 1], [u'Thursday', [[u'6:30 am--4:15 pm']]], [u'Friday', [[u'6:30 am--4:15 pm']]], [u'Saturday', [[u'6:30 am--4:15 pm']]], [u'Sunday', [[u'6:30 am--4:15 pm']]]], 'phone': u'(951) 926-7201', 'closed': False, 'gPlusPlaceId': '104699454385822125632', 'gps': [33.703804, -117.003209] }
Places Data (276mb)
User Data (178mb)
Review Data (1.4gb)
Translation-based factorization machines for sequential recommendation Rajiv Pasricha, Julian McAuley RecSys , 2018 pdf
Translation-based recommendation Ruining He, Wang-Cheng Kang, Julian McAuley RecSys , 2017 pdf
Google Restaurants
This is a mutli-modal dataset of restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as other metadata for each restaurant.
- Geographical location and address
- Reviews, ratings and images
- Business category, opening status, price, etc.
"name":"The Fish Spot", "address":"5101 W Pico Blvd, Los Angeles, CA 90019", "Description":null, "Latitude":34.0481627, "Longitude":-118.3494339, "category":["Seafood restaurant"], "gmap_url":"https://www.google.com/maps/place/The+Fish+Spot/", "Avg_rating":4.3, "Num_of_reviews":80, "price":"$$", "Reviews": [ {"user_id":"111210125124533240892", "time":"3 years ago", "Rating":5, "text":"Absolutely love this place.", "pics":[ {"id":"AF1QipO1ejvRhkVBlg-v52UczxYMD7uebcZIhKC9uGud", "url":["https://lh5.googleusercontent.com/p/"]}, ], "link":"https://www.google.com/maps/reviews/"}, ...,]
See our data folder containing all related files. The file image_review_all.json contains the full dataset, while filter_all_t.json is a subset with filtered review sentences that have higher correlation with images. Code is available in our Github repository .
Steam Video Game and Bundle Data
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
- purchases, plays, recommends ("likes")
- product bundles
- pricing information
Example (bundle)
{ 'bundle_final_price': '$29.66', 'bundle_url': 'http://store.steampowered.com/bundle/1482/?utm_source=SteamDB...', 'bundle_price': '$32.96', 'bundle_name': 'Two Tribes Complete Pack!', 'bundle_id': '1482', 'items': [{'genre': 'Casual, Indie', 'item_id': '38700', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38700', 'item_name': 'Toki Tori'}, {'genre': 'Adventure, Casual, Indie', 'item_id': '201420', 'discounted_price': '$14.99', 'item_url': 'http://store.steampowered.com/app/201420', 'item_name': 'Toki Tori 2+'}, {'genre': 'Strategy, Indie, Casual', 'item_id': '38720', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38720', 'item_name': 'RUSH'}, {'genre': 'Action, Indie', 'item_id': '38740', 'discounted_price': '$7.99', 'item_url': 'http://store.steampowered.com/app/38740', 'item_name': 'EDGE'}], 'bundle_discount': '10%' }
Version 1: Review Data (6.7mb)
Version 1: User and Item Data (71mb)
Version 2: Review Data (1.3gb)
Version 2: Item metadata (2.7mb)
Bundle Data (92kb)
Self-attentive sequential recommendation Wang-Cheng Kang, Julian McAuley ICDM , 2018 pdf
Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys , 2018 pdf
Generating and personalizing bundle recommendations on Steam Apurva Pathak, Kshitiz Gupta, Julian McAuley SIGIR , 2017 pdf
Goodreads Book Reviews
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a "shelf", rating, and reading.
- add-to-shelf, read, review actions
- book attributes: title, isbn
- graph of similar books
Example (interaction data)
{ "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }
Goodreads Spoilers
These datasets contain reviews from the Goodreads book review website, along with annotated "spoiler" information from each review.
- see also metadata from the complete Goodreads dataset
Example (spoiler data)
Sentences are annotated as "1" if the sentence contains a spoiler, "0" otherwise.
{ 'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb', 'timestamp': '2013-12-28', 'review_sentences': [[0, 'First, be aware that this book is not for the faint of heart.'], [0, 'Human trafficking, drugs, kidnapping, abuse in all forms - this story contains all of this and more.'], ..., [0, '(ARC provided by the author in return for an honest review.)']], 'rating': 5, 'has_spoiler': False, 'book_id': '18398089', 'review_id': '4b3ffeaf14310ac6854f140188e191cd' }
Fine-grained spoiler detection from large-scale review corpora Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley ACL , 2019 pdf
Pairwise Fashion Explanations
The Pair Fashion Explanation (PFE) dataset contains 6407 instances, with each instance including items, features and the reason why these items are a good match.
Mentioned Items and the Percentages:
- Items (dress, top, skirt, etc.);
- Features (kilt, studded, etc.);
- Explanations (The outfit looks cohesive because the oversized layers are cinched with a studded belt, which complements the little strip from a kilt skirt that is also affixed to the belt, creating a visually pleasing balance in the outfit.);
{ "items": ['trousers', 'belt'], "features": ['tone-on-tone burgundy, slight flare', 'big circular gold buckle'], "explanations": "They all share a similar color scheme and the pieces have a cohesive silhouette that creates a polished and sophisticated look." }
See our project page for download information.
Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation. Yu Wang, Zexue He, Zhankui He, Hao Xu, Julian McAuley. AAAI 2024 pdf
Pinterest Fashion Compatibility
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
- product IDs
- bounding boxes
Example (fashion.json)
{ "product": "0027e30879ce3d87f82f699f148bff7e", "scene": "cdab9160072dd1800038227960ff6467", "bbox": [ 0.434097, 0.859363, 0.560254, 1.0 ] }
See our project page for download links, and for instructions as to how the product images can be collected from Pinterest.
Complete the Look: Scene-based complementary product recommendation Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley CVPR , 2019 pdf
Clothing Fit Data
These datasets contain measurements of clothing fit from ModCloth and RentTheRunway .
- ratings and reviews
- fit feedback (small/fit/large etc.)
- user/item measurements
- category information
Example (RentTheRunway)
{ "fit": "fit", "user_id": "420272", "bust size": "34d", "item_id": "2260466", "weight": "137lbs", "rating": "10", "rented for": "vacation", "review_text": "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.", "body type": "hourglass", "review_summary": "So many compliments!", "category": "romper", "height": "5' 8\"", "size": 14, "age": "28", "review_date": "April 20, 2016" }
Modcloth (8.5mb)
Renttherunway (31mb)
Decomposing fit semantics for product size recommendation in metric spaces Rishabh Misra, Mengting Wan, Julian McAuley RecSys , 2018 pdf
Product Exchange/Bartering Data
These datasets contain peer-to-peer trades from various recommendation platforms.
- peer-to-peer trades
- "have" and "want" lists
- image data (tradesy)
Example (tradesy)
{ 'lists': { 'bought': ['466', '459', '457', '449'], 'selling': [], 'want': [], 'sold': ['104', '103', '102'] }, 'uid': '2' }
Tradesy (3.8mb)
See the project page for ratebeer, gameswap (and other) datasets
Bartering books to beers: A recommender system for exchange platforms Jérémie Rappaz, Maria-Luiza Vladarean, Julian McAuley, Michele Catasta WSDM , 2017 pdf
VBPR: Visual bayesian personalized ranking from implicit feedback Ruining He, Julian McAuley AAAI , 2016 pdf
Behance Community Art Data
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
- appreciates (likes)
- extracted image features
Example ("appreciate" data)
Each entry is a user, item, timestamp triple:
276633 01588231 1307583271 1238354 01529213 1307583273 165550 00485000 1307583337 2173258 00776972 1307583340 165550 00158226 1307583406 1238354 01540285 1307583495 2459267 01578261 1307583509 165550 00264669 1307583518 165550 00171501 1307583536
Code to read image features
import struct def readImageFeatures(path): f = open(path, 'rb') while True: itemId = f.read(8) if itemId == '': break feature = struct.unpack('f'*4096, f.read(4*4096)) yield itemId, feature
See our data folder containing all Behance files. The folder also contains additional documentation.
Vista: A visually, socially, and temporally-aware model for artistic recommendation Ruining He, Chen Fang, Zhaowen Wang, Julian McAuley RecSys , 2016 pdf
Social Recommendation Data
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).
- price paid (epinions)
- helpfulness votes (librarything)
- flags (librarything)
Example (LibraryThing reviews)
{ 'work': '3067', 'flags': [], 'unixtime': 1160265600, 'stars': 4.5, 'nhelpful': 0, 'time': 'Oct 8, 2006', 'comment': 'great storytelling in this novel about a couple crossed by a time travelling disorder ', 'user': 'justine' }
Example (LibraryThing social network)
Rodo anehan Rodo sevilemar Rodo dingsi Rodo slash RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader Bumpersmom RelaxedReader DivaColumbus RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader BookWorm2729 RelaxedReader Bumpersmom
LibraryThing (594mb)
epinions (66mb)
SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation Chenwei Cai, Ruining He, Julian McAuley IJCAI , 2017 pdf
Improving latent factor models via personalized feature projection for one-class recommendation Tong Zhao, Julian McAuley, Irwin King Conference on Information and Knowledge Management (CIKM) , 2015 pdf
Other Non-Recommender-Systems Datasets
Below are various datasets collected by my lab that are not related to recommender systems specifically. Formats of these datasets vary, so their respective project pages should be consulted for further details.
DogWhistle: Cant Understanding Data
DogWhistle is a Chinese dataset collected from the historical records for an online game. It provides hidden words and the cant for them, with human answers. The dataset is suitable for semantic similarity evaluation for large language models.
- cant and the hidden words
- cant history
- human answers
Example (insider subtask)
0 高铁,周末,无情,条纹 冷漠,休息,斑马 冷漠 2 1 高铁,周末,无情,条纹 冷漠,休息,斑马 休息 1 2 高铁,周末,无情,条纹 冷漠,休息,斑马 斑马 3
Please refer to our leaderboard page for download instructions.
Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian McAuley, Furu Wei NAACL , 2021 pdf
Video Game Data
Step charts from the video game Dance Dance Revolution , and audio files from the NES platform.
See the project pages for Dance Dance Convolution and NES MDB for further details and links to the data
Dance Dance Convolution Chris Donahue, Zachary Lipton, Julian McAuley ICML , 2017 pdf
The NES Music Database: A symbolic music dataset with expressive performance attributes Chris Donahue, Henry Mao, Julian McAuley International Society for Music Information Retrieval Conference (ISMIR) , 2018 pdf
Multi-aspect Reviews
These datasets include reviews with multiple rated dimensions. The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell.
- aspect-specific ratings (taste, look, feel, smell, overall impression)
- product category
Example (ratebeer)
beer/name: John Harvards Simcoe IPA beer/beerId: 63836 beer/brewerId: 8481 beer/ABV: 5.4 beer/style: India Pale Ale (IPA) review/appearance: 4/5 review/aroma: 6/10 review/palate: 3/5 review/taste: 6/10 review/overall: 13/20 review/time: 1157587200 review/profileName: hopdog review/text: On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.
BeerAdvocate (433mb)
RateBeer (388mb)
Sentences with aspect labels (annotator 1) (758kb)
Sentences with aspect labels (annotator 2) (759kb)
Learning attitudes and attributes from multi-aspect reviews Julian McAuley, Jure Leskovec, Dan Jurafsky International Conference on Data Mining (ICDM) , 2012 pdf
From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews Julian McAuley, Jure Leskovec WWW , 2013 pdf
Social Circles
These datasets contain social connections and "circles" from Facebook, Twitter, and Google Plus.
- social connections
- circles (sets of friends sharing a common property)
- user metadata
Example (Kaggle egonet data)
UserId: Friends 1: 4 6 12 2 208 2: 5 3 17 90 7
See SNAP facebook , twitter , and Google Plus data, as well as the Kaggle competition based on the same data.
Learning to Discover Social Circles in Ego Networks Julian McAuley, Jure Leskovec Neural Information Processing Systems (NIPS) , 2012 pdf
Reddit Submissions
Submissions of reddit posts (and in particular resubmissions of the same content) along with metadata.
- upvotes/downvotes
- post title, subreddit, etc.
#image_id,unixtime,rawtime,title,total_votes,reddit_id,... number_of_downvotes,localtime,score,number_of_comments,username 1005,1335861624,2012-05-01T15:40:24.968266-07:00,I immediately regret this decision,27,t296r,20,pics,7,1335886824,13,0,ninjaroflmaster 1005,1336470481,2012-05-08T16:48:01.418140-07:00,"Pushing your friend into the water,Level: 99",18,tds4i,16,funny,2,1336495681,14,0,hme4 1005,1339566752,2012-06-13T12:52:32.371941-07:00,I told him. He Didn't Listen,6,v0cma,4,funny,2,1339591952,2,0,HeyPatWhatsUp 1005,1342200476,2012-07-14T00:27:56.857805-07:00,Don't end up as this guy.,16,wjivx,7,funny,9,1342225676,-2,2,catalyst24
resubmissions data (7.3mb)
raw html of resubmissions (1.8gb)
See also the SNAP project page .
Understanding the interplay between titles, content, and communities in social media Himabindu Lakkaraju, Julian McAuley, Jure Leskovec ICWSM , 2013 pdf
Questions and comments to Julian McAuley
Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer
- Original Article
- Published: 16 April 2024
- Volume 14 , article number 87 , ( 2024 )
Cite this article
- Mian Muhammad Danyal 1 , 2 na1 ,
- Sarwar Shah Khan 3 , 4 ,
- Muzammil Khan 3 na1 ,
- Subhan Ullah 2 na1 ,
- Muhammad Bilal Ghaffar 2 na1 &
- Wahab Khan 2 na1
40 Accesses
Explore all metrics
Movies have been important in our lives for many years. Movies provide entertainment, inspire, educate, and offer an escape from reality. Movie reviews help us choose better movies, but reading them all can be time-consuming and overwhelming. To make it easier, sentiment analysis can classify movie reviews into positive and negative categories. Opinion mining (OP), called sentiment analysis (SA), uses natural language processing to identify and extract opinions expressed through text. Naive Bayes, a supervised learning algorithm, offers simplicity, efficiency, and strong performance in classification tasks due to its feature independence assumption. This study evaluates the performance of four Naïve Bayes variations using two vectorization techniques, Count Vectorizer and Term Frequency–Inverse Document Frequency (TF–IDF), on two movie review datasets: IMDb Movie Reviews Dataset and Rotten Tomatoes Movie Reviews. Bernoulli Naive Bayes achieved the highest accuracy using Count Vectorizer on the IMDB and Rotten Tomatoes datasets. Multinomial Naive Bayes, on the other hand, achieved better accuracy on the IMDB dataset with TF–IDF. During preprocessing, we implemented different techniques to enhance the quality of our datasets. These included data cleaning, spelling correction, fixing chat words, lemmatization, and removing stop words. Additionally, we fine-tuned our models through hyperparameter tuning to achieve optimal results. Using TF–IDF, we observed a slight performance improvement compared to using the count vectorizer. The experiment highlights the significant role of sentiment analysis in understanding the attitudes and emotions expressed in movie reviews. By predicting the sentiments of each review and calculating the average sentiment of all reviews, it becomes possible to make an accurate prediction about a movie’s overall performance.
This is a preview of subscription content, log in via an institution to check access.
Access this article
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Similar content being viewed by others
Sentiment Analysis of IMDb Movie Reviews: A Comparative Analysis of Feature Selection and Feature Extraction Techniques
Sentiment Analysis through Word Vectors: A Study on Movie Reviews from IMDb
Complement Naive Bayes Classifier for Sentiment Analysis of Internet Movie Database
Data availibility statement.
The data that support the findings of this study are openly available through the Open Science Framework at https://github.com/Ankit152/IMDB-sentiment-analysis.git and https://www.kaggle.com/datasets/talha002/rottentomatoes-400k-review
Abbreviations
Aspect-based sentiment analysis
Artificial intelligence
Bag-of-words
Bernoulli Naive Bayes
Complement Naive Bayes
Cross-validation
Deep learning
Gaussian Naive Bayes
Grid search
Internet movie database
K-Nearest Neighbours
Support vector machines
Machine learning
Multinomial Naive Bayes
- Naive Bayes
Natural language processing
Natural language tool kit
Opinion mining
Rotten Tomatoes
True Positive
True Negative
False Positive
False Negative
- Sentiment analysis
Term Frequency–Inverse Document Frequency
Word to vector
Abimanyu AJ, Dwifebri M, Astuti W (2023) Sentiment analysis on movie review from rotten tomatoes using logistic regression and information gain feature selection. Build Inf Technol Sci (BITS) 5(1):162–170
Google Scholar
Adam NL, Rosli NH, Soh SC (2021) Sentiment analysis on movie review using Naïve Bayes. In: 2021 2nd International conference on artificial intelligence and data sciences (AiDAS), pp 1–6. https://doi.org/10.1109/AiDAS53897.2021.9574419
Agrawal T (2021) Introduction to hyperparameters. In: Hyperparameter optimization in machine learning: make your machine learning and deep learning models more efficient, pp 1–8. APRESS: New York
Arsyah UI, Pratiwi M, Muhammad A (2024) Twitter sentiment analysis of public space opinions using SVM and TF–IDF methods. Indon J Comput Sci 13(1)
Artur M (2021) Review the performance of the bernoulli Naïve Bayes classifier in intrusion detection systems using recursive feature elimination with cross-validated selection of the best number of features. Proc Comput Sci 190:564–570
Article Google Scholar
Asghar MZ, Khan A, Ahmad S, Kundi FM (2014) A review of feature extraction in sentiment analysis. J Basic Appl Sci Res 4(3):181–186
Baid P, Gupta A, Chaplot N (2017) Sentiment analysis of movie reviews using machine learning techniques. Int J Comput Appl 179(7):45–49
Banik N, Rahman MHH (2018) Evaluation of Naïve Bayes and support vector machines on Bangla textual movie reviews. In: 2018 International conference on Bangla speech and language processing (ICBSLP), pp 1–6. IEEE
Başarslan MS, Kayaalp F (2023) MBI-GRUMCONV: a novel multi BI-GRU and multi CNN-based deep learning model for social media sentiment analysis. J Cloud Comput. https://doi.org/10.1186/s13677-022-00386-3
Bilal Khan S, Muhammad Arshad SK (2023) Comparative analysis of machine learning models for pdf malware detection: Evaluating different training and testing criteria. J Cyber Secur 5(1), 1–11 https://doi.org/10.32604/jcs.2023.042501
Bodapati JD, Veeranjaneyulu N, Shareef SN (2019) Sentiment analysis from movie reviews using LSTMS. Ingénierie des Systèmes d Inf 24(1):125–129
Cahyanti FE, AlFaraby S (2020) On the feature extraction for sentiment analysis of movie reviews based on SVM. In: 2020 8th International conference on information and communication technology (ICoICT), pp 1–5, IEEE
Danyal MM, Khan SS, Khan M, Ullah S, Mehmood F, Ali I (2024) Proposing sentiment analysis model based on BERT and XLNET for movie reviews. Multimed Tools Appl 1–25
Deepa D, Raaji Tamilarasi A (2019) Sentiment analysis using feature extraction and dictionary-based approaches. In: 2019 Third international conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, pp 786–790. https://doi.org/10.1109/I-SMAC47947.2019.9032456
Dewi C, Chen R-C, Christanto HJ, Cauteruccio F (2023) Multinomial Naïve Bayes classifier for sentiment analysis of internet movie database. Vietnam J Comput Sci 10(04):485–498
Dey L, Chakraborty S, Biswas A, Bose B, Tiwari S (2016) Sentiment analysis of review datasets using Naive Bayes and k-NN classifier. arXiv preprint arXiv:1610.09982
Danyal M M, Haseeb M, Khan S S, Khan B, Ullah S (2024) Opinion Mining on Movie Reviews Based on Deep Learning Models. J Artif Intell (6):(2579–0021).
Danyal M M, Khan S S, Khan M, Ghaffar M B, Khan B, Arshad, M (2023) Sentiment Analysis Based on Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews with Baseline Techniques. J Big Data (5).
Horsa OG, Tune KK, et al (2023) Aspect-based sentiment analysis for AFAAN OROMOO movie reviews using machine learning techniques. Appl Comput Intell Soft Comput 2023
Jahromi AH, Taheri M (2017) A non-parametric mixture of gaussian Naive Bayes classifiers based on local independent features. In: 2017 Artificial intelligence and signal processing conference (AISP), pp 209–212. IEEE
Khan M, Khan M S, Alharbi Y (2020) Text mining challenges and applications—a comprehensive review. IJCSNS 20(12):138
Khan SS, Khan M, Ran Q, Naseem R (2018) Challenges in opinion mining, comprehensive. Sci Technol J (Ciencia e Tecnica Vitivinicola) 33(11):123–135
Maas AL, Daly R, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 142–150
Mall P, Kumar M, Kumar A, Gupta A, Srivastava S, Narayan V, Chauhan AS, Srivastava AP (2024) Self-attentive CNN + BERT: An approach for analysis of sentiment on movie reviews using word embedding. Int J Intell Syst Appl Eng 12(12s):612–623
Maulana R, Rahayuningsih PA, Irmayani W, Saputra D, Jayanti WE (2020) Improved accuracy of sentiment analysis movie review using support vector machine based information gain. J Phys Conf Ser 1641:012060
Pimpalkar A, Raj RJR (2022) Mbilstmglove: embedding glove knowledge into the corpus using multi-layer Bilstm deep learning model for social media sentiment analysis. Exp Syst Appl 203:117581. https://doi.org/10.1016/j.eswa.2022.117581
Rahat AM, Kahir A, Masum AKM (2019) Comparison of Naive Bayes and SVM algorithm based on sentiment analysis using review dataset. In: 2019 8th International conference system modeling and advancement in research trends (SMART), pp 266–270. IEEE
Rahman R, Masud MA, Mimi RJ, Dina MNS (2021) Sentiment analysis on Bengali movie reviews using multinomial Naïve Bayes. In: 2021 24th International conference on computer and information technology (ICCIT), pp 1–6. https://doi.org/10.1109/ICCIT54785.2021.9689787
Rizal C, Kifta DA, Nasution RH, Rengganis A, Watrianthos R (2023) Opinion classification for IMDB review based using Naive Bayes method. In: AIP conference proceedings, vol 2913. AIP Publishing: New York
Rotten Tomatoes Movie Reviews dataset https://www.rottentomatoes.com . Accessed on 02 Mar 2023 (2020)
Samsir S, Kusmanto K, Dalimunthe AH, Aditiya R, Watrianthos R (2022) Implementation Naïve Bayes classification for sentiment analysis on internet movie database. Build Inf Technol Sci (BITS) 4(1):1–6
Shackley D, Folajimi Y (2023) Sentiment analysis of fake health news using Naive Bayes classification models. Int J Cognit Lang Sci 17(3):217–224
Sudha N, Govindarajan M (2016) Mining movie reviews using machine learning techniques. Int J Comput Appl 144 (5)
Teja JS, Sai GK, Kumar MD, Manikandan R (2018) Sentiment analysis of movie reviews using machine learning algorithms—a survey. Int J Pure Appl Math 118(20):3277–3284
Ullah K, Rashad, A, Khan M, Ghadi Y, Aljuaid H, Nawaz Z et al (2022) A deep neural network-based approach for sentiment analysis of movie reviews. Complexity 2022
Veziroğlu M, Eziroğlu E, Bucak İ.Ö (2024) Performance comparison between Naive Bayes and machine learning algorithms for news classification. In: Bayesian inference-recent trends. IntechOpen
Vielma C, Verma A, Bein D (2023) Sentiment analysis with novel GRU based deep learning networks. In: 2023 IEEE World AI IoT congress (AIIoT), pp 0440–0446. https://doi.org/10.1109/AIIoT58121.2023.10174396
Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415(1):295–316
Yusran M, Siswanto S, Islamiyati A (2024) Comparison of multinomial Naïve Bayes and Bernoulli Naïve Bayes on sentiment analysis of Kurikulum Merdeka with query expansion ranking. SISTEMASI 13(1):96–106
Download references
Acknowledgements
We sincerely thank everyone who helped us finish this research paper. We are grateful to the participants for their helpful feedback and ideas, which improved our research methods and the quality of our results. We appreciate everyone who gave their time to join our study, as this research wouldn’t have been possible without them. Thank you to everyone who took the time to contribute to this research paper.
This paper is for free publication.
Author information
Mian Muhammad Danyal, Muzammil Khan, Subhan Ullah, Muhammad Bilal Ghaffar, Wahab Khan have contributed equally to this work.
Authors and Affiliations
Center for Excellence in Information Technology, Institute of Management Sciences, Peshawar, 24720, Pakistan
Mian Muhammad Danyal
Department of Computer Science, City University of Science and Information Technology, Peshawar, 25000, Pakistan
Mian Muhammad Danyal, Subhan Ullah, Muhammad Bilal Ghaffar & Wahab Khan
Department of Computer and Software Technology, University of Swat, Swat, 19130, Pakistan
Sarwar Shah Khan & Muzammil Khan
Department of Computer Science, Iqra University Swat Campus, Swat, 19130, Pakistan
Sarwar Shah Khan
You can also search for this author in PubMed Google Scholar
Contributions
The author contributions are as follow: “Conceptualization, MMD and SSK; methodology, MBG and MK; software, MMD, SU; validation, SSK and WK; formal analysis, MK, WK, and MBG; investigation, SU; data curation, SU and SSK; writing-original draft preparation, MMD, and MBG; writing-review and editing, SSK; visualization, MBG, and MK.
Corresponding author
Correspondence to Muzammil Khan .
Ethics declarations
Conflict of interest.
The authors of this paper declare that they do not have any conflicts of interest.
Financial interests
The authors of this paper have no Conflict of interest relevant to this article’s content to declare.
Ethical approval
Not applicable.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Danyal, M.M., Khan, S.S., Khan, M. et al. Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer. Soc. Netw. Anal. Min. 14 , 87 (2024). https://doi.org/10.1007/s13278-024-01250-9
Download citation
Received : 02 April 2023
Revised : 16 March 2024
Accepted : 20 March 2024
Published : 16 April 2024
DOI : https://doi.org/10.1007/s13278-024-01250-9
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- IMDB dataset
- Rotten tomatoes dataset
- Count vectorizer
- Find a journal
- Publish with us
- Track your research
Subscribe to the PwC Newsletter
Join the community.
IMAGES
VIDEO
COMMENTS
Dataset information. This dataset consists of movie reviews from amazon. The data span a period of more than 10 years, including all ~8 million reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories .
Amazon Review Data (2018) Jianmo Ni, UCSD. Description. This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
In the Amazon Reviews'23, we provide: Larger Dataset: We collected 571.54M reviews, 245.2% larger than the last version; Newer Interactions: Current interactions range from May. 1996 to Sep. 2023; Richer Metadata: More descriptive features in item metadata; Fine-grained Timestamp: Interaction timestamp at the second or finer level;
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website.
This is a large-scale Amazon Reviews dataset, collected in 2023 by McAuley Lab, and it includes rich features such as: User Reviews ( ratings, text, helpfulness votes, etc.); Item Metadata ( descriptions, price, raw image, etc.); Links ( user-item / bought together graphs).
Refresh. Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals.
Contains a table with 300,000 unique reviews. The format of the table has two columns; i) 'Id': contains an id that corresponds to a review in train.csv for which you predict a score ii) 'Score': the values for this column are missing since it will include the score predictions. You are required to predict the star ratings of these Id using the ...
The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012 about 253,059 products.. Data format: product/productId: B00006HAXW review/userId: A1RSDE90N6RSZF review/profileName: Joseph M. Kotow review/helpfulness: 9/9 review/score: 5.0 review/time: 1042502400 review/summary: Pittsburgh - Home of the OLDIES ...
It has removed the profile name of the reviewer, the review-summary, and the review-text from the primary data. The data span a period of more than 10 years, including all up to October 2012. Each row contains 6 fields: 1. Product ID (e.g. B003AI2VGA) 2. User ID (e.g. A141HP4LYPWMSR) 3. Count of thumb-ups received by this review (e.g. 7) 4.
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). Files.
Add this topic to your repo. To associate your repository with the amazon-review-dataset topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
In this project, we analyze a dataset consist-ing of 50,000 movie reviews on Amazon. The dataset includes the observed rating on a 1 - 5 scale, along with the text of the user's review. As the dataset is quite sparse, these reviews prove to be valuable in predicting unobserved ratings. We train two supervised learning algorithms
In this case study, we use the Amazon movies reviews dataset1. The dataset spans a period between August of 1997 and October of 2012. It consists of 889.176 users and 253.059 movies. The users have given 7.911.684 reviews, and their median word count is 101. Each entry in the dataset consists of a user identi cation number, the movie id that ...
This is a large-scale Amazon Reviews dataset collected in 2023. This dataset contains 48.19 million items, and 571.54 million reviews from 54.51 million users. Basic statistics ... These datasets contain reviews from the Steam video game platform, and information about which games were bundled together. Basic statistics. Reviews: 7,793,069 ...
I have created and published a new dataset containing the movies streaming on the Amazon Prime Video platform. This dataset contains over 7K+ unique movies. The metadata contains information about the IMDb rating that the movie received, the total running time of the movie, audio language, maturing rating, and a short descriptive summary of the ...
Amazon Review is a dataset to tackle the task of identifying whether the sentiment of a product review is positive or negative. This dataset includes reviews from four different merchandise categories: Books (B) (2834 samples), DVDs (D) (1199 samples), Electronics (E) (1883 samples), and Kitchen and housewares (K) (1755 samples).
4.1.2 Rotten Tomatoes Movie reviews dataset. The Rotten Tomatoes Reviews dataset consists of movie reviews and labels indicating whether they are "fresh" or "rotten". The dataset covers various movies and genres and includes metadata such as year of release, genre, and cast (Asghar et al. 2014). For this research experiment, 50K samples are ...
processed version of amazon movie reviews. processed version of amazon movie reviews. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion ...
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10.
The current state-of-the-art on Amazon Review Full is BERT large. See a full comparison of 9 papers with code. ... Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. Read previous issues. Subscribe. Join the community
Movies and TV Shows listings on Amazon Prime Video. Movies and TV Shows listings on Amazon Prime Video . code. New Notebook. table_chart ... table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto ...