amazon movie review dataset

  • SNAP C++ Main Page
  • SNAP C++ Download
  • SNAP C++ Documentation
  • Snap.py Python Main Page
  • Snap.py Python Download
  • Snap.py Python Documentation
  • Large networks
  • Web datasets
  • Other resources
  • BIOSNAP Datasets
  • Activity Inequality
  • Higher-order
  • Disinformation
  • Memetracker
  • Temporal Motifs
  • Citing SNAP

Web data: Amazon movie reviews

Dataset information.

This dataset consists of movie reviews from amazon . The data span a period of more than 10 years, including all ~8 million reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories .

Source (citation)

  • J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews . WWW, 2013.

Data format

  • product/productId : asin , e.g. amazon.com/dp/B00006HAXW
  • review/userId : id of the user, e.g. A1RSDE90N6RSZF
  • review/profileName : name of the user
  • review/helpfulness : fraction of users who found the review helpful
  • review/score : rating of the product
  • review/time : time of the review (unix time)
  • review/summary : review summary

Amazon Review Data (2018)

Jianmo Ni , UCSD

Description

  • The total number of reviews is 233.1 million (142.8 million in 2014).
  • Current data includes reviews in the range May 1996 - Oct 2018.
  • Product information, e.g. color (white or black), size (large or small), package type (hardcover or electronics), etc.
  • Product images that are taken after the user received the product.
  • Bullet-point descriptions under product title.
  • Technical details table (attribute-value pairs).
  • Similar products table.
  • Includes 5 new product categories.

You can also download the review data from our previous datasets.

Amazon review (2014)

Amazon review (2013)

Please cite the following paper if you use the data in any way:

Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP) , 2019 pdf

05/2021 We updated high resolution image urls to the metadata!

08/2020 We have updated the metadata and now it includes much less HTML/CSS code. Feel free to download the updated data!

  • Load the metadata (e.g. as JSON or DataFrame)
  • Check if title has HTML contents and filter them

We provide a colab notebook that helps you find target products and obtain their reviews!

  • Unparsed HTML contents
  • Duplicate items which have same reviews
  • Files complete data K-cores and ratings-only data sample review sample metadata

Complete review data

Please only download these (large!) files if you really need them. We recommend using the smaller datasets (i.e. k-core and CSV files) as shown in the next section .

raw review data (34gb) - all 233.1 million reviews

user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user

product review data (18gb) - duplicate items removed, sorted by product

5-core (14.3gb) - subset of the data in which all users and items have at least 5 reviews (75.26 million reviews)

Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:

aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)

Per-category data - the review and product metadata for each category.

To download the complete review data and the per-category files, the following links will direct you to enter a form. Please contact me if you can't get access to the form.

"Small" subsets for experimentation

If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files.

K-cores (i.e., dense subsets): These data have been reduced to extract the k-core , such that each of the remaining users and items have k reviews each.

Ratings only: These datasets include no metadata or reviews, but only (item,user,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.

You can directly download the following smaller per-category datasets.

Data format

Format is one-review-per-line in json. See examples below for further help reading the data.

Sample review:

{ "image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"], "overall": 5.0, "vote": "2", "verified": True, "reviewTime": "01 1, 2018", "reviewerID": "AUI6WTTT0QZYS", "asin": "5120053084", "style": { "Size:": "Large", "Color:": "Charcoal" }, "reviewerName": "Abbey", "reviewText": "I now have 4 of the 5 available colors of this shirt... ", "summary": "Comfy, flattering, discreet--highly recommended!", "unixReviewTime": 1514764800 } { "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "vote": 5, "style": { "Format:": "Hardcover" }, "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }

  • reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin - ID of the product, e.g. 0000013714
  • reviewerName - name of the reviewer
  • vote - helpful votes of the review
  • style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
  • reviewText - text of the review
  • overall - rating of the product
  • summary - summary of the review
  • unixReviewTime - time of the review (unix time)
  • reviewTime - time of the review (raw)
  • image - images that users post after they have received the product

Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:

metadata (24gb) - metadata for 15.5 million products

Sample metadata:

{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "feature": ["Botiquecutie Trademark exclusive Brand", "Hot Pink Layered Zebra Print Tutu", "Fits girls up to a size 4T", "Hand wash / Line Dry", "Includes a Botiquecutie TM Exclusive hair flower bow"], "description": "This tutu is great for dress up play for your little ballerina. Botiquecute Trade Mark exclusive brand. Hot Pink Zebra print tutu.", "price": 3.17, "imageURL": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "imageURLHighRes": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL.jpg", "also_buy": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }

  • asin - ID of the product, e.g. 0000031852
  • title - name of the product
  • feature - bullet-point format features of the product
  • description - description of the product
  • price - price in US dollars (at time of crawl)
  • imageURL - url of the product image
  • imageURL - url of the high resolution product image
  • related - related products (also bought, also viewed, bought together, buy after viewing)
  • salesRank - sales rank information
  • brand - brand name
  • categories - list of categories the product belongs to
  • tech1 - the first technical detail table of the product
  • tech2 - the second technical detail table of the product
  • similar - similar product table

Visual Features

We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.

visual features (141gb) - visual features for all products

The images themselves can be extracted from the image field in the metadata files.

Below are files for individual product categories, which have already had duplicate item reviews removed.

Reading the data

Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield json.loads(l)

Convert to 'strict' json

The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:

import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')

Pandas data frame

This code reads the data into a pandas data frame:

import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield json.loads(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('reviews_Video_Games.json.gz')

Convert to CSV

This code converts (a selection of fields from) the above files to CSV format:

import csv fields = ["asin", "description", "brand"] csvOut = gzip.open("meta_Video_Games.csv.gz", 'w') writer = csv.writer(csvOut) for product in parse("meta_Video_Games.json.gz"): line = [] for f in fields: if product.has_key(f): line.append(product[f]) else: line.append("") writer.writerow(line)

Read image features

import array def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break a = array.array('f') a.fromfile(f, 4096) yield asin, a.tolist()

Example: compute average rating

ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)

Example: latent-factor model in mymedialite

Predicts ratings from a rating-only CSV file

./rating_prediction --recommender=BiasedMatrixFactorization --training-file=ratings_Video_Games.csv --test-ratio=0.1

Introduction #

[ 🤗 Huggingface Datasets ] · [ 📑 Paper ] · [ 💻 GitHub ]

This is a large-scale Amazon Reviews dataset, collected in 2023 by McAuley Lab , and it includes rich features such as:

User Reviews ( ratings , text , helpfulness votes , etc.);

Item Metadata ( descriptions , price , raw image , etc.);

Links ( user-item / bought together graphs).

What’s New? #

In the Amazon Reviews’23, we provide:

Larger Dataset: We collected 571.54M reviews, 245.2% larger than the last version;

Newer Interactions: Current interactions range from May. 1996 to Sep. 2023;

Richer Metadata: More descriptive features in item metadata;

Fine-grained Timestamp: Interaction timestamp at the second or finer level;

Cleaner Processing: Cleaner item metadata than previous versions;

Standard Splitting: Standard data splits to encourage RecSys benchmarking.

Basic Statistics #

We define the #R_Tokens as the number of tokens in user reviews and #M_Tokens as the number of tokens if treating the dictionaries of item attributes as strings. We emphasize them as important statistics in the era of LLMs.

We count the number of items based on user reviews rather than item metadata files. Note that some items lack metadata.

Compared to Previous Versions #

Grouped by category #.

Check Pure ID files and corresponding data splitting strategies in Common Data Processing section.

Quick Start #

Load user reviews #, load item metadata #.

Check data loading examples and Huggingface datasets APIs in Common Data Loading section.

Data Fields #

For user reviews #, for item metadata #, contact us #.

Report Bugs : To report bugs in the dataset, please file an issue on our GitHub .

Others : For research collaborations or other questions, please email yphou AT ucsd.edu .

  • Español – América Latina
  • Português – Brasil
  • Tiếng Việt

TFDS now supports the Croissant 🥐 format ! Read the documentation to know more.

amazon_us_reviews

  • Description :

Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.

Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters).

Each Dataset contains the following columns : marketplace - 2 letter country code of the marketplace where the review was written. customer_id - Random identifier that can be used to aggregate reviews written by a single author. review_id - The unique ID of the review. product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id. product_parent - Random identifier that can be used to aggregate reviews for the same product. product_title - Title of the product. product_category - Broad product category that can be used to group reviews (also used to group the dataset into coherent parts). star_rating - The 1-5 star rating of the review. helpful_votes - Number of helpful votes. total_votes - Number of total votes the review received. vine - Review was written as part of the Vine program. verified_purchase - The review is on a verified purchase. review_headline - The title of the review. review_body - The review text. review_date - The date the review was written.

Homepage : https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Source code : tfds.datasets.amazon_us_reviews.Builder

  • 0.1.0 (default): No release notes.

Feature structure :

  • Feature documentation :

Supervised keys (See as_supervised doc ): None

Figure ( tfds.show_examples ): Not supported.

amazon_us_reviews/Wireless_v1_00 (default config)

Config description : A dataset consisting of reviews of Amazon Wireless_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 1.59 GiB

Dataset size : 7.21 GiB

Auto-cached ( documentation ): No

  • Examples ( tfds.as_dataframe ):

amazon_us_reviews/Watches_v1_00

Config description : A dataset consisting of reviews of Amazon Watches_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 155.42 MiB

Dataset size : 753.08 MiB

amazon_us_reviews/Video_Games_v1_00

Config description : A dataset consisting of reviews of Amazon Video_Games_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 453.19 MiB

Dataset size : 1.78 GiB

amazon_us_reviews/Video_DVD_v1_00

Config description : A dataset consisting of reviews of Amazon Video_DVD_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 1.41 GiB

Dataset size : 5.31 GiB

amazon_us_reviews/Video_v1_00

Config description : A dataset consisting of reviews of Amazon Video_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 132.49 MiB

Dataset size : 465.08 MiB

amazon_us_reviews/Toys_v1_00

Config description : A dataset consisting of reviews of Amazon Toys_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 799.61 MiB

Dataset size : 3.61 GiB

amazon_us_reviews/Tools_v1_00

Config description : A dataset consisting of reviews of Amazon Tools_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 318.32 MiB

Dataset size : 1.37 GiB

amazon_us_reviews/Sports_v1_00

Config description : A dataset consisting of reviews of Amazon Sports_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 832.06 MiB

Dataset size : 3.64 GiB

amazon_us_reviews/Software_v1_00

Config description : A dataset consisting of reviews of Amazon Software_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 89.66 MiB

Dataset size : 366.16 MiB

amazon_us_reviews/Shoes_v1_00

Config description : A dataset consisting of reviews of Amazon Shoes_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 612.50 MiB

Dataset size : 3.06 GiB

amazon_us_reviews/Pet_Products_v1_00

Config description : A dataset consisting of reviews of Amazon Pet_Products_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 491.92 MiB

Dataset size : 2.11 GiB

amazon_us_reviews/Personal_Care_Appliances_v1_00

Config description : A dataset consisting of reviews of Amazon Personal_Care_Appliances_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 16.82 MiB

Dataset size : 75.03 MiB

Auto-cached ( documentation ): Yes

amazon_us_reviews/PC_v1_00

Config description : A dataset consisting of reviews of Amazon PC_v1_00 products in US marketplace. Each product has its own version as specified with it.

Dataset size : 5.93 GiB

amazon_us_reviews/Outdoors_v1_00

Config description : A dataset consisting of reviews of Amazon Outdoors_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 428.16 MiB

Dataset size : 1.83 GiB

amazon_us_reviews/Office_Products_v1_00

Config description : A dataset consisting of reviews of Amazon Office_Products_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 488.59 MiB

Dataset size : 2.12 GiB

amazon_us_reviews/Musical_Instruments_v1_00

Config description : A dataset consisting of reviews of Amazon Musical_Instruments_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 184.43 MiB

Dataset size : 792.16 MiB

amazon_us_reviews/Music_v1_00

Config description : A dataset consisting of reviews of Amazon Music_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 1.42 GiB

Dataset size : 5.16 GiB

amazon_us_reviews/Mobile_Electronics_v1_00

Config description : A dataset consisting of reviews of Amazon Mobile_Electronics_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 21.81 MiB

Dataset size : 94.97 MiB

amazon_us_reviews/Mobile_Apps_v1_00

Config description : A dataset consisting of reviews of Amazon Mobile_Apps_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 532.11 MiB

Dataset size : 3.13 GiB

amazon_us_reviews/Major_Appliances_v1_00

Config description : A dataset consisting of reviews of Amazon Major_Appliances_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 23.23 MiB

Dataset size : 96.36 MiB

amazon_us_reviews/Luggage_v1_00

Config description : A dataset consisting of reviews of Amazon Luggage_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 57.53 MiB

Dataset size : 274.07 MiB

amazon_us_reviews/Lawn_and_Garden_v1_00

Config description : A dataset consisting of reviews of Amazon Lawn_and_Garden_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 464.22 MiB

Dataset size : 2.00 GiB

amazon_us_reviews/Kitchen_v1_00

Config description : A dataset consisting of reviews of Amazon Kitchen_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 887.63 MiB

Dataset size : 3.85 GiB

amazon_us_reviews/Jewelry_v1_00

Config description : A dataset consisting of reviews of Amazon Jewelry_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 235.58 MiB

Dataset size : 1.22 GiB

amazon_us_reviews/Home_Improvement_v1_00

Config description : A dataset consisting of reviews of Amazon Home_Improvement_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 480.02 MiB

Dataset size : 2.08 GiB

amazon_us_reviews/Home_Entertainment_v1_00

Config description : A dataset consisting of reviews of Amazon Home_Entertainment_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 184.22 MiB

Dataset size : 741.78 MiB

amazon_us_reviews/Home_v1_00

Config description : A dataset consisting of reviews of Amazon Home_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 1.01 GiB

Dataset size : 4.60 GiB

amazon_us_reviews/Health_Personal_Care_v1_00

Config description : A dataset consisting of reviews of Amazon Health_Personal_Care_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 964.34 MiB

Dataset size : 4.21 GiB

amazon_us_reviews/Grocery_v1_00

Config description : A dataset consisting of reviews of Amazon Grocery_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 382.74 MiB

Dataset size : 1.77 GiB

amazon_us_reviews/Gift_Card_v1_00

Config description : A dataset consisting of reviews of Amazon Gift_Card_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 11.57 MiB

Dataset size : 93.82 MiB

amazon_us_reviews/Furniture_v1_00

Config description : A dataset consisting of reviews of Amazon Furniture_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 142.08 MiB

Dataset size : 646.69 MiB

amazon_us_reviews/Electronics_v1_00

Config description : A dataset consisting of reviews of Amazon Electronics_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 666.45 MiB

Dataset size : 2.74 GiB

amazon_us_reviews/Digital_Video_Games_v1_00

Config description : A dataset consisting of reviews of Amazon Digital_Video_Games_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 26.17 MiB

Dataset size : 124.19 MiB

amazon_us_reviews/Digital_Video_Download_v1_00

Config description : A dataset consisting of reviews of Amazon Digital_Video_Download_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 483.49 MiB

Dataset size : 2.68 GiB

amazon_us_reviews/Digital_Software_v1_00

Config description : A dataset consisting of reviews of Amazon Digital_Software_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 18.12 MiB

Dataset size : 89.59 MiB

amazon_us_reviews/Digital_Music_Purchase_v1_00

Config description : A dataset consisting of reviews of Amazon Digital_Music_Purchase_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 241.82 MiB

Dataset size : 1.20 GiB

amazon_us_reviews/Digital_Ebook_Purchase_v1_00

Config description : A dataset consisting of reviews of Amazon Digital_Ebook_Purchase_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 2.51 GiB

Dataset size : 10.82 GiB

amazon_us_reviews/Camera_v1_00

Config description : A dataset consisting of reviews of Amazon Camera_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 422.15 MiB

Dataset size : 1.69 GiB

amazon_us_reviews/Books_v1_00

Config description : A dataset consisting of reviews of Amazon Books_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 2.55 GiB

Dataset size : 10.01 GiB

amazon_us_reviews/Beauty_v1_00

Config description : A dataset consisting of reviews of Amazon Beauty_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 871.73 MiB

Dataset size : 3.88 GiB

amazon_us_reviews/Baby_v1_00

Config description : A dataset consisting of reviews of Amazon Baby_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 340.84 MiB

Dataset size : 1.45 GiB

amazon_us_reviews/Automotive_v1_00

Config description : A dataset consisting of reviews of Amazon Automotive_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 555.18 MiB

Dataset size : 2.54 GiB

amazon_us_reviews/Apparel_v1_00

Config description : A dataset consisting of reviews of Amazon Apparel_v1_00 products in US marketplace. Each product has its own version as specified with it.

Download size : 618.59 MiB

Dataset size : 3.99 GiB

amazon_us_reviews/Digital_Ebook_Purchase_v1_01

Config description : A dataset consisting of reviews of Amazon Digital_Ebook_Purchase_v1_01 products in US marketplace. Each product has its own version as specified with it.

Download size : 1.21 GiB

Dataset size : 4.87 GiB

amazon_us_reviews/Books_v1_01

Config description : A dataset consisting of reviews of Amazon Books_v1_01 products in US marketplace. Each product has its own version as specified with it.

Dataset size : 8.48 GiB

amazon_us_reviews/Books_v1_02

Config description : A dataset consisting of reviews of Amazon Books_v1_02 products in US marketplace. Each product has its own version as specified with it.

Download size : 1.24 GiB

Dataset size : 4.15 GiB

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-12-06 UTC.

Datasets: McAuley-Lab / Amazon-Reviews-2023 like 25

Need help to make the dataset viewer work? Open a discussion for direct support.

Amazon Reviews 2023

Please also visit amazon-reviews-2023.github.io/ for more details, loading scripts, and preprocessed benchmark files.

[April 7, 2024] We add two useful files:

  • all_categories.txt : 34 lines (33 categories + "Unknown"), each line contains a category name.
  • asin2category.json : A mapping between parent_asin (item ID) to its corresponding category name.

This is a large-scale Amazon Reviews dataset, collected in 2023 by McAuley Lab , and it includes rich features such as:

  • User Reviews ( ratings , text , helpfulness votes , etc.);
  • Item Metadata ( descriptions , price , raw image , etc.);
  • Links ( user-item / bought together graphs).

What's New?

In the Amazon Reviews'23, we provide:

  • Larger Dataset: We collected 571.54M reviews, 245.2% larger than the last version;
  • Newer Interactions: Current interactions range from May. 1996 to Sep. 2023;
  • Richer Metadata: More descriptive features in item metadata;
  • Fine-grained Timestamp: Interaction timestamp at the second or finer level;
  • Cleaner Processing: Cleaner item metadata than previous versions;
  • Standard Splitting: Standard data splits to encourage RecSys benchmarking.

Basic Statistics

We define the #R_Tokens as the number of tokens in user reviews and #M_Tokens as the number of tokens if treating the dictionaries of item attributes as strings. We emphasize them as important statistics in the era of LLMs.
We count the number of items based on user reviews rather than item metadata files. Note that some items lack metadata.

Compared to Previous Versions

Grouped by category.

Check Pure ID files and corresponding data splitting strategies in Common Data Processing section.

Quick Start

Load user reviews, load item metadata.

Check data loading examples and Huggingface datasets APIs in Common Data Loading section.

Data Fields

For user reviews, for item metadata.

Report Bugs : To report bugs in the dataset, please file an issue on our GitHub .

Others : For research collaborations or other questions, please email yphou AT ucsd.edu .

Models trained or fine-tuned on McAuley-Lab/Amazon-Reviews-2023

amazon movie review dataset

hyp1231/blair-roberta-base

Hyp1231/blair-roberta-large.

Amazon product data

Julian McAuley , UCSD

New!: See our updated (2018) version of the Amazon data here

New: repository of recommender systems datasets.

See a variety of other datasets for recommender systems research on our lab's dataset webpage

Description

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

"Small" subsets for experimentation

If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files. To obtain the larger files you will need to contact me to obtain access.

K-cores (i.e., dense subsets): These data have been reduced to extract the k-core , such that each of the remaining users and items have k reviews each.

Ratings only: These datasets include no metadata or reviews, but only (user,item,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.

Complete review data

Please see the per-category files below, and only download these (large!) files if you really need them:

raw review data (20gb) - all 142.8 million reviews

The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the files below:

user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user

product review data (18gb) - duplicate items removed, sorted by product

5-core (9.9gb) - subset of the data in which all users and items have at least 5 reviews (41.13 million reviews)

Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:

aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)

Format is one-review-per-line in (loose) json. See examples below for further help reading the data.

Sample review:

{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }

  • reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin - ID of the product, e.g. 0000013714
  • reviewerName - name of the reviewer
  • helpful - helpfulness rating of the review, e.g. 2/3
  • reviewText - text of the review
  • overall - rating of the product
  • summary - summary of the review
  • unixReviewTime - time of the review (unix time)
  • reviewTime - time of the review (raw)

Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:

metadata (3.1gb) - metadata for 9.4 million products

Sample metadata:

{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }

  • asin - ID of the product, e.g. 0000031852
  • title - name of the product
  • price - price in US dollars (at time of crawl)
  • imUrl - url of the product image
  • related - related products (also bought, also viewed, bought together, buy after viewing)
  • salesRank - sales rank information
  • brand - brand name
  • categories - list of categories the product belongs to

Visual Features

We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.

visual features (141gb) - visual features for all products

The images themselves can be extracted from the imUrl field in the metadata files.

Below are files for individual product categories, which have already had duplicate item reviews removed.

Please cite one or both of the following if you use the data in any way:

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW , 2016 pdf

Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR , 2015 pdf

Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining , 2015 pdf

Hidden factors and hidden topics: understanding rating dimensions with review text J. McAuley, J. Leskovec RecSys pdf | reviews | bibtex | code (C++) slides

Reading the data

Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)

Convert to 'strict' json

The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:

import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')

Pandas data frame

This code reads the data into a pandas data frame:

import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield eval(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('reviews_Video_Games.json.gz')

Convert to CSV

This code converts (a selection of fields from) the above files to CSV format:

import csv fields = ["asin", "description", "brand"] csvOut = gzip.open("meta_Video_Games.csv.gz", 'w') writer = csv.writer(csvOut) for product in parse("meta_Video_Games.json.gz"): line = [] for f in fields: if product.has_key(f): line.append(product[f]) else: line.append("") writer.writerow(line)

Read image features

import array def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break a = array.array('f') a.fromfile(f, 4096) yield asin, a.tolist()

Example: compute average rating

ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)

Example: latent-factor model in mymedialite

Predicts ratings from a rating-only CSV file

./rating_prediction --recommender=BiasedMatrixFactorization --training-file=ratings_Video_Games.csv --test-ratio=0.1

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

amazon-review-dataset

Here are 18 public repositories matching this topic..., vinaykanigicherla / amazon_reviews_sentiment.

Sentiment Analysis on the Amazon Reviews Dataset using BERT-based transfer learning approach.

  • Updated Apr 19, 2021
  • Jupyter Notebook

Kavitha-Kothandaraman / Product-Recommendation-Systems

To build a recommendation system to recommend products to customers based on the their previous ratings for other products

  • Updated May 29, 2020

imdeepmind / RatePrediction

Rate Prediction using Amazon Review Dataset and Deep Learning

  • Updated Nov 21, 2022

pallavitilloo / Data-Mining-on-Amazon-Reviews

Data Mining on Amazon user reviews for musical instruments

  • Updated Mar 3, 2023

rkarwayun / MSCI-641

Assignments for MSCI 641: Text Analytics, Spring 2020 at University of Waterloo.

  • Updated Aug 24, 2020

NohanJoemon / Automatic-review-labelling-using-BERT

Sentiment analysis of amazon reviews dataset using BERT - model development and deployment

  • Updated Nov 11, 2023

Rahulraj31 / NLP_Review_SportsAndOutdoor

Performing NLP on Amazon's review on sports and outdoor

  • Updated Jul 14, 2021

rdadrl / reviewlytics

Sentimentally analyze product reviews to predict opinion honesty.

  • Updated May 26, 2020

joshivaibhav / AmazonCustomerReview

Analysing Amazon customer reviews via Clustering, Visualization and Classification

  • Updated Feb 17, 2021

MrRaghav / Complaints-mining-from-Hindi-product-reviews

The public dataset in Hindi language published for paper 28 - AICS2020, Ireland

  • Updated Nov 13, 2020

Harsh251299 / Sentimental-Analysis

Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral.

  • Updated Sep 16, 2021

InsiderPants / AmazonReview-Sentiment-Analysis

Sentiment Analysis using Conv1D and LSTM

  • Updated Apr 17, 2019

dewith / reviews_polarity

Predicting polarity of Amazon user reviews using Deep Learning 🎭

  • Updated Dec 23, 2020

jirenmaa / test_sentiment_analysis_dataset

A simple sentiment analysis using SGD and LinearSVC for Amazon Reviews

  • Updated Nov 13, 2023

banurekhaMohan279 / AmazonReviews-Analyser

React App in AWS with CI/CD workflow

  • Updated Jul 20, 2023

lcarcamo1526 / Amazon-Reviews-Analysis

Amazon Reviews Analysis

  • Updated May 23, 2019

kuldeep27396 / Apparel-recommendation

Apparel-recommendation-engine-Machine-Learning

  • Updated Mar 28, 2021

roshancyriacmathew / Deep-Learning-on-Amazon-Alexa-Reviews

This notebook will show you how to implement a deep leaning algorithm (LSTM) on the Amazon Alexa Reviews dataset

  • Updated Apr 5, 2023

Improve this page

Add a description, image, and links to the amazon-review-dataset topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the amazon-review-dataset topic, visit your repo's landing page and select "manage topics."

Recommender Systems and Personalization Datasets

Julian McAuley , UCSD

Description

This page contains a collection of datasets that have been collected for research by our lab. Datasets contain the following features:

  • user/item interactions
  • star ratings
  • product reviews
  • social networks
  • item-to-item relationships (e.g. copurchases, compatibility)
  • product images
  • price, brand, and category information
  • heart-rate sequences
  • other metadata

Please cite the appropriate reference if you use any of the datasets below.

Datasets are in (loose) json format unless specified otherwise, meaning they can be treated as python dictionary objects. A simple script to read json-formatted data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)

Directory by Dataset

Twitch live-streaming interactions

NPR interview dialog data

This American Life podcast transcripts

Recipes and interactions from food.com

Paired Recipes from food.com

EndoMondo fitness tracking data

Amazon product reviews and metadata

Amazon question/answer data

Amazon marketing bias data

Google Local business reviews and metadata

Google Restaurants restaurant reviews and metadata

Steam video game reviews and bundles

Goodreads book reviews

Goodreads spoilers

Fashion explanations

Pinterest fashion compatibility data

ModCloth clothing fit feedback

ModCloth marketing bias data

RentTheRunway clothing fit feedback

Tradesy bartering data

RateBeer bartering data

Gameswap bartering data

Behance community art reviews and image features

Librarything reviews and social data

Epinions reviews and social data

Cant understanding data

Dance Dance Revolution step charts

NES song data

BeerAdvocate multi-aspect beer reviews

RateBeer multi-aspect beer reviews

Facebook social circles data

Twitter social circles data

Google+ social circles data

Reddit submission popularity and metadata

Directory by Metadata Type

The datasets below can be roughly organized in terms of the types of metadata they contain:

Review text: see Amazon , BeerAdvocate, RateBeer , Google Local , Google Restaurants

Image data: Amazon , Behance , Pinterest , Google Restaurants

Item-to-item relationships: Amazon

Q/A data: Amazon Q/A

Geographical data: Google Local , Google Restaurants , EndoMondo

Heart-Rate data: EndoMondo

Bundle data: Steam

Peer-to-peer trades: Tradesy, RateBeer, Gameswap

Social connections: Librarything, Epinions

Fit feedback: Modcloth, Renttherunway

Multple aspects: BeerAdvocate, RateBeer

This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days.

Basic statistics

  • User ID (anonymized)
  • Streamer username

1,34347669376,grimnax,5415, 5419 1,34391109664,jtgtv,5869,5870 1,34395247264,towshun,5898, 5899 1,34405646144,mithrain,6024, 6025 2,33848559952,chfhdtpgus1,206, 207 2,33881429664,sal_gu,519,524 2,33921292016,chfhdtpgus1,922, 924

Download link

See our data folder containing all Twitch files. The file full_a.csv.gz contains the full dataset while 100k.csv is a subset of 100k users for benchmark purposes. The code is available in our Github repository .

Please cite the following if you use the data:

Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption Jérémie Rappaz, Julian McAuley and Karl Aberer RecSys , 2021

Interview: NPR Media Dialog Data

This dataset contains interview transcripts from National Public Radio (NPR) . Data includes full interview transcripts and news article headlines.

  • Episode Date and Title
  • Speaker Names
  • Speaker Utterances
  • News Article Headlines

episode: 79679 program: Talk of the Nation title: Forecasting the Future of the Internet date: 2006-05-26 episode_order: 48 speaker: Professor LARRY PETERSON (Princeton University) utterance: And this is almost like the neutrality aspect of the issue, that there are places you just can't get to and the universal connectivity of the original Internet is deteriorating. Because of a lack of security built into the Internet your only recourse is to throw up all sorts of protections that are extremely suspicious of every bit of traffic that happens to fly by.

See the Interview Dataset Page for download information.

Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley EMNLP , 2020 pdf

This American Life Podcast Transcripts

This dataset contains program transcripts from This American Life . Data includes full program transcripts and associated audio.

  • Episode Act
  • Utterance Lengths
  • Episode Audio

episode: ep-1 act: prologue utterance_start: 39.96 utterance_end: 54.89 duration: 14.93 speaker: ira glass utterance: Well, one great thing about starting a new show is utter anonymity. Nobody really knows what to expect from you. This interviewee did not know us from Adam.

See the This American Life Dataset Page for download information.

Speech Recognition and Multi-Speaker Diarization of Long Conversations Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison W. Cottrell INTERSPEECH , 2020 pdf

Food.com Recipe & Review Data

These datasets contain recipe details and reviews from Food.com (formerly GeniusKitchen). Data includes cooking recipes and review texts.

  • Ratings and Reviews
  • Recipe Name, Description, Ingredients, and Directions
  • Recipe Categories (Tags)
  • Recipe Nutrition Information

name: beer mac n cheese soup id: 499490 minutes: 45 contributor_id: 560491 submitted: 2013-04-27 tags: 60-minutes-or-less time-to-make preparation nutrition: 678.8 70.0 20.0 46.0 61.0 134.0 11.0 n_steps: 7 steps: cook the bacon in a pan over medium heat and set aside on paper towels to drain , reserving 2 tablespoons of the grease in the pan add the onion , carrot , celery and jalapeno and cook until tender , about 10-15 minutes add the garlic and cook until fragrant , about a minute mix in the flour and let it cook for 2-3 minutes add the broth , beer , nutmeg , bacon and macaroni and let cook until the macaroni is al-dente , about 7-8 minutes add the cream , mustard , worcestershire sauce and cheese and cook until the cheese has melted without bringing it back to a boil season with cayenne , salt and pepper to taste description: all of the flavors of mac n' cheese in the form of a hot bowl of soup! submitted by kevin lynch ingredients: bacon onion carrots celery jalapeno pepper garlic cloves flour chicken broth beer nutmeg elbow macaroni heavy cream dijon mustard worcestershire sauce cheddar cheese cayenne salt and pepper n_ingredients: 17

user_id: 8937 recipe_id: 44394 date: 2002-12-01 rating: 4 review: This worked very well and is EASY. I used not quite a whole package (10oz) of white chips. Great!

See the Food.com Dataset Page for download information.

Generating Personalized Recipes from Historical User Preferences Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley EMNLP , 2019 pdf

Recipe Pairs data

This is a collection recipes paired with variants, e.g. a recipe matched with a vegan version of the same recipe.

See the Recipe Pairs Dataset Page for download information.

SHARE: a System for Hierarchical Assistive Recipe Editing Shuyang Li, Yufei Li, Jianmo Ni, Julian McAuley EMNLP , 2022 pdf

EndoMondo Fitness Tracking Data

This is a collection of workout logs from users of EndoMondo . Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.

  • User Identifier
  • Latitude/Longitude/Altitude sequences (with timestamps)
  • Heart rates
  • Various derived sequences

userId: 10921915 gender: male sport: bike id: 396826535 longitude: [24.64977040886879, 24.65014273300767, 24.650910682976246, 24.650668865069747, 24.649145286530256, ...] latitude: [60.173348765820265, 60.173239801079035, 60.17298021353781, 60.172477969899774, 60.17186114564538, ...] altitude: [-1.8044666444624418, -1.8190453555595787, -1.8190453555595787, -1.8511185199732794, -1.871528715509271, ...] timestamp: [1408898746, 1408898754, 1408898765, 1408898778, 1408898794, ...] time_elapsed: [-0.12256752559145224, -0.12221090169596584, -0.12172054383967204, -0.12114103000950663, -0.12042778221853381, ...] heart_rate: [-8.197369036801112, -5.867841701016304, -3.961864789919643, -4.173640002263717, -3.961864789919643, ...] derived_speed: [-7.0829444390064396, -2.8061928357004815, -0.3976286593020398, -0.7571073884764162, 2.6415189187026646, ...] distance: [-4.372303649217691, -2.374952819539426, -0.07926348591212737, 0.4284751220389811, 4.710835498111755, ...] tar_heart_rate: [100, 111, 120, 119, 120, ...] tar_derived_speed: [0, 10.751376415573548, 16.806294372816662, 15.902596545765366, 24.446443398153843, ...] since_begin: [1378478.8892184314, 1378478.8892184314, 1378478.8892184314, 1378478.8892184314, 1378478.8892184314, ...] since_last: [2158.84607810351, 2158.84607810351, 2158.84607810351, 2158.84607810351, 2158.84607810351, ...]

See the FitRec Dataset Page for download information.

Modeling heart rate and activity data for personalized fitness recommendation Jianmo Ni, Larry Muhlstein, Julian McAuley WWW , 2019 pdf

Amazon Product Reviews

This is a large-scale Amazon Reviews dataset collected in 2023. This dataset contains 48.19 million items, and 571.54 million reviews from 54.51 million users.

  • User Reviews (ratings, text, helpfulness votes, etc.);
  • Item Metadata (descriptions, price, raw image, etc.);
  • Links (user-item / bought together graphs).

{ "sort_timestamp": 1634275259292, "rating": 3.0, "helpful_votes": 0, "title": "Meh", "text": "These were lightweight and soft but much too small for my liking. I would have preferred two of these together to make one loc. For that reason I will not be repurchasing.", "images": [ { "small_image_url": " https://m.media-amazon.com/images/I/81FN4c0VHzL._SL256_.jpg ", "medium_image_url": " https://m.media-amazon.com/images/I/81FN4c0VHzL._SL800_.jpg ", "large_image_url": " https://m.media-amazon.com/images/I/81FN4c0VHzL._SL1600_.jpg ", "attachment_type": "IMAGE" } ], "asin": "B088SZDGXG", "verified_purchase": true, "parent_asin": "B08BBQ29N5", "user_id": "AEYORY2AVPMCPDV57CE337YU5LXA" }

See the Amazon Dataset Page for download information.

See the Amazon Reviews 2023 page for download information.

You can also download data from previous versions of these datasets:

Amazon Reviews 2018

Amazon Reviews 2014

2023 version

Bridging Language and Items for Retrieval and Recommendation Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, Julian McAuley arXiv pdf

2018 version

Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley EMNLP , 2019 pdf

2014 version

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering Ruining He, Julian McAuley WWW , 2016 pdf

Image-based recommendations on styles and substitutes Julian McAuley, Christopher Targett, Javen Shi, Anton van den Hengel SIGIR , 2015 pdf

This is a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users.

  • reviews and ratings
  • item-to-item relationships (e.g. "people who bought X also bought Y")
  • helpfulness votes
  • product image (and CNN features)

{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }

The 2014 version of this dataset is also available .

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW , 2016 pdf

Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR , 2015 pdf

Amazon Question and Answer Data

These datasets contain questions and answers about products from the Amazon dataset above.

  • question and answer text
  • is the question binary (yes/no), and if so does it have a yes/no answer?
  • product ID (to reference the review dataset)

{ "asin": "B000050B6Z", "questionType": "yes/no", "answerType": "Y", "answerTime": "Aug 8, 2014", "unixTime": 1407481200, "question": "Can you use this unit with GEL shaving cans?", "answer": "Yes. If the can fits in the machine it will despense hot gel lather. I've been using my machine for both , gel and traditional lather for over 10 years." }

See the Amazon Q/A Page for download information.

Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems Mengting Wan, Julian McAuley International Conference on Data Mining (ICDM) , 2016 pdf

Addressing complex and subjective product-related queries with customer reviews Julian McAuley, Alex Yang World Wide Web (WWW) , 2016 pdf

Marketing Bias data

These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.

  • user identities
  • item sizes, user genders

Example (ModCloth)

item_id,user_id,rating,timestamp,size,fit,user_attr,model_attr,c... 7443,Alex,4,2010-01-21 08:00:00+00:00,,,Small,Small,Dresses,,2012,0 7443,carolyn.agan,3,2010-01-27 08:00:00+00:00,,,,Small,Dresses,,... 7443,Robyn,4,2010-01-29 08:00:00+00:00,,,Small,Small,Dresses,,20... 7443,De,4,2010-02-13 08:00:00+00:00,,,,Small,Dresses,,2012,0 7443,tasha,4,2010-02-18 08:00:00+00:00,,,Small,Small,Dresses,,20... 7443,gina.chihos,5,2010-02-25 08:00:00+00:00,,,,Small,Dresses,,2... 7443,Kim,2,2010-02-26 08:00:00+00:00,,,Small,Small,Dresses,,2012,0 7443,jess.betcher,5,2010-03-26 07:00:00+00:00,,,,Small,Dresses,,...

Download links

See our project page for download links.

Addressing Marketing Bias in Product Recommendations Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley WSDM , 2020 pdf

Google Local Reviews (2021)

This dataset contains review information from Google Maps (ratings, text, images, etc.), business metadata (address, geographic info, descriptions, category information, price, open hours, etc.), and links (related businesses) up to Sep 2021 in the United States.

See also two variants of this dataset below, including a 2021 version, and a version containing item images.

{ 'user_id': '101463350189962023774', 'name': 'Jordan Adams', 'time': 1627750414677, 'rating': 5, 'text': 'Cool place, great people, awesome dentist!', 'pics': [ { 'url': ['https://lh5.googleusercontent.com/p/AF1QipNq2nZC5TH4_M7h5xRAd 61hoTgvY1o9lozABguI=w150-h150-k-no-p'] } ], 'resp': { 'time': 1628455067818, 'text': 'Thank you for your five-star review! -Dr. Blake' }, 'gmap_id': '0x87ec2394c2cd9d2d:0xd1119cfbee0da6f3' }

  • user_id - ID of the reviewer
  • name - name of the reviwer
  • time - time of the review (unix time)
  • rating - rating of the business
  • text - text of the review
  • pics - pictures of the review
  • resp - business response to the review including unix time and text of the response
  • gmap_id - ID of the business

{ 'name': 'Walgreens Pharmacy', 'address': 'Walgreens Pharmacy, 124 E North St, Kendallville, IN 46755', 'gmap_id': '0x881614ce7c13acbb:0x5c7b18bbf6ec4f7e', 'description': 'Department of the Walgreens chain providing prescription medications & other health-related items.', 'latitude': 41.451859999999996, 'longitude': -85.2666757, 'category': ['Pharmacy'], 'avg_rating': 4.2, 'num_of_reviews': 5, 'price': '$$', 'hours': [['Thursday', '8AM–1:30PM'], ['Friday', '8AM–1:30PM'], ['Saturday', '9AM–1:30PM'], ['Sunday', '10AM–1:30PM'], ['Monday', '8AM–1:30PM'], ['Tuesday', '8AM–1:30PM'], ['Wednesday', '8AM–1:30PM']], 'MISC': { 'Service options': ['Curbside pickup', 'Drive-through', 'In-store pickup', 'In-store shopping'], 'Health & safety': ['Mask required', 'Staff wear masks', 'Staff get temperature checks'], 'Accessibility': ['Wheelchair accessible entrance', 'Wheelchair accessible parking lot'], 'Planning': ['Quick visit'], 'Payments': ['Checks', 'Debit cards'] }, 'state': 'Closes soon ⋅ 1:30PM ⋅ Reopens 2PM', 'relative_results': ['0x881614cd49e4fa33:0x2d507c24ff4f1c74', '0x8816145bf5141c89:0x535c1d605109f94b', '0x881614cda24cc591:0xca426e3a9b826432', '0x88162894d98b91ef:0xd139b34de70d3e03', '0x881615400b5e57f9:0xc56d17dbe420a67f'], 'url': 'https://www.google.com/maps/place//data=!4m2!3m1!1s0x881614ce7c13acb b:0x5c7b18bbf6ec4f7e?authuser=-1&hl=en&gl=us' }

  • name - name of the business
  • address - address of the business
  • description - description of the business
  • latitude - latitude of the business
  • longitude - longitude of the business
  • category - category of the business
  • avg_rating - average rating of the business
  • num_of_reviews - number of reviews
  • price - price of the business
  • hours - open hours
  • MISC - MISC information
  • state - the current status of the business (e.g., permanently closed)
  • relative_results - relative businesses recommended by Google
  • url - URL of the business

See the Google Local Dataset Page for download information.

UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining Jiacheng Li, Jingbo Shang, Julian McAuley Annual Meeting of the Association for Computational Linguistics (ACL) , 2022 pdf

Personalized Showcases: Generating Multi-Modal Explanations for Recommendations An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, Julian Mcauley arXiv:2207.00422 , 2022 pdf

Google Local Reviews (2018)

These datasets contain reviews about businesses from Google Local (Google Maps). Data includes geographic information for each business as well as reviews.

  • GPS coordinates and address
  • User information (places lived, jobs)
  • business category, opening hours, etc.

Example (review)

{ 'rating': 3.0, 'reviewerName': u'an lam', 'reviewText': u'Ch\u1ea5t l\u01b0\u1ee3ng t\u1ea1m \u1ed5n', 'categories': [u'Gi\u1ea3i Tr\xed - Caf\xe9'], 'gPlusPlaceId': u'108103314380004200232', 'unixReviewTime': 1372686659, 'reviewTime': u'Jul 1, 2013', 'gPlusUserId': u'100000010817154263736' }

Example (business)

{ 'name': u'Diamond Valley Lake Marina', 'price': None, 'address': [u'2615 Angler Ave', u'Hemet, CA 92545'], 'hours': [[u'Monday', [[u'6:30 am--4:15 pm']]], [u'Tuesday', [[u'6:30 am--4:15 pm']]], [u'Wednesday', [[u'6:30 am--4:15 pm']], 1], [u'Thursday', [[u'6:30 am--4:15 pm']]], [u'Friday', [[u'6:30 am--4:15 pm']]], [u'Saturday', [[u'6:30 am--4:15 pm']]], [u'Sunday', [[u'6:30 am--4:15 pm']]]], 'phone': u'(951) 926-7201', 'closed': False, 'gPlusPlaceId': '104699454385822125632', 'gps': [33.703804, -117.003209] }

Places Data (276mb)

User Data (178mb)

Review Data (1.4gb)

Translation-based factorization machines for sequential recommendation Rajiv Pasricha, Julian McAuley RecSys , 2018 pdf

Translation-based recommendation Ruining He, Wang-Cheng Kang, Julian McAuley RecSys , 2017 pdf

Google Restaurants

This is a mutli-modal dataset of restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as other metadata for each restaurant.

  • Geographical location and address
  • Reviews, ratings and images
  • Business category, opening status, price, etc.

"name":"The Fish Spot", "address":"5101 W Pico Blvd, Los Angeles, CA 90019", "Description":null, "Latitude":34.0481627, "Longitude":-118.3494339, "category":["Seafood restaurant"], "gmap_url":"https://www.google.com/maps/place/The+Fish+Spot/", "Avg_rating":4.3, "Num_of_reviews":80, "price":"$$", "Reviews": [ {"user_id":"111210125124533240892", "time":"3 years ago", "Rating":5, "text":"Absolutely love this place.", "pics":[ {"id":"AF1QipO1ejvRhkVBlg-v52UczxYMD7uebcZIhKC9uGud", "url":["https://lh5.googleusercontent.com/p/"]}, ], "link":"https://www.google.com/maps/reviews/"}, ...,]

See our data folder containing all related files. The file image_review_all.json contains the full dataset, while filter_all_t.json is a subset with filtered review sentences that have higher correlation with images. Code is available in our Github repository .

Steam Video Game and Bundle Data

These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

  • purchases, plays, recommends ("likes")
  • product bundles
  • pricing information

Example (bundle)

{ 'bundle_final_price': '$29.66', 'bundle_url': 'http://store.steampowered.com/bundle/1482/?utm_source=SteamDB...', 'bundle_price': '$32.96', 'bundle_name': 'Two Tribes Complete Pack!', 'bundle_id': '1482', 'items': [{'genre': 'Casual, Indie', 'item_id': '38700', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38700', 'item_name': 'Toki Tori'}, {'genre': 'Adventure, Casual, Indie', 'item_id': '201420', 'discounted_price': '$14.99', 'item_url': 'http://store.steampowered.com/app/201420', 'item_name': 'Toki Tori 2+'}, {'genre': 'Strategy, Indie, Casual', 'item_id': '38720', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38720', 'item_name': 'RUSH'}, {'genre': 'Action, Indie', 'item_id': '38740', 'discounted_price': '$7.99', 'item_url': 'http://store.steampowered.com/app/38740', 'item_name': 'EDGE'}], 'bundle_discount': '10%' }

Version 1: Review Data (6.7mb)

Version 1: User and Item Data (71mb)

Version 2: Review Data (1.3gb)

Version 2: Item metadata (2.7mb)

Bundle Data (92kb)

Self-attentive sequential recommendation Wang-Cheng Kang, Julian McAuley ICDM , 2018 pdf

Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys , 2018 pdf

Generating and personalizing bundle recommendations on Steam Apurva Pathak, Kshitiz Gupta, Julian McAuley SIGIR , 2017 pdf

Goodreads Book Reviews

These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a "shelf", rating, and reading.

  • add-to-shelf, read, review actions
  • book attributes: title, isbn
  • graph of similar books

Example (interaction data)

{ "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

Goodreads Spoilers

These datasets contain reviews from the Goodreads book review website, along with annotated "spoiler" information from each review.

  • see also metadata from the complete Goodreads dataset

Example (spoiler data)

Sentences are annotated as "1" if the sentence contains a spoiler, "0" otherwise.

{ 'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb', 'timestamp': '2013-12-28', 'review_sentences': [[0, 'First, be aware that this book is not for the faint of heart.'], [0, 'Human trafficking, drugs, kidnapping, abuse in all forms - this story contains all of this and more.'], ..., [0, '(ARC provided by the author in return for an honest review.)']], 'rating': 5, 'has_spoiler': False, 'book_id': '18398089', 'review_id': '4b3ffeaf14310ac6854f140188e191cd' }

Fine-grained spoiler detection from large-scale review corpora Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley ACL , 2019 pdf

Pairwise Fashion Explanations

The Pair Fashion Explanation (PFE) dataset contains 6407 instances, with each instance including items, features and the reason why these items are a good match.

Mentioned Items and the Percentages:

  • Items (dress, top, skirt, etc.);
  • Features (kilt, studded, etc.);
  • Explanations (The outfit looks cohesive because the oversized layers are cinched with a studded belt, which complements the little strip from a kilt skirt that is also affixed to the belt, creating a visually pleasing balance in the outfit.);

{ "items": ['trousers', 'belt'], "features": ['tone-on-tone burgundy, slight flare', 'big circular gold buckle'], "explanations": "They all share a similar color scheme and the pieces have a cohesive silhouette that creates a polished and sophisticated look." }

See our project page for download information.

Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation. Yu Wang, Zexue He, Zhankui He, Hao Xu, Julian McAuley. AAAI 2024 pdf

Pinterest Fashion Compatibility

This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.

  • product IDs
  • bounding boxes

Example (fashion.json)

{ "product": "0027e30879ce3d87f82f699f148bff7e", "scene": "cdab9160072dd1800038227960ff6467", "bbox": [ 0.434097, 0.859363, 0.560254, 1.0 ] }

See our project page for download links, and for instructions as to how the product images can be collected from Pinterest.

Complete the Look: Scene-based complementary product recommendation Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley CVPR , 2019 pdf

Clothing Fit Data

These datasets contain measurements of clothing fit from ModCloth and RentTheRunway .

  • ratings and reviews
  • fit feedback (small/fit/large etc.)
  • user/item measurements
  • category information

Example (RentTheRunway)

{ "fit": "fit", "user_id": "420272", "bust size": "34d", "item_id": "2260466", "weight": "137lbs", "rating": "10", "rented for": "vacation", "review_text": "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.", "body type": "hourglass", "review_summary": "So many compliments!", "category": "romper", "height": "5' 8\"", "size": 14, "age": "28", "review_date": "April 20, 2016" }

Modcloth (8.5mb)

Renttherunway (31mb)

Decomposing fit semantics for product size recommendation in metric spaces Rishabh Misra, Mengting Wan, Julian McAuley RecSys , 2018 pdf

Product Exchange/Bartering Data

These datasets contain peer-to-peer trades from various recommendation platforms.

  • peer-to-peer trades
  • "have" and "want" lists
  • image data (tradesy)

Example (tradesy)

{ 'lists': { 'bought': ['466', '459', '457', '449'], 'selling': [], 'want': [], 'sold': ['104', '103', '102'] }, 'uid': '2' }

Tradesy (3.8mb)

See the project page for ratebeer, gameswap (and other) datasets

Bartering books to beers: A recommender system for exchange platforms Jérémie Rappaz, Maria-Luiza Vladarean, Julian McAuley, Michele Catasta WSDM , 2017 pdf

VBPR: Visual bayesian personalized ranking from implicit feedback Ruining He, Julian McAuley AAAI , 2016 pdf

Behance Community Art Data

Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

  • appreciates (likes)
  • extracted image features

Example ("appreciate" data)

Each entry is a user, item, timestamp triple:

276633 01588231 1307583271 1238354 01529213 1307583273 165550 00485000 1307583337 2173258 00776972 1307583340 165550 00158226 1307583406 1238354 01540285 1307583495 2459267 01578261 1307583509 165550 00264669 1307583518 165550 00171501 1307583536

Code to read image features

import struct def readImageFeatures(path): f = open(path, 'rb') while True: itemId = f.read(8) if itemId == '': break feature = struct.unpack('f'*4096, f.read(4*4096)) yield itemId, feature

See our data folder containing all Behance files. The folder also contains additional documentation.

Vista: A visually, socially, and temporally-aware model for artistic recommendation Ruining He, Chen Fang, Zhaowen Wang, Julian McAuley RecSys , 2016 pdf

Social Recommendation Data

These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).

  • price paid (epinions)
  • helpfulness votes (librarything)
  • flags (librarything)

Example (LibraryThing reviews)

{ 'work': '3067', 'flags': [], 'unixtime': 1160265600, 'stars': 4.5, 'nhelpful': 0, 'time': 'Oct 8, 2006', 'comment': 'great storytelling in this novel about a couple crossed by a time travelling disorder ', 'user': 'justine' }

Example (LibraryThing social network)

Rodo anehan Rodo sevilemar Rodo dingsi Rodo slash RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader Bumpersmom RelaxedReader DivaColumbus RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader BookWorm2729 RelaxedReader Bumpersmom

LibraryThing (594mb)

epinions (66mb)

SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation Chenwei Cai, Ruining He, Julian McAuley IJCAI , 2017 pdf

Improving latent factor models via personalized feature projection for one-class recommendation Tong Zhao, Julian McAuley, Irwin King Conference on Information and Knowledge Management (CIKM) , 2015 pdf

Other Non-Recommender-Systems Datasets

Below are various datasets collected by my lab that are not related to recommender systems specifically. Formats of these datasets vary, so their respective project pages should be consulted for further details.

DogWhistle: Cant Understanding Data

DogWhistle is a Chinese dataset collected from the historical records for an online game. It provides hidden words and the cant for them, with human answers. The dataset is suitable for semantic similarity evaluation for large language models.

  • cant and the hidden words
  • cant history
  • human answers

Example (insider subtask)

0 高铁,周末,无情,条纹 冷漠,休息,斑马 冷漠 2 1 高铁,周末,无情,条纹 冷漠,休息,斑马 休息 1 2 高铁,周末,无情,条纹 冷漠,休息,斑马 斑马 3

Please refer to our leaderboard page for download instructions.

Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian McAuley, Furu Wei NAACL , 2021 pdf

Video Game Data

Step charts from the video game Dance Dance Revolution , and audio files from the NES platform.

See the project pages for Dance Dance Convolution and NES MDB for further details and links to the data

Dance Dance Convolution Chris Donahue, Zachary Lipton, Julian McAuley ICML , 2017 pdf

The NES Music Database: A symbolic music dataset with expressive performance attributes Chris Donahue, Henry Mao, Julian McAuley International Society for Music Information Retrieval Conference (ISMIR) , 2018 pdf

Multi-aspect Reviews

These datasets include reviews with multiple rated dimensions. The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell.

  • aspect-specific ratings (taste, look, feel, smell, overall impression)
  • product category

Example (ratebeer)

beer/name: John Harvards Simcoe IPA beer/beerId: 63836 beer/brewerId: 8481 beer/ABV: 5.4 beer/style: India Pale Ale (IPA) review/appearance: 4/5 review/aroma: 6/10 review/palate: 3/5 review/taste: 6/10 review/overall: 13/20 review/time: 1157587200 review/profileName: hopdog review/text: On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.

BeerAdvocate (433mb)

RateBeer (388mb)

Sentences with aspect labels (annotator 1) (758kb)

Sentences with aspect labels (annotator 2) (759kb)

Learning attitudes and attributes from multi-aspect reviews Julian McAuley, Jure Leskovec, Dan Jurafsky International Conference on Data Mining (ICDM) , 2012 pdf

From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews Julian McAuley, Jure Leskovec WWW , 2013 pdf

Social Circles

These datasets contain social connections and "circles" from Facebook, Twitter, and Google Plus.

  • social connections
  • circles (sets of friends sharing a common property)
  • user metadata

Example (Kaggle egonet data)

UserId: Friends 1: 4 6 12 2 208 2: 5 3 17 90 7

See SNAP facebook , twitter , and Google Plus data, as well as the Kaggle competition based on the same data.

Learning to Discover Social Circles in Ego Networks Julian McAuley, Jure Leskovec Neural Information Processing Systems (NIPS) , 2012 pdf

Reddit Submissions

Submissions of reddit posts (and in particular resubmissions of the same content) along with metadata.

  • upvotes/downvotes
  • post title, subreddit, etc.

#image_id,unixtime,rawtime,title,total_votes,reddit_id,... number_of_downvotes,localtime,score,number_of_comments,username 1005,1335861624,2012-05-01T15:40:24.968266-07:00,I immediately regret this decision,27,t296r,20,pics,7,1335886824,13,0,ninjaroflmaster 1005,1336470481,2012-05-08T16:48:01.418140-07:00,"Pushing your friend into the water,Level: 99",18,tds4i,16,funny,2,1336495681,14,0,hme4 1005,1339566752,2012-06-13T12:52:32.371941-07:00,I told him. He Didn't Listen,6,v0cma,4,funny,2,1339591952,2,0,HeyPatWhatsUp 1005,1342200476,2012-07-14T00:27:56.857805-07:00,Don't end up as this guy.,16,wjivx,7,funny,9,1342225676,-2,2,catalyst24

resubmissions data (7.3mb)

raw html of resubmissions (1.8gb)

See also the SNAP project page .

Understanding the interplay between titles, content, and communities in social media Himabindu Lakkaraju, Julian McAuley, Jure Leskovec ICWSM , 2013 pdf

Questions and comments to Julian McAuley

Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer

  • Original Article
  • Published: 16 April 2024
  • Volume 14 , article number  87 , ( 2024 )

Cite this article

amazon movie review dataset

  • Mian Muhammad Danyal 1 , 2   na1 ,
  • Sarwar Shah Khan 3 , 4 ,
  • Muzammil Khan 3   na1 ,
  • Subhan Ullah 2   na1 ,
  • Muhammad Bilal Ghaffar 2   na1 &
  • Wahab Khan 2   na1  

40 Accesses

Explore all metrics

Movies have been important in our lives for many years. Movies provide entertainment, inspire, educate, and offer an escape from reality. Movie reviews help us choose better movies, but reading them all can be time-consuming and overwhelming. To make it easier, sentiment analysis can classify movie reviews into positive and negative categories. Opinion mining (OP), called sentiment analysis (SA), uses natural language processing to identify and extract opinions expressed through text. Naive Bayes, a supervised learning algorithm, offers simplicity, efficiency, and strong performance in classification tasks due to its feature independence assumption. This study evaluates the performance of four Naïve Bayes variations using two vectorization techniques, Count Vectorizer and Term Frequency–Inverse Document Frequency (TF–IDF), on two movie review datasets: IMDb Movie Reviews Dataset and Rotten Tomatoes Movie Reviews. Bernoulli Naive Bayes achieved the highest accuracy using Count Vectorizer on the IMDB and Rotten Tomatoes datasets. Multinomial Naive Bayes, on the other hand, achieved better accuracy on the IMDB dataset with TF–IDF. During preprocessing, we implemented different techniques to enhance the quality of our datasets. These included data cleaning, spelling correction, fixing chat words, lemmatization, and removing stop words. Additionally, we fine-tuned our models through hyperparameter tuning to achieve optimal results. Using TF–IDF, we observed a slight performance improvement compared to using the count vectorizer. The experiment highlights the significant role of sentiment analysis in understanding the attitudes and emotions expressed in movie reviews. By predicting the sentiments of each review and calculating the average sentiment of all reviews, it becomes possible to make an accurate prediction about a movie’s overall performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

amazon movie review dataset

Similar content being viewed by others

amazon movie review dataset

Sentiment Analysis of IMDb Movie Reviews: A Comparative Analysis of Feature Selection and Feature Extraction Techniques

amazon movie review dataset

Sentiment Analysis through Word Vectors: A Study on Movie Reviews from IMDb

amazon movie review dataset

Complement Naive Bayes Classifier for Sentiment Analysis of Internet Movie Database

Data availibility statement.

The data that support the findings of this study are openly available through the Open Science Framework at https://github.com/Ankit152/IMDB-sentiment-analysis.git and https://www.kaggle.com/datasets/talha002/rottentomatoes-400k-review

Abbreviations

Aspect-based sentiment analysis

Artificial intelligence

Bag-of-words

Bernoulli Naive Bayes

Complement Naive Bayes

Cross-validation

Deep learning

Gaussian Naive Bayes

Grid search

Internet movie database

K-Nearest Neighbours

Support vector machines

Machine learning

Multinomial Naive Bayes

  • Naive Bayes

Natural language processing

Natural language tool kit

Opinion mining

Rotten Tomatoes

True Positive

True Negative

False Positive

False Negative

  • Sentiment analysis

Term Frequency–Inverse Document Frequency

Word to vector

Abimanyu AJ, Dwifebri M, Astuti W (2023) Sentiment analysis on movie review from rotten tomatoes using logistic regression and information gain feature selection. Build Inf Technol Sci (BITS) 5(1):162–170

Google Scholar  

Adam NL, Rosli NH, Soh SC (2021) Sentiment analysis on movie review using Naïve Bayes. In: 2021 2nd International conference on artificial intelligence and data sciences (AiDAS), pp 1–6. https://doi.org/10.1109/AiDAS53897.2021.9574419

Agrawal T (2021) Introduction to hyperparameters. In: Hyperparameter optimization in machine learning: make your machine learning and deep learning models more efficient, pp 1–8. APRESS: New York

Arsyah UI, Pratiwi M, Muhammad A (2024) Twitter sentiment analysis of public space opinions using SVM and TF–IDF methods. Indon J Comput Sci 13(1)

Artur M (2021) Review the performance of the bernoulli Naïve Bayes classifier in intrusion detection systems using recursive feature elimination with cross-validated selection of the best number of features. Proc Comput Sci 190:564–570

Article   Google Scholar  

Asghar MZ, Khan A, Ahmad S, Kundi FM (2014) A review of feature extraction in sentiment analysis. J Basic Appl Sci Res 4(3):181–186

Baid P, Gupta A, Chaplot N (2017) Sentiment analysis of movie reviews using machine learning techniques. Int J Comput Appl 179(7):45–49

Banik N, Rahman MHH (2018) Evaluation of Naïve Bayes and support vector machines on Bangla textual movie reviews. In: 2018 International conference on Bangla speech and language processing (ICBSLP), pp 1–6. IEEE

Başarslan MS, Kayaalp F (2023) MBI-GRUMCONV: a novel multi BI-GRU and multi CNN-based deep learning model for social media sentiment analysis. J Cloud Comput. https://doi.org/10.1186/s13677-022-00386-3

Bilal Khan S, Muhammad Arshad SK (2023) Comparative analysis of machine learning models for pdf malware detection: Evaluating different training and testing criteria. J Cyber Secur 5(1), 1–11 https://doi.org/10.32604/jcs.2023.042501

Bodapati JD, Veeranjaneyulu N, Shareef SN (2019) Sentiment analysis from movie reviews using LSTMS. Ingénierie des Systèmes d Inf 24(1):125–129

Cahyanti FE, AlFaraby S (2020) On the feature extraction for sentiment analysis of movie reviews based on SVM. In: 2020 8th International conference on information and communication technology (ICoICT), pp 1–5, IEEE

Danyal MM, Khan SS, Khan M, Ullah S, Mehmood F, Ali I (2024) Proposing sentiment analysis model based on BERT and XLNET for movie reviews. Multimed Tools Appl 1–25

Deepa D, Raaji Tamilarasi A (2019) Sentiment analysis using feature extraction and dictionary-based approaches. In: 2019 Third international conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, pp 786–790. https://doi.org/10.1109/I-SMAC47947.2019.9032456

Dewi C, Chen R-C, Christanto HJ, Cauteruccio F (2023) Multinomial Naïve Bayes classifier for sentiment analysis of internet movie database. Vietnam J Comput Sci 10(04):485–498

Dey L, Chakraborty S, Biswas A, Bose B, Tiwari S (2016) Sentiment analysis of review datasets using Naive Bayes and k-NN classifier. arXiv preprint arXiv:1610.09982

Danyal M M, Haseeb M, Khan S S, Khan B, Ullah S (2024) Opinion Mining on Movie Reviews Based on Deep Learning Models. J Artif Intell (6):(2579–0021).

Danyal M M, Khan S S, Khan M, Ghaffar M B, Khan B, Arshad, M (2023) Sentiment Analysis Based on Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews with Baseline Techniques. J Big Data (5).

Horsa OG, Tune KK, et al (2023) Aspect-based sentiment analysis for AFAAN OROMOO movie reviews using machine learning techniques. Appl Comput Intell Soft Comput 2023

Jahromi AH, Taheri M (2017) A non-parametric mixture of gaussian Naive Bayes classifiers based on local independent features. In: 2017 Artificial intelligence and signal processing conference (AISP), pp 209–212. IEEE

Khan M, Khan M S, Alharbi Y (2020) Text mining challenges and applications—a comprehensive review. IJCSNS 20(12):138

Khan SS, Khan M, Ran Q, Naseem R (2018) Challenges in opinion mining, comprehensive. Sci Technol J (Ciencia e Tecnica Vitivinicola) 33(11):123–135

Maas AL, Daly R, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 142–150

Mall P, Kumar M, Kumar A, Gupta A, Srivastava S, Narayan V, Chauhan AS, Srivastava AP (2024) Self-attentive CNN + BERT: An approach for analysis of sentiment on movie reviews using word embedding. Int J Intell Syst Appl Eng 12(12s):612–623

Maulana R, Rahayuningsih PA, Irmayani W, Saputra D, Jayanti WE (2020) Improved accuracy of sentiment analysis movie review using support vector machine based information gain. J Phys Conf Ser 1641:012060

Pimpalkar A, Raj RJR (2022) Mbilstmglove: embedding glove knowledge into the corpus using multi-layer Bilstm deep learning model for social media sentiment analysis. Exp Syst Appl 203:117581. https://doi.org/10.1016/j.eswa.2022.117581

Rahat AM, Kahir A, Masum AKM (2019) Comparison of Naive Bayes and SVM algorithm based on sentiment analysis using review dataset. In: 2019 8th International conference system modeling and advancement in research trends (SMART), pp 266–270. IEEE

Rahman R, Masud MA, Mimi RJ, Dina MNS (2021) Sentiment analysis on Bengali movie reviews using multinomial Naïve Bayes. In: 2021 24th International conference on computer and information technology (ICCIT), pp 1–6. https://doi.org/10.1109/ICCIT54785.2021.9689787

Rizal C, Kifta DA, Nasution RH, Rengganis A, Watrianthos R (2023) Opinion classification for IMDB review based using Naive Bayes method. In: AIP conference proceedings, vol 2913. AIP Publishing: New York

Rotten Tomatoes Movie Reviews dataset https://www.rottentomatoes.com . Accessed on 02 Mar 2023 (2020)

Samsir S, Kusmanto K, Dalimunthe AH, Aditiya R, Watrianthos R (2022) Implementation Naïve Bayes classification for sentiment analysis on internet movie database. Build Inf Technol Sci (BITS) 4(1):1–6

Shackley D, Folajimi Y (2023) Sentiment analysis of fake health news using Naive Bayes classification models. Int J Cognit Lang Sci 17(3):217–224

Sudha N, Govindarajan M (2016) Mining movie reviews using machine learning techniques. Int J Comput Appl 144 (5)

Teja JS, Sai GK, Kumar MD, Manikandan R (2018) Sentiment analysis of movie reviews using machine learning algorithms—a survey. Int J Pure Appl Math 118(20):3277–3284

Ullah K, Rashad, A, Khan M, Ghadi Y, Aljuaid H, Nawaz Z et al (2022) A deep neural network-based approach for sentiment analysis of movie reviews. Complexity 2022

Veziroğlu M, Eziroğlu E, Bucak İ.Ö (2024) Performance comparison between Naive Bayes and machine learning algorithms for news classification. In: Bayesian inference-recent trends. IntechOpen

Vielma C, Verma A, Bein D (2023) Sentiment analysis with novel GRU based deep learning networks. In: 2023 IEEE World AI IoT congress (AIIoT), pp 0440–0446. https://doi.org/10.1109/AIIoT58121.2023.10174396

Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415(1):295–316

Yusran M, Siswanto S, Islamiyati A (2024) Comparison of multinomial Naïve Bayes and Bernoulli Naïve Bayes on sentiment analysis of Kurikulum Merdeka with query expansion ranking. SISTEMASI 13(1):96–106

Download references

Acknowledgements

We sincerely thank everyone who helped us finish this research paper. We are grateful to the participants for their helpful feedback and ideas, which improved our research methods and the quality of our results. We appreciate everyone who gave their time to join our study, as this research wouldn’t have been possible without them. Thank you to everyone who took the time to contribute to this research paper.

This paper is for free publication.

Author information

Mian Muhammad Danyal, Muzammil Khan, Subhan Ullah, Muhammad Bilal Ghaffar, Wahab Khan have contributed equally to this work.

Authors and Affiliations

Center for Excellence in Information Technology, Institute of Management Sciences, Peshawar, 24720, Pakistan

Mian Muhammad Danyal

Department of Computer Science, City University of Science and Information Technology, Peshawar, 25000, Pakistan

Mian Muhammad Danyal, Subhan Ullah, Muhammad Bilal Ghaffar & Wahab Khan

Department of Computer and Software Technology, University of Swat, Swat, 19130, Pakistan

Sarwar Shah Khan & Muzammil Khan

Department of Computer Science, Iqra University Swat Campus, Swat, 19130, Pakistan

Sarwar Shah Khan

You can also search for this author in PubMed   Google Scholar

Contributions

The author contributions are as follow: “Conceptualization, MMD and SSK; methodology, MBG and MK; software, MMD, SU; validation, SSK and WK; formal analysis, MK, WK, and MBG; investigation, SU; data curation, SU and SSK; writing-original draft preparation, MMD, and MBG; writing-review and editing, SSK; visualization, MBG, and MK.

Corresponding author

Correspondence to Muzammil Khan .

Ethics declarations

Conflict of interest.

The authors of this paper declare that they do not have any conflicts of interest.

Financial interests

The authors of this paper have no Conflict of interest relevant to this article’s content to declare.

Ethical approval

Not applicable.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Danyal, M.M., Khan, S.S., Khan, M. et al. Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer. Soc. Netw. Anal. Min. 14 , 87 (2024). https://doi.org/10.1007/s13278-024-01250-9

Download citation

Received : 02 April 2023

Revised : 16 March 2024

Accepted : 20 March 2024

Published : 16 April 2024

DOI : https://doi.org/10.1007/s13278-024-01250-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • IMDB dataset
  • Rotten tomatoes dataset
  • Count vectorizer
  • Find a journal
  • Publish with us
  • Track your research

Subscribe to the PwC Newsletter

Join the community.

amazon movie review dataset

IMAGES

  1. Amazon reviews datasets

    amazon movie review dataset

  2. Sentiment Analysis on Amazon Movie Reviews Dataset

    amazon movie review dataset

  3. Amazon Review Data Analysis with Sentiment Mining

    amazon movie review dataset

  4. Amazon Review Data Analysis with Sentiment Mining

    amazon movie review dataset

  5. Product Recommender using Amazon Review dataset

    amazon movie review dataset

  6. Sentiment Analysis on Amazon Movie Reviews Dataset

    amazon movie review dataset

VIDEO

  1. Amazon Movie Unboxing!!!

  2. This Time *AMAZON FIND*Gone Wrong🤦‍♀️Amazon Gadgets 2024

  3. New Amazon Movie Makes Satan A Hero

  4. KNIME Challenge

  5. Amazon Fine Food Review Dataset Word2Vec LEC 258

  6. Discover the Real prithviraj chauhan movie review

COMMENTS

  1. SNAP: Web data: Amazon movie reviews

    Dataset information. This dataset consists of movie reviews from amazon. The data span a period of more than 10 years, including all ~8 million reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories .

  2. Amazon review data

    Amazon Review Data (2018) Jianmo Ni, UCSD. Description. This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

  3. Introduction

    In the Amazon Reviews'23, we provide: Larger Dataset: We collected 571.54M reviews, 245.2% larger than the last version; Newer Interactions: Current interactions range from May. 1996 to Sep. 2023; Richer Metadata: More descriptive features in item metadata; Fine-grained Timestamp: Interaction timestamp at the second or finer level;

  4. amazon_us_reviews

    Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website.

  5. McAuley-Lab/Amazon-Reviews-2023 · Datasets at Hugging Face

    This is a large-scale Amazon Reviews dataset, collected in 2023 by McAuley Lab, and it includes rich features such as: User Reviews ( ratings, text, helpfulness votes, etc.); Item Metadata ( descriptions, price, raw image, etc.); Links ( user-item / bought together graphs).

  6. Web data: Amazon movie reviews

    Refresh. Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals.

  7. YashvardhanDas/Amazon-Movie-Reviews-Sentiment-Analysis

    Contains a table with 300,000 unique reviews. The format of the table has two columns; i) 'Id': contains an id that corresponds to a review in train.csv for which you predict a score ii) 'Score': the values for this column are missing since it will include the score predictions. You are required to predict the star ratings of these Id using the ...

  8. bazakoskon/labels-on-Amazon-movie-reviews-dataset

    The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012 about 253,059 products.. Data format: product/productId: B00006HAXW review/userId: A1RSDE90N6RSZF review/profileName: Joseph M. Kotow review/helpfulness: 9/9 review/score: 5.0 review/time: 1042502400 review/summary: Pittsburgh - Home of the OLDIES ...

  9. Amazon movie reviews

    It has removed the profile name of the reviewer, the review-summary, and the review-text from the primary data. The data span a period of more than 10 years, including all up to October 2012. Each row contains 6 fields: 1. Product ID (e.g. B003AI2VGA) 2. User ID (e.g. A141HP4LYPWMSR) 3. Count of thumb-ups received by this review (e.g. 7) 4.

  10. Amazon review data

    This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). Files.

  11. amazon-review-dataset · GitHub Topics · GitHub

    Add this topic to your repo. To associate your repository with the amazon-review-dataset topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  12. PDF Rating Prediction for Amazon Movies

    In this project, we analyze a dataset consist-ing of 50,000 movie reviews on Amazon. The dataset includes the observed rating on a 1 - 5 scale, along with the text of the user's review. As the dataset is quite sparse, these reviews prove to be valuable in predicting unobserved ratings. We train two supervised learning algorithms

  13. PDF Ratings vs. Reviews in Recommender Systems: A Case study on the Amazon

    In this case study, we use the Amazon movies reviews dataset1. The dataset spans a period between August of 1997 and October of 2012. It consists of 889.176 users and 253.059 movies. The users have given 7.911.684 reviews, and their median word count is 101. Each entry in the dataset consists of a user identi cation number, the movie id that ...

  14. Recommender Systems Datasets

    This is a large-scale Amazon Reviews dataset collected in 2023. This dataset contains 48.19 million items, and 571.54 million reviews from 54.51 million users. Basic statistics ... These datasets contain reviews from the Steam video game platform, and information about which games were bundled together. Basic statistics. Reviews: 7,793,069 ...

  15. Amazon Prime Movies Dataset : r/kaggle

    I have created and published a new dataset containing the movies streaming on the Amazon Prime Video platform. This dataset contains over 7K+ unique movies. The metadata contains information about the IMDb rating that the movie received, the total running time of the movie, audio language, maturing rating, and a short descriptive summary of the ...

  16. Amazon Review Dataset

    Amazon Review is a dataset to tackle the task of identifying whether the sentiment of a product review is positive or negative. This dataset includes reviews from four different merchandise categories: Books (B) (2834 samples), DVDs (D) (1199 samples), Electronics (E) (1883 samples), and Kitchen and housewares (K) (1755 samples).

  17. Sentiment analysis of movie reviews based on NB approaches ...

    4.1.2 Rotten Tomatoes Movie reviews dataset. The Rotten Tomatoes Reviews dataset consists of movie reviews and labels indicating whether they are "fresh" or "rotten". The dataset covers various movies and genres and includes metadata such as year of release, genre, and cast (Asghar et al. 2014). For this research experiment, 50K samples are ...

  18. Amazon Movie Reviews-processed

    processed version of amazon movie reviews. processed version of amazon movie reviews. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion ...

  19. IMDb Movie Reviews Dataset

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10.

  20. Amazon Review Full Benchmark (Sentiment Analysis)

    The current state-of-the-art on Amazon Review Full is BERT large. See a full comparison of 9 papers with code. ... Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. Read previous issues. Subscribe. Join the community

  21. Amazon Prime Movies and TV Shows

    Movies and TV Shows listings on Amazon Prime Video. Movies and TV Shows listings on Amazon Prime Video . code. New Notebook. table_chart ... table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto ...