Amazon Review Dataset数据集介绍

springtostring

43404人浏览 · 2021-01-29 22:25:05

springtostring · 2021-01-29 22:25:05 发布

Amazon Review Dataset数据集记录了用户对亚马逊网站商品的评价，是推荐系统的经典数据集，并且Amazon一直在更新这个数据集，根据时间顺序，Amazon数据集可以分成三类：

2013 版 http://snap.stanford.edu/data/web-Amazon-links.html
2014版 http://jmcauley.ucsd.edu/data/amazon/index_2014.html
如果直接跳转到2018版，可换为访问http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/
2018版 https://nijianmo.github.io/amazon/index.html

Amazon数据集可以根据商品类别分为 Books，Electronics，Movies and TV，CDs and Vinyl等子数据集，这些子数据集包含两类信息：

以2014版数据集为例：

商品信息描述

asin	商品id
title	商品名称
price	价格
imUrl	商品图片链接
related	相关商品
salesRank	折扣信息
brand	品牌
categories	目录类别

官方例子：

{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
"related":
{
 "also_bought": ["B00JHONN1S", "B002BZX8Z6"],
 "also_viewed": ["B002BZX8Z6", "B00JHONN1S"],
 "bought_together": ["B002BZX8Z6"]
},
"salesRank": {"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}

用户评分记录数据

reviewerID	用户id
asin	商品id
reviewerName	用户名
helpful	有效评价率（helpfulness rating of the review, e.g. 2/3）
reviewText	评价文本
overall	评分
summary	评价总结
unixReviewTime	评价时间戳
reviewTime	评价时间

{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

Amazon数据集读取：

因为下载的数据是json文件，不易操作，这里主要介绍如何将json文件转化为csv格式文件。以2014版Amazon Electronics数据集的转化为例：

商品信息读取

import pickle
import pandas as pd

file_path = 'meta_Electronics.json'
fin = open(file_path, 'r')

df = {}
useless_col = ['imUrl','salesRank','related','title','description']  # 不想要的字段
i = 0
for line in fin:
    d = eval(line)
    for s in useless_col:
        if s in d:
            d.pop(s)
    df[i] = d 
    i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('meta_Electronics.csv',index=False)

用户评分记录数据读取

file_path = 'Electronics_10.json'
fin = open(file_path, 'r')

df = {}
useless_col = ['reviewerName','reviewText','unixReviewTime','summary'] # 不想要的字段
i = 0
for line in fin:
    d = eval(line)
    for s in useless_col:
        if s in d:
            d.pop(s)
    df[i] = d 
    i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('Electronics_10.csv',index=False)

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

1.8B 体积、33 种语言互译｜腾讯混元 HY-MT1.5-1.8B 多语言机器翻译模型上线

在跨语言交流日益频繁的今天，阅读外语菜单、处理多语言邮件、与不同语言背景的人沟通，已经成为很多人日常工作与生活的一部分。过去，这类需求往往依赖联网翻译工具，而如今，—— 一部设备即可支持的相互翻译。当 AI 不再只是“逐字直译”，而是开始理解语境、风格与语言之间的细微差异，机器翻译就真正具备了今天为大家介绍一款高质量、多语言、支持端侧部署的机器翻译模型 ——，现已上线 AtomGit AI 社区，