Recently I was doing a project on analyzing music recommendation algorithms. I have found that one of the popular free music dataset is Last.fm Dataset. However, this dataset only includes users’ recent played musics. Normally, users’ rating histries are important for recommendation algorithms. The play histries in Last.fm Dataset only represent users’ implicit ratings, which means the ratings can be inferred from play times, skip or not, etc. One the other side, the explicit ratings are not avaliable in this dataset, which are users’ explicit activities such as mark a song as loved or banned. After some study, I decide to scrape the data by myself, using Last.fm API.
Before Start
If you are on Windows platform, do not upgrade your ruby to version 2.2. An issue on GitHub shows the gem nokogiri is not supported well on Ruby 2.2.
The following Ruby Gems are required.
- mechanize
- url
- httparty
- json
Get User List
This part is not supported by the API. The way I do this is scrape it from Last.fm webpage. One way is to find user lists from Recently Active Users. But it only contains 8 pages, sometimes it is not enough. Another way is to find the user list in Last.fm Groups. I was using the second method.
The completed code can be found here.
First, get a new Mechanize agent.
1 | agent = Mechanize.new |
And use the following code to get usernames.
1 | for i in 1..page_no |
The page_no
is the total pages in this group, which can be obtained by:
1 | page = agent.get("http://www.last.fm/group/#{url_name}/members?memberspage=1") |
Get Loved and Banned Tracks
Before start, apply an api key from here.
Loved Tracks
To get loved tracks, for each user name uid
, and url is1
url = "http://ws.audioscrobbler.com/2.0/?method=user.getlovedtracks&user=#{uid}&limit=1&api_key=#{api_key}"
And then use Http GET to get the response1
response = HTTParty.get(url, :verify => false)
The response
is a json object, convert it to a Ruby Hash.1
h = JSON.parse(response.body)
If the user have only one loved music,1
2artist = h["lovedtracks"]["track"]["artist"]["name"]
music = h["lovedtracks"]["track"]["name"]
If the user have more than 1 musics, then the h["lovedtracks"]["track"]
is an array.1
2
3
4h["lovedtracks"]["track"].each{|track|
artist = track["artist"]["name"]
music = track["track"]["name"]
}
Banned Tracks
Similar to loved tracks, but needs to change the url to
1 | url = "http://ws.audioscrobbler.com/2.0/?method=user.getbannedtracks&user=#{uid}&limit=1&api_key=#{api_key}" |
And the key of h
is "bannedtracks"
.
See here for the complete code.