Scrape Last.fm using Ruby

Recently I was doing a project on analyzing music recommendation algorithms. I have found that one of the popular free music dataset is Last.fm Dataset. However, this dataset only includes users’ recent played musics. Normally, users’ rating histries are important for recommendation algorithms. The play histries in Last.fm Dataset only represent users’ implicit ratings, which means the ratings can be inferred from play times, skip or not, etc. One the other side, the explicit ratings are not avaliable in this dataset, which are users’ explicit activities such as mark a song as loved or banned. After some study, I decide to scrape the data by myself, using Last.fm API.

Before Start

If you are on Windows platform, do not upgrade your ruby to version 2.2. An issue on GitHub shows the gem nokogiri is not supported well on Ruby 2.2.

The following Ruby Gems are required.

  • mechanize
  • url
  • httparty
  • json

Get User List

This part is not supported by the API. The way I do this is scrape it from Last.fm webpage. One way is to find user lists from Recently Active Users. But it only contains 8 pages, sometimes it is not enough. Another way is to find the user list in Last.fm Groups. I was using the second method.

The completed code can be found here.

First, get a new Mechanize agent.

1
agent = Mechanize.new

And use the following code to get usernames.

1
2
3
4
5
6
7
for i in 1..page_no
puts "Fetching page no.#{i}..."
page = agent.get("http://www.last.fm/group/#{url_name}/members?memberspage=#{i}")
page.search("strong").search("a").map do |text|
uid = text.attributes['href'].text.gsub("/user/","")
end
end

The page_no is the total pages in this group, which can be obtained by:

1
2
page = agent.get("http://www.last.fm/group/#{url_name}/members?memberspage=1")
page_no = page.search("a[class='pagelink lastpage']")[0].text.to_i

Get Loved and Banned Tracks

Before start, apply an api key from here.

Loved Tracks

To get loved tracks, for each user name uid, and url is

1
url = "http://ws.audioscrobbler.com/2.0/?method=user.getlovedtracks&user=#{uid}&limit=1&api_key=#{api_key}"

And then use Http GET to get the response

1
response = HTTParty.get(url, :verify => false)

The response is a json object, convert it to a Ruby Hash.

1
h = JSON.parse(response.body)

If the user have only one loved music,

1
2
artist = h["lovedtracks"]["track"]["artist"]["name"]
music = h["lovedtracks"]["track"]["name"]

If the user have more than 1 musics, then the h["lovedtracks"]["track"] is an array.

1
2
3
4
h["lovedtracks"]["track"].each{|track|
artist = track["artist"]["name"]
music = track["track"]["name"]
}

Banned Tracks

Similar to loved tracks, but needs to change the url to

1
url = "http://ws.audioscrobbler.com/2.0/?method=user.getbannedtracks&user=#{uid}&limit=1&api_key=#{api_key}"

And the key of h is "bannedtracks".

See here for the complete code.