Create Spotify Song Pool

This Jupyter notebook creates a pool containing unique songs obtained through Spotify’s API. It requests data for all unique songs found in a subset of Spotify’s 1 Million Playlist Set. The subset consists of 10,000 playlists. The number of unique songs found in this subset and pushed into a pool csv file is approximaetly 170,000. This, we assume, is a sufficiently large enough pool for recommending songs to a playlist. Song features are added to the pool in order to be used as a source of information.


We import a set of functions we created in order to make notebook codes easier to read. These functions, stored in a .py file called “spotify_api_fuction_set”, are used for handling a Library that communicates with the Spotify API called Spotipy. The Spotipy library can be found here (https://spotipy.readthedocs.io/en/latest/). Note that the functions created are specific to this project (See EDA section for list of functions inside this .py file).

import spotify_api_function_set as sps #imports set of functions created to use spotify API

We load a subset of 10,000 playlists from the 1 Million Playlist Dataset from Spotify using the json library.

path = 'data'
file_names = ["mpd.slice.0-999", "mpd.slice.1000-1999", "mpd.slice.2000-2999",
              "mpd.slice.3000-3999", "mpd.slice.4000-4999", "mpd.slice.5000-5999",
              "mpd.slice.6000-6999", "mpd.slice.7000-7999", "mpd.slice.8000-8999", "mpd.slice.9000-9999"]

spotify_playlist = []
for file in file_names:
    with open(path+"/"+file+".json", "r") as fd:
        plylist_temp = json.load(fd)
        plylist_temp = plylist_temp.get('playlists')
        spotify_playlist = spotify_playlist + plylist_temp

We define the number of playlists we wish to use as a source for song pool generation. In this case we will use all 10,000 playlists. From here, for each playlist, we extract each song’s Uniform Resource Identifier (URI) and each song’s artist URI so we can use it later with Spotify’s API.

N = 10000 #Number of playlists to request

track_uri = []
artist_uri = []

for i in range(N):
    track_id = sps.get_playlist_n(spotify_playlist[i], feature = 'track_uri', n_playlist = i)
    artist_id = sps.get_playlist_n(spotify_playlist[i], feature = 'artist_uri', n_playlist = i)  

    track_uri.extend(track_id)
    artist_uri.extend(artist_id)

Since we expect many songs to be repeated from playlist to playlist, we store the track and artist URIs in a pandas dataframe in order to drop any duplicates based on track URIs.

data = [np.array(track_uri).T, np.array(artist_uri).T]
data = np.transpose(data)
temp_df = pd.DataFrame(data)
temp_df.columns = ['track_uri', 'artist_uri']

We check the length of the dataframe containing all songs extracted from the 10,000 playlists. We see that there are currently 664,712 songs in the dataframe.

len(temp_df)
664712

Dropping duplicated songs, we reduce the playlist to 170,089 unique songs. We do this before requesting API information in order to prevent unnecessary requests.

temp_df = temp_df.drop_duplicates(subset='track_uri') #Remove duplicates
len(temp_df)
170089
track_uri = list(temp_df.track_uri)
artist_uri = list(temp_df.artist_uri)
sp = sps.create_spotipy_obj() #create spotify object to use to request songs

We request song and artist features provided by spotify’s API for all unique songs found in the 10,000 playlist subset. We time it to get a sense of speed. Note, this code took us approximately 22 minutes to run. Feel free to use a smaller Playlist subset (N above) to test the code first.

start_time = time.time()
t_features, a_features = sps.get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))
--- 1293.6080300807953 seconds ---
data = [np.array(t_features).T, np.array(a_features).T]
data = np.transpose(data)
feature_pd = pd.DataFrame(data)
feature_pd.columns = ['t_features', 'a_features']

Before proceeding any further, we check to see if Spotify returned any NonType objects and drop them. When we ran the code, we got only one NonType object for a song, hence our pool was reduced by one song.

feature_pd = feature_pd.dropna()
t_features = list(feature_pd.t_features)
a_features = list(feature_pd.a_features)

We create a pandas dataframe containing unique songs with its features and categorize each song into a genre just like we did when doing data exploration and preparation. We also timed this step, fortunately for the 10,000 playlists, this took about 5 minutes.

songs_df = sps.create_song_df(t_features, a_features, list(range(len(t_features))))
start_time = time.time()
songs_df_unique = sps.genre_generator(songs_df)
print("--- %s seconds ---" % (time.time() - start_time))
--- 265.5861220359802 seconds ---

We clean the data a bit further.

cols = ['song_uri', 'duration_ms', 'time_signature', 'key', 'tempo',
       'energy', 'mode', 'loudness', 'speechiness', 'danceability',
       'acousticness', 'instrumentalness', 'valence', 'liveness',
       'artist_followers', 'artist_name', 'artist_popularity', 'artist_uri','genre']
drop = set(cols)^set(songs_df_unique.columns)
pool_df = songs_df_unique.drop(drop, axis=1)

Check to see if things look ok.

pool_df.head()
song_uri duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers artist_uri artist_name artist_popularity genre
0 spotify:track:0UaMYEvWZi0ZqiDOoHU3YI 226864 4 4 125.461 0.813 0 -7.105 0.1210 0.904 0.03110 0.006970 0.810 0.0471 909185 spotify:artist:2wIVse2owClT7go1WT98tk Missy Elliott 76 rap
1 spotify:track:6I9VzXrHxO9rA9A5euc8Ak 198800 4 5 143.040 0.838 0 -3.914 0.1140 0.774 0.02490 0.025000 0.924 0.2420 5455441 spotify:artist:26dSoYclwsYLMAKD3tpOr4 Britney Spears 82 pop
2 spotify:track:0WqIKmW4BTrj3eJFmnCKMv 235933 4 2 99.259 0.758 0 -6.583 0.2100 0.664 0.00238 0.000000 0.701 0.0598 16678709 spotify:artist:6vWDO969PvNqNYHIOW5v0m Beyoncé 87 pop
3 spotify:track:1AWQoqb9bSvzTjaLralEkT 267267 4 4 100.972 0.714 0 -6.055 0.1400 0.891 0.20200 0.000234 0.818 0.0521 7341126 spotify:artist:31TPClRtHm23RisEBtV3X7 Justin Timberlake 83 rap
4 spotify:track:1lzr43nnXAijIGYnCT8M8H 227600 4 0 94.759 0.606 1 -4.596 0.0713 0.853 0.05610 0.000000 0.654 0.3130 1044532 spotify:artist:5EvFsr3kj42KNv97ZEnqij Shaggy 74 rap

Finally we store the pool into the specified path, we drop the index as it isn’t necesarry.

pool_df.to_csv(path+'/'+'big_song_pool.csv', index=False)