Create Spotify Song Pool

This Jupyter notebook creates a pool containing unique songs obtained through Spotify’s API. It requests data for all unique songs found in a subset of Spotify’s 1 Million Playlist Set. The subset consists of 10,000 playlists. The number of unique songs found in this subset and pushed into a pool csv file is approximaetly 170,000. This, we assume, is a sufficiently large enough pool for recommending songs to a playlist. Song features are added to the pool in order to be used as a source of information.

We import a set of functions we created in order to make notebook codes easier to read. These functions, stored in a .py file called “spotify_api_fuction_set”, are used for handling a Library that communicates with the Spotify API called Spotipy. The Spotipy library can be found here (https://spotipy.readthedocs.io/en/latest/). Note that the functions created are specific to this project (See EDA section for list of functions inside this .py file).

import spotify_api_function_set as sps #imports set of functions created to use spotify API

We load a subset of 10,000 playlists from the 1 Million Playlist Dataset from Spotify using the json library.

path = 'data'
file_names = ["mpd.slice.0-999", "mpd.slice.1000-1999", "mpd.slice.2000-2999",
              "mpd.slice.3000-3999", "mpd.slice.4000-4999", "mpd.slice.5000-5999",
              "mpd.slice.6000-6999", "mpd.slice.7000-7999", "mpd.slice.8000-8999", "mpd.slice.9000-9999"]

spotify_playlist = []
for file in file_names:
    with open(path+"/"+file+".json", "r") as fd:
        plylist_temp = json.load(fd)
        plylist_temp = plylist_temp.get('playlists')
        spotify_playlist = spotify_playlist + plylist_temp

We define the number of playlists we wish to use as a source for song pool generation. In this case we will use all 10,000 playlists. From here, for each playlist, we extract each song’s Uniform Resource Identifier (URI) and each song’s artist URI so we can use it later with Spotify’s API.

N = 10000 #Number of playlists to request

track_uri = []
artist_uri = []

for i in range(N):
    track_id = sps.get_playlist_n(spotify_playlist[i], feature = 'track_uri', n_playlist = i)
    artist_id = sps.get_playlist_n(spotify_playlist[i], feature = 'artist_uri', n_playlist = i)  

    track_uri.extend(track_id)
    artist_uri.extend(artist_id)

Since we expect many songs to be repeated from playlist to playlist, we store the track and artist URIs in a pandas dataframe in order to drop any duplicates based on track URIs.

data = [np.array(track_uri).T, np.array(artist_uri).T]
data = np.transpose(data)
temp_df = pd.DataFrame(data)
temp_df.columns = ['track_uri', 'artist_uri']

We check the length of the dataframe containing all songs extracted from the 10,000 playlists. We see that there are currently 664,712 songs in the dataframe.

len(temp_df)

Dropping duplicated songs, we reduce the playlist to 170,089 unique songs. We do this before requesting API information in order to prevent unnecessary requests.

temp_df = temp_df.drop_duplicates(subset='track_uri') #Remove duplicates
len(temp_df)

track_uri = list(temp_df.track_uri)
artist_uri = list(temp_df.artist_uri)
sp = sps.create_spotipy_obj() #create spotify object to use to request songs

We request song and artist features provided by spotify’s API for all unique songs found in the 10,000 playlist subset. We time it to get a sense of speed. Note, this code took us approximately 22 minutes to run. Feel free to use a smaller Playlist subset (N above) to test the code first.

start_time = time.time()
t_features, a_features = sps.get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))

--- 1293.6080300807953 seconds ---

data = [np.array(t_features).T, np.array(a_features).T]
data = np.transpose(data)
feature_pd = pd.DataFrame(data)
feature_pd.columns = ['t_features', 'a_features']

Before proceeding any further, we check to see if Spotify returned any NonType objects and drop them. When we ran the code, we got only one NonType object for a song, hence our pool was reduced by one song.

feature_pd = feature_pd.dropna()
t_features = list(feature_pd.t_features)
a_features = list(feature_pd.a_features)

We create a pandas dataframe containing unique songs with its features and categorize each song into a genre just like we did when doing data exploration and preparation. We also timed this step, fortunately for the 10,000 playlists, this took about 5 minutes.

songs_df = sps.create_song_df(t_features, a_features, list(range(len(t_features))))

start_time = time.time()
songs_df_unique = sps.genre_generator(songs_df)
print("--- %s seconds ---" % (time.time() - start_time))

--- 265.5861220359802 seconds ---

We clean the data a bit further.

cols = ['song_uri', 'duration_ms', 'time_signature', 'key', 'tempo',
       'energy', 'mode', 'loudness', 'speechiness', 'danceability',
       'acousticness', 'instrumentalness', 'valence', 'liveness',
       'artist_followers', 'artist_name', 'artist_popularity', 'artist_uri','genre']
drop = set(cols)^set(songs_df_unique.columns)

pool_df = songs_df_unique.drop(drop, axis=1)

Check to see if things look ok.

pool_df.head()

	song_uri	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	artist_uri	artist_name	artist_popularity	genre
0	spotify:track:0UaMYEvWZi0ZqiDOoHU3YI	226864	4	4	125.461	0.813	0	-7.105	0.1210	0.904	0.03110	0.006970	0.810	0.0471	909185	spotify:artist:2wIVse2owClT7go1WT98tk	Missy Elliott	76	rap
1	spotify:track:6I9VzXrHxO9rA9A5euc8Ak	198800	4	5	143.040	0.838	0	-3.914	0.1140	0.774	0.02490	0.025000	0.924	0.2420	5455441	spotify:artist:26dSoYclwsYLMAKD3tpOr4	Britney Spears	82	pop
2	spotify:track:0WqIKmW4BTrj3eJFmnCKMv	235933	4	2	99.259	0.758	0	-6.583	0.2100	0.664	0.00238	0.000000	0.701	0.0598	16678709	spotify:artist:6vWDO969PvNqNYHIOW5v0m	Beyoncé	87	pop
3	spotify:track:1AWQoqb9bSvzTjaLralEkT	267267	4	4	100.972	0.714	0	-6.055	0.1400	0.891	0.20200	0.000234	0.818	0.0521	7341126	spotify:artist:31TPClRtHm23RisEBtV3X7	Justin Timberlake	83	rap
4	spotify:track:1lzr43nnXAijIGYnCT8M8H	227600	4	0	94.759	0.606	1	-4.596	0.0713	0.853	0.05610	0.000000	0.654	0.3130	1044532	spotify:artist:5EvFsr3kj42KNv97ZEnqij	Shaggy	74	rap

Finally we store the pool into the specified path, we drop the index as it isn’t necesarry.

pool_df.to_csv(path+'/'+'big_song_pool.csv', index=False)