The goal of this section is to create a final dataset of playlists (our sample), with independent variables (tracks and artists features) and the dependent variable (genre of the playlist). Most importantly, we made sure that our sample was equally distributed in each of the classes, since this is important in fitting the models to the training dataset. In order to do so, we had to carry out a number of steps, which included:
- Requesting playlist IDs, tracks and artist features from Spotify's API using Spotipy Package
- Setting up a pandas dataframe at the track level
- Classifying each song to one of 5 genres (rock, pop, poprock, rap, and others)
- Collapsing the songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist
- Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others), according to the genre most frequent in that given playlist
- Setting up final sample of playlists of equally distributed in each of the classes (genres)
The following function takes a number of playlist and returns the features of the tracks of those selected playlists:
def feature_list_func(plylist_dic, feature, n_playlist, first_pid):
""""
This function takes a number of playlist and returns the features of the tracks of those selected playlists.
input:
1 - plylist_dic: dictionary of all playlists (dataset in dictionary format: json)
2 - feature: feature to be selected from each songs in selected playlists
3 - n_playlists: number of playlists to be selected
output: list of observations for the feature chosen, for all of the tracks that belong to the selected playlists
"""
feature_list = []
pid_list = []
length_playlist = np.minimum(n_playlist,len(plylist_dic)) # the output will be based on the min of the n_playlist and the actual length of the input playlist
for i in range(length_playlist):
playlist = plylist_dic[first_pid + i]
playlist_pid = playlist.get('pid')
for j in range(len(playlist.get('tracks'))):
feature_list.append(playlist.get('tracks')[j].get(feature))
pid_list.append(playlist_pid)
return pid_list, feature_list
The following code calls the functions above, in order to get the playlist IDs, the track and artist URIs, which will be used later to request the features that will comprise our dataframe.
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 10, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 10, first_pid = 0)
After getting the URI of the tracks and artists, we requested their features from API Spotify, to create a pandas database at the track level. We used Spotipy API. The Spotify Package can be found at: https://spotipy.readthedocs.io
def create_spotipy_obj():
"""
Uses dbarjum's client id for DS Project
"""
SPOTIPY_CLIENT_ID = '54006da9bd7849b7906b944a7fa4e29d'
SPOTIPY_CLIENT_SECRET = 'f54ae294a30c4a99b2ff330a923cd6e3'
SPOTIPY_REDIRECT_URI = 'http://localhost/'
username = 'dbarjum'
scope = 'user-library-read'
token = util.prompt_for_user_token(username,scope,client_id=SPOTIPY_CLIENT_ID,
client_secret=SPOTIPY_CLIENT_SECRET,
redirect_uri=SPOTIPY_REDIRECT_URI)
client_credentials_manager = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID,
client_secret=SPOTIPY_CLIENT_SECRET, proxies=None)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
return sp
sp = create_spotipy_obj()
def get_all_features(track_list = list, artist_list = list, sp=None):
"""
This function takes in a list of tracks and a list of artists, along
with a spotipy object and generates two lists of features from Spotify's API.
inputs:
1. track_list: list of all tracks to be included in dataframe
2. artist_list: list of all artists corresponding to tracks
3. sp: spotipy object to communicate with Spotify API
returns:
1. track_features: list of all features for each track in track_list
2. artist_features: list of all artist features for each artist in artist_list
"""
track_features = []
artist_features = []
track_iters = int(len(track_list)/50)
track_remainders = len(track_list)%50
start = 0
end = start+50
for i in range(track_iters):
track_features.extend(sp.audio_features(track_list[start:end]))
artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))
start += 50
end = start+50
if track_remainders:
end = start + track_remainders
track_features.extend(sp.audio_features(track_list[start:end]))
artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))
return track_features, artist_features
start_time = time.time()
t_features, a_features = get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))
--- 2.8701727390289307 seconds ---
The following function takes in the lists of track and artist features, and generates a dataframe of the features. It also creates columns in the dataframe that represent the genres provided for the artist of each track. These columns will be used later for classifying each track to one of 5 genres (rock, pop, poprock, rap, and others).
def create_song_df(track_features=list, artist_features=list, pid=list):
"""
This function takes in two lists of track and artist features, respectively,
and generates a dataframe of the features.
inputs:
1. track_features: list of all tracks including features
2. artist_features: list of all artists including features
returns:
1. df: a pandas dataframe of size (N, X) where N corresponds to the number of songs
in track_features, X is the number of features in the dataframe.
"""
import pandas as pd
selected_song_features = ['uri', 'duration_ms', 'time_signature', 'key',
'tempo', 'energy', 'mode', 'loudness', 'speechiness',
'danceability', 'acousticness', 'instrumentalness',
'valence', 'liveness']
selected_artist_features = ['followers', 'uri', 'name', 'popularity', 'genres']
col_names = ['song_uri', 'duration_ms', 'time_signature', 'key',
'tempo', 'energy', 'mode', 'loudness', 'speechiness',
'danceability', 'acousticness', 'instrumentalness',
'valence', 'liveness', 'artist_followers', 'artist_uri',
'artist_name', 'artist_popularity']
data = []
for i, j in zip(track_features, artist_features):
temp = []
for sf in selected_song_features:
temp.append(i.get(sf))
for af in selected_artist_features:
if af == 'followers':
temp.append(j.get('followers').get('total'))
elif af == 'genres':
for g in j.get('genres'):
temp.append(g)
else:
temp.append(j.get(af))
data.append(list(temp))
df = pd.DataFrame(data)
for i in range(len(df.columns)- len(col_names)):
col_names.append('g'+str(i+1))
df.columns = col_names
df.insert(loc=0, column='pid', value=pid)
return df
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df.head()
| pid | song_uri | duration_ms | time_signature | key | tempo | energy | mode | loudness | speechiness | danceability | acousticness | instrumentalness | valence | liveness | artist_followers | artist_uri | artist_name | artist_popularity | g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | g9 | g10 | g11 | g12 | g13 | g14 | g15 | g16 | g17 | g18 | g19 | g20 | g21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | spotify:track:0UaMYEvWZi0ZqiDOoHU3YI | 226864 | 4 | 4 | 125.461 | 0.813 | 0 | -7.105 | 0.1210 | 0.904 | 0.03110 | 0.006970 | 0.810 | 0.0471 | 909647 | spotify:artist:2wIVse2owClT7go1WT98tk | Missy Elliott | 76 | dance pop | hip hop | hip pop | pop | pop rap | r&b | rap | southern hip hop | urban contemporary | None | None | None | None | None | None | None | None | None | None | None | None |
| 1 | 0 | spotify:track:6I9VzXrHxO9rA9A5euc8Ak | 198800 | 4 | 5 | 143.040 | 0.838 | 0 | -3.914 | 0.1140 | 0.774 | 0.02490 | 0.025000 | 0.924 | 0.2420 | 5457673 | spotify:artist:26dSoYclwsYLMAKD3tpOr4 | Britney Spears | 82 | dance pop | pop | post-teen pop | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
| 2 | 0 | spotify:track:0WqIKmW4BTrj3eJFmnCKMv | 235933 | 4 | 2 | 99.259 | 0.758 | 0 | -6.583 | 0.2100 | 0.664 | 0.00238 | 0.000000 | 0.701 | 0.0598 | 16686181 | spotify:artist:6vWDO969PvNqNYHIOW5v0m | Beyoncé | 87 | dance pop | pop | post-teen pop | r&b | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
| 3 | 0 | spotify:track:1AWQoqb9bSvzTjaLralEkT | 267267 | 4 | 4 | 100.972 | 0.714 | 0 | -6.055 | 0.1400 | 0.891 | 0.20200 | 0.000234 | 0.818 | 0.0521 | 7343717 | spotify:artist:31TPClRtHm23RisEBtV3X7 | Justin Timberlake | 83 | dance pop | pop | pop rap | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
| 4 | 0 | spotify:track:1lzr43nnXAijIGYnCT8M8H | 227600 | 4 | 0 | 94.759 | 0.606 | 1 | -4.596 | 0.0713 | 0.853 | 0.05610 | 0.000000 | 0.654 | 0.3130 | 1044930 | spotify:artist:5EvFsr3kj42KNv97ZEnqij | Shaggy | 74 | dance pop | pop rap | reggae fusion | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
This section is responsible for collapsing songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist. In this section we also classified songs, and playlists.
The following function classifies songs according to the given genres of the artist of the song, according to “if” statements:
def genre_generator(songs_df):
"""
This function classifies songs according to the given genres of the artist of the song, according to an "if" statements.
Input: dataframe with a list of songs
Output: dataframe with added column with unique genre for each song
"""
# defining liist of genres that will determine a song with unique genre "rap"
rap = ["rap","hiphop", "r&d"]
# finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
g1_index = 0
last_column_index = 0
column_names = songs_df.columns.values
# finding first column with genres ("g1")
for i in column_names:
if i == "g1":
break
g1_index += 1
# finding last column with genrer ("gX")
for i in column_names:
last_column_index += 1
# create new columnn that will have unique genre (class) of each song
songs_df["genre"] = ""
# loop to create genre for each song in dataframe
for j in range(len(songs_df)):
# Creating list of genres for a given song
genres_row = list(songs_df.iloc[[j]][column_names[g1_index:last_column_index-1]].dropna(axis=1).values.flatten())
# genres_row = ['british invasion', 'merseybeat', 'psychedelic']
# classifing genre for the song
genre = "other"
if any("rock" in s for s in genres_row) and any("pop" in s for s in genres_row):
genre = "pop rock"
elif any("rock" in s for s in genres_row):
genre = "rock"
elif any("pop" in s for s in genres_row):
genre = "pop"
for i in rap:
if any(i in s for s in genres_row):
genre = "rap"
# giving column genre the classified genre for a given song
songs_df.set_value(j, 'genre', genre)
return songs_df
The code below calls the song genre generator function, and the result is a dataframe with songs containing a genre, which has been classified according to the genre of the artists of each song.
songs_df_new = genre_generator(songs_df)
songs_df_new.head()
C:\Users\Joao Araujo\Anaconda3\lib\site-packages\ipykernel_launcher.py:56: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
| pid | song_uri | duration_ms | time_signature | key | tempo | energy | mode | loudness | speechiness | danceability | acousticness | instrumentalness | valence | liveness | artist_followers | artist_uri | artist_name | artist_popularity | g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | g9 | g10 | g11 | g12 | g13 | g14 | g15 | g16 | g17 | g18 | g19 | g20 | g21 | genre | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | spotify:track:0UaMYEvWZi0ZqiDOoHU3YI | 226864 | 4 | 4 | 125.461 | 0.813 | 0 | -7.105 | 0.1210 | 0.904 | 0.03110 | 0.006970 | 0.810 | 0.0471 | 909647 | spotify:artist:2wIVse2owClT7go1WT98tk | Missy Elliott | 76 | dance pop | hip hop | hip pop | pop | pop rap | r&b | rap | southern hip hop | urban contemporary | None | None | None | None | None | None | None | None | None | None | None | None | rap |
| 1 | 0 | spotify:track:6I9VzXrHxO9rA9A5euc8Ak | 198800 | 4 | 5 | 143.040 | 0.838 | 0 | -3.914 | 0.1140 | 0.774 | 0.02490 | 0.025000 | 0.924 | 0.2420 | 5457673 | spotify:artist:26dSoYclwsYLMAKD3tpOr4 | Britney Spears | 82 | dance pop | pop | post-teen pop | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | pop |
| 2 | 0 | spotify:track:0WqIKmW4BTrj3eJFmnCKMv | 235933 | 4 | 2 | 99.259 | 0.758 | 0 | -6.583 | 0.2100 | 0.664 | 0.00238 | 0.000000 | 0.701 | 0.0598 | 16686181 | spotify:artist:6vWDO969PvNqNYHIOW5v0m | Beyoncé | 87 | dance pop | pop | post-teen pop | r&b | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | pop |
| 3 | 0 | spotify:track:1AWQoqb9bSvzTjaLralEkT | 267267 | 4 | 4 | 100.972 | 0.714 | 0 | -6.055 | 0.1400 | 0.891 | 0.20200 | 0.000234 | 0.818 | 0.0521 | 7343717 | spotify:artist:31TPClRtHm23RisEBtV3X7 | Justin Timberlake | 83 | dance pop | pop | pop rap | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | rap |
| 4 | 0 | spotify:track:1lzr43nnXAijIGYnCT8M8H | 227600 | 4 | 0 | 94.759 | 0.606 | 1 | -4.596 | 0.0713 | 0.853 | 0.05610 | 0.000000 | 0.654 | 0.3130 | 1044930 | spotify:artist:5EvFsr3kj42KNv97ZEnqij | Shaggy | 74 | dance pop | pop rap | reggae fusion | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | rap |
The following lines clean the dataframe by dropping unnecessary columns (the genres of each song), which were used to create the unique column of song genre that will be used later in the algorithm.
temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
column_names_temp
array(['artist_popularity', 'g1', 'g2', 'g3', 'g4', 'g5', 'g6', 'g7', 'g8',
'g9', 'g10', 'g11', 'g12', 'g13', 'g14', 'g15', 'g16', 'g17', 'g18',
'g19', 'g20', 'g21'], dtype=object)
temp = temp.drop(column_names_temp,axis=1)
temp.head()
| pid | song_uri | duration_ms | time_signature | key | tempo | energy | mode | loudness | speechiness | danceability | acousticness | instrumentalness | valence | liveness | artist_followers | artist_uri | artist_name | genre | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | spotify:track:0UaMYEvWZi0ZqiDOoHU3YI | 226864 | 4 | 4 | 125.461 | 0.813 | 0 | -7.105 | 0.1210 | 0.904 | 0.03110 | 0.006970 | 0.810 | 0.0471 | 909647 | spotify:artist:2wIVse2owClT7go1WT98tk | Missy Elliott | rap |
| 1 | 0 | spotify:track:6I9VzXrHxO9rA9A5euc8Ak | 198800 | 4 | 5 | 143.040 | 0.838 | 0 | -3.914 | 0.1140 | 0.774 | 0.02490 | 0.025000 | 0.924 | 0.2420 | 5457673 | spotify:artist:26dSoYclwsYLMAKD3tpOr4 | Britney Spears | pop |
| 2 | 0 | spotify:track:0WqIKmW4BTrj3eJFmnCKMv | 235933 | 4 | 2 | 99.259 | 0.758 | 0 | -6.583 | 0.2100 | 0.664 | 0.00238 | 0.000000 | 0.701 | 0.0598 | 16686181 | spotify:artist:6vWDO969PvNqNYHIOW5v0m | Beyoncé | pop |
| 3 | 0 | spotify:track:1AWQoqb9bSvzTjaLralEkT | 267267 | 4 | 4 | 100.972 | 0.714 | 0 | -6.055 | 0.1400 | 0.891 | 0.20200 | 0.000234 | 0.818 | 0.0521 | 7343717 | spotify:artist:31TPClRtHm23RisEBtV3X7 | Justin Timberlake | rap |
| 4 | 0 | spotify:track:1lzr43nnXAijIGYnCT8M8H | 227600 | 4 | 0 | 94.759 | 0.606 | 1 | -4.596 | 0.0713 | 0.853 | 0.05610 | 0.000000 | 0.654 | 0.3130 | 1044930 | spotify:artist:5EvFsr3kj42KNv97ZEnqij | Shaggy | rap |
feature_indexes = list(range(len(temp.columns)-1))
col_names_temp = ['duration_ms','time_signature','key','tempo','energy','loudness','speechiness','danceability','acousticness',
'instrumentalness', 'valence', 'liveness', 'artist_followers', 'artist_popularity' ]
col_names = temp.columns
The code below one-hot-encodes the variable genre, so that we can calculated the proportion of songs of each genre in each playlist. This will help classify the genre of our playlist according to the most frequent genre of songs that belong to that playlist.
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
songs_encoded.head()
| pid | song_uri | duration_ms | time_signature | key | tempo | energy | mode | loudness | speechiness | danceability | acousticness | instrumentalness | valence | liveness | artist_followers | artist_uri | artist_name | genre_other | genre_pop | genre_pop rock | genre_rap | genre_rock | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | spotify:track:0UaMYEvWZi0ZqiDOoHU3YI | 226864 | 4 | 4 | 125.461 | 0.813 | 0 | -7.105 | 0.1210 | 0.904 | 0.03110 | 0.006970 | 0.810 | 0.0471 | 909647 | spotify:artist:2wIVse2owClT7go1WT98tk | Missy Elliott | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | spotify:track:6I9VzXrHxO9rA9A5euc8Ak | 198800 | 4 | 5 | 143.040 | 0.838 | 0 | -3.914 | 0.1140 | 0.774 | 0.02490 | 0.025000 | 0.924 | 0.2420 | 5457673 | spotify:artist:26dSoYclwsYLMAKD3tpOr4 | Britney Spears | 0 | 1 | 0 | 0 | 0 |
| 2 | 0 | spotify:track:0WqIKmW4BTrj3eJFmnCKMv | 235933 | 4 | 2 | 99.259 | 0.758 | 0 | -6.583 | 0.2100 | 0.664 | 0.00238 | 0.000000 | 0.701 | 0.0598 | 16686181 | spotify:artist:6vWDO969PvNqNYHIOW5v0m | Beyoncé | 0 | 1 | 0 | 0 | 0 |
| 3 | 0 | spotify:track:1AWQoqb9bSvzTjaLralEkT | 267267 | 4 | 4 | 100.972 | 0.714 | 0 | -6.055 | 0.1400 | 0.891 | 0.20200 | 0.000234 | 0.818 | 0.0521 | 7343717 | spotify:artist:31TPClRtHm23RisEBtV3X7 | Justin Timberlake | 0 | 0 | 0 | 1 | 0 |
| 4 | 0 | spotify:track:1lzr43nnXAijIGYnCT8M8H | 227600 | 4 | 0 | 94.759 | 0.606 | 1 | -4.596 | 0.0713 | 0.853 | 0.05610 | 0.000000 | 0.654 | 0.3130 | 1044930 | spotify:artist:5EvFsr3kj42KNv97ZEnqij | Shaggy | 0 | 0 | 0 | 1 | 0 |
The following function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column (which characterize each playlist). This creates a datafram at the playlist level.
def collapse_pid(df):
"""
This function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column.
Input: data frame of songs (with playlists IDs)
Output: data frame of playlists (collapsing songs into playlist IDs, using average)
"""
# Group by play list category
pid_groups = df.groupby('pid')
# Apply mean function to all columns
return pid_groups.mean()
playlists_collapsed = collapse_pid(songs_encoded)
The following function classifies playlists according to the most frequent genre of the songs in the playlist:
def playlist_genre_generator (df, first_row):
"""
This function classifies playlists according to the most frequent genre of the songs in the playlist
Input: dataframe with a list of playlists
Output: dataframe with added column with unique genre for each playlist
"""
# create new columnn that will have unique genre (class) of each playlist
df ["playlist_genre"] = ""
for j in range(len(df)):
# finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
g1_index = 0
last_column_index = 0
column_names = df.columns.values
# finding first column with genres ("g1")
for i in column_names:
if i == "artist_followers":
break
g1_index += 1
g1_index += 1
# finding last column with genrer ("gX")
for i in column_names:
last_column_index += 1
last_column_index -= 1
# Creating list of genres for a given song
genres_row = list(df.iloc[[j]][column_names[g1_index:last_column_index]].dropna(axis=1).values.flatten())
# classifing genre for the playlist
max_value = max(genres_row)
max_index = genres_row.index(max_value)
playlist_genre = column_names[g1_index + max_index]
# giving column genre the classified genre for a given playlist
df.set_value(j + first_row, 'playlist_genre', playlist_genre)
return df
The following code creates a “base line” playlist with a defined minimum size of the playlist (2000 playlists), which will have an unequal distribution of genres among the playlists, as demonstrated in the output table below.
### creating base_line data frame
import warnings
warnings.filterwarnings('ignore')
n_playlist = 2000
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = n_playlist, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = n_playlist, first_pid = 0)
t_features, a_features = get_all_features(track_uri, artist_uri, sp)
#create dataframe of songs
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df_new = genre_generator(songs_df)
temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
temp = temp.drop(column_names_temp,axis=1)
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
#create dataframe of playlists
playlists_collapsed = collapse_pid(songs_encoded)
genre_classified_playlists = playlist_genre_generator (playlists_collapsed, first_row = 0)
genre_classified_playlists.head()
| duration_ms | time_signature | key | tempo | energy | mode | loudness | speechiness | danceability | acousticness | instrumentalness | valence | liveness | artist_followers | genre_other | genre_pop | genre_pop rock | genre_rap | genre_rock | playlist_genre | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pid | ||||||||||||||||||||
| 0 | 221777.461538 | 4.000000 | 5.038462 | 123.006885 | 0.782173 | 0.692308 | -4.881942 | 0.107021 | 0.659288 | 0.083440 | 0.000676 | 0.642904 | 0.192127 | 4.800843e+06 | 0.000000 | 0.288462 | 0.230769 | 0.461538 | 0.019231 | genre_rap |
| 1 | 298844.128205 | 3.769231 | 4.461538 | 122.669615 | 0.691077 | 0.538462 | -8.291667 | 0.088449 | 0.496459 | 0.163100 | 0.222270 | 0.476667 | 0.178433 | 1.704673e+06 | 0.358974 | 0.000000 | 0.051282 | 0.000000 | 0.589744 | genre_rock |
| 2 | 219374.875000 | 4.000000 | 5.000000 | 114.600672 | 0.693203 | 0.515625 | -4.874156 | 0.096288 | 0.671875 | 0.269230 | 0.000638 | 0.565078 | 0.169028 | 1.691574e+06 | 0.062500 | 0.937500 | 0.000000 | 0.000000 | 0.000000 | genre_pop |
| 3 | 229575.055556 | 3.952381 | 5.103175 | 125.032413 | 0.621282 | 0.714286 | -9.614937 | 0.067186 | 0.513714 | 0.273870 | 0.202042 | 0.451623 | 0.188585 | 2.125109e+05 | 0.246032 | 0.150794 | 0.317460 | 0.071429 | 0.214286 | genre_pop rock |
| 4 | 255014.352941 | 3.941176 | 3.352941 | 127.759882 | 0.650535 | 0.823529 | -7.634471 | 0.041159 | 0.576765 | 0.177148 | 0.081875 | 0.490765 | 0.166524 | 1.167521e+06 | 0.117647 | 0.117647 | 0.705882 | 0.000000 | 0.058824 | genre_pop rock |
The following code is an intermediary step in adjusting the sample towards an equal distribution of genres among all playlists. It looks for the most frequent genre among the playlists, calculates the number of playlists of each genre, so that in the next step we fill up the sample with playlits of underrepresented genres.
from pandas.tools.plotting import table
table = genre_classified_playlists['playlist_genre'].value_counts()
mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
number_mode_genre = table.loc[mode_genre]
number_genre_pop = table.loc["genre_pop"]
number_genre_rap = table.loc["genre_rap"]
number_genre_other = table.loc["genre_other"]
number_genre_poprock = table.loc["genre_pop rock"]
number_genre_rock = table.loc["genre_rock"]
mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
mode_genre
total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock
total_number
2675
The code below takes one playlist at a time from the pool of 15,000 playlists (read from the Million Playlist json files at the beginning of this page), checks to which genre it belongs, and adds the playlist (if of underepresented genre) to the baseline sample, until the full sample is equally distributed.
The playlists taken from the 15,000 playlists are taken in sequence after the playlists that have already been added to the sample, or discarded if the playlist belongs to an already “well represented genre”.
### adjusting base_line data frame to get to desired distribution
start_time = time.time()
t = 0
while total_number < number_mode_genre*5:
first_pid = n_playlist + t
# get uri for tracks and artists of playlist selected
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 1, first_pid = first_pid)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 1, first_pid = first_pid)
t_features, a_features = get_all_features(track_uri, artist_uri, sp)
#create dataframe of songs
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df_new = genre_generator(songs_df)
temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
temp = temp.drop(column_names_temp,axis=1)
temp
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
songs_encoded
#create dataframe of playlists
playlists_collapsed = collapse_pid(songs_encoded)
genre_classified_SinglePlaylist = playlist_genre_generator (playlists_collapsed, first_row = first_pid)
# checking if playlist selected belongs to one of the genres that is not the most frequent in baseline dataframe
if total_number != 5*number_mode_genre:
if genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop":
if number_genre_pop < number_mode_genre:
genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
number_genre_pop += 1
elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rap":
if number_genre_rap < number_mode_genre:
genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
number_genre_rap += 1
elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_other":
if number_genre_other < number_mode_genre:
genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
number_genre_other += 1
elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop rock":
if number_genre_poprock < number_mode_genre:
genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
number_genre_poprock += 1
elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rock":
if number_genre_rock < number_mode_genre:
genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
number_genre_rock += 1
t += 1
total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock
# print (total_number)
# print (number_genre_pop)
# print (number_genre_rap)
# print (number_genre_other)
# print (number_genre_poprock)
# print (number_genre_rock)
print("--- %s seconds ---" % (time.time() - start_time))
genre_classified_playlists.head()
--- 0.0009975433349609375 seconds ---
| pid | duration_ms | time_signature | key | tempo | energy | mode | loudness | speechiness | danceability | acousticness | instrumentalness | valence | liveness | artist_followers | genre_other | genre_pop | genre_pop rock | genre_rap | genre_rock | playlist_genre | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 221777.461538 | 4.000000 | 5.038462 | 123.006885 | 0.782173 | 0.692308 | -4.881942 | 0.107021 | 0.659288 | 0.083440 | 0.000676 | 0.642904 | 0.192127 | 4.797984e+06 | 0.000000 | 0.288462 | 0.230769 | 0.461538 | 0.019231 | genre_rap |
| 1 | 1 | 298844.128205 | 3.769231 | 4.461538 | 122.669615 | 0.691077 | 0.538462 | -8.291667 | 0.088449 | 0.496459 | 0.163100 | 0.222270 | 0.476667 | 0.178433 | 1.702573e+06 | 0.358974 | 0.000000 | 0.051282 | 0.000000 | 0.589744 | genre_rock |
| 2 | 2 | 219374.875000 | 4.000000 | 5.000000 | 114.600672 | 0.693203 | 0.515625 | -4.874156 | 0.096288 | 0.671875 | 0.269230 | 0.000638 | 0.565078 | 0.169028 | 1.688725e+06 | 0.062500 | 0.937500 | 0.000000 | 0.000000 | 0.000000 | genre_pop |
| 3 | 3 | 229575.055556 | 3.952381 | 5.103175 | 125.032413 | 0.621282 | 0.714286 | -9.614937 | 0.067186 | 0.513714 | 0.273870 | 0.202042 | 0.451623 | 0.188585 | 2.123258e+05 | 0.246032 | 0.150794 | 0.317460 | 0.071429 | 0.214286 | genre_pop rock |
| 4 | 4 | 255014.352941 | 3.941176 | 3.352941 | 127.759882 | 0.650535 | 0.823529 | -7.634471 | 0.041159 | 0.576765 | 0.177148 | 0.081875 | 0.490765 | 0.166524 | 1.166320e+06 | 0.117647 | 0.117647 | 0.705882 | 0.000000 | 0.058824 | genre_pop rock |
Finally, we check to make sure that the final dataframe is equally distributed among all genres:
display(genre_classified_playlists['playlist_genre'].value_counts())
display(genre_classified_playlists['playlist_genre'].value_counts(normalize=True))
genre_other 535
genre_pop 535
genre_rock 535
genre_rap 535
genre_pop rock 535
Name: playlist_genre, dtype: int64
genre_other 0.2
genre_pop 0.2
genre_rock 0.2
genre_rap 0.2
genre_pop rock 0.2
Name: playlist_genre, dtype: float64
And export the final dataframe as a csv file, which will be used as the sample data for our machine learning models. This sample will be split into training and test data, the former for training different models and assesing their performance, and the latter for evaluating how well our trained models perform in the test data.
genre_classified_playlists.to_csv ("playlist_df.csv")