Creating balanced dataset

Contents

Description (summary):

The goal of this section is to create a final dataset of playlists (our sample), with independent variables (tracks and artists features) and the dependent variable (genre of the playlist). Most importantly, we made sure that our sample was equally distributed in each of the classes, since this is important in fitting the models to the training dataset. In order to do so, we had to carry out a number of steps, which included:

- Requesting playlist IDs, tracks and artist features from Spotify's API using Spotipy Package
- Setting up a pandas dataframe at the track level
- Classifying each song to one of 5 genres (rock, pop, poprock, rap, and others)
- Collapsing the songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist
- Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others), according to the genre most frequent in that given playlist
- Setting up final sample of playlists of equally distributed in each of the classes (genres)

Requesting playlist IDs, tracks and artist features from Spotify’s API using Spotipy Package

The following function takes a number of playlist and returns the features of the tracks of those selected playlists:

def feature_list_func(plylist_dic, feature, n_playlist, first_pid):
    """"
    This function takes a number of playlist and returns the features of the tracks of those selected playlists.

    input:
        1 - plylist_dic: dictionary of all playlists (dataset in dictionary format: json)
        2 - feature: feature to be selected from each songs in selected playlists
        3 - n_playlists: number of playlists to be selected

    output: list of observations for the feature chosen, for all of the tracks that belong to the selected playlists

    """
    feature_list = []
    pid_list = []
    length_playlist = np.minimum(n_playlist,len(plylist_dic)) # the output will be based on the min of the n_playlist and the actual length of the input playlist
    for i in range(length_playlist):
        playlist = plylist_dic[first_pid + i]
        playlist_pid = playlist.get('pid')
        for j in range(len(playlist.get('tracks'))):
            feature_list.append(playlist.get('tracks')[j].get(feature))
            pid_list.append(playlist_pid)
    return pid_list, feature_list

The following code calls the functions above, in order to get the playlist IDs, the track and artist URIs, which will be used later to request the features that will comprise our dataframe.

pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 10, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 10, first_pid = 0)

After getting the URI of the tracks and artists, we requested their features from API Spotify, to create a pandas database at the track level. We used Spotipy API. The Spotify Package can be found at: https://spotipy.readthedocs.io

def create_spotipy_obj():

    """
    Uses dbarjum's client id for DS Project
    """

    SPOTIPY_CLIENT_ID = '54006da9bd7849b7906b944a7fa4e29d'
    SPOTIPY_CLIENT_SECRET = 'f54ae294a30c4a99b2ff330a923cd6e3'
    SPOTIPY_REDIRECT_URI = 'http://localhost/'

    username = 'dbarjum'
    scope = 'user-library-read'

    token = util.prompt_for_user_token(username,scope,client_id=SPOTIPY_CLIENT_ID,
                           client_secret=SPOTIPY_CLIENT_SECRET,
                           redirect_uri=SPOTIPY_REDIRECT_URI)
    client_credentials_manager = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID,
                                                          client_secret=SPOTIPY_CLIENT_SECRET, proxies=None)
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

    return sp
sp = create_spotipy_obj()
def get_all_features(track_list = list, artist_list = list, sp=None):

    """
    This function takes in a list of tracks and a list of artists, along
    with a spotipy object and generates two lists of features from Spotify's API.

    inputs:
        1. track_list: list of all tracks to be included in dataframe
        2. artist_list: list of all artists corresponding to tracks
        3. sp: spotipy object to communicate with Spotify API

    returns:
        1. track_features: list of all features for each track in track_list
        2. artist_features: list of all artist features for each artist in artist_list
    """

    track_features = []
    artist_features = []

    track_iters = int(len(track_list)/50)
    track_remainders = len(track_list)%50

    start = 0
    end = start+50

    for i in range(track_iters):
        track_features.extend(sp.audio_features(track_list[start:end]))
        artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))
        start += 50
        end = start+50


    if track_remainders:
        end = start + track_remainders
        track_features.extend(sp.audio_features(track_list[start:end]))
        artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))

    return track_features, artist_features
start_time = time.time()
t_features, a_features = get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))
--- 2.8701727390289307 seconds ---

Setting up a pandas dataframe at the track level

The following function takes in the lists of track and artist features, and generates a dataframe of the features. It also creates columns in the dataframe that represent the genres provided for the artist of each track. These columns will be used later for classifying each track to one of 5 genres (rock, pop, poprock, rap, and others).

def create_song_df(track_features=list, artist_features=list, pid=list):

    """
    This function takes in two lists of track and artist features, respectively,
    and generates a dataframe of the features.

    inputs:
        1. track_features: list of all tracks including features
        2. artist_features: list of all artists including features

    returns:
        1. df: a pandas dataframe of size (N, X) where N corresponds to the number of songs
        in track_features, X is the number of features in the dataframe.
    """

    import pandas as pd

    selected_song_features = ['uri', 'duration_ms', 'time_signature', 'key',
                              'tempo', 'energy', 'mode', 'loudness', 'speechiness',
                              'danceability', 'acousticness', 'instrumentalness',
                              'valence', 'liveness']
    selected_artist_features = ['followers', 'uri', 'name', 'popularity', 'genres']

    col_names = ['song_uri', 'duration_ms', 'time_signature', 'key',
                 'tempo', 'energy', 'mode', 'loudness', 'speechiness',
                 'danceability', 'acousticness', 'instrumentalness',
                 'valence', 'liveness', 'artist_followers', 'artist_uri',
                 'artist_name', 'artist_popularity']


    data = []

    for i, j in zip(track_features, artist_features):
        temp = []
        for sf in selected_song_features:
            temp.append(i.get(sf))
        for af in selected_artist_features:
            if af == 'followers':
                temp.append(j.get('followers').get('total'))
            elif af == 'genres':
                for g in j.get('genres'):
                    temp.append(g)
            else:
                temp.append(j.get(af))

        data.append(list(temp))

    df = pd.DataFrame(data)

    for i in range(len(df.columns)- len(col_names)):
        col_names.append('g'+str(i+1))

    df.columns = col_names

    df.insert(loc=0, column='pid', value=pid)

    return df
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df.head()
pid song_uri duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers artist_uri artist_name artist_popularity g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 g20 g21
0 0 spotify:track:0UaMYEvWZi0ZqiDOoHU3YI 226864 4 4 125.461 0.813 0 -7.105 0.1210 0.904 0.03110 0.006970 0.810 0.0471 909647 spotify:artist:2wIVse2owClT7go1WT98tk Missy Elliott 76 dance pop hip hop hip pop pop pop rap r&b rap southern hip hop urban contemporary None None None None None None None None None None None None
1 0 spotify:track:6I9VzXrHxO9rA9A5euc8Ak 198800 4 5 143.040 0.838 0 -3.914 0.1140 0.774 0.02490 0.025000 0.924 0.2420 5457673 spotify:artist:26dSoYclwsYLMAKD3tpOr4 Britney Spears 82 dance pop pop post-teen pop None None None None None None None None None None None None None None None None None None
2 0 spotify:track:0WqIKmW4BTrj3eJFmnCKMv 235933 4 2 99.259 0.758 0 -6.583 0.2100 0.664 0.00238 0.000000 0.701 0.0598 16686181 spotify:artist:6vWDO969PvNqNYHIOW5v0m Beyoncé 87 dance pop pop post-teen pop r&b None None None None None None None None None None None None None None None None None
3 0 spotify:track:1AWQoqb9bSvzTjaLralEkT 267267 4 4 100.972 0.714 0 -6.055 0.1400 0.891 0.20200 0.000234 0.818 0.0521 7343717 spotify:artist:31TPClRtHm23RisEBtV3X7 Justin Timberlake 83 dance pop pop pop rap None None None None None None None None None None None None None None None None None None
4 0 spotify:track:1lzr43nnXAijIGYnCT8M8H 227600 4 0 94.759 0.606 1 -4.596 0.0713 0.853 0.05610 0.000000 0.654 0.3130 1044930 spotify:artist:5EvFsr3kj42KNv97ZEnqij Shaggy 74 dance pop pop rap reggae fusion None None None None None None None None None None None None None None None None None None

Collapsing songs to unique playlists

This section is responsible for collapsing songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist. In this section we also classified songs, and playlists.

The following function classifies songs according to the given genres of the artist of the song, according to “if” statements:

def genre_generator(songs_df):

    """
    This function classifies songs according to the given genres of the artist of the song, according to an "if" statements.

    Input: dataframe with a list of songs

    Output: dataframe with added column with unique genre for each song

    """
    # defining liist of genres that will determine a song with unique genre "rap"
    rap = ["rap","hiphop", "r&d"]

    # finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
    g1_index = 0
    last_column_index = 0

    column_names = songs_df.columns.values

    # finding first column with genres ("g1")
    for i in column_names:
        if i == "g1":
            break
        g1_index += 1

    # finding last column with genrer ("gX")
    for i in column_names:
        last_column_index += 1

    # create new columnn that will have unique genre (class) of each song
    songs_df["genre"] = ""

    # loop to create genre for each song in dataframe     
    for j in range(len(songs_df)):

        # Creating list of genres for a given song  
        genres_row = list(songs_df.iloc[[j]][column_names[g1_index:last_column_index-1]].dropna(axis=1).values.flatten())
        # genres_row = ['british invasion', 'merseybeat', 'psychedelic']

        # classifing genre for the song

        genre = "other"

        if any("rock" in s for s in genres_row) and any("pop" in s for s in genres_row):
            genre = "pop rock"
        elif any("rock" in s for s in genres_row):
            genre = "rock"
        elif any("pop" in s for s in genres_row):
            genre = "pop"

        for i in rap:
            if any(i in s for s in genres_row):
                genre = "rap"

        # giving column genre the classified genre for a given song         
        songs_df.set_value(j, 'genre', genre)

    return songs_df

The code below calls the song genre generator function, and the result is a dataframe with songs containing a genre, which has been classified according to the genre of the artists of each song.

songs_df_new = genre_generator(songs_df)
songs_df_new.head()
C:\Users\Joao Araujo\Anaconda3\lib\site-packages\ipykernel_launcher.py:56: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
pid song_uri duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers artist_uri artist_name artist_popularity g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 g20 g21 genre
0 0 spotify:track:0UaMYEvWZi0ZqiDOoHU3YI 226864 4 4 125.461 0.813 0 -7.105 0.1210 0.904 0.03110 0.006970 0.810 0.0471 909647 spotify:artist:2wIVse2owClT7go1WT98tk Missy Elliott 76 dance pop hip hop hip pop pop pop rap r&b rap southern hip hop urban contemporary None None None None None None None None None None None None rap
1 0 spotify:track:6I9VzXrHxO9rA9A5euc8Ak 198800 4 5 143.040 0.838 0 -3.914 0.1140 0.774 0.02490 0.025000 0.924 0.2420 5457673 spotify:artist:26dSoYclwsYLMAKD3tpOr4 Britney Spears 82 dance pop pop post-teen pop None None None None None None None None None None None None None None None None None None pop
2 0 spotify:track:0WqIKmW4BTrj3eJFmnCKMv 235933 4 2 99.259 0.758 0 -6.583 0.2100 0.664 0.00238 0.000000 0.701 0.0598 16686181 spotify:artist:6vWDO969PvNqNYHIOW5v0m Beyoncé 87 dance pop pop post-teen pop r&b None None None None None None None None None None None None None None None None None pop
3 0 spotify:track:1AWQoqb9bSvzTjaLralEkT 267267 4 4 100.972 0.714 0 -6.055 0.1400 0.891 0.20200 0.000234 0.818 0.0521 7343717 spotify:artist:31TPClRtHm23RisEBtV3X7 Justin Timberlake 83 dance pop pop pop rap None None None None None None None None None None None None None None None None None None rap
4 0 spotify:track:1lzr43nnXAijIGYnCT8M8H 227600 4 0 94.759 0.606 1 -4.596 0.0713 0.853 0.05610 0.000000 0.654 0.3130 1044930 spotify:artist:5EvFsr3kj42KNv97ZEnqij Shaggy 74 dance pop pop rap reggae fusion None None None None None None None None None None None None None None None None None None rap

The following lines clean the dataframe by dropping unnecessary columns (the genres of each song), which were used to create the unique column of song genre that will be used later in the algorithm.

temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
column_names_temp
array(['artist_popularity', 'g1', 'g2', 'g3', 'g4', 'g5', 'g6', 'g7', 'g8',
       'g9', 'g10', 'g11', 'g12', 'g13', 'g14', 'g15', 'g16', 'g17', 'g18',
       'g19', 'g20', 'g21'], dtype=object)
temp = temp.drop(column_names_temp,axis=1)
temp.head()
pid song_uri duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers artist_uri artist_name genre
0 0 spotify:track:0UaMYEvWZi0ZqiDOoHU3YI 226864 4 4 125.461 0.813 0 -7.105 0.1210 0.904 0.03110 0.006970 0.810 0.0471 909647 spotify:artist:2wIVse2owClT7go1WT98tk Missy Elliott rap
1 0 spotify:track:6I9VzXrHxO9rA9A5euc8Ak 198800 4 5 143.040 0.838 0 -3.914 0.1140 0.774 0.02490 0.025000 0.924 0.2420 5457673 spotify:artist:26dSoYclwsYLMAKD3tpOr4 Britney Spears pop
2 0 spotify:track:0WqIKmW4BTrj3eJFmnCKMv 235933 4 2 99.259 0.758 0 -6.583 0.2100 0.664 0.00238 0.000000 0.701 0.0598 16686181 spotify:artist:6vWDO969PvNqNYHIOW5v0m Beyoncé pop
3 0 spotify:track:1AWQoqb9bSvzTjaLralEkT 267267 4 4 100.972 0.714 0 -6.055 0.1400 0.891 0.20200 0.000234 0.818 0.0521 7343717 spotify:artist:31TPClRtHm23RisEBtV3X7 Justin Timberlake rap
4 0 spotify:track:1lzr43nnXAijIGYnCT8M8H 227600 4 0 94.759 0.606 1 -4.596 0.0713 0.853 0.05610 0.000000 0.654 0.3130 1044930 spotify:artist:5EvFsr3kj42KNv97ZEnqij Shaggy rap
feature_indexes = list(range(len(temp.columns)-1))
col_names_temp = ['duration_ms','time_signature','key','tempo','energy','loudness','speechiness','danceability','acousticness',
         'instrumentalness', 'valence', 'liveness', 'artist_followers', 'artist_popularity'  ]

col_names = temp.columns

The code below one-hot-encodes the variable genre, so that we can calculated the proportion of songs of each genre in each playlist. This will help classify the genre of our playlist according to the most frequent genre of songs that belong to that playlist.

songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
songs_encoded.head()
pid song_uri duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers artist_uri artist_name genre_other genre_pop genre_pop rock genre_rap genre_rock
0 0 spotify:track:0UaMYEvWZi0ZqiDOoHU3YI 226864 4 4 125.461 0.813 0 -7.105 0.1210 0.904 0.03110 0.006970 0.810 0.0471 909647 spotify:artist:2wIVse2owClT7go1WT98tk Missy Elliott 0 0 0 1 0
1 0 spotify:track:6I9VzXrHxO9rA9A5euc8Ak 198800 4 5 143.040 0.838 0 -3.914 0.1140 0.774 0.02490 0.025000 0.924 0.2420 5457673 spotify:artist:26dSoYclwsYLMAKD3tpOr4 Britney Spears 0 1 0 0 0
2 0 spotify:track:0WqIKmW4BTrj3eJFmnCKMv 235933 4 2 99.259 0.758 0 -6.583 0.2100 0.664 0.00238 0.000000 0.701 0.0598 16686181 spotify:artist:6vWDO969PvNqNYHIOW5v0m Beyoncé 0 1 0 0 0
3 0 spotify:track:1AWQoqb9bSvzTjaLralEkT 267267 4 4 100.972 0.714 0 -6.055 0.1400 0.891 0.20200 0.000234 0.818 0.0521 7343717 spotify:artist:31TPClRtHm23RisEBtV3X7 Justin Timberlake 0 0 0 1 0
4 0 spotify:track:1lzr43nnXAijIGYnCT8M8H 227600 4 0 94.759 0.606 1 -4.596 0.0713 0.853 0.05610 0.000000 0.654 0.3130 1044930 spotify:artist:5EvFsr3kj42KNv97ZEnqij Shaggy 0 0 0 1 0

The following function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column (which characterize each playlist). This creates a datafram at the playlist level.

def collapse_pid(df):

    """
    This function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column.

    Input: data frame of songs (with playlists IDs)

    Output: data frame of playlists (collapsing songs into playlist IDs, using average)

    """

    # Group by play list category
    pid_groups = df.groupby('pid')
    # Apply mean function to all columns

    return pid_groups.mean()

playlists_collapsed = collapse_pid(songs_encoded)

Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others)

The following function classifies playlists according to the most frequent genre of the songs in the playlist:

def playlist_genre_generator (df, first_row):

    """
    This function classifies playlists according to the most frequent genre of the songs in the playlist

    Input: dataframe with a list of playlists

    Output: dataframe with added column with unique genre for each playlist

    """

    # create new columnn that will have unique genre (class) of each playlist
    df ["playlist_genre"] = ""

    for j in range(len(df)):

        # finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
        g1_index = 0
        last_column_index = 0

        column_names = df.columns.values

        # finding first column with genres ("g1")
        for i in column_names:
            if i == "artist_followers":
                break
            g1_index += 1
        g1_index += 1

        # finding last column with genrer ("gX")
        for i in column_names:
            last_column_index += 1
        last_column_index -= 1

        # Creating list of genres for a given song  
        genres_row = list(df.iloc[[j]][column_names[g1_index:last_column_index]].dropna(axis=1).values.flatten())

        # classifing genre for the playlist
        max_value = max(genres_row)
        max_index = genres_row.index(max_value)
        playlist_genre = column_names[g1_index + max_index]

        # giving column genre the classified genre for a given playlist
        df.set_value(j + first_row, 'playlist_genre', playlist_genre)
    return df

Setting up final sample of playlists of equally distributed in each of the classes (genres)

The following code creates a “base line” playlist with a defined minimum size of the playlist (2000 playlists), which will have an unequal distribution of genres among the playlists, as demonstrated in the output table below.

### creating base_line data frame

import warnings
warnings.filterwarnings('ignore')

n_playlist = 2000
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = n_playlist, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = n_playlist, first_pid = 0)

t_features, a_features = get_all_features(track_uri, artist_uri, sp)

#create dataframe of songs
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df_new = genre_generator(songs_df)
temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
temp = temp.drop(column_names_temp,axis=1)
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)

#create dataframe of playlists
playlists_collapsed = collapse_pid(songs_encoded)
genre_classified_playlists = playlist_genre_generator (playlists_collapsed, first_row = 0)
genre_classified_playlists.head()
duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers genre_other genre_pop genre_pop rock genre_rap genre_rock playlist_genre
pid
0 221777.461538 4.000000 5.038462 123.006885 0.782173 0.692308 -4.881942 0.107021 0.659288 0.083440 0.000676 0.642904 0.192127 4.800843e+06 0.000000 0.288462 0.230769 0.461538 0.019231 genre_rap
1 298844.128205 3.769231 4.461538 122.669615 0.691077 0.538462 -8.291667 0.088449 0.496459 0.163100 0.222270 0.476667 0.178433 1.704673e+06 0.358974 0.000000 0.051282 0.000000 0.589744 genre_rock
2 219374.875000 4.000000 5.000000 114.600672 0.693203 0.515625 -4.874156 0.096288 0.671875 0.269230 0.000638 0.565078 0.169028 1.691574e+06 0.062500 0.937500 0.000000 0.000000 0.000000 genre_pop
3 229575.055556 3.952381 5.103175 125.032413 0.621282 0.714286 -9.614937 0.067186 0.513714 0.273870 0.202042 0.451623 0.188585 2.125109e+05 0.246032 0.150794 0.317460 0.071429 0.214286 genre_pop rock
4 255014.352941 3.941176 3.352941 127.759882 0.650535 0.823529 -7.634471 0.041159 0.576765 0.177148 0.081875 0.490765 0.166524 1.167521e+06 0.117647 0.117647 0.705882 0.000000 0.058824 genre_pop rock

The following code is an intermediary step in adjusting the sample towards an equal distribution of genres among all playlists. It looks for the most frequent genre among the playlists, calculates the number of playlists of each genre, so that in the next step we fill up the sample with playlits of underrepresented genres.

from pandas.tools.plotting import table
table = genre_classified_playlists['playlist_genre'].value_counts()
mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
number_mode_genre = table.loc[mode_genre]

number_genre_pop = table.loc["genre_pop"]
number_genre_rap = table.loc["genre_rap"]
number_genre_other = table.loc["genre_other"]
number_genre_poprock = table.loc["genre_pop rock"]
number_genre_rock = table.loc["genre_rock"]

mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
mode_genre
total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock
total_number
2675

The code below takes one playlist at a time from the pool of 15,000 playlists (read from the Million Playlist json files at the beginning of this page), checks to which genre it belongs, and adds the playlist (if of underepresented genre) to the baseline sample, until the full sample is equally distributed.

The playlists taken from the 15,000 playlists are taken in sequence after the playlists that have already been added to the sample, or discarded if the playlist belongs to an already “well represented genre”.

### adjusting base_line data frame to get to desired distribution

start_time = time.time()

t = 0

while total_number < number_mode_genre*5:

    first_pid = n_playlist + t

    # get uri for tracks and artists of playlist selected
    pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 1, first_pid = first_pid)
    pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 1, first_pid = first_pid)
    t_features, a_features = get_all_features(track_uri, artist_uri, sp)

    #create dataframe of songs
    songs_df = create_song_df(t_features, a_features, pid_t)
    songs_df_new = genre_generator(songs_df)
    temp = songs_df_new.copy()
    column_names_temp = songs_df_new.columns.values[18:-1]
    temp = temp.drop(column_names_temp,axis=1)
    temp

    songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
    songs_encoded

    #create dataframe of playlists
    playlists_collapsed = collapse_pid(songs_encoded)
    genre_classified_SinglePlaylist = playlist_genre_generator (playlists_collapsed, first_row = first_pid)

    # checking if playlist selected belongs to one of the genres that is not the most frequent in baseline dataframe
    if total_number != 5*number_mode_genre:

        if genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop":
            if number_genre_pop < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_pop += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rap":
            if number_genre_rap < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_rap += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_other":
            if number_genre_other < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_other += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop rock":
            if number_genre_poprock < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_poprock += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rock":
            if number_genre_rock < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_rock += 1

    t += 1

    total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock

    # print (total_number)
    # print (number_genre_pop)
    # print (number_genre_rap)
    # print (number_genre_other)
    # print (number_genre_poprock)
    # print (number_genre_rock)

print("--- %s seconds ---" % (time.time() - start_time))

genre_classified_playlists.head()
--- 0.0009975433349609375 seconds ---
pid duration_ms time_signature key tempo energy mode loudness speechiness danceability acousticness instrumentalness valence liveness artist_followers genre_other genre_pop genre_pop rock genre_rap genre_rock playlist_genre
0 0 221777.461538 4.000000 5.038462 123.006885 0.782173 0.692308 -4.881942 0.107021 0.659288 0.083440 0.000676 0.642904 0.192127 4.797984e+06 0.000000 0.288462 0.230769 0.461538 0.019231 genre_rap
1 1 298844.128205 3.769231 4.461538 122.669615 0.691077 0.538462 -8.291667 0.088449 0.496459 0.163100 0.222270 0.476667 0.178433 1.702573e+06 0.358974 0.000000 0.051282 0.000000 0.589744 genre_rock
2 2 219374.875000 4.000000 5.000000 114.600672 0.693203 0.515625 -4.874156 0.096288 0.671875 0.269230 0.000638 0.565078 0.169028 1.688725e+06 0.062500 0.937500 0.000000 0.000000 0.000000 genre_pop
3 3 229575.055556 3.952381 5.103175 125.032413 0.621282 0.714286 -9.614937 0.067186 0.513714 0.273870 0.202042 0.451623 0.188585 2.123258e+05 0.246032 0.150794 0.317460 0.071429 0.214286 genre_pop rock
4 4 255014.352941 3.941176 3.352941 127.759882 0.650535 0.823529 -7.634471 0.041159 0.576765 0.177148 0.081875 0.490765 0.166524 1.166320e+06 0.117647 0.117647 0.705882 0.000000 0.058824 genre_pop rock

Finally, we check to make sure that the final dataframe is equally distributed among all genres:

display(genre_classified_playlists['playlist_genre'].value_counts())
display(genre_classified_playlists['playlist_genre'].value_counts(normalize=True))
genre_other       535
genre_pop         535
genre_rock        535
genre_rap         535
genre_pop rock    535
Name: playlist_genre, dtype: int64



genre_other       0.2
genre_pop         0.2
genre_rock        0.2
genre_rap         0.2
genre_pop rock    0.2
Name: playlist_genre, dtype: float64

And export the final dataframe as a csv file, which will be used as the sample data for our machine learning models. This sample will be split into training and test data, the former for training different models and assesing their performance, and the latter for evaluating how well our trained models perform in the test data.

genre_classified_playlists.to_csv ("playlist_df.csv")