Creating balanced dataset

Description (summary):
Requesting playlist IDs, tracks and artist features from Spotify’s API using Spotipy Package
Setting up a pandas dataframe at the track level
Collapsing songs to unique playlists
Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others)
Setting up final sample of playlists of equally distributed in each of the classes (genres)

Description (summary):

The goal of this section is to create a final dataset of playlists (our sample), with independent variables (tracks and artists features) and the dependent variable (genre of the playlist). Most importantly, we made sure that our sample was equally distributed in each of the classes, since this is important in fitting the models to the training dataset. In order to do so, we had to carry out a number of steps, which included:

- Requesting playlist IDs, tracks and artist features from Spotify's API using Spotipy Package
- Setting up a pandas dataframe at the track level
- Classifying each song to one of 5 genres (rock, pop, poprock, rap, and others)
- Collapsing the songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist
- Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others), according to the genre most frequent in that given playlist
- Setting up final sample of playlists of equally distributed in each of the classes (genres)

Requesting playlist IDs, tracks and artist features from Spotify’s API using Spotipy Package

The following function takes a number of playlist and returns the features of the tracks of those selected playlists:

def feature_list_func(plylist_dic, feature, n_playlist, first_pid):
    """"
    This function takes a number of playlist and returns the features of the tracks of those selected playlists.

    input:
        1 - plylist_dic: dictionary of all playlists (dataset in dictionary format: json)
        2 - feature: feature to be selected from each songs in selected playlists
        3 - n_playlists: number of playlists to be selected

    output: list of observations for the feature chosen, for all of the tracks that belong to the selected playlists

    """
    feature_list = []
    pid_list = []
    length_playlist = np.minimum(n_playlist,len(plylist_dic)) # the output will be based on the min of the n_playlist and the actual length of the input playlist
    for i in range(length_playlist):
        playlist = plylist_dic[first_pid + i]
        playlist_pid = playlist.get('pid')
        for j in range(len(playlist.get('tracks'))):
            feature_list.append(playlist.get('tracks')[j].get(feature))
            pid_list.append(playlist_pid)
    return pid_list, feature_list

The following code calls the functions above, in order to get the playlist IDs, the track and artist URIs, which will be used later to request the features that will comprise our dataframe.

pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 10, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 10, first_pid = 0)

After getting the URI of the tracks and artists, we requested their features from API Spotify, to create a pandas database at the track level. We used Spotipy API. The Spotify Package can be found at: https://spotipy.readthedocs.io

def create_spotipy_obj():

    """
    Uses dbarjum's client id for DS Project
    """

    SPOTIPY_CLIENT_ID = '54006da9bd7849b7906b944a7fa4e29d'
    SPOTIPY_CLIENT_SECRET = 'f54ae294a30c4a99b2ff330a923cd6e3'
    SPOTIPY_REDIRECT_URI = 'http://localhost/'

    username = 'dbarjum'
    scope = 'user-library-read'

    token = util.prompt_for_user_token(username,scope,client_id=SPOTIPY_CLIENT_ID,
                           client_secret=SPOTIPY_CLIENT_SECRET,
                           redirect_uri=SPOTIPY_REDIRECT_URI)
    client_credentials_manager = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID,
                                                          client_secret=SPOTIPY_CLIENT_SECRET, proxies=None)
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

    return sp

sp = create_spotipy_obj()

def get_all_features(track_list = list, artist_list = list, sp=None):

    """
    This function takes in a list of tracks and a list of artists, along
    with a spotipy object and generates two lists of features from Spotify's API.

    inputs:
        1. track_list: list of all tracks to be included in dataframe
        2. artist_list: list of all artists corresponding to tracks
        3. sp: spotipy object to communicate with Spotify API

    returns:
        1. track_features: list of all features for each track in track_list
        2. artist_features: list of all artist features for each artist in artist_list
    """

    track_features = []
    artist_features = []

    track_iters = int(len(track_list)/50)
    track_remainders = len(track_list)%50

    start = 0
    end = start+50

    for i in range(track_iters):
        track_features.extend(sp.audio_features(track_list[start:end]))
        artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))
        start += 50
        end = start+50


    if track_remainders:
        end = start + track_remainders
        track_features.extend(sp.audio_features(track_list[start:end]))
        artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))

    return track_features, artist_features

start_time = time.time()
t_features, a_features = get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))

--- 2.8701727390289307 seconds ---

Setting up a pandas dataframe at the track level

The following function takes in the lists of track and artist features, and generates a dataframe of the features. It also creates columns in the dataframe that represent the genres provided for the artist of each track. These columns will be used later for classifying each track to one of 5 genres (rock, pop, poprock, rap, and others).

def create_song_df(track_features=list, artist_features=list, pid=list):

    """
    This function takes in two lists of track and artist features, respectively,
    and generates a dataframe of the features.

    inputs:
        1. track_features: list of all tracks including features
        2. artist_features: list of all artists including features

    returns:
        1. df: a pandas dataframe of size (N, X) where N corresponds to the number of songs
        in track_features, X is the number of features in the dataframe.
    """

    import pandas as pd

    selected_song_features = ['uri', 'duration_ms', 'time_signature', 'key',
                              'tempo', 'energy', 'mode', 'loudness', 'speechiness',
                              'danceability', 'acousticness', 'instrumentalness',
                              'valence', 'liveness']
    selected_artist_features = ['followers', 'uri', 'name', 'popularity', 'genres']

    col_names = ['song_uri', 'duration_ms', 'time_signature', 'key',
                 'tempo', 'energy', 'mode', 'loudness', 'speechiness',
                 'danceability', 'acousticness', 'instrumentalness',
                 'valence', 'liveness', 'artist_followers', 'artist_uri',
                 'artist_name', 'artist_popularity']


    data = []

    for i, j in zip(track_features, artist_features):
        temp = []
        for sf in selected_song_features:
            temp.append(i.get(sf))
        for af in selected_artist_features:
            if af == 'followers':
                temp.append(j.get('followers').get('total'))
            elif af == 'genres':
                for g in j.get('genres'):
                    temp.append(g)
            else:
                temp.append(j.get(af))

        data.append(list(temp))

    df = pd.DataFrame(data)

    for i in range(len(df.columns)- len(col_names)):
        col_names.append('g'+str(i+1))

    df.columns = col_names

    df.insert(loc=0, column='pid', value=pid)

    return df

songs_df = create_song_df(t_features, a_features, pid_t)
songs_df.head()

	song_uri	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	artist_uri	artist_name	artist_popularity	g1	g2	g3	g4	g5	g6	g7	g8	g9	g10	g11	g12	g13	g14	g15	g16	g17	g18	g19	g20	g21
0	spotify:track:0UaMYEvWZi0ZqiDOoHU3YI	226864	4	4	125.461	0.813	0	-7.105	0.1210	0.904	0.03110	0.006970	0.810	0.0471	909647	spotify:artist:2wIVse2owClT7go1WT98tk	Missy Elliott	76	dance pop	hip hop	hip pop	pop	pop rap	r&b	rap	southern hip hop	urban contemporary	None	None	None	None	None	None	None	None	None	None	None	None
1	spotify:track:6I9VzXrHxO9rA9A5euc8Ak	198800	4	5	143.040	0.838	0	-3.914	0.1140	0.774	0.02490	0.025000	0.924	0.2420	5457673	spotify:artist:26dSoYclwsYLMAKD3tpOr4	Britney Spears	82	dance pop	pop	post-teen pop	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None
2	spotify:track:0WqIKmW4BTrj3eJFmnCKMv	235933	4	2	99.259	0.758	0	-6.583	0.2100	0.664	0.00238	0.000000	0.701	0.0598	16686181	spotify:artist:6vWDO969PvNqNYHIOW5v0m	Beyoncé	87	dance pop	pop	post-teen pop	r&b	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None
3	spotify:track:1AWQoqb9bSvzTjaLralEkT	267267	4	4	100.972	0.714	0	-6.055	0.1400	0.891	0.20200	0.000234	0.818	0.0521	7343717	spotify:artist:31TPClRtHm23RisEBtV3X7	Justin Timberlake	83	dance pop	pop	pop rap	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None
4	spotify:track:1lzr43nnXAijIGYnCT8M8H	227600	4	0	94.759	0.606	1	-4.596	0.0713	0.853	0.05610	0.000000	0.654	0.3130	1044930	spotify:artist:5EvFsr3kj42KNv97ZEnqij	Shaggy	74	dance pop	pop rap	reggae fusion	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None

Collapsing songs to unique playlists

This section is responsible for collapsing songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist. In this section we also classified songs, and playlists.

The following function classifies songs according to the given genres of the artist of the song, according to “if” statements:

def genre_generator(songs_df):

    """
    This function classifies songs according to the given genres of the artist of the song, according to an "if" statements.

    Input: dataframe with a list of songs

    Output: dataframe with added column with unique genre for each song

    """
    # defining liist of genres that will determine a song with unique genre "rap"
    rap = ["rap","hiphop", "r&d"]

    # finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
    g1_index = 0
    last_column_index = 0

    column_names = songs_df.columns.values

    # finding first column with genres ("g1")
    for i in column_names:
        if i == "g1":
            break
        g1_index += 1

    # finding last column with genrer ("gX")
    for i in column_names:
        last_column_index += 1

    # create new columnn that will have unique genre (class) of each song
    songs_df["genre"] = ""

    # loop to create genre for each song in dataframe     
    for j in range(len(songs_df)):

        # Creating list of genres for a given song  
        genres_row = list(songs_df.iloc[[j]][column_names[g1_index:last_column_index-1]].dropna(axis=1).values.flatten())
        # genres_row = ['british invasion', 'merseybeat', 'psychedelic']

        # classifing genre for the song

        genre = "other"

        if any("rock" in s for s in genres_row) and any("pop" in s for s in genres_row):
            genre = "pop rock"
        elif any("rock" in s for s in genres_row):
            genre = "rock"
        elif any("pop" in s for s in genres_row):
            genre = "pop"

        for i in rap:
            if any(i in s for s in genres_row):
                genre = "rap"

        # giving column genre the classified genre for a given song         
        songs_df.set_value(j, 'genre', genre)

    return songs_df

The code below calls the song genre generator function, and the result is a dataframe with songs containing a genre, which has been classified according to the genre of the artists of each song.

songs_df_new = genre_generator(songs_df)
songs_df_new.head()

C:\Users\Joao Araujo\Anaconda3\lib\site-packages\ipykernel_launcher.py:56: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead

	song_uri	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	artist_uri	artist_name	artist_popularity	g1	g2	g3	g4	g5	g6	g7	g8	g9	g10	g11	g12	g13	g14	g15	g16	g17	g18	g19	g20	g21	genre
0	spotify:track:0UaMYEvWZi0ZqiDOoHU3YI	226864	4	4	125.461	0.813	0	-7.105	0.1210	0.904	0.03110	0.006970	0.810	0.0471	909647	spotify:artist:2wIVse2owClT7go1WT98tk	Missy Elliott	76	dance pop	hip hop	hip pop	pop	pop rap	r&b	rap	southern hip hop	urban contemporary	None	None	None	None	None	None	None	None	None	None	None	None	rap
1	spotify:track:6I9VzXrHxO9rA9A5euc8Ak	198800	4	5	143.040	0.838	0	-3.914	0.1140	0.774	0.02490	0.025000	0.924	0.2420	5457673	spotify:artist:26dSoYclwsYLMAKD3tpOr4	Britney Spears	82	dance pop	pop	post-teen pop	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	pop
2	spotify:track:0WqIKmW4BTrj3eJFmnCKMv	235933	4	2	99.259	0.758	0	-6.583	0.2100	0.664	0.00238	0.000000	0.701	0.0598	16686181	spotify:artist:6vWDO969PvNqNYHIOW5v0m	Beyoncé	87	dance pop	pop	post-teen pop	r&b	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	pop
3	spotify:track:1AWQoqb9bSvzTjaLralEkT	267267	4	4	100.972	0.714	0	-6.055	0.1400	0.891	0.20200	0.000234	0.818	0.0521	7343717	spotify:artist:31TPClRtHm23RisEBtV3X7	Justin Timberlake	83	dance pop	pop	pop rap	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	rap
4	spotify:track:1lzr43nnXAijIGYnCT8M8H	227600	4	0	94.759	0.606	1	-4.596	0.0713	0.853	0.05610	0.000000	0.654	0.3130	1044930	spotify:artist:5EvFsr3kj42KNv97ZEnqij	Shaggy	74	dance pop	pop rap	reggae fusion	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	rap

The following lines clean the dataframe by dropping unnecessary columns (the genres of each song), which were used to create the unique column of song genre that will be used later in the algorithm.

temp = songs_df_new.copy()

column_names_temp = songs_df_new.columns.values[18:-1]
column_names_temp

array(['artist_popularity', 'g1', 'g2', 'g3', 'g4', 'g5', 'g6', 'g7', 'g8',
       'g9', 'g10', 'g11', 'g12', 'g13', 'g14', 'g15', 'g16', 'g17', 'g18',
       'g19', 'g20', 'g21'], dtype=object)

temp = temp.drop(column_names_temp,axis=1)
temp.head()

	song_uri	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	artist_uri	artist_name	genre
0	spotify:track:0UaMYEvWZi0ZqiDOoHU3YI	226864	4	4	125.461	0.813	0	-7.105	0.1210	0.904	0.03110	0.006970	0.810	0.0471	909647	spotify:artist:2wIVse2owClT7go1WT98tk	Missy Elliott	rap
1	spotify:track:6I9VzXrHxO9rA9A5euc8Ak	198800	4	5	143.040	0.838	0	-3.914	0.1140	0.774	0.02490	0.025000	0.924	0.2420	5457673	spotify:artist:26dSoYclwsYLMAKD3tpOr4	Britney Spears	pop
2	spotify:track:0WqIKmW4BTrj3eJFmnCKMv	235933	4	2	99.259	0.758	0	-6.583	0.2100	0.664	0.00238	0.000000	0.701	0.0598	16686181	spotify:artist:6vWDO969PvNqNYHIOW5v0m	Beyoncé	pop
3	spotify:track:1AWQoqb9bSvzTjaLralEkT	267267	4	4	100.972	0.714	0	-6.055	0.1400	0.891	0.20200	0.000234	0.818	0.0521	7343717	spotify:artist:31TPClRtHm23RisEBtV3X7	Justin Timberlake	rap
4	spotify:track:1lzr43nnXAijIGYnCT8M8H	227600	4	0	94.759	0.606	1	-4.596	0.0713	0.853	0.05610	0.000000	0.654	0.3130	1044930	spotify:artist:5EvFsr3kj42KNv97ZEnqij	Shaggy	rap

feature_indexes = list(range(len(temp.columns)-1))

col_names_temp = ['duration_ms','time_signature','key','tempo','energy','loudness','speechiness','danceability','acousticness',
         'instrumentalness', 'valence', 'liveness', 'artist_followers', 'artist_popularity'  ]

col_names = temp.columns

The code below one-hot-encodes the variable genre, so that we can calculated the proportion of songs of each genre in each playlist. This will help classify the genre of our playlist according to the most frequent genre of songs that belong to that playlist.

songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
songs_encoded.head()

	song_uri	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	artist_uri	artist_name	genre_pop	genre_rap
0	spotify:track:0UaMYEvWZi0ZqiDOoHU3YI	226864	4	4	125.461	0.813	0	-7.105	0.1210	0.904	0.03110	0.006970	0.810	0.0471	909647	spotify:artist:2wIVse2owClT7go1WT98tk	Missy Elliott	0	1
1	spotify:track:6I9VzXrHxO9rA9A5euc8Ak	198800	4	5	143.040	0.838	0	-3.914	0.1140	0.774	0.02490	0.025000	0.924	0.2420	5457673	spotify:artist:26dSoYclwsYLMAKD3tpOr4	Britney Spears	1	0
2	spotify:track:0WqIKmW4BTrj3eJFmnCKMv	235933	4	2	99.259	0.758	0	-6.583	0.2100	0.664	0.00238	0.000000	0.701	0.0598	16686181	spotify:artist:6vWDO969PvNqNYHIOW5v0m	Beyoncé	1	0
3	spotify:track:1AWQoqb9bSvzTjaLralEkT	267267	4	4	100.972	0.714	0	-6.055	0.1400	0.891	0.20200	0.000234	0.818	0.0521	7343717	spotify:artist:31TPClRtHm23RisEBtV3X7	Justin Timberlake	0	1
4	spotify:track:1lzr43nnXAijIGYnCT8M8H	227600	4	0	94.759	0.606	1	-4.596	0.0713	0.853	0.05610	0.000000	0.654	0.3130	1044930	spotify:artist:5EvFsr3kj42KNv97ZEnqij	Shaggy	0	1

The following function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column (which characterize each playlist). This creates a datafram at the playlist level.

def collapse_pid(df):

    """
    This function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column.

    Input: data frame of songs (with playlists IDs)

    Output: data frame of playlists (collapsing songs into playlist IDs, using average)

    """

    # Group by play list category
    pid_groups = df.groupby('pid')
    # Apply mean function to all columns

    return pid_groups.mean()

playlists_collapsed = collapse_pid(songs_encoded)

Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others)

The following function classifies playlists according to the most frequent genre of the songs in the playlist:

def playlist_genre_generator (df, first_row):

    """
    This function classifies playlists according to the most frequent genre of the songs in the playlist

    Input: dataframe with a list of playlists

    Output: dataframe with added column with unique genre for each playlist

    """

    # create new columnn that will have unique genre (class) of each playlist
    df ["playlist_genre"] = ""

    for j in range(len(df)):

        # finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
        g1_index = 0
        last_column_index = 0

        column_names = df.columns.values

        # finding first column with genres ("g1")
        for i in column_names:
            if i == "artist_followers":
                break
            g1_index += 1
        g1_index += 1

        # finding last column with genrer ("gX")
        for i in column_names:
            last_column_index += 1
        last_column_index -= 1

        # Creating list of genres for a given song  
        genres_row = list(df.iloc[[j]][column_names[g1_index:last_column_index]].dropna(axis=1).values.flatten())

        # classifing genre for the playlist
        max_value = max(genres_row)
        max_index = genres_row.index(max_value)
        playlist_genre = column_names[g1_index + max_index]

        # giving column genre the classified genre for a given playlist
        df.set_value(j + first_row, 'playlist_genre', playlist_genre)
    return df

Setting up final sample of playlists of equally distributed in each of the classes (genres)

The following code creates a “base line” playlist with a defined minimum size of the playlist (2000 playlists), which will have an unequal distribution of genres among the playlists, as demonstrated in the output table below.

### creating base_line data frame

import warnings
warnings.filterwarnings('ignore')

n_playlist = 2000
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = n_playlist, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = n_playlist, first_pid = 0)

t_features, a_features = get_all_features(track_uri, artist_uri, sp)

#create dataframe of songs
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df_new = genre_generator(songs_df)
temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
temp = temp.drop(column_names_temp,axis=1)
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)

#create dataframe of playlists
playlists_collapsed = collapse_pid(songs_encoded)
genre_classified_playlists = playlist_genre_generator (playlists_collapsed, first_row = 0)
genre_classified_playlists.head()

	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	genre_other	genre_pop	genre_pop rock	genre_rap	genre_rock	playlist_genre
pid
0	221777.461538	4.000000	5.038462	123.006885	0.782173	0.692308	-4.881942	0.107021	0.659288	0.083440	0.000676	0.642904	0.192127	4.800843e+06	0.000000	0.288462	0.230769	0.461538	0.019231	genre_rap
1	298844.128205	3.769231	4.461538	122.669615	0.691077	0.538462	-8.291667	0.088449	0.496459	0.163100	0.222270	0.476667	0.178433	1.704673e+06	0.358974	0.000000	0.051282	0.000000	0.589744	genre_rock
2	219374.875000	4.000000	5.000000	114.600672	0.693203	0.515625	-4.874156	0.096288	0.671875	0.269230	0.000638	0.565078	0.169028	1.691574e+06	0.062500	0.937500	0.000000	0.000000	0.000000	genre_pop
3	229575.055556	3.952381	5.103175	125.032413	0.621282	0.714286	-9.614937	0.067186	0.513714	0.273870	0.202042	0.451623	0.188585	2.125109e+05	0.246032	0.150794	0.317460	0.071429	0.214286	genre_pop rock
4	255014.352941	3.941176	3.352941	127.759882	0.650535	0.823529	-7.634471	0.041159	0.576765	0.177148	0.081875	0.490765	0.166524	1.167521e+06	0.117647	0.117647	0.705882	0.000000	0.058824	genre_pop rock

The following code is an intermediary step in adjusting the sample towards an equal distribution of genres among all playlists. It looks for the most frequent genre among the playlists, calculates the number of playlists of each genre, so that in the next step we fill up the sample with playlits of underrepresented genres.

from pandas.tools.plotting import table
table = genre_classified_playlists['playlist_genre'].value_counts()
mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
number_mode_genre = table.loc[mode_genre]

number_genre_pop = table.loc["genre_pop"]
number_genre_rap = table.loc["genre_rap"]
number_genre_other = table.loc["genre_other"]
number_genre_poprock = table.loc["genre_pop rock"]
number_genre_rock = table.loc["genre_rock"]

mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
mode_genre
total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock
total_number

The code below takes one playlist at a time from the pool of 15,000 playlists (read from the Million Playlist json files at the beginning of this page), checks to which genre it belongs, and adds the playlist (if of underepresented genre) to the baseline sample, until the full sample is equally distributed.

The playlists taken from the 15,000 playlists are taken in sequence after the playlists that have already been added to the sample, or discarded if the playlist belongs to an already “well represented genre”.

### adjusting base_line data frame to get to desired distribution

start_time = time.time()

t = 0

while total_number < number_mode_genre*5:

    first_pid = n_playlist + t

    # get uri for tracks and artists of playlist selected
    pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 1, first_pid = first_pid)
    pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 1, first_pid = first_pid)
    t_features, a_features = get_all_features(track_uri, artist_uri, sp)

    #create dataframe of songs
    songs_df = create_song_df(t_features, a_features, pid_t)
    songs_df_new = genre_generator(songs_df)
    temp = songs_df_new.copy()
    column_names_temp = songs_df_new.columns.values[18:-1]
    temp = temp.drop(column_names_temp,axis=1)
    temp

    songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
    songs_encoded

    #create dataframe of playlists
    playlists_collapsed = collapse_pid(songs_encoded)
    genre_classified_SinglePlaylist = playlist_genre_generator (playlists_collapsed, first_row = first_pid)

    # checking if playlist selected belongs to one of the genres that is not the most frequent in baseline dataframe
    if total_number != 5*number_mode_genre:

        if genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop":
            if number_genre_pop < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_pop += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rap":
            if number_genre_rap < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_rap += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_other":
            if number_genre_other < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_other += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop rock":
            if number_genre_poprock < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_poprock += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rock":
            if number_genre_rock < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_rock += 1

    t += 1

    total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock

    # print (total_number)
    # print (number_genre_pop)
    # print (number_genre_rap)
    # print (number_genre_other)
    # print (number_genre_poprock)
    # print (number_genre_rock)

print("--- %s seconds ---" % (time.time() - start_time))

genre_classified_playlists.head()

--- 0.0009975433349609375 seconds ---

	pid	duration_ms	time_signature	key	tempo	energy	mode	loudness	speechiness	danceability	acousticness	instrumentalness	valence	liveness	artist_followers	genre_other	genre_pop	genre_pop rock	genre_rap	genre_rock	playlist_genre
0	0	221777.461538	4.000000	5.038462	123.006885	0.782173	0.692308	-4.881942	0.107021	0.659288	0.083440	0.000676	0.642904	0.192127	4.797984e+06	0.000000	0.288462	0.230769	0.461538	0.019231	genre_rap
1	1	298844.128205	3.769231	4.461538	122.669615	0.691077	0.538462	-8.291667	0.088449	0.496459	0.163100	0.222270	0.476667	0.178433	1.702573e+06	0.358974	0.000000	0.051282	0.000000	0.589744	genre_rock
2	2	219374.875000	4.000000	5.000000	114.600672	0.693203	0.515625	-4.874156	0.096288	0.671875	0.269230	0.000638	0.565078	0.169028	1.688725e+06	0.062500	0.937500	0.000000	0.000000	0.000000	genre_pop
3	3	229575.055556	3.952381	5.103175	125.032413	0.621282	0.714286	-9.614937	0.067186	0.513714	0.273870	0.202042	0.451623	0.188585	2.123258e+05	0.246032	0.150794	0.317460	0.071429	0.214286	genre_pop rock
4	4	255014.352941	3.941176	3.352941	127.759882	0.650535	0.823529	-7.634471	0.041159	0.576765	0.177148	0.081875	0.490765	0.166524	1.166320e+06	0.117647	0.117647	0.705882	0.000000	0.058824	genre_pop rock

Finally, we check to make sure that the final dataframe is equally distributed among all genres:

display(genre_classified_playlists['playlist_genre'].value_counts())
display(genre_classified_playlists['playlist_genre'].value_counts(normalize=True))

genre_other       535
genre_pop         535
genre_rock        535
genre_rap         535
genre_pop rock    535
Name: playlist_genre, dtype: int64



genre_other       0.2
genre_pop         0.2
genre_rock        0.2
genre_rap         0.2
genre_pop rock    0.2
Name: playlist_genre, dtype: float64

And export the final dataframe as a csv file, which will be used as the sample data for our machine learning models. This sample will be split into training and test data, the former for training different models and assesing their performance, and the latter for evaluating how well our trained models perform in the test data.

genre_classified_playlists.to_csv ("playlist_df.csv")