Piero Trujillo - Spotify Classification Dashboard and Model Analysis

Introduction

In this project, my friend Nirvit and I shared our 2023 Spotify Wrapped playlists so we could visualize comparisons between our music tastes and then create a model to try and predict whose playlist a song belongs to. Finally, I have compiled the results of each model into an interactive dashboard using Panel.

This blog post will have the following sections:

Setup and Preprocessing
Exploratory Data Analysis for Feature Selection
Prepping Data For Machine Learning Models
Creating Machine Learning Models
Panel Dashboard
Final thoughts

Now, let’s dive into the exciting world of music data analysis!

Understanding our Spotify Dataset

Track Metadata

column	description
Song	Song title
Artist	Song artist
Genre	Song genre category

Audio Numerical Quantitive Data

column	description
Loud	How loud a song is (db)
Time Seconds	Duration of the song in seconds
BPM	Average song tempo / how fast a song is

Audio Qualitative Data

column	description
Energy	How energetic the song is
Dance	How easy the song is to dance to
Happy	How positive the mood of the song is
Acoustic	How acoustic sounding the song is
Speech	How much of a song is spoken word
Popularity	How popular a song is (at time of data collection)
Live	How likely the song is a live recording (higher value = live recording)
Instrumental	Measures if the song is more music and less vocals

Audio Categorical Data

column	description
Key	The most repeated key in the song
Time Signature	Numerical representation of rhythmic structure in song
Camelot	Musical key of a song for harmonic mixing
Playlist Owner	Who’s playlist the song belongs to

Gather Your Own Spotify Dataset.

Setup and Preprocessing

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Read in csv file to create tabular dataframe 
piero_top_songs = pd.read_csv("/Users/piero/Downloads/Spotify_Project/Piero_Top_Songs_2023.csv") 
nirvit_top_songs = pd.read_csv("/Users/piero/Downloads/Spotify_Project/Nirvit_Top_Songs_2023.csv") 

# Add a truth column to classify whether a song is from Piero's or Nirvit's playlist
piero_top_songs['Playlist Owner'] = 'Piero'
nirvit_top_songs['Playlist Owner'] = 'Nirvit'

# Convert time to seconds
piero_top_songs['Time Seconds'] = pd.to_timedelta('00:' + piero_top_songs['Time']).dt.total_seconds().astype(int)
nirvit_top_songs['Time Seconds'] = pd.to_timedelta('00:' + nirvit_top_songs['Time']).dt.total_seconds().astype(int)

# Remove unnecessary columns
piero_top_songs = piero_top_songs.drop(columns=['Song Preview', 'Spotify Track Img', 'Album Label', 'Spotify Track Id', 'Added At', 'Spotify Track Id', '#', 'Album', 'Album Date', 'Time'])
nirvit_top_songs = nirvit_top_songs.drop(columns=['Song Preview', 'Spotify Track Img', 'Album Label', 'Spotify Track Id', 'Added At', 'Spotify Track Id', '#', 'Album', 'Album Date', 'Time'])

# Join playlists into one dataframe
all_songs = pd.concat([piero_top_songs, nirvit_top_songs])

# Convert all object columns to type string
object_columns = all_songs.select_dtypes(include=['object']).columns # First, create list of object columns to convert
all_songs[object_columns] = all_songs[object_columns].astype('string')

#print(all_songs.dtypes) # Check that column types have been converted to string

# Check for null values
all_songs.isnull().sum().sum() # 9 NaN values in 'Genres' and 'Parent Genres' columns

# Create dataframe of songs containing NaN values in either 'Genres' or 'Parent Genres'
nan_rows = all_songs[(all_songs['Genres'].isnull()) | (all_songs['Parent Genres'].isnull())]

# Fill NaNs in 'Parent Genres' column with 'Unknown' since I cannot find them on Spotify or Google
all_songs[['Genres']] = all_songs[['Genres']].fillna('Unknown')

# Populate NaN values in 'Genres' column with genres found on Spotify or Google for specified song and artist
all_songs.loc[(all_songs['Song'] == 'Mumbo Sugar') & (all_songs['Artist'] == 'Arc De Soleil'), ['Parent Genres']] = ['R&B, Soul']

all_songs.loc[(all_songs['Song'] == 'Give It Back') & (all_songs['Artist'] == 'Gaelle'), ['Parent Genres']] = ['Dance, Electronic']

all_songs.loc[(all_songs['Song'] == '愛してる') & (all_songs['Artist'] == "callin'"), ['Parent Genres']] = ['Anime, J-Pop']

all_songs.loc[(all_songs['Song'] == 'You Are Mine') & (all_songs['Artist'] == 'Jay Robinson'), ['Parent Genres']] = ['Classic Soul']

all_songs.loc[(all_songs['Song'] == 'Thank You DubNation! (the page will never be long enough)') & (all_songs['Artist'] == 'herlovebeheadsdaisies'), ['Parent Genres']] = ['Screamo']

# Convert categorical variables to factors - allow us to use non-numeric data in statistical modeling
object_columns = all_songs.select_dtypes(include=['object']).columns # First, create list of object columns to convert
all_songs[object_columns] = all_songs[object_columns].astype('category')

# Making sure there are no null values left in the dataset
nan_rows = all_songs[(all_songs['Genres'].isnull()) | (all_songs['Parent Genres'].isnull())]
nan_rows

	Song	Artist	Popularity	BPM	Genres	Parent Genres	Dance	Energy	Acoustic	Instrumental	Happy	Speech	Live	Loud	Key	Time Signature	Camelot	Playlist Owner	Time Seconds

# Splitting 'Parent Genres' column since there are so many different genres
first_instance = all_songs['Parent Genres'].str.split(',').str[0] # extract first genre element

# Assign first instance to new 'Genre' column
all_songs['Genre'] = first_instance

unique_genres = all_songs['Genre'].unique()
num_unique_genres = len(unique_genres)
print("Number of unique genres:", num_unique_genres)

# Counting unique genres
all_songs['Genre'].value_counts() # 17 (now) vs 56 (before)

# Remove unnecessary columns
all_songs = all_songs.drop(columns=['Parent Genres', 'Genres'])

Number of unique genres: 17

print(all_songs.dtypes) # Check categorical column types have been converted to string

Song              string
Artist            string
Popularity         int64
BPM                int64
Dance              int64
Energy             int64
Acoustic           int64
Instrumental       int64
Happy              int64
Speech             int64
Live               int64
Loud               int64
Key               string
Time Signature     int64
Camelot           string
Playlist Owner    string
Time Seconds       int64
Genre             object
dtype: object

Final Dataset

# Save dataset as csv file
#all_songs.to_csv('all_spotify_songs.csv')

# Final dataset
all_songs

	Song	Artist	Popularity	BPM	Dance	Energy	Acoustic	Instrumental	Happy	Speech	Live	Loud	Key	Time Signature	Camelot	Playlist Owner	Time Seconds	Genre
0	CAN'T SAY	Travis Scott	80	148	70	71	20	0	71	0	10	-5	A#/B♭ Minor	4	3A	Piero	198	Hip Hop
1	New Gold (feat. Tame Impala and Bootie Brown)	Gorillaz,Tame Impala,Bootie Brown	71	108	70	92	4	5	55	0	10	-4	C♯/D♭ Minor	3	12A	Piero	215	Hip Hop
2	1AM FREESTYLE	Joji	68	126	62	54	75	0	12	0	10	-6	C Minor	4	5A	Piero	113	Pop
3	20 Min	Lil Uzi Vert	84	123	77	75	11	0	78	10	10	-4	G#/A♭ Minor	4	1A	Piero	220	Hip Hop
4	The Less I Know The Better	Tame Impala	88	117	64	74	1	1	79	0	10	-4	E Major	4	12B	Piero	216	Metal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	FLASH CASANOVA	Yabujin	53	143	42	72	1	0	41	10	0	-10	C♯/D♭ Major	4	3B	Nirvit	163	Hip Hop
96	Sinceramente	Sérgio Sampaio	51	92	71	25	94	0	85	0	10	-11	E Minor	4	9A	Nirvit	78	Jazz
97	24 Hr Drive-Thru	Origami Angel	52	155	57	96	2	0	26	10	30	-4	G#/A♭ Major	4	4B	Nirvit	164	Rock
98	If I Ain't Got You	Alicia Keys	84	118	61	44	60	0	17	10	10	-9	G Major	3	9B	Nirvit	228	R&B
99	Solitude	Lord Snow	20	85	16	99	0	9	9	30	40	-5	A Minor	4	8A	Nirvit	312	Metal

200 rows × 18 columns

Exploratory Data Analysis for Feature Selection

Correlation Heatmap

import plotly.graph_objects as go

def corr_plot(data):
    # Calculate the correlation matrix
    correlation_matrix = data.corr()

    # Create heatmap using Plotly  
    annotations = []
    for i, row in enumerate(correlation_matrix.values):
        for j, value in enumerate(row):
            font_color = 'white' if value > -0.4 else '#7fc591'  # Set font color based on z value
            annotations.append(dict(x=correlation_matrix.columns[j], y=correlation_matrix.index[i],
                                text=str(round(value, 2)),
                                showarrow=False, font=dict(color=font_color)))

    # Create heatmap using Plotly
    fig = go.Figure(data=go.Heatmap(
                    z=correlation_matrix.values,
                    x=correlation_matrix.columns,
                    y=correlation_matrix.index,
                    colorscale='Greens',  # Choose your preferred colorscale
                    colorbar=dict(title='Correlation<br>Strength<br>')
    ))



    fig.update_layout(
        title=dict(text ='<b>Correlation Heatmap</b>', x=0.5, y=0.85),
        xaxis=dict(title='<b>Features</b>'),
        yaxis=dict(title='<b>Features</b>'),
        annotations=annotations,
        template="plotly_dark",
        height=500,
        width=700,
        hoverlabel=dict(
            bgcolor="#008000")
    )

    return fig

corr_plot(all_songs)

/var/folders/th/h7c9tz61505fhg7ds00qjhlm0000gn/T/ipykernel_1168/3196471437.py:5: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correlation_matrix = data.corr()

Unable to display output for mime type(s): application/vnd.plotly.v1+json

A correlation heatmap visualizes how well different variables interact with each other. By illustrating the strength and direction of these relationships, correlation heatmaps help identify patterns, trends, and dependencies within the data. Therefore, we are most interested in the features with very dark or very light tiles.

A few main takeaways: * Loud: Strong positive correlation with Energy suggests that louder songs tend to have higher energy levels.

Accoustic: Strong negative correlation with Energy and Loudness implies that acoustic songs tend to have lower energy and loudness levels.
Energy: Strong positive correlation with Loudness indicates that energetically intense songs tend to be louder.
Instrumental: Moderate negative correlation with Danceability and Popularity suggests that instrumental songs are less danceable and less popular.
Dance: It has a moderate positive correlation with Popularity, Energy, and Happiness, suggesting that more danceable songs tend to be more popular, energetic, and happier.
Popularity: It shows weak positive correlations with attributes like Dance, Energy, Happy, and Loudness, indicating that more popular songs tend to have higher danceability, energy, happiness, and loudness.

Interactive Scatterplot Comparing Similarity Between Music Tastes

import plotly.graph_objects as go

# Define colors for each playlist owner
color_map = {'Piero': '#1ED760', 'Nirvit': '#ff00ff'} #1db96e , #b91d82

# Define symbols for each playlist owner
symbol_map = {'Piero': 'circle', 'Nirvit': 'diamond'} #triangle-up

# Define a function to create scatter plot with my original dataset
def create_original_scatter_plot(all_songs):
    # Create scatter plot
    fig = go.Figure()

    # Add text markers when hovering over points
    for group, data in all_songs.groupby('Playlist Owner'):
        fig.add_trace(go.Scatter(
            x=data['Happy'],
            y=data['Energy'],
            opacity=0.75,
            mode='markers',
            name=group,
            text=data.apply(lambda row: f"Song: {row['Song']}, Artist: {row['Artist']}, Energy: {row['Energy']}, Happiness: {row['Happy']}", axis=1),  # Hover text
            marker=dict(
                color=color_map[group],  # Color points based on group
                size=10,
                symbol=symbol_map.get(group, 'circle'),
                line=dict(
                    color='#2a8ccb',
                    width=2
                )
            )
        ))

    # Scatterplot layout
    fig.update_layout(
        title={
            'text': "<b>Top 100 Songs by Mood</b>", # Top 100 Songs by Positivity and Energy Levels
            'font': {'size': 14},
            'x': 0.5,  # Centered title
            'y': 0.9  # Adjust vertical position of title
        },
        xaxis_title="Happiness Level",
        yaxis_title="Energy Level",
        legend_title="Listener",
        width=1070,  # Set width to 1000 pixels
        height=525,  # Set height to 600 pixels
        template="plotly_dark",
        # Make hover text white
        hoverlabel=dict( 
            font=dict(
                color="white"  # Text color inside hover label
            ))
        
    )


    # Label song mood quadrants
    fig.add_annotation(
        x=0, y=105,
        text="<b>Chaotic/Angry</b>",
        font=dict(
            size=12,
            color="white",
        ),
        showarrow=False
    ) 

    fig.add_annotation(
        x= 100, y=105,
        text="<b>Happy/Upbeat</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x= 100, y=-5,
        text="<b>Chill/Peaceful</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x=0, y=-5,
        text="<b>Sad/Depressing</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    # Adding cross section to distinguish mood sectors

    # Vertical line
    fig.add_shape(
        type="line",
        x0=50, y0=0,
        x1=50, y1=100,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Horizontal line
    fig.add_shape(
        type="line",
        x0=0, y0=50,
        x1=100, y1=50,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Show the plot
    return fig

create_original_scatter_plot(all_songs)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

This scatterplot compares the energy and happiness levels of all songs in our Spotify Wrapped playlists. To interpret the plot, it’s important to think about how energy and happiness features interact.

Low Energy + Low Happiness = Sad / Depressing
Low Energy + High Happiness = Chill / Peaceful
High Energy + High Happiness = Happy / Upbeat
High Energy + Low Happiness = Chaotic / Angry

The scatterplot reveals that songs from my playlist are primarily clustered in the top quadrant, reflecting a mix of chaotic/angry and happy/upbeat tunes. This clustering pattern could significantly influence a model’s predictive capabilities, potentially making the dataset more predictable than anticipated. Additionally, another notable trend emerges: while Nirvit’s music taste appears evenly spread across the plot, he tends to gravitate towards a higher proportion of sad and chill music compared to my preferences.

Make sure to hover over the various points on the scatterplot, to see which songs they represent.

Prepping Data For Machine Learning Models

Normalize Data

from sklearn.preprocessing import MinMaxScaler

# Remove Artist and Song columns
normalized_songs = all_songs.drop(columns=['Song', 'Artist'])

# Select numerical columns to normalize
columns_to_normalize = ['Popularity', 'BPM', 'Dance', 'Energy', 'Acoustic', 'Instrumental', 'Happy', 'Speech', 'Live', 'Loud', 'Time Signature', 'Time Seconds']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler on the selected columns
scaler.fit(normalized_songs[columns_to_normalize])

# Transform the selected columns
normalized_songs[columns_to_normalize] = scaler.transform(normalized_songs[columns_to_normalize])

# Create a new binary response column
normalized_songs['Binary Response'] = (normalized_songs['Playlist Owner'] == 'Piero').astype(int)

# Drop the original 'playlist' column if no longer needed
normalized_songs.drop(columns=['Playlist Owner'], inplace=True)

Now we’ve got ourselves a normalized dataset!

normalized_songs

	Popularity	BPM	Dance	Energy	Acoustic	Instrumental	Happy	Speech	Live	Loud	Key	Time Signature	Camelot	Time Seconds	Genre	Binary Response
0	0.857143	0.550725	0.732558	0.707071	0.202020	0.000000	0.715789	0.000000	0.125	0.820513	A#/B♭ Minor	0.75	3A	0.115156	Hip Hop	1
1	0.750000	0.260870	0.732558	0.919192	0.040404	0.050505	0.547368	0.000000	0.125	0.846154	C♯/D♭ Minor	0.50	12A	0.127786	Hip Hop	1
2	0.714286	0.391304	0.639535	0.535354	0.757576	0.000000	0.094737	0.000000	0.125	0.794872	C Minor	0.75	5A	0.052006	Pop	1
3	0.904762	0.369565	0.813953	0.747475	0.111111	0.000000	0.789474	0.166667	0.125	0.846154	G#/A♭ Minor	0.75	1A	0.131501	Hip Hop	1
4	0.952381	0.326087	0.662791	0.737374	0.010101	0.010101	0.800000	0.000000	0.125	0.846154	E Major	0.75	12B	0.128529	Metal	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	0.535714	0.514493	0.406977	0.717172	0.010101	0.000000	0.400000	0.166667	0.000	0.692308	C♯/D♭ Major	0.75	3B	0.089153	Hip Hop	0
96	0.511905	0.144928	0.744186	0.242424	0.949495	0.000000	0.863158	0.000000	0.125	0.666667	E Minor	0.75	9A	0.026003	Jazz	0
97	0.523810	0.601449	0.581395	0.959596	0.020202	0.000000	0.242105	0.166667	0.375	0.846154	G#/A♭ Major	0.75	4B	0.089896	Rock	0
98	0.904762	0.333333	0.627907	0.434343	0.606061	0.000000	0.147368	0.166667	0.125	0.717949	G Major	0.50	9B	0.137444	R&B	0
99	0.142857	0.094203	0.104651	0.989899	0.000000	0.090909	0.063158	0.500000	0.500	0.820513	A Minor	0.75	8A	0.199851	Metal	0

200 rows × 16 columns

Setting Up Training and Testing Data

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

# Define features (X) and target variable (y)
X = normalized_songs[['Popularity', 'BPM', 'Dance', 'Energy', 'Acoustic', 'Instrumental', 'Happy', 'Speech', 'Live', 'Loud', 'Key', 'Time Signature', 'Camelot', 'Time Seconds', 'Genre']] # Features

y = normalized_songs['Binary Response'] # Target variable Playlist Owner

# Initialize OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

# One-hot encode categorical columns
X_encoded = pd.DataFrame(encoder.fit_transform(X[['Key', 'Camelot', 'Genre']]))  # Only encode categorical columns
X_encoded.columns = encoder.get_feature_names_out(['Key', 'Camelot', 'Genre'])  # Get categorical column names

# Reset indices of X and X_encoded
X.reset_index(drop=True, inplace=True)
X_encoded.reset_index(drop=True, inplace=True)

# Concatenate numerical and encoded categorical columns
X_final = pd.concat([X, X_encoded], axis=1)

# Drop original columns since they have been encoded to new columns
X_final.drop(columns=['Key', 'Camelot', 'Genre'], inplace=True)

# Splitting up the data into training and testing sets (60% training, 40% testing)
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.4, random_state=18, shuffle=True)

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning:

`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.

Now that the data has been split into testing and training sets, the next step involves creating machine learning models to predict which Spotify Wrapped playlist a song belongs to.

Creating Machine Learning Models

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create and train the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred_lr = lr_model.predict(X_test)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy:", accuracy_lr)

Accuracy: 0.7375

Feature Importance Plot (Logistic Regression)

I developed this feature importance plot function to identify the most and least useful predictors in each model.

import plotly.graph_objects as go
import panel as pn

def plot_linear_feature_importance(model_name):
    # Get feature importances
    lr_importances = model_name.coef_[0]
    indices = np.argsort(lr_importances)[::-1]

    # Get feature names
    feature_names = X_train.columns

    # Create custom color gradient
    colors = ['#1DB954', '#2BBE60', '#3AC26C', '#48C778', '#57CB84', '#65D08F', '#74D49B', '#83D9A7', '#91DDB3', '#9FE2BF'] 

    # Create figure
    fig = go.Figure()

    # Add bars to plot
    fig.add_trace(go.Bar(
        x=lr_importances[indices][:10],  # Grabs the top 10 features
        y=[feature_names[i] for i in indices[:10]],  # Grabs their corresponding feature names
        marker=dict(color=colors),
        orientation='h'  # Style as horizontal bar chart
    ))

    # Style barplot
    fig.update_layout(
        title=dict(text="<b>Top 10 Feature Importances</b>", x=0.5, font=dict(size=16, color='white', family='Arial, sans-serif')),
        xaxis=dict(title='<b>Importance</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        yaxis=dict(title='<b>Features</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        font=dict(size=12, color='white', family='Arial, sans-serif'),
        margin=dict(l=100, r=20, t=40, b=20),
        height=500, #500
        width=700,  # 800
        template="plotly_dark", # dark mode
         # Make hover markers have white text
        hoverlabel=dict(
            font=dict(
                color="white"
            )
        )
    )

    return fig # Display plot in dashboard when clicked

# Call function for logistic regression
plot_linear_feature_importance(lr_model)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Creating a Visualization Dataset

To craft scatterplots, we need a streamlined visualization dataset containing only essential columns. This dataset, labeled viz_dataset, is extracted from the original dataset, all_songs, and encompasses descriptive song attributes like ‘Song’, ‘Artist’, ‘Playlist Owner’, in addition to ‘Happy’ and ‘Energy’ levels. The extraction process involves selecting rows corresponding to indices found within the X_test dataset.

# Reset index of the all_songs DataFrame
all_songs_reset_index = all_songs.reset_index(drop=True)

# Extract rows from the original dataset based on indices in X_test
viz_dataset = all_songs_reset_index.loc[X_test.index, ['Song', 'Artist', 'Playlist Owner','Happy', 'Energy']]

viz_dataset

	Song	Artist	Playlist Owner	Happy	Energy
134	Suite bergamasque, L. 75: III. Clair de lune	Claude Debussy,Philippe Entremont	Nirvit	4	6
91	Fair Trade (with Travis Scott)	Drake,Travis Scott	Piero	29	47
81	Father Stretch My Hands Pt. 1	Kanye West	Piero	44	57
108	愛してる	callin'	Nirvit	31	31
170	Disfarça E Chora	Cartola	Nirvit	96	44
...	...	...	...	...	...
126	Kiss the Ladder	Fleshwater	Nirvit	25	99
37	lose	Travis Scott	Piero	28	56
27	Doin' it Right (feat. Panda Bear)	Daft Punk,Panda Bear	Piero	19	45
2	1AM FREESTYLE	Joji	Piero	12	54
77	Hot Air Balloon	Don Diablo,AR/CO	Piero	57	71

80 rows × 5 columns

Creating a Scatterplot Function to Show Logistic Regression Classification Results

This function can create scatterplots for any type of model, whether it’s linear, tree-based, or cluster-based. The plan is to utilize it in the dashboard to visually represent classification song predictions for every model.

import plotly.graph_objects as go

def model_plot(y_pred):

    # Define colors for each playlist owner
    color_map = {1: '#1ED760', 0: '#ff00ff'} #1db96e , #b91d82

    # Define symbols for each playlist owner
    symbol_map = {1: 'circle', 0: 'diamond'} 

    # Map class labels to name legend labels
    legend_labels = {1: 'Piero', 0: 'Nirvit'}

    # Replace prediction labels (1,0) for names (Piero, Nirvit) in the legend
    legend_names = [legend_labels[label] for label in color_map.keys()]

    # Add truth labels by merging `y_pred` from each model as a prediction column
    viz_dataset['Predicted Owner'] = y_pred

    # Create scatter plot
    fig = go.Figure()

    # Add text markers when hovering over points
    for group, data in viz_dataset.groupby('Predicted Owner'):
        fig.add_trace(go.Scatter(
            x=data['Happy'],
            y=data['Energy'],
            opacity=0.75,
            mode='markers',
            name=legend_labels[group],
            text=data.apply(lambda row: f"Song: {row['Song']}, Artist: {row['Artist']}, Energy: {row['Energy']}, Happiness: {row['Happy']}", axis=1),  # Hover text
            marker=dict(
                color=color_map[group],  # Color points based on group
                size=10,
                symbol=symbol_map.get(group, 'circle'),
                line=dict(
                    color='#2a8ccb', ##2a8ccb
                    width=2
                )
            )
        ))


    # Change scatterplot appearance / styles
    fig.update_layout(
         title={
        'text': "<b>Top 100 Songs by Mood</b>", # Top 100 Songs by Positivity and Energy Levels
        'font': {'size': 14},
        'x': 0.5,  # Centered title
        'y': 0.9  # Adjust vertical position of title
        },
        xaxis_title="Happiness Level",
        yaxis_title="Energy Level",
        legend_title="Listener",
        width=1070,
        height=525,
        template="plotly_dark",
        # Make hover text white
        hoverlabel=dict(
            font=dict(
                color="white"
            )
        )
       
    )

    # Label song mood quadrants
    fig.add_annotation(
        x=0, y=105,
        text="<b>Chaotic/Angry</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x= 100, y=105,
        text="<b>Happy/Upbeat</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )


    fig.add_annotation(
        x= 100, y=-5,
        text="<b>Chill/Peaceful</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x=0, y=-5,
        text="<b>Sad/Depressing</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    # Adding cross section to distinguish mood sectors

    # Vertical line
    fig.add_shape(
        type="line",
        x0=50, y0=0,
        x1=50, y1=100,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Horizontal line
    fig.add_shape(
        type="line",
        x0=0, y0=50,
        x1=100, y1=50,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Show scatterplot
    return fig

model_plot(y_pred_lr)  # Scatterplot for logistic regression

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Create random forest model
rf_model = RandomForestClassifier(n_estimators=1000, random_state=18)

# Train Model
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate model performance
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("Accuracy:", rf_accuracy)

Accuracy: 0.7875

Feature Importance Plot (Random Forest)

Since linear models and tree-based models store their feature importances differently, two separate feature importance plot functions are required.

In linear models, such as linear regression or logistic regression, feature importance is derived directly from the coefficients assigned to each feature during the model fitting process. These coefficients represent the magnitude and direction of the relationship between each feature and the target variable. Therefore, accessing the .coef_ attribute retrieves these coefficients, which can be interpreted as feature importances.

In tree-based models like Random Forests, feature importance is typically computed based on how much each feature contributes to decreasing impurity (e.g., Gini impurity or entropy) across all the trees in the forest. The .feature_importances_ attribute of a trained Random Forest model provides the importance scores for each feature, calculated based on this criterion.

So, while linear models directly use the coefficients as feature importance, Random Forest models use a measure of impurity decrease to determine feature importance across the ensemble of trees.

import plotly.graph_objects as go
import panel as pn

def plot_tree_feature_importance(model_name):
    # Get feature importances for tree-based model
    lr_importances = model_name.feature_importances_
    indices = np.argsort(lr_importances)[::-1]

    # Get corresponding feature names
    feature_names = X_train.columns

    # Create custom color gradient
    colors = ['#1DB954', '#2BBE60', '#3AC26C', '#48C778', '#57CB84', '#65D08F', '#74D49B', '#83D9A7', '#91DDB3', '#9FE2BF'] 

    # Create figure
    fig = go.Figure()

    # Add bars to plot
    fig.add_trace(go.Bar(
        x=lr_importances[indices][:10],  # Grab top 10 features in the model
        y=[feature_names[i] for i in indices[:10]],  # Get corresponding feature names
        marker=dict(color=colors), # assign color gradient to bars
        orientation='h'  # Style as horizontal barplot
    ))

    # Style barplot
    fig.update_layout(
        title=dict(text="<b>Top 10 Feature Importances</b>", x=0.5, font=dict(size=16, color='white', family='Arial, sans-serif')),
        xaxis=dict(title='<b>Importance</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        yaxis=dict(title='<b>Features</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        font=dict(size=12, color='white', family='Arial, sans-serif'),
        margin=dict(l=100, r=20, t=40, b=20),
        height=500,
        width=700,
        template="plotly_dark",
         # Make hover text white
        hoverlabel=dict(
            font=dict(
                color="white"
            )
        )
    )

    return fig # display plot in dashboard when clicked

# Plot random forest barplot
plot_tree_feature_importance(rf_model)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Boosted Trees

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Create boosted trees model
boost_model = GradientBoostingClassifier(n_estimators=1000,
                                max_depth=3,
                                learning_rate=0.1,
                                min_samples_split=3)

# Fit the model to training set
boost_model.fit(X_train, y_train)

# Predictions
y_pred_boost = boost_model.predict(X_test)

# Evaluate boosted trees model accuracy
boost_accuracy = accuracy_score(y_test, y_pred_boost)
print("Accuracy:", boost_accuracy)

Accuracy: 0.8

Feature Importance Plot (Boosted Trees)

plot_tree_feature_importance(boost_model)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Create K-nearest neighbors classifier
knn_model = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Fit the model to training set
knn_model.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn_model.predict(X_test)

# Calculate accuracy
knn_accuracy = accuracy_score(y_test, y_pred_knn)
print("Accuracy:", knn_accuracy)

Accuracy: 0.6375

Feature Importance Plot (KNN)

Unfortunately, a feature importance bar plot cannot be plotted because the K-Nearest Neighbors algorithm doesn’t inherently provide feature importance scores like tree-based algorithms or linear models. Instead, K-Nearest Neighbors is a distance-based algorithm that makes predictions using Euclidean distance to measure proximity and similarity between data points. Due to the lack of feature importance scores and its low performance, it will not be included in the final dashboard.

Support Vector Machine

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Create Support Vector Machine Classifier
svm_model = SVC(kernel='linear')  # Other kernels I could choose 'linear', 'rbf', 'poly'

# Fit the model to training set
svm_model.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Calculate accuracy
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print("Accuracy:", svm_accuracy)

Accuracy: 0.75

Feature Importance Plot (SVM)

plot_linear_feature_importance(svm_model)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Decision Trees

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create Decision Tree Classifier
dec_tree_model = DecisionTreeClassifier()

# Fit the model to training set
dec_tree_model.fit(X_train, y_train)

# Make predictions
y_pred_dec_tree = dec_tree_model.predict(X_test)

# Calculate accuracy
dec_tree_accuracy = accuracy_score(y_test, y_pred_dec_tree)
print("Accuracy:", dec_tree_accuracy)

Accuracy: 0.7875

Feature Importance Plot (Decision Tree)

plot_tree_feature_importance(dec_tree_model)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Gauge Visualization

Now, a gauge visualization function is developed to showcase model accuracy on the dashboard.

import panel as pn
import plotly.graph_objects as go

# Create gauge visualization function
def gauge_accuracy_viz(model_performance, last_reference):
    # Calculate delta to show if current model is performing better or worse
    delta = model_performance - last_reference

    # Create gauge chart
    fig = go.Figure(go.Indicator(
        mode="gauge+number+delta",
        value= model_performance * 100,
        domain={'x': [0, 1], 'y': [0, 1]},
        title={'text': "Accuracy", 'font': {'size': 24, 'color': "#00ff7f"}},
        delta={'reference': last_reference * 100, 'increasing': {'color': "#00ff00"}, 'decreasing': {'color': "#ff7373"}},
        gauge={
            'axis': {'range': [None, 100], 'tickwidth': 2, 'tickcolor': "#70D2A2"},
            'bar': {'color': "#1DB954"},
            'bgcolor': "white",
            'borderwidth': 3,
            'bordercolor': "#00ff7f",
            'steps': [
                {'range': [0, 50], 'color': '#b91d82'},
                {'range': [50, 100], 'color': '#fff68f'}],
            'threshold': {
                'line': {'color': "#cc0000", 'width': 4},
                'thickness': 0.75,
                'value': model_performance * 100}}
    ))
    
    # Add percent sign to value and delta
    fig.update_traces(number={'suffix': '%'}, delta={'suffix': '%'})
    # Visualize gauge in dark mode
    fig.update_layout(template="plotly_dark", font={'color': "#00ff7f", 'family': "Arial"}, height=500, width=364)
    
    return fig

Panel Dashboard

import panel as pn
import pandas as pd
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pn.extension('echarts')

# Create buttons for selecting models
button_original_dataset = pn.widgets.Button(name = 'Original Dataset')
button_logistic_regression = pn.widgets.Button(name='Logistic Regression')
button_random_forest = pn.widgets.Button(name='Random Forest')
button_boosted_trees = pn.widgets.Button(name='Boosted Trees')
button_decision_trees = pn.widgets.Button(name='Decision Trees')
button_svm = pn.widgets.Button(name='Support Vector Machine')
button_knn = pn.widgets.Button(name='K-Nearest Neighbors')

last_reference = 0 # Create global variable to store the previous model's accuracy score

# Define callback functions for the buttons
def on_click_original_dataset(event):
    scatter_plot.object = create_original_scatter_plot(all_songs)
    feature_importance_plot.object = corr_plot(all_songs) # Switch in a corr plot since there are no features to show

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(0,0)

def on_click_logistic_regression(event):
    scatter_plot.object = model_plot(y_pred_lr)  
    feature_importance_plot.object = plot_linear_feature_importance(lr_model)
    
    global last_reference
    gauge_pane.object = gauge_accuracy_viz(accuracy_lr, last_reference)
    last_reference = accuracy_lr

def on_click_random_forest(event):
    scatter_plot.object =  model_plot(y_pred_rf) 
    feature_importance_plot.object = plot_tree_feature_importance(rf_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(rf_accuracy, last_reference)
    last_reference = rf_accuracy

def on_click_boosted_trees(event):
    scatter_plot.object =  model_plot(y_pred_boost) 
    feature_importance_plot.object =  plot_tree_feature_importance(boost_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(boost_accuracy, last_reference)
    last_reference = boost_accuracy

def on_click_decision_trees(event):
    scatter_plot.object = model_plot(y_pred_dec_tree) 
    feature_importance_plot.object = plot_tree_feature_importance(dec_tree_model)
    
    global last_reference
    gauge_pane.object = gauge_accuracy_viz(dec_tree_accuracy, last_reference)
    last_reference = dec_tree_accuracy

def on_click_svm(event):
    scatter_plot.object = model_plot(y_pred_svm) 
    feature_importance_plot.object = plot_linear_feature_importance(svm_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(svm_accuracy, last_reference)
    last_reference = svm_accuracy

def on_click_knn(event):
    scatter_plot.object = model_plot(y_pred_knn) 
    feature_importance_plot.object =  plot_tree_feature_importance(knn_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(knn_accuracy, last_reference)
    last_reference = knn_accuracy


# Bind callbacks when button is clicked
button_original_dataset.on_click(on_click_original_dataset)
button_logistic_regression.on_click(on_click_logistic_regression)
button_random_forest.on_click(on_click_random_forest)
button_boosted_trees.on_click(on_click_boosted_trees)
button_decision_trees.on_click(on_click_decision_trees)
button_svm.on_click(on_click_svm)
button_knn.on_click(on_click_knn)


# Create scatter plot widget
scatter_plot = pn.pane.Plotly()

# Create feature importance plot widget
feature_importance_plot = pn.pane.Plotly()  # plot_feature_importance(lr_model) 

# Create gauge visualization pane
gauge_pane = pn.pane.Plotly() #gauge_accuracy_viz(rf_accuracy, last_reference)

# Create logo pane
panel_logo = pn.pane.PNG(
    '/Users/piero/Downloads/Spotify_Project/Spotify_Logo_RGB_Green.png',
    width=150, height=95, align='center'
)

#text1 = 'Visualize the performance of machine learning models in classifying songs from my playlist and my friends.' 
text2 = 'Select a model using the buttons above to visualize its performance.' 
text3 = '[View dashboard code](https://github.com/suppiero/spotify_classification_dash)'

# Dashboard layout
template = pn.template.FastListTemplate(theme="dark",
    logo = '/Users/piero/Downloads/Spotify_Project/Spotify_Logo_RGB_Green.png',
    title = "Visualizing Spotify Song Classification Performance",
    sidebar =[pn.pane.Markdown("## Reset"),   
             button_original_dataset, pn.pane.Markdown("## Models"), button_logistic_regression, button_random_forest, 
             button_boosted_trees, button_decision_trees, button_svm, text2, text3],
    main=[
            pn.Row(pn.Column(scatter_plot, sizing_mode='stretch_both', margin=(-20,0,0,-24))),
            pn.Row(pn.Column(feature_importance_plot, margin=(11,0,0,-24)),
                   pn.Column(gauge_pane, margin=(11,0,0,-13)), sizing_mode='stretch_both', height=400, width=950
                  )
          ],
    theme_toggle = False,
    accent_base_color="#0bff38", # change color of hyperlink text
    header_background="#1f2630", # change color of header banner | previous color: #009E60
    header_color = '#0bff38', # change color of header text | previous color: #57ff76
    main_max_width = '900',
    main_layout = None, # maximum width of the main area containing all plots
    sidebar_width=172, # adjust sidebar size
    font = 'https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100;1,100&display=swap'
    
) 

# Load original dataset button images on startup
on_click_original_dataset(None)

# Display the dashboard
template.show()

WARNING:param.Row: Providing a width-responsive sizing_mode ('stretch_both') and a fixed width is not supported. Converting fixed width to min_width. If you intended the component to be fully width-responsive remove the heightsetting, otherwise change it to min_height. To error on the incorrect specification disable the config.layout_compatibility option.
WARNING:param.Row: Providing a height-responsive sizing_mode ('stretch_both') and a fixed height is not supported. Converting fixed height to min_height. If you intended the component to be fully height-responsive remove the height setting, otherwise change it to min_height. To error on the incorrect specification disable the config.layout_compatibility option.
/var/folders/th/h7c9tz61505fhg7ds00qjhlm0000gn/T/ipykernel_1168/3196471437.py:5: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Launching server at http://localhost:50831

<panel.io.server.Server at 0x12daf9bd0>

Final Thoughts

Overall, this project provided an amazing opportunity to delve into the realm of music data analysis, machine learning, and dashboarding. It was fascinating to uncover the intricate patterns within our Spotify Wrapped playlists and to conduct statistical comparisons of our music tastes. I was incredibly excited to visualize the similarities in our music tastes and gain deeper insights into our listening habits. While our classification models didn’t achieve perfection, they still yielded remarkably accurate results, hinting at meaningful distinctions in the songs favored by Nirvit and myself.

For those interested in conducting a similar analysis using Python, I recommend exploring my GitHub repository dedicated to this project.

Sources

I’d like to extend a special thank you to the wonderful data analysts who inspired me to make this project, offering invaluable ideas and sharing fantastic source code! * Whose Song is it Anyway? By Lewis White.

Citation

BibTeX citation:

@online{trujillo2024,
  author = {Trujillo, Piero},
  title = {Spotify {Classification} {Dashboard} and {Model} {Analysis}},
  date = {2024-04-01},
  url = {https://suppiero.github.io/projects/spotify_classification_dashboard/},
  langid = {en}
}

For attribution, please cite this work as:

Trujillo, Piero. 2024. “Spotify Classification Dashboard and Model Analysis.” April 1, 2024. https://suppiero.github.io/projects/spotify_classification_dashboard/.