import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Read in csv file to create tabular dataframe
piero_top_songs = pd.read_csv("/Users/piero/Downloads/Spotify_Project/Piero_Top_Songs_2023.csv")
nirvit_top_songs = pd.read_csv("/Users/piero/Downloads/Spotify_Project/Nirvit_Top_Songs_2023.csv")
# Add a truth column to classify whether a song is from Piero's or Nirvit's playlist
piero_top_songs['Playlist Owner'] = 'Piero'
nirvit_top_songs['Playlist Owner'] = 'Nirvit'
# Convert time to seconds
piero_top_songs['Time Seconds'] = pd.to_timedelta('00:' + piero_top_songs['Time']).dt.total_seconds().astype(int)
nirvit_top_songs['Time Seconds'] = pd.to_timedelta('00:' + nirvit_top_songs['Time']).dt.total_seconds().astype(int)
# Remove unnecessary columns
piero_top_songs = piero_top_songs.drop(columns=['Song Preview', 'Spotify Track Img', 'Album Label', 'Spotify Track Id', 'Added At', 'Spotify Track Id', '#', 'Album', 'Album Date', 'Time'])
nirvit_top_songs = nirvit_top_songs.drop(columns=['Song Preview', 'Spotify Track Img', 'Album Label', 'Spotify Track Id', 'Added At', 'Spotify Track Id', '#', 'Album', 'Album Date', 'Time'])
# Join playlists into one dataframe
all_songs = pd.concat([piero_top_songs, nirvit_top_songs])
# Convert all object columns to type string
object_columns = all_songs.select_dtypes(include=['object']).columns # First, create list of object columns to convert
all_songs[object_columns] = all_songs[object_columns].astype('string')
#print(all_songs.dtypes) # Check that column types have been converted to stringIntroduction
In this project, my friend Nirvit and I shared our 2023 Spotify Wrapped playlists so we could visualize comparisons between our music tastes and then create a model to try and predict whose playlist a song belongs to. Finally, I have compiled the results of each model into an interactive dashboard using Panel.
This blog post will have the following sections:
Setup and Preprocessing
Exploratory Data Analysis for Feature Selection
Prepping Data For Machine Learning Models
Creating Machine Learning Models
Panel Dashboard
Final thoughts
Now, let’s dive into the exciting world of music data analysis!
Understanding our Spotify Dataset
Track Metadata
| column | description |
|---|---|
| Song | Song title |
| Artist | Song artist |
| Genre | Song genre category |
Audio Numerical Quantitive Data
| column | description |
|---|---|
| Loud | How loud a song is (db) |
| Time Seconds | Duration of the song in seconds |
| BPM | Average song tempo / how fast a song is |
Audio Qualitative Data
| column | description |
|---|---|
| Energy | How energetic the song is |
| Dance | How easy the song is to dance to |
| Happy | How positive the mood of the song is |
| Acoustic | How acoustic sounding the song is |
| Speech | How much of a song is spoken word |
| Popularity | How popular a song is (at time of data collection) |
| Live | How likely the song is a live recording (higher value = live recording) |
| Instrumental | Measures if the song is more music and less vocals |
Audio Categorical Data
| column | description |
|---|---|
| Key | The most repeated key in the song |
| Time Signature | Numerical representation of rhythmic structure in song |
| Camelot | Musical key of a song for harmonic mixing |
| Playlist Owner | Who’s playlist the song belongs to |
Setup and Preprocessing
# Check for null values
all_songs.isnull().sum().sum() # 9 NaN values in 'Genres' and 'Parent Genres' columns
# Create dataframe of songs containing NaN values in either 'Genres' or 'Parent Genres'
nan_rows = all_songs[(all_songs['Genres'].isnull()) | (all_songs['Parent Genres'].isnull())]
# Fill NaNs in 'Parent Genres' column with 'Unknown' since I cannot find them on Spotify or Google
all_songs[['Genres']] = all_songs[['Genres']].fillna('Unknown')
# Populate NaN values in 'Genres' column with genres found on Spotify or Google for specified song and artist
all_songs.loc[(all_songs['Song'] == 'Mumbo Sugar') & (all_songs['Artist'] == 'Arc De Soleil'), ['Parent Genres']] = ['R&B, Soul']
all_songs.loc[(all_songs['Song'] == 'Give It Back') & (all_songs['Artist'] == 'Gaelle'), ['Parent Genres']] = ['Dance, Electronic']
all_songs.loc[(all_songs['Song'] == '愛してる') & (all_songs['Artist'] == "callin'"), ['Parent Genres']] = ['Anime, J-Pop']
all_songs.loc[(all_songs['Song'] == 'You Are Mine') & (all_songs['Artist'] == 'Jay Robinson'), ['Parent Genres']] = ['Classic Soul']
all_songs.loc[(all_songs['Song'] == 'Thank You DubNation! (the page will never be long enough)') & (all_songs['Artist'] == 'herlovebeheadsdaisies'), ['Parent Genres']] = ['Screamo']
# Convert categorical variables to factors - allow us to use non-numeric data in statistical modeling
object_columns = all_songs.select_dtypes(include=['object']).columns # First, create list of object columns to convert
all_songs[object_columns] = all_songs[object_columns].astype('category')# Making sure there are no null values left in the dataset
nan_rows = all_songs[(all_songs['Genres'].isnull()) | (all_songs['Parent Genres'].isnull())]
nan_rows| Song | Artist | Popularity | BPM | Genres | Parent Genres | Dance | Energy | Acoustic | Instrumental | Happy | Speech | Live | Loud | Key | Time Signature | Camelot | Playlist Owner | Time Seconds |
|---|
# Splitting 'Parent Genres' column since there are so many different genres
first_instance = all_songs['Parent Genres'].str.split(',').str[0] # extract first genre element
# Assign first instance to new 'Genre' column
all_songs['Genre'] = first_instance
unique_genres = all_songs['Genre'].unique()
num_unique_genres = len(unique_genres)
print("Number of unique genres:", num_unique_genres)
# Counting unique genres
all_songs['Genre'].value_counts() # 17 (now) vs 56 (before)
# Remove unnecessary columns
all_songs = all_songs.drop(columns=['Parent Genres', 'Genres'])Number of unique genres: 17
print(all_songs.dtypes) # Check categorical column types have been converted to stringSong string
Artist string
Popularity int64
BPM int64
Dance int64
Energy int64
Acoustic int64
Instrumental int64
Happy int64
Speech int64
Live int64
Loud int64
Key string
Time Signature int64
Camelot string
Playlist Owner string
Time Seconds int64
Genre object
dtype: object
Final Dataset
# Save dataset as csv file
#all_songs.to_csv('all_spotify_songs.csv')
# Final dataset
all_songs| Song | Artist | Popularity | BPM | Dance | Energy | Acoustic | Instrumental | Happy | Speech | Live | Loud | Key | Time Signature | Camelot | Playlist Owner | Time Seconds | Genre | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CAN'T SAY | Travis Scott | 80 | 148 | 70 | 71 | 20 | 0 | 71 | 0 | 10 | -5 | A#/B♭ Minor | 4 | 3A | Piero | 198 | Hip Hop |
| 1 | New Gold (feat. Tame Impala and Bootie Brown) | Gorillaz,Tame Impala,Bootie Brown | 71 | 108 | 70 | 92 | 4 | 5 | 55 | 0 | 10 | -4 | C♯/D♭ Minor | 3 | 12A | Piero | 215 | Hip Hop |
| 2 | 1AM FREESTYLE | Joji | 68 | 126 | 62 | 54 | 75 | 0 | 12 | 0 | 10 | -6 | C Minor | 4 | 5A | Piero | 113 | Pop |
| 3 | 20 Min | Lil Uzi Vert | 84 | 123 | 77 | 75 | 11 | 0 | 78 | 10 | 10 | -4 | G#/A♭ Minor | 4 | 1A | Piero | 220 | Hip Hop |
| 4 | The Less I Know The Better | Tame Impala | 88 | 117 | 64 | 74 | 1 | 1 | 79 | 0 | 10 | -4 | E Major | 4 | 12B | Piero | 216 | Metal |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | FLASH CASANOVA | Yabujin | 53 | 143 | 42 | 72 | 1 | 0 | 41 | 10 | 0 | -10 | C♯/D♭ Major | 4 | 3B | Nirvit | 163 | Hip Hop |
| 96 | Sinceramente | Sérgio Sampaio | 51 | 92 | 71 | 25 | 94 | 0 | 85 | 0 | 10 | -11 | E Minor | 4 | 9A | Nirvit | 78 | Jazz |
| 97 | 24 Hr Drive-Thru | Origami Angel | 52 | 155 | 57 | 96 | 2 | 0 | 26 | 10 | 30 | -4 | G#/A♭ Major | 4 | 4B | Nirvit | 164 | Rock |
| 98 | If I Ain't Got You | Alicia Keys | 84 | 118 | 61 | 44 | 60 | 0 | 17 | 10 | 10 | -9 | G Major | 3 | 9B | Nirvit | 228 | R&B |
| 99 | Solitude | Lord Snow | 20 | 85 | 16 | 99 | 0 | 9 | 9 | 30 | 40 | -5 | A Minor | 4 | 8A | Nirvit | 312 | Metal |
200 rows × 18 columns
Exploratory Data Analysis for Feature Selection
Correlation Heatmap
import plotly.graph_objects as go
def corr_plot(data):
# Calculate the correlation matrix
correlation_matrix = data.corr()
# Create heatmap using Plotly
annotations = []
for i, row in enumerate(correlation_matrix.values):
for j, value in enumerate(row):
font_color = 'white' if value > -0.4 else '#7fc591' # Set font color based on z value
annotations.append(dict(x=correlation_matrix.columns[j], y=correlation_matrix.index[i],
text=str(round(value, 2)),
showarrow=False, font=dict(color=font_color)))
# Create heatmap using Plotly
fig = go.Figure(data=go.Heatmap(
z=correlation_matrix.values,
x=correlation_matrix.columns,
y=correlation_matrix.index,
colorscale='Greens', # Choose your preferred colorscale
colorbar=dict(title='Correlation<br>Strength<br>')
))
fig.update_layout(
title=dict(text ='<b>Correlation Heatmap</b>', x=0.5, y=0.85),
xaxis=dict(title='<b>Features</b>'),
yaxis=dict(title='<b>Features</b>'),
annotations=annotations,
template="plotly_dark",
height=500,
width=700,
hoverlabel=dict(
bgcolor="#008000")
)
return fig
corr_plot(all_songs)/var/folders/th/h7c9tz61505fhg7ds00qjhlm0000gn/T/ipykernel_1168/3196471437.py:5: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
correlation_matrix = data.corr()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
A correlation heatmap visualizes how well different variables interact with each other. By illustrating the strength and direction of these relationships, correlation heatmaps help identify patterns, trends, and dependencies within the data. Therefore, we are most interested in the features with very dark or very light tiles.
A few main takeaways: * Loud: Strong positive correlation with Energy suggests that louder songs tend to have higher energy levels.
Accoustic: Strong negative correlation with Energy and Loudness implies that acoustic songs tend to have lower energy and loudness levels.
Energy: Strong positive correlation with Loudness indicates that energetically intense songs tend to be louder.
Instrumental: Moderate negative correlation with Danceability and Popularity suggests that instrumental songs are less danceable and less popular.
Dance: It has a moderate positive correlation with Popularity, Energy, and Happiness, suggesting that more danceable songs tend to be more popular, energetic, and happier.
Popularity: It shows weak positive correlations with attributes like Dance, Energy, Happy, and Loudness, indicating that more popular songs tend to have higher danceability, energy, happiness, and loudness.
Interactive Scatterplot Comparing Similarity Between Music Tastes
import plotly.graph_objects as go
# Define colors for each playlist owner
color_map = {'Piero': '#1ED760', 'Nirvit': '#ff00ff'} #1db96e , #b91d82
# Define symbols for each playlist owner
symbol_map = {'Piero': 'circle', 'Nirvit': 'diamond'} #triangle-up
# Define a function to create scatter plot with my original dataset
def create_original_scatter_plot(all_songs):
# Create scatter plot
fig = go.Figure()
# Add text markers when hovering over points
for group, data in all_songs.groupby('Playlist Owner'):
fig.add_trace(go.Scatter(
x=data['Happy'],
y=data['Energy'],
opacity=0.75,
mode='markers',
name=group,
text=data.apply(lambda row: f"Song: {row['Song']}, Artist: {row['Artist']}, Energy: {row['Energy']}, Happiness: {row['Happy']}", axis=1), # Hover text
marker=dict(
color=color_map[group], # Color points based on group
size=10,
symbol=symbol_map.get(group, 'circle'),
line=dict(
color='#2a8ccb',
width=2
)
)
))
# Scatterplot layout
fig.update_layout(
title={
'text': "<b>Top 100 Songs by Mood</b>", # Top 100 Songs by Positivity and Energy Levels
'font': {'size': 14},
'x': 0.5, # Centered title
'y': 0.9 # Adjust vertical position of title
},
xaxis_title="Happiness Level",
yaxis_title="Energy Level",
legend_title="Listener",
width=1070, # Set width to 1000 pixels
height=525, # Set height to 600 pixels
template="plotly_dark",
# Make hover text white
hoverlabel=dict(
font=dict(
color="white" # Text color inside hover label
))
)
# Label song mood quadrants
fig.add_annotation(
x=0, y=105,
text="<b>Chaotic/Angry</b>",
font=dict(
size=12,
color="white",
),
showarrow=False
)
fig.add_annotation(
x= 100, y=105,
text="<b>Happy/Upbeat</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
fig.add_annotation(
x= 100, y=-5,
text="<b>Chill/Peaceful</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
fig.add_annotation(
x=0, y=-5,
text="<b>Sad/Depressing</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
# Adding cross section to distinguish mood sectors
# Vertical line
fig.add_shape(
type="line",
x0=50, y0=0,
x1=50, y1=100,
line=dict(
color="white",
width=1,
dash="dash"
)
)
# Horizontal line
fig.add_shape(
type="line",
x0=0, y0=50,
x1=100, y1=50,
line=dict(
color="white",
width=1,
dash="dash"
)
)
# Show the plot
return fig
create_original_scatter_plot(all_songs)Unable to display output for mime type(s): application/vnd.plotly.v1+json
This scatterplot compares the energy and happiness levels of all songs in our Spotify Wrapped playlists. To interpret the plot, it’s important to think about how energy and happiness features interact.
Low Energy + Low Happiness = Sad / Depressing
Low Energy + High Happiness = Chill / Peaceful
High Energy + High Happiness = Happy / Upbeat
High Energy + Low Happiness = Chaotic / Angry
The scatterplot reveals that songs from my playlist are primarily clustered in the top quadrant, reflecting a mix of chaotic/angry and happy/upbeat tunes. This clustering pattern could significantly influence a model’s predictive capabilities, potentially making the dataset more predictable than anticipated. Additionally, another notable trend emerges: while Nirvit’s music taste appears evenly spread across the plot, he tends to gravitate towards a higher proportion of sad and chill music compared to my preferences.
Make sure to hover over the various points on the scatterplot, to see which songs they represent.
Prepping Data For Machine Learning Models
Normalize Data
from sklearn.preprocessing import MinMaxScaler
# Remove Artist and Song columns
normalized_songs = all_songs.drop(columns=['Song', 'Artist'])
# Select numerical columns to normalize
columns_to_normalize = ['Popularity', 'BPM', 'Dance', 'Energy', 'Acoustic', 'Instrumental', 'Happy', 'Speech', 'Live', 'Loud', 'Time Signature', 'Time Seconds']
# Initialize MinMaxScaler
scaler = MinMaxScaler()
# Fit the scaler on the selected columns
scaler.fit(normalized_songs[columns_to_normalize])
# Transform the selected columns
normalized_songs[columns_to_normalize] = scaler.transform(normalized_songs[columns_to_normalize])
# Create a new binary response column
normalized_songs['Binary Response'] = (normalized_songs['Playlist Owner'] == 'Piero').astype(int)
# Drop the original 'playlist' column if no longer needed
normalized_songs.drop(columns=['Playlist Owner'], inplace=True)Now we’ve got ourselves a normalized dataset!
normalized_songs| Popularity | BPM | Dance | Energy | Acoustic | Instrumental | Happy | Speech | Live | Loud | Key | Time Signature | Camelot | Time Seconds | Genre | Binary Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.857143 | 0.550725 | 0.732558 | 0.707071 | 0.202020 | 0.000000 | 0.715789 | 0.000000 | 0.125 | 0.820513 | A#/B♭ Minor | 0.75 | 3A | 0.115156 | Hip Hop | 1 |
| 1 | 0.750000 | 0.260870 | 0.732558 | 0.919192 | 0.040404 | 0.050505 | 0.547368 | 0.000000 | 0.125 | 0.846154 | C♯/D♭ Minor | 0.50 | 12A | 0.127786 | Hip Hop | 1 |
| 2 | 0.714286 | 0.391304 | 0.639535 | 0.535354 | 0.757576 | 0.000000 | 0.094737 | 0.000000 | 0.125 | 0.794872 | C Minor | 0.75 | 5A | 0.052006 | Pop | 1 |
| 3 | 0.904762 | 0.369565 | 0.813953 | 0.747475 | 0.111111 | 0.000000 | 0.789474 | 0.166667 | 0.125 | 0.846154 | G#/A♭ Minor | 0.75 | 1A | 0.131501 | Hip Hop | 1 |
| 4 | 0.952381 | 0.326087 | 0.662791 | 0.737374 | 0.010101 | 0.010101 | 0.800000 | 0.000000 | 0.125 | 0.846154 | E Major | 0.75 | 12B | 0.128529 | Metal | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 0.535714 | 0.514493 | 0.406977 | 0.717172 | 0.010101 | 0.000000 | 0.400000 | 0.166667 | 0.000 | 0.692308 | C♯/D♭ Major | 0.75 | 3B | 0.089153 | Hip Hop | 0 |
| 96 | 0.511905 | 0.144928 | 0.744186 | 0.242424 | 0.949495 | 0.000000 | 0.863158 | 0.000000 | 0.125 | 0.666667 | E Minor | 0.75 | 9A | 0.026003 | Jazz | 0 |
| 97 | 0.523810 | 0.601449 | 0.581395 | 0.959596 | 0.020202 | 0.000000 | 0.242105 | 0.166667 | 0.375 | 0.846154 | G#/A♭ Major | 0.75 | 4B | 0.089896 | Rock | 0 |
| 98 | 0.904762 | 0.333333 | 0.627907 | 0.434343 | 0.606061 | 0.000000 | 0.147368 | 0.166667 | 0.125 | 0.717949 | G Major | 0.50 | 9B | 0.137444 | R&B | 0 |
| 99 | 0.142857 | 0.094203 | 0.104651 | 0.989899 | 0.000000 | 0.090909 | 0.063158 | 0.500000 | 0.500 | 0.820513 | A Minor | 0.75 | 8A | 0.199851 | Metal | 0 |
200 rows × 16 columns
Setting Up Training and Testing Data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
# Define features (X) and target variable (y)
X = normalized_songs[['Popularity', 'BPM', 'Dance', 'Energy', 'Acoustic', 'Instrumental', 'Happy', 'Speech', 'Live', 'Loud', 'Key', 'Time Signature', 'Camelot', 'Time Seconds', 'Genre']] # Features
y = normalized_songs['Binary Response'] # Target variable Playlist Owner
# Initialize OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)
# One-hot encode categorical columns
X_encoded = pd.DataFrame(encoder.fit_transform(X[['Key', 'Camelot', 'Genre']])) # Only encode categorical columns
X_encoded.columns = encoder.get_feature_names_out(['Key', 'Camelot', 'Genre']) # Get categorical column names
# Reset indices of X and X_encoded
X.reset_index(drop=True, inplace=True)
X_encoded.reset_index(drop=True, inplace=True)
# Concatenate numerical and encoded categorical columns
X_final = pd.concat([X, X_encoded], axis=1)
# Drop original columns since they have been encoded to new columns
X_final.drop(columns=['Key', 'Camelot', 'Genre'], inplace=True)
# Splitting up the data into training and testing sets (60% training, 40% testing)
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.4, random_state=18, shuffle=True)/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning:
`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
Now that the data has been split into testing and training sets, the next step involves creating machine learning models to predict which Spotify Wrapped playlist a song belongs to.
Creating Machine Learning Models
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create and train the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred_lr = lr_model.predict(X_test)
# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy:", accuracy_lr)Accuracy: 0.7375
Feature Importance Plot (Logistic Regression)
I developed this feature importance plot function to identify the most and least useful predictors in each model.
import plotly.graph_objects as go
import panel as pn
def plot_linear_feature_importance(model_name):
# Get feature importances
lr_importances = model_name.coef_[0]
indices = np.argsort(lr_importances)[::-1]
# Get feature names
feature_names = X_train.columns
# Create custom color gradient
colors = ['#1DB954', '#2BBE60', '#3AC26C', '#48C778', '#57CB84', '#65D08F', '#74D49B', '#83D9A7', '#91DDB3', '#9FE2BF']
# Create figure
fig = go.Figure()
# Add bars to plot
fig.add_trace(go.Bar(
x=lr_importances[indices][:10], # Grabs the top 10 features
y=[feature_names[i] for i in indices[:10]], # Grabs their corresponding feature names
marker=dict(color=colors),
orientation='h' # Style as horizontal bar chart
))
# Style barplot
fig.update_layout(
title=dict(text="<b>Top 10 Feature Importances</b>", x=0.5, font=dict(size=16, color='white', family='Arial, sans-serif')),
xaxis=dict(title='<b>Importance</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
yaxis=dict(title='<b>Features</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
font=dict(size=12, color='white', family='Arial, sans-serif'),
margin=dict(l=100, r=20, t=40, b=20),
height=500, #500
width=700, # 800
template="plotly_dark", # dark mode
# Make hover markers have white text
hoverlabel=dict(
font=dict(
color="white"
)
)
)
return fig # Display plot in dashboard when clicked
# Call function for logistic regression
plot_linear_feature_importance(lr_model)Unable to display output for mime type(s): application/vnd.plotly.v1+json
Creating a Visualization Dataset
To craft scatterplots, we need a streamlined visualization dataset containing only essential columns. This dataset, labeled viz_dataset, is extracted from the original dataset, all_songs, and encompasses descriptive song attributes like ‘Song’, ‘Artist’, ‘Playlist Owner’, in addition to ‘Happy’ and ‘Energy’ levels. The extraction process involves selecting rows corresponding to indices found within the X_test dataset.
# Reset index of the all_songs DataFrame
all_songs_reset_index = all_songs.reset_index(drop=True)
# Extract rows from the original dataset based on indices in X_test
viz_dataset = all_songs_reset_index.loc[X_test.index, ['Song', 'Artist', 'Playlist Owner','Happy', 'Energy']]
viz_dataset| Song | Artist | Playlist Owner | Happy | Energy | |
|---|---|---|---|---|---|
| 134 | Suite bergamasque, L. 75: III. Clair de lune | Claude Debussy,Philippe Entremont | Nirvit | 4 | 6 |
| 91 | Fair Trade (with Travis Scott) | Drake,Travis Scott | Piero | 29 | 47 |
| 81 | Father Stretch My Hands Pt. 1 | Kanye West | Piero | 44 | 57 |
| 108 | 愛してる | callin' | Nirvit | 31 | 31 |
| 170 | Disfarça E Chora | Cartola | Nirvit | 96 | 44 |
| ... | ... | ... | ... | ... | ... |
| 126 | Kiss the Ladder | Fleshwater | Nirvit | 25 | 99 |
| 37 | lose | Travis Scott | Piero | 28 | 56 |
| 27 | Doin' it Right (feat. Panda Bear) | Daft Punk,Panda Bear | Piero | 19 | 45 |
| 2 | 1AM FREESTYLE | Joji | Piero | 12 | 54 |
| 77 | Hot Air Balloon | Don Diablo,AR/CO | Piero | 57 | 71 |
80 rows × 5 columns
Creating a Scatterplot Function to Show Logistic Regression Classification Results
This function can create scatterplots for any type of model, whether it’s linear, tree-based, or cluster-based. The plan is to utilize it in the dashboard to visually represent classification song predictions for every model.
import plotly.graph_objects as go
def model_plot(y_pred):
# Define colors for each playlist owner
color_map = {1: '#1ED760', 0: '#ff00ff'} #1db96e , #b91d82
# Define symbols for each playlist owner
symbol_map = {1: 'circle', 0: 'diamond'}
# Map class labels to name legend labels
legend_labels = {1: 'Piero', 0: 'Nirvit'}
# Replace prediction labels (1,0) for names (Piero, Nirvit) in the legend
legend_names = [legend_labels[label] for label in color_map.keys()]
# Add truth labels by merging `y_pred` from each model as a prediction column
viz_dataset['Predicted Owner'] = y_pred
# Create scatter plot
fig = go.Figure()
# Add text markers when hovering over points
for group, data in viz_dataset.groupby('Predicted Owner'):
fig.add_trace(go.Scatter(
x=data['Happy'],
y=data['Energy'],
opacity=0.75,
mode='markers',
name=legend_labels[group],
text=data.apply(lambda row: f"Song: {row['Song']}, Artist: {row['Artist']}, Energy: {row['Energy']}, Happiness: {row['Happy']}", axis=1), # Hover text
marker=dict(
color=color_map[group], # Color points based on group
size=10,
symbol=symbol_map.get(group, 'circle'),
line=dict(
color='#2a8ccb', ##2a8ccb
width=2
)
)
))
# Change scatterplot appearance / styles
fig.update_layout(
title={
'text': "<b>Top 100 Songs by Mood</b>", # Top 100 Songs by Positivity and Energy Levels
'font': {'size': 14},
'x': 0.5, # Centered title
'y': 0.9 # Adjust vertical position of title
},
xaxis_title="Happiness Level",
yaxis_title="Energy Level",
legend_title="Listener",
width=1070,
height=525,
template="plotly_dark",
# Make hover text white
hoverlabel=dict(
font=dict(
color="white"
)
)
)
# Label song mood quadrants
fig.add_annotation(
x=0, y=105,
text="<b>Chaotic/Angry</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
fig.add_annotation(
x= 100, y=105,
text="<b>Happy/Upbeat</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
fig.add_annotation(
x= 100, y=-5,
text="<b>Chill/Peaceful</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
fig.add_annotation(
x=0, y=-5,
text="<b>Sad/Depressing</b>",
font=dict(
size=12,
color="white"
),
showarrow=False
)
# Adding cross section to distinguish mood sectors
# Vertical line
fig.add_shape(
type="line",
x0=50, y0=0,
x1=50, y1=100,
line=dict(
color="white",
width=1,
dash="dash"
)
)
# Horizontal line
fig.add_shape(
type="line",
x0=0, y0=50,
x1=100, y1=50,
line=dict(
color="white",
width=1,
dash="dash"
)
)
# Show scatterplot
return fig model_plot(y_pred_lr) # Scatterplot for logistic regressionUnable to display output for mime type(s): application/vnd.plotly.v1+json
Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
# Create random forest model
rf_model = RandomForestClassifier(n_estimators=1000, random_state=18)
# Train Model
rf_model.fit(X_train, y_train)
# Predictions
y_pred_rf = rf_model.predict(X_test)
# Evaluate model performance
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("Accuracy:", rf_accuracy)Accuracy: 0.7875
Feature Importance Plot (Random Forest)
Since linear models and tree-based models store their feature importances differently, two separate feature importance plot functions are required.
In linear models, such as linear regression or logistic regression, feature importance is derived directly from the coefficients assigned to each feature during the model fitting process. These coefficients represent the magnitude and direction of the relationship between each feature and the target variable. Therefore, accessing the .coef_ attribute retrieves these coefficients, which can be interpreted as feature importances.
In tree-based models like Random Forests, feature importance is typically computed based on how much each feature contributes to decreasing impurity (e.g., Gini impurity or entropy) across all the trees in the forest. The .feature_importances_ attribute of a trained Random Forest model provides the importance scores for each feature, calculated based on this criterion.
So, while linear models directly use the coefficients as feature importance, Random Forest models use a measure of impurity decrease to determine feature importance across the ensemble of trees.
import plotly.graph_objects as go
import panel as pn
def plot_tree_feature_importance(model_name):
# Get feature importances for tree-based model
lr_importances = model_name.feature_importances_
indices = np.argsort(lr_importances)[::-1]
# Get corresponding feature names
feature_names = X_train.columns
# Create custom color gradient
colors = ['#1DB954', '#2BBE60', '#3AC26C', '#48C778', '#57CB84', '#65D08F', '#74D49B', '#83D9A7', '#91DDB3', '#9FE2BF']
# Create figure
fig = go.Figure()
# Add bars to plot
fig.add_trace(go.Bar(
x=lr_importances[indices][:10], # Grab top 10 features in the model
y=[feature_names[i] for i in indices[:10]], # Get corresponding feature names
marker=dict(color=colors), # assign color gradient to bars
orientation='h' # Style as horizontal barplot
))
# Style barplot
fig.update_layout(
title=dict(text="<b>Top 10 Feature Importances</b>", x=0.5, font=dict(size=16, color='white', family='Arial, sans-serif')),
xaxis=dict(title='<b>Importance</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
yaxis=dict(title='<b>Features</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
font=dict(size=12, color='white', family='Arial, sans-serif'),
margin=dict(l=100, r=20, t=40, b=20),
height=500,
width=700,
template="plotly_dark",
# Make hover text white
hoverlabel=dict(
font=dict(
color="white"
)
)
)
return fig # display plot in dashboard when clicked
# Plot random forest barplot
plot_tree_feature_importance(rf_model)Unable to display output for mime type(s): application/vnd.plotly.v1+json
Boosted Trees
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Create boosted trees model
boost_model = GradientBoostingClassifier(n_estimators=1000,
max_depth=3,
learning_rate=0.1,
min_samples_split=3)
# Fit the model to training set
boost_model.fit(X_train, y_train)
# Predictions
y_pred_boost = boost_model.predict(X_test)
# Evaluate boosted trees model accuracy
boost_accuracy = accuracy_score(y_test, y_pred_boost)
print("Accuracy:", boost_accuracy)Accuracy: 0.8
Feature Importance Plot (Boosted Trees)
plot_tree_feature_importance(boost_model)Unable to display output for mime type(s): application/vnd.plotly.v1+json
K-Nearest Neighbors (KNN)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Create K-nearest neighbors classifier
knn_model = KNeighborsClassifier(n_neighbors=5) # You can adjust the number of neighbors as needed
# Fit the model to training set
knn_model.fit(X_train, y_train)
# Make predictions
y_pred_knn = knn_model.predict(X_test)
# Calculate accuracy
knn_accuracy = accuracy_score(y_test, y_pred_knn)
print("Accuracy:", knn_accuracy)Accuracy: 0.6375
Feature Importance Plot (KNN)
Unfortunately, a feature importance bar plot cannot be plotted because the K-Nearest Neighbors algorithm doesn’t inherently provide feature importance scores like tree-based algorithms or linear models. Instead, K-Nearest Neighbors is a distance-based algorithm that makes predictions using Euclidean distance to measure proximity and similarity between data points. Due to the lack of feature importance scores and its low performance, it will not be included in the final dashboard.
Support Vector Machine
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Create Support Vector Machine Classifier
svm_model = SVC(kernel='linear') # Other kernels I could choose 'linear', 'rbf', 'poly'
# Fit the model to training set
svm_model.fit(X_train, y_train)
# Make predictions
y_pred_svm = svm_model.predict(X_test)
# Calculate accuracy
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print("Accuracy:", svm_accuracy)Accuracy: 0.75
Feature Importance Plot (SVM)
plot_linear_feature_importance(svm_model)Unable to display output for mime type(s): application/vnd.plotly.v1+json
Decision Trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Create Decision Tree Classifier
dec_tree_model = DecisionTreeClassifier()
# Fit the model to training set
dec_tree_model.fit(X_train, y_train)
# Make predictions
y_pred_dec_tree = dec_tree_model.predict(X_test)
# Calculate accuracy
dec_tree_accuracy = accuracy_score(y_test, y_pred_dec_tree)
print("Accuracy:", dec_tree_accuracy)Accuracy: 0.7875
Feature Importance Plot (Decision Tree)
plot_tree_feature_importance(dec_tree_model)Unable to display output for mime type(s): application/vnd.plotly.v1+json
Gauge Visualization
Now, a gauge visualization function is developed to showcase model accuracy on the dashboard.
import panel as pn
import plotly.graph_objects as go
# Create gauge visualization function
def gauge_accuracy_viz(model_performance, last_reference):
# Calculate delta to show if current model is performing better or worse
delta = model_performance - last_reference
# Create gauge chart
fig = go.Figure(go.Indicator(
mode="gauge+number+delta",
value= model_performance * 100,
domain={'x': [0, 1], 'y': [0, 1]},
title={'text': "Accuracy", 'font': {'size': 24, 'color': "#00ff7f"}},
delta={'reference': last_reference * 100, 'increasing': {'color': "#00ff00"}, 'decreasing': {'color': "#ff7373"}},
gauge={
'axis': {'range': [None, 100], 'tickwidth': 2, 'tickcolor': "#70D2A2"},
'bar': {'color': "#1DB954"},
'bgcolor': "white",
'borderwidth': 3,
'bordercolor': "#00ff7f",
'steps': [
{'range': [0, 50], 'color': '#b91d82'},
{'range': [50, 100], 'color': '#fff68f'}],
'threshold': {
'line': {'color': "#cc0000", 'width': 4},
'thickness': 0.75,
'value': model_performance * 100}}
))
# Add percent sign to value and delta
fig.update_traces(number={'suffix': '%'}, delta={'suffix': '%'})
# Visualize gauge in dark mode
fig.update_layout(template="plotly_dark", font={'color': "#00ff7f", 'family': "Arial"}, height=500, width=364)
return fig Panel Dashboard
import panel as pn
import pandas as pd
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pn.extension('echarts')
# Create buttons for selecting models
button_original_dataset = pn.widgets.Button(name = 'Original Dataset')
button_logistic_regression = pn.widgets.Button(name='Logistic Regression')
button_random_forest = pn.widgets.Button(name='Random Forest')
button_boosted_trees = pn.widgets.Button(name='Boosted Trees')
button_decision_trees = pn.widgets.Button(name='Decision Trees')
button_svm = pn.widgets.Button(name='Support Vector Machine')
button_knn = pn.widgets.Button(name='K-Nearest Neighbors')
last_reference = 0 # Create global variable to store the previous model's accuracy score
# Define callback functions for the buttons
def on_click_original_dataset(event):
scatter_plot.object = create_original_scatter_plot(all_songs)
feature_importance_plot.object = corr_plot(all_songs) # Switch in a corr plot since there are no features to show
global last_reference
gauge_pane.object = gauge_accuracy_viz(0,0)
def on_click_logistic_regression(event):
scatter_plot.object = model_plot(y_pred_lr)
feature_importance_plot.object = plot_linear_feature_importance(lr_model)
global last_reference
gauge_pane.object = gauge_accuracy_viz(accuracy_lr, last_reference)
last_reference = accuracy_lr
def on_click_random_forest(event):
scatter_plot.object = model_plot(y_pred_rf)
feature_importance_plot.object = plot_tree_feature_importance(rf_model)
global last_reference
gauge_pane.object = gauge_accuracy_viz(rf_accuracy, last_reference)
last_reference = rf_accuracy
def on_click_boosted_trees(event):
scatter_plot.object = model_plot(y_pred_boost)
feature_importance_plot.object = plot_tree_feature_importance(boost_model)
global last_reference
gauge_pane.object = gauge_accuracy_viz(boost_accuracy, last_reference)
last_reference = boost_accuracy
def on_click_decision_trees(event):
scatter_plot.object = model_plot(y_pred_dec_tree)
feature_importance_plot.object = plot_tree_feature_importance(dec_tree_model)
global last_reference
gauge_pane.object = gauge_accuracy_viz(dec_tree_accuracy, last_reference)
last_reference = dec_tree_accuracy
def on_click_svm(event):
scatter_plot.object = model_plot(y_pred_svm)
feature_importance_plot.object = plot_linear_feature_importance(svm_model)
global last_reference
gauge_pane.object = gauge_accuracy_viz(svm_accuracy, last_reference)
last_reference = svm_accuracy
def on_click_knn(event):
scatter_plot.object = model_plot(y_pred_knn)
feature_importance_plot.object = plot_tree_feature_importance(knn_model)
global last_reference
gauge_pane.object = gauge_accuracy_viz(knn_accuracy, last_reference)
last_reference = knn_accuracy
# Bind callbacks when button is clicked
button_original_dataset.on_click(on_click_original_dataset)
button_logistic_regression.on_click(on_click_logistic_regression)
button_random_forest.on_click(on_click_random_forest)
button_boosted_trees.on_click(on_click_boosted_trees)
button_decision_trees.on_click(on_click_decision_trees)
button_svm.on_click(on_click_svm)
button_knn.on_click(on_click_knn)
# Create scatter plot widget
scatter_plot = pn.pane.Plotly()
# Create feature importance plot widget
feature_importance_plot = pn.pane.Plotly() # plot_feature_importance(lr_model)
# Create gauge visualization pane
gauge_pane = pn.pane.Plotly() #gauge_accuracy_viz(rf_accuracy, last_reference)
# Create logo pane
panel_logo = pn.pane.PNG(
'/Users/piero/Downloads/Spotify_Project/Spotify_Logo_RGB_Green.png',
width=150, height=95, align='center'
)
#text1 = 'Visualize the performance of machine learning models in classifying songs from my playlist and my friends.'
text2 = 'Select a model using the buttons above to visualize its performance.'
text3 = '[View dashboard code](https://github.com/suppiero/spotify_classification_dash)'
# Dashboard layout
template = pn.template.FastListTemplate(theme="dark",
logo = '/Users/piero/Downloads/Spotify_Project/Spotify_Logo_RGB_Green.png',
title = "Visualizing Spotify Song Classification Performance",
sidebar =[pn.pane.Markdown("## Reset"),
button_original_dataset, pn.pane.Markdown("## Models"), button_logistic_regression, button_random_forest,
button_boosted_trees, button_decision_trees, button_svm, text2, text3],
main=[
pn.Row(pn.Column(scatter_plot, sizing_mode='stretch_both', margin=(-20,0,0,-24))),
pn.Row(pn.Column(feature_importance_plot, margin=(11,0,0,-24)),
pn.Column(gauge_pane, margin=(11,0,0,-13)), sizing_mode='stretch_both', height=400, width=950
)
],
theme_toggle = False,
accent_base_color="#0bff38", # change color of hyperlink text
header_background="#1f2630", # change color of header banner | previous color: #009E60
header_color = '#0bff38', # change color of header text | previous color: #57ff76
main_max_width = '900',
main_layout = None, # maximum width of the main area containing all plots
sidebar_width=172, # adjust sidebar size
font = 'https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100;1,100&display=swap'
)
# Load original dataset button images on startup
on_click_original_dataset(None)
# Display the dashboard
template.show()WARNING:param.Row: Providing a width-responsive sizing_mode ('stretch_both') and a fixed width is not supported. Converting fixed width to min_width. If you intended the component to be fully width-responsive remove the heightsetting, otherwise change it to min_height. To error on the incorrect specification disable the config.layout_compatibility option.
WARNING:param.Row: Providing a height-responsive sizing_mode ('stretch_both') and a fixed height is not supported. Converting fixed height to min_height. If you intended the component to be fully height-responsive remove the height setting, otherwise change it to min_height. To error on the incorrect specification disable the config.layout_compatibility option.
/var/folders/th/h7c9tz61505fhg7ds00qjhlm0000gn/T/ipykernel_1168/3196471437.py:5: FutureWarning:
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
Launching server at http://localhost:50831
<panel.io.server.Server at 0x12daf9bd0>
Final Thoughts
Overall, this project provided an amazing opportunity to delve into the realm of music data analysis, machine learning, and dashboarding. It was fascinating to uncover the intricate patterns within our Spotify Wrapped playlists and to conduct statistical comparisons of our music tastes. I was incredibly excited to visualize the similarities in our music tastes and gain deeper insights into our listening habits. While our classification models didn’t achieve perfection, they still yielded remarkably accurate results, hinting at meaningful distinctions in the songs favored by Nirvit and myself.
For those interested in conducting a similar analysis using Python, I recommend exploring my GitHub repository dedicated to this project.
Sources
I’d like to extend a special thank you to the wonderful data analysts who inspired me to make this project, offering invaluable ideas and sharing fantastic source code! * Whose Song is it Anyway? By Lewis White.
Citation
@online{trujillo2024,
author = {Trujillo, Piero},
title = {Spotify {Classification} {Dashboard} and {Model} {Analysis}},
date = {2024-04-01},
url = {https://suppiero.github.io/projects/spotify_classification_dashboard/},
langid = {en}
}