Music Taste Analysis¶
When I met my first girlfriend's friends for the first time, somebody innocently tried to make conversation with me by asking my music taste. I clammed up. How could I answer such a complicated question convincingly? There are so many things! I became annoyed with myself for not knowing what to say, not even any genre names. How uncool. My frustration leaked out, and my then-girlfriend later told me someone had asked what was wrong with me. (Hi, Sam.)
But now when someone asks, I have the perfect answer. "Would you like to see my data analysis!?" Enjoy lol.
How to Use¶
My first pass at this depended upon Watsonbox's Exportify, but I decided I didn't like his version because of bugs and inadequate output detail. So I went and forked it, cleaned up the code, and hosted it myself.
As such, the code here depends on .csv
inputs in the format output by my version.
- To get started, hop on over there, sign in to Spotify to give the app access to your playlists, and export whatever you like.
- Next, either download this
.ipynb
file and run the notebook yourself or launch it in Binder. - Either put the downloaded
.csv
in the same directory as the notebook, or upload it in Binder. - Open the
.ipynb
through your browser, update thefilename
variable in the first code cell to point to your playlist instead, andshift+enter
in each following code cell to generate the corresponding plot. (Or selectCell
->Run All
from the menu to make all graphs at once.)
Read the Data¶
For years I've been accumulating my favorite songs in a single master playlist called music that tickles my fancy
. It's thousands of songs. This is what I'll be analyzing. Let's take a look at the first few rows to get a sense of what we're dealing with.
filename = 'music_that_tickles_my_fancy.csv'
from matplotlib import pyplot
import seaborn
import pandas
from collections import defaultdict
from scipy.stats import pareto, gamma
from datetime import date
# read the data
data = pandas.read_csv(filename)
print("total songs:", data.shape[0])
print(data[:3])
total songs: 5256 Spotify ID Artist IDs \ 0 3T9HSgS5jBFdXIBPav51gj 0nJvyjVTb8sAULPYyA1bqU,5yxyJsFanEAuwSM5kOuZKc 1 2bdZDXDoFLzazaomjzoER8 1P6U1dCeHxPui5pIrGmndZ 2 1fE3ddAlmjJ99IIfLgZjTy 0id62QV2SZZfvBn9xpmuCl Track Name \ 0 Fanfare for the Common Man 1 Highschool Lover 2 I Need a Dollar Album Name \ 0 Copland Conducts Copland - Expanded Edition (F... 1 Virgin Suicides 2 I Need A Dollar Artist Name(s) Release Date Duration (ms) \ 0 Aaron Copland,London Symphony Orchestra 1963 196466 1 Air 2000 162093 2 Aloe Blacc 2010-03-16 244373 Popularity Added By Added At ... Key Loudness \ 0 32 spotify:user:pvlkmrv 2014-12-28T00:57:17Z ... 10 -15.727 1 0 spotify:user:pvlkmrv 2014-12-28T00:59:35Z ... 1 -15.025 2 0 spotify:user:pvlkmrv 2014-12-28T01:03:38Z ... 8 -11.829 Mode Speechiness Acousticness Instrumentalness Liveness Valence \ 0 1 0.0381 0.986 0.954 0.0575 0.0377 1 0 0.0302 0.952 0.959 0.2520 0.0558 2 0 0.0387 0.178 0.000 0.0863 0.9620 Tempo Time Signature 0 104.036 4 1 130.052 4 2 95.509 4 [3 rows x 23 columns]
Artist Bar Chart¶
Number of songs binned by artist.
# count songs per artist
artists = defaultdict(int)
for i,song in data.iterrows():
if isinstance(song['Artist Name(s)'], str):
for musician in song['Artist Name(s)'].split(','):
artists[musician] += 1
# sort for chart
artists = pandas.DataFrame(artists.items(), columns=['Artist', 'Num Songs']
).sort_values('Num Songs', ascending=False).reset_index(drop=True)
print("number of unique artists:", artists.shape[0])
pyplot.figure(figsize=(18, 6))
pyplot.bar(artists['Artist'], artists['Num Songs'])
pyplot.xticks(visible=False)
pyplot.xlabel(artists.columns[0])
pyplot.ylabel(artists.columns[1])
pyplot.title('everybody')
pyplot.show()
number of unique artists: 2612
Note I've attributed songs with multiple artists to multiple bars, so the integral here is the number of unique song-artist pairs, not just the number of songs.
It seems to follow a Pareto distribution. Let's try to fit one.
# Let's find the best parameters. Need x, y data 'sampled' from the distribution for
# parameter fit.
y = []
for i in range(artists.shape[0]):
for j in range(artists['Num Songs'][i]):
y.append(i) # just let y have index[artist] repeated for each song
# sanity check. If the dataframe isn't sorted properly, y isn't either.
#pyplot.figure()
#pyplot.hist(y, bins=30)
# The documentation is pretty bad, but this is okay:
# https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-
# ones-with-scipy-python
param = pareto.fit(y, 100)
pareto_fitted = len(y)*pareto.pdf(range(artists.shape[0]), *param)
# param = gamma.fit(y) # gamma fits abysmally; see for yourself by uncommenting
# gamma_fitted = len(y)*gamma.pdf(range(artists.shape[0]), *param)
pyplot.figure(figsize=(18, 6))
pyplot.bar(artists['Artist'], artists['Num Songs'])
pyplot.plot(pareto_fitted, color='r')
#pyplot.plot(gamma_fitted, color='g')
pyplot.xticks(visible=False)
pyplot.xlabel(artists.columns[0])
pyplot.ylabel(artists.columns[1])
pyplot.title('everybody');
Best fit is still too sharp for the data, and I tried for a good long while to get it to fit better, so I conclude this doesn't quite fit a power law.
Let's plot the top 50 artists so we can actually read who they are.
pyplot.figure(figsize=(18, 10))
pyplot.bar(artists['Artist'][:50], artists['Num Songs'][:50])
pyplot.xticks(rotation=80)
pyplot.xlabel(artists.columns[0])
pyplot.ylabel(artists.columns[1])
pyplot.title('top 50');
Volume Added Over Time¶
My proclivity to add songs to this playlist is a proxy for my interest in listening to music generally. How has it waxed and waned over time?
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() # to suppress warning
# Plot of added volume over time
parse_date = lambda d:(int(d[:4]), int(d[5:7]), int(d[8:10]))
pyplot.figure(figsize=(10, 6))
pyplot.hist([date(*parse_date(d)) for d in data['Added At']], bins=30)
pyplot.title('volume added over time');
The initial spike is from when I first stared using Spotify as the home for this collection and manually added hundreds from my previous list.
Eclecticness Measure (Frequency Transform)¶
This one is a personal favorite. I want to know how many of my songs are one-offs from that artist for me--just individual pieces I found fantastic and ended up adding after a few listens--, how many are two-offs, et cetera. I know it must be heavily skewed toward the low numbers.
# bar chart of first bar chart == hipster diversity factor
frequency = defaultdict(int)
for n in artists['Num Songs']:
frequency[n] += n
frequency = pandas.DataFrame(frequency.items(), columns=['Unique Count', 'Volume']
).sort_values('Volume', ascending=False)
print("number of song-artist pairs represented in the eclecticness chart:",
sum(frequency['Volume']))
pyplot.figure(figsize=(10, 6))
pyplot.bar(frequency['Unique Count'].values, frequency['Volume'].values)
pyplot.title('volume of songs binned by |songs from that artist|')
pyplot.xlabel('quasi-frequency domain')
pyplot.ylabel(frequency.columns[1]);
number of song-artist pairs represented in the eclecticness chart: 5973
So, yes, it's much more common for an artist to make it in my list a few times than many times. In fact, the plurality of my top songs come from unique artists.
Conversely, this view also makes stark those few musicians from whom I've collected dozens.
Note that here, as in the artist bar charts, some songs are doubly-counted, because in cases artists collaborated I listed the song in both bins.
Genres Bar Chart¶
Alright, enough messing around. All the above were possible with the output from Watsonbox's Exportify. Let's get to the novel stuff you came here for.
People describe music by genre. As we'll see, genre names are flippin' hilarious and extremely varied, but in theory if I cluster around a few, that should give you a flavor of my tastes.
# count songs per genre
genres = defaultdict(int)
for i,song in data.iterrows():
if type(song['Genres']) is str: # some times there aren't any, and this is NaN
for genre in song['Genres'].split(','):
if len(genre) > 0: # empty string seems to be a legit genre
genres[genre] += 1
# sort for chart
genres = pandas.DataFrame(genres.items(), columns=['Genre', 'Num Songs']
).sort_values('Num Songs', ascending=False).reset_index(drop=True)
print("number of unique genres:", genres.shape[0])
pyplot.figure(figsize=(18, 6))
pyplot.bar(genres['Genre'], genres['Num Songs'])
pyplot.xticks(visible=False)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('All the genera');
number of unique genres: 1138
So many! Let's do the same thing as with the artists and for giggles see if it fits a power law.
y = []
for i in range(genres.shape[0]):
for j in range(genres['Num Songs'][i]):
y.append(i)
# sanity check
#pyplot.figure()
#pyplot.hist(y, bins=30)
param = pareto.fit(y, 100)
pareto_fitted = len(y)*pareto.pdf(range(genres.shape[0]), *param)
pyplot.figure(figsize=(18, 6))
pyplot.bar(genres['Genre'], genres['Num Songs'])
pyplot.plot(pareto_fitted, color='r')
pyplot.xticks(visible=False)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('All the genera');
Still too sharp, but fits better than with the artists.
Let's look at the top 50 so we can read the names.
pyplot.figure(figsize=(18, 10))
pyplot.bar(genres['Genre'][:50], genres['Num Songs'][:50])
pyplot.xticks(rotation=80)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('top 50');
"Indie poptimism" lol. wtf? "Dreamo", "Vapor soul", "Freak folk", "Tropical house", "Post-grunge", "Hopebeat", "Noise pop", "Mellow gold"
These are too good. Next time someone asks me my music taste, I'm definitely using these.
If these are the most popular names, what are the really unique ones at the bottom of the chart?
pyplot.figure(figsize=(18, 1))
pyplot.bar(genres['Genre'][-50:], genres['Num Songs'][-50:])
pyplot.xticks(rotation=80)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('bottom 50');