Music Taste Analysis¶

When I met my first girlfriend's friends for the first time, somebody innocently tried to make conversation with me by asking my music taste. I clammed up. How could I answer such a complicated question convincingly? There are so many things! I became annoyed with myself for not knowing what to say, not even any genre names. How uncool. My frustration leaked out, and my then-girlfriend later told me someone had asked what was wrong with me. (Hi, Sam.)

But now when someone asks, I have the perfect answer. "Would you like to see my data analysis!?" Enjoy lol.

How to Use¶

My first pass at this depended upon Watsonbox's Exportify, but I decided I didn't like his version because of bugs and inadequate output detail. So I went and forked it, cleaned up the code, and hosted it myself.

As such, the code here depends on .csv inputs in the format output by my version.

  1. To get started, hop on over there, sign in to Spotify to give the app access to your playlists, and export whatever you like.
  2. Next, either download this .ipynb file and run the notebook yourself or launch it in Binder.
  3. Either put the downloaded .csv in the same directory as the notebook, or upload it in Binder.
  4. Open the .ipynb through your browser, update the filename variable in the first code cell to point to your playlist instead, and shift+enter in each following code cell to generate the corresponding plot. (Or select Cell -> Run All from the menu to make all graphs at once.)

Read the Data¶

For years I've been accumulating my favorite songs in a single master playlist called music that tickles my fancy. It's thousands of songs. This is what I'll be analyzing. Let's take a look at the first few rows to get a sense of what we're dealing with.

In [1]:
filename = 'music_that_tickles_my_fancy.csv'

from matplotlib import pyplot
import seaborn
import pandas
from collections import defaultdict
from scipy.stats import pareto, gamma
from datetime import date

# read the data
data = pandas.read_csv(filename)
print("total songs:", data.shape[0])
print(data[:3])
total songs: 5256
               Spotify ID                                     Artist IDs  \
0  3T9HSgS5jBFdXIBPav51gj  0nJvyjVTb8sAULPYyA1bqU,5yxyJsFanEAuwSM5kOuZKc   
1  2bdZDXDoFLzazaomjzoER8                         1P6U1dCeHxPui5pIrGmndZ   
2  1fE3ddAlmjJ99IIfLgZjTy                         0id62QV2SZZfvBn9xpmuCl   

                   Track Name  \
0  Fanfare for the Common Man   
1            Highschool Lover   
2             I Need a Dollar   

                                          Album Name  \
0  Copland Conducts Copland - Expanded Edition (F...   
1                                    Virgin Suicides   
2                                    I Need A Dollar   

                            Artist Name(s) Release Date  Duration (ms)  \
0  Aaron Copland,London Symphony Orchestra         1963         196466   
1                                      Air         2000         162093   
2                               Aloe Blacc   2010-03-16         244373   

   Popularity              Added By              Added At  ... Key  Loudness  \
0          32  spotify:user:pvlkmrv  2014-12-28T00:57:17Z  ...  10   -15.727   
1           0  spotify:user:pvlkmrv  2014-12-28T00:59:35Z  ...   1   -15.025   
2           0  spotify:user:pvlkmrv  2014-12-28T01:03:38Z  ...   8   -11.829   

   Mode  Speechiness  Acousticness  Instrumentalness  Liveness  Valence  \
0     1       0.0381         0.986             0.954    0.0575   0.0377   
1     0       0.0302         0.952             0.959    0.2520   0.0558   
2     0       0.0387         0.178             0.000    0.0863   0.9620   

     Tempo  Time Signature  
0  104.036               4  
1  130.052               4  
2   95.509               4  

[3 rows x 23 columns]

Artist Bar Chart¶

Number of songs binned by artist.

In [2]:
# count songs per artist
artists = defaultdict(int)
for i,song in data.iterrows():
    if isinstance(song['Artist Name(s)'], str):
    	for musician in song['Artist Name(s)'].split(','):
    		artists[musician] += 1

# sort for chart
artists = pandas.DataFrame(artists.items(), columns=['Artist', 'Num Songs']
                          ).sort_values('Num Songs', ascending=False).reset_index(drop=True)
print("number of unique artists:", artists.shape[0])

pyplot.figure(figsize=(18, 6))
pyplot.bar(artists['Artist'], artists['Num Songs'])
pyplot.xticks(visible=False)
pyplot.xlabel(artists.columns[0])
pyplot.ylabel(artists.columns[1])
pyplot.title('everybody')
pyplot.show()
number of unique artists: 2612
No description has been provided for this image

Note I've attributed songs with multiple artists to multiple bars, so the integral here is the number of unique song-artist pairs, not just the number of songs.

It seems to follow a Pareto distribution. Let's try to fit one.

In [3]:
# Let's find the best parameters. Need x, y data 'sampled' from the distribution for
# parameter fit.
y = []
for i in range(artists.shape[0]):
	for j in range(artists['Num Songs'][i]):
		y.append(i) # just let y have index[artist] repeated for each song 

# sanity check. If the dataframe isn't sorted properly, y isn't either.
#pyplot.figure()
#pyplot.hist(y, bins=30)
        
# The documentation is pretty bad, but this is okay:
# https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-
# ones-with-scipy-python
param = pareto.fit(y, 100)
pareto_fitted = len(y)*pareto.pdf(range(artists.shape[0]), *param)
# param = gamma.fit(y) # gamma fits abysmally; see for yourself by uncommenting
# gamma_fitted = len(y)*gamma.pdf(range(artists.shape[0]), *param)

pyplot.figure(figsize=(18, 6))
pyplot.bar(artists['Artist'], artists['Num Songs'])
pyplot.plot(pareto_fitted, color='r')
#pyplot.plot(gamma_fitted, color='g')
pyplot.xticks(visible=False)
pyplot.xlabel(artists.columns[0])
pyplot.ylabel(artists.columns[1])
pyplot.title('everybody');
No description has been provided for this image

Best fit is still too sharp for the data, and I tried for a good long while to get it to fit better, so I conclude this doesn't quite fit a power law.

Let's plot the top 50 artists so we can actually read who they are.

In [4]:
pyplot.figure(figsize=(18, 10))
pyplot.bar(artists['Artist'][:50], artists['Num Songs'][:50])
pyplot.xticks(rotation=80)
pyplot.xlabel(artists.columns[0])
pyplot.ylabel(artists.columns[1])
pyplot.title('top 50');
No description has been provided for this image

Volume Added Over Time¶

My proclivity to add songs to this playlist is a proxy for my interest in listening to music generally. How has it waxed and waned over time?

In [5]:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() # to suppress warning

# Plot of added volume over time
parse_date = lambda d:(int(d[:4]), int(d[5:7]), int(d[8:10]))
pyplot.figure(figsize=(10, 6))
pyplot.hist([date(*parse_date(d)) for d in data['Added At']], bins=30)
pyplot.title('volume added over time');
No description has been provided for this image

The initial spike is from when I first stared using Spotify as the home for this collection and manually added hundreds from my previous list.

Eclecticness Measure (Frequency Transform)¶

This one is a personal favorite. I want to know how many of my songs are one-offs from that artist for me--just individual pieces I found fantastic and ended up adding after a few listens--, how many are two-offs, et cetera. I know it must be heavily skewed toward the low numbers.

In [6]:
# bar chart of first bar chart == hipster diversity factor
frequency = defaultdict(int)
for n in artists['Num Songs']:
	frequency[n] += n
frequency = pandas.DataFrame(frequency.items(), columns=['Unique Count', 'Volume']
                           ).sort_values('Volume', ascending=False)
print("number of song-artist pairs represented in the eclecticness chart:",
      sum(frequency['Volume']))

pyplot.figure(figsize=(10, 6))
pyplot.bar(frequency['Unique Count'].values, frequency['Volume'].values)
pyplot.title('volume of songs binned by |songs from that artist|')
pyplot.xlabel('quasi-frequency domain')
pyplot.ylabel(frequency.columns[1]);
number of song-artist pairs represented in the eclecticness chart: 5973
No description has been provided for this image

So, yes, it's much more common for an artist to make it in my list a few times than many times. In fact, the plurality of my top songs come from unique artists.

Conversely, this view also makes stark those few musicians from whom I've collected dozens.

Note that here, as in the artist bar charts, some songs are doubly-counted, because in cases artists collaborated I listed the song in both bins.

Genres Bar Chart¶

Alright, enough messing around. All the above were possible with the output from Watsonbox's Exportify. Let's get to the novel stuff you came here for.

People describe music by genre. As we'll see, genre names are flippin' hilarious and extremely varied, but in theory if I cluster around a few, that should give you a flavor of my tastes.

In [7]:
# count songs per genre
genres = defaultdict(int)
for i,song in data.iterrows():
    if type(song['Genres']) is str: # some times there aren't any, and this is NaN
        for genre in song['Genres'].split(','):
            if len(genre) > 0: # empty string seems to be a legit genre
                genres[genre] += 1

# sort for chart
genres = pandas.DataFrame(genres.items(), columns=['Genre', 'Num Songs']
                          ).sort_values('Num Songs', ascending=False).reset_index(drop=True)
print("number of unique genres:", genres.shape[0])

pyplot.figure(figsize=(18, 6))
pyplot.bar(genres['Genre'], genres['Num Songs'])
pyplot.xticks(visible=False)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('All the genera');
number of unique genres: 1138
No description has been provided for this image

So many! Let's do the same thing as with the artists and for giggles see if it fits a power law.

In [8]:
y = []
for i in range(genres.shape[0]):
	for j in range(genres['Num Songs'][i]):
		y.append(i) 

# sanity check
#pyplot.figure()
#pyplot.hist(y, bins=30)

param = pareto.fit(y, 100)
pareto_fitted = len(y)*pareto.pdf(range(genres.shape[0]), *param)

pyplot.figure(figsize=(18, 6))
pyplot.bar(genres['Genre'], genres['Num Songs'])
pyplot.plot(pareto_fitted, color='r')
pyplot.xticks(visible=False)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('All the genera');
No description has been provided for this image

Still too sharp, but fits better than with the artists.

Let's look at the top 50 so we can read the names.

In [9]:
pyplot.figure(figsize=(18, 10))
pyplot.bar(genres['Genre'][:50], genres['Num Songs'][:50])
pyplot.xticks(rotation=80)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('top 50');
No description has been provided for this image

"Indie poptimism" lol. wtf? "Dreamo", "Vapor soul", "Freak folk", "Tropical house", "Post-grunge", "Hopebeat", "Noise pop", "Mellow gold"

These are too good. Next time someone asks me my music taste, I'm definitely using these.

If these are the most popular names, what are the really unique ones at the bottom of the chart?

In [10]:
pyplot.figure(figsize=(18, 1))
pyplot.bar(genres['Genre'][-50:], genres['Num Songs'][-50:])
pyplot.xticks(rotation=80)
pyplot.xlabel(genres.columns[0])
pyplot.ylabel(genres.columns[1])
pyplot.title('bottom 50');
No description has been provided for this image

"hauntology", "psychadelic folk", "stomp and whittle", "dark trap", "filthstep", "shamanic", "deep underground hip hop", "future garage"

That was fun.

Release Dates¶

Which era of music do I prefer?

In [11]:
years = defaultdict(int)
for i,song in data.iterrows():
    if isinstance(song['Release Date'], str): #  somebody found a NaN release date!
        years[song['Release Date'][:4]] += 1

years = pandas.DataFrame(years.items(), columns=['Year', 'Num Songs']
                          ).sort_values('Year')

pyplot.figure(figsize=(10, 6))
pyplot.bar(years['Year'], years['Num Songs'])
pyplot.xticks(years['Year'], [y if i % 2 == 0 else '' for i,y in enumerate(years['Year'])], rotation=80)
pyplot.xlabel(years.columns[0])
pyplot.ylabel(years.columns[1])
pyplot.title('Songs per year');
No description has been provided for this image

It seems to follow a Gamma distribution! This makes sense because I'm more likely to have heard things that are nearer me in time, and it takes a while for them to get through my process and become favorites.

Let's fit that gamma to the time-reversed data.

In [12]:
# Some years are missing, so transform to a dataframe that covers full time period.
eldest = int(years['Year'].values[0])
youngest = int(years['Year'].values[-1])
missing_years = [str(x) for x in range(eldest+1, youngest) if
                 str(x) not in years['Year'].values]
ago = pandas.concat((years, pandas.DataFrame.from_dict(
    {'Year': missing_years, 'Num Songs': [0 for x in range(len(missing_years))]})
                  )).sort_values('Year', ascending=False).reset_index(drop=True)

y = []
for i in range(ago.shape[0]):
	for j in range(int(ago['Num Songs'][i])):
		y.append(i)

# sanity check histogram to make sure I'm constructing y properly
#pyplot.figure()
#pyplot.hist(y, bins=30)
        
param = gamma.fit(y, 10000)
gamma_fitted = len(y)*gamma.pdf(range(ago.shape[0]), *param)

pyplot.figure(figsize=(10, 6))
pyplot.bar(range(len(ago['Year'])), ago['Num Songs'])
pyplot.plot(gamma_fitted, color='g')
pyplot.xlabel('Years Ago')
pyplot.ylabel(ago.columns[1])
pyplot.title('Songs per year (in absolute time)');

print('Oldest Hall of Fame')
print(data[['Track Name', 'Artist Name(s)', 'Release Date']].sort_values(
    'Release Date')[:10])
Oldest Hall of Fame
                                             Track Name  \
2990                                       That's Amore   
2950                                    Autumn Nocturne   
3748        The Elements (Music By Sir Arthur Sullivan)   
2421                                          Take Five   
3136                            Skating In Central Park   
3105  I Guess I'll Hang My Tears Out To Dry - Rudy V...   
4262                                        Oye Cómo Va   
2630                                        Stand By Me   
0                            Fanfare for the Common Man   
3188                              In A Sentimental Mood   

                                  Artist Name(s) Release Date  
2990  Dean Martin,Dick Stabile And His Orchestra         1954  
2950                               Lou Donaldson         1958  
3748                                  Tom Lehrer   1959-01-01  
2421                    The Dave Brubeck Quartet   1959-12-14  
3136                         Bill Evans,Jim Hall         1962  
3105                               Dexter Gordon         1962  
4262                                 Tito Puente   1962-01-01  
2630                                 Ben E. King   1962-08-20  
0        Aaron Copland,London Symphony Orchestra         1963  
3188                Duke Ellington,John Coltrane      1963-02  
No description has been provided for this image

Pretty good fit! I seem to be extra partial to music from about 5 years ago. We'll see whether the present or maybe the further past catches up.

Popularity Contest¶

I was happy to find popularity listed as a field in Spotify's track JSON. It's a percentile between 0 and 100, rather than an absolute number of plays. Still, it can be used to give a notion of how hipster I am.

In [13]:
popularity = defaultdict(int)
for i,song in data.iterrows():
    popularity[song['Popularity']] += 1

popularity = pandas.DataFrame(popularity.items(), columns=['Popularity', 'Num Songs']
                          ).sort_values('Popularity')

pyplot.figure(figsize=(10, 6))
pyplot.bar(popularity['Popularity'].values, popularity['Num Songs'].values)
pyplot.xlabel(popularity.columns[0])
pyplot.ylabel(popularity.columns[1])
pyplot.title('popularity distribution');

print("Average song popularity: ", popularity['Popularity'].mean())
print("Median song popularity: ", popularity['Popularity'].median())
print("Max song popularity: ", popularity['Popularity'].max())
Average song popularity:  44.0
Median song popularity:  44.0
Max song popularity:  88
No description has been provided for this image

Damn, I'm a hipster.

Track Duration¶

Do I prefer long songs or short ones?

In [14]:
pyplot.figure(figsize=(10,6))
pyplot.hist(data['Duration (ms)']/1000, bins=50);
pyplot.xlabel('Duration (s)')
pyplot.ylabel('Num Songs')
pyplot.title('Histogram of song lengths')

mean = data['Duration (ms)'].mean()/1000
median = data['Duration (ms)'].median()/1000
print("Average song length: " + str(int(mean//60)) + (":" if mean%60 >=10 else ":0")
      + str(mean%60))
print("Median song length: " + str(int(median//60)) + (":" if median%60 >=10 else ":0")
      + str(median%60))
Average song length: 4:02.6681554414003017
Median song length: 3:52.64099999999999
No description has been provided for this image

Median is lower than the mean, so I'm skewed right. That is, I like a few really long songs. What are they?

In [15]:
print("Longest Hall of Fame:")
print(data[['Track Name', 'Artist Name(s)', 'Release Date', 'Duration (ms)']].sort_values(
    'Duration (ms)', ascending=False)[:10])
Longest Hall of Fame:
                                             Track Name  \
5244                                             Echoes   
3155                              Concierto De Aranjuez   
691                                               Irene   
1912  The Return of the King (From The Lord of the R...   
4232                                     Boléro (Ravel)   
460                                   The Cure For Pain   
2349              Shine On You Crazy Diamond (Pts. 1-5)   
140   Two Step - Live At Piedmont Park, Atlanta, GA ...   
5062                                             Rivers   
3474      Má vlast (My Country): No. 2, Vltava [Moldau]   

                                         Artist Name(s) Release Date  \
5244                                         Pink Floyd   1971-11-11   
3155                                           Jim Hall         1974   
691                                         Beach House   2012-05-15   
1912          The City of Prague Philharmonic Orchestra   2004-01-01   
4232                          London Symphony Orchestra         1995   
460                                        mewithoutYou   2002-01-01   
2349                                         Pink Floyd   1975-09-12   
140                                  Dave Matthews Band   2007-12-11   
5062                                         Tarek Musa   2010-01-30   
3474  Bedřich Smetana,Polish National Radio Symphony...   1994-08-05   

      Duration (ms)  
5244        1412451  
3155        1154040  
691         1017013  
1912         976893  
4232         934067  
460          908840  
2349         811077  
140          808226  
5062         807437  
3474         794000  

Musical Features¶

In the interest of understanding user tastes and providing the best possible music recommendations, Spotify has done some really sophisticated analysis of actual track content, which has only gotten more extensive in recent years. Music is a time series, but most similarity metrics (and most ML methods generally) require inputs to be vectors, that is: points in some feature-space. So they've transformed the tracks to numerical metrics like Energy and Valence (continuous) and Key (discrete).

For the continuous metrics, here are distributions for my songs.

In [16]:
pyplot.figure(figsize=(20,20))

for i,category in enumerate(['Tempo', 'Acousticness', 'Instrumentalness', 'Liveness',
                            'Valence', 'Speechiness', 'Loudness', 'Energy', 'Danceability']):
    pyplot.subplot(3, 3, i+1)
    pyplot.hist(data[category], bins=30)
    pyplot.text(pyplot.xlim()[1] - (pyplot.xlim()[1] - pyplot.xlim()[0])*0.3,
                pyplot.ylim()[1]*0.9, r'$\mu=$'+str(data[category].mean())[:7], fontsize=12)
    pyplot.xlabel('Value')
    pyplot.ylabel('Num Songs')
    pyplot.title(category)

pyplot.tight_layout(h_pad=2)
No description has been provided for this image

My Valence is somewhat negatively skewed; do I have an affinity for sadder songs?

Now let's look at the discrete music features.

In [17]:
pyplot.figure(figsize=(15,4))

pyplot.subplot(1, 3, 1)
seaborn.countplot(data, x='Time Signature', hue='Time Signature', legend=False)
pyplot.xlabel('Beats per bar')
pyplot.ylabel('Num Songs')
pyplot.title('Time Signature')

pyplot.subplot(1, 3, 2)
seaborn.countplot(data, x='Key', hue='Key', palette='husl', legend=False)
pyplot.xticks(ticks=pyplot.xticks()[0], labels=['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B'])
pyplot.ylabel('Num Songs')
pyplot.title('Key')

pyplot.subplot(1, 3, 3)
seaborn.countplot(data, x='Mode', hue='Mode', legend=False)
pyplot.xticks(ticks=pyplot.xticks()[0], labels=['minor', 'major'])
pyplot.ylabel('Num Songs')
pyplot.title('Major vs Minor Key');

pyplot.tight_layout(w_pad=2)
No description has been provided for this image

Musicians seem to favor C major and eschew D#. More than a third of my songs are in a minor key. I don't have a baseline to compare against here, but this might contribute to my lower Valence.

Looks like the vast majority of my music is 4/4 time with a good few in 3/4. I wasn't even aware there were any with 5 beats. What are those?

In [18]:
print('5:\n', data.loc[data['Time Signature']==5][
    ['Track Name', 'Artist Name(s)', 'Release Date']][:20])
5:
                           Track Name  \
76    Yachts - A Man Called Adam mix   
120          Good Morning Fire Eater   
223                         Carry On   
233                  Vanishing Grace   
244                          Elysium   
273                           Lately   
386                         Evenstar   
447                      Make A Fist   
459                              (B)   
567                          Animals   
726                 All That Remains   
734                 Crush The Camera   
1061                     Cold Sparks   
1162               You Are Gonna Die   
1188   Everything in Its Right Place   
1193                     The Tourist   
1194             I Am Citizen Insane   
1835        Have I Always Loved You?   
1973                       Resonance   
2135                            Pray   

                                         Artist Name(s) Release Date  
76                                  Coco Steel Lovebomb   2000-10-31  
120                                            Copeland   2008-01-01  
223                                                fun.   2012-02-21  
233                                 Gustavo Santaolalla   2013-06-07  
244   Klaus Badelt,Lisa Gerrard,Gavin Greenaway,The ...   2000-04-25  
273                                         Memoryhouse   2011-09-13  
386                                        Howard Shore   2002-12-02  
447                                          Phantogram         2011  
459                                        mewithoutYou   2002-01-01  
567                                                Muse   2012-09-24  
726                                          Rogue Wave         2010  
734                                          Rogue Wave   2005-08-23  
1061                                           Mutemath   2011-09-30  
1162                                  Marc Streitenfeld         2011  
1188                                          Radiohead   2000-10-02  
1193                                          Radiohead   1997-06-17  
1194                                          Radiohead   2003-06-09  
1835                                           Copeland   2014-11-17  
1973                                               Home   2014-07-01  
2135                                          Sam Smith   2017-10-06  

Make A Fist is totally 5/4, and so is Animals. Funny how I didn't notice the strange energetic time signature until now. But Carry On is definitely 4/4, as is Yachts, and Pray is 6/8. So Spotify's algorithm isn't perfect at this, which is expected.

What are 0 and 1?

In [19]:
print('0:\n', data.loc[data['Time Signature']==0][
    ['Track Name', 'Artist Name(s)', 'Release Date']][:10])
print('\n1:\n', data.loc[data['Time Signature']==1][
    ['Track Name', 'Artist Name(s)', 'Release Date']][:20])
0:
         Track Name Artist Name(s) Release Date
1364  Small Memory    Jon Hopkins   2009-05-05

1:
                                              Track Name  \
71                                        Clair De Lune   
119                                     Top Of The Hill   
227                     I Am the Very Model of a Modern   
239                         The Last of Us (You and Me)   
362                                              Bowery   
503                                    The Eternal City   
564                                             Prelude   
601                                       Þú ert jörðin   
604                                               Raein   
1278                                 Campfire Song Song   
1330                                        Mylo Xyloto   
1370                                            Anagram   
1915  The Fellowship (From The Lord of the Rings: Th...   
1955                                            Monsoon   
1999                               Meet Me in the Woods   
2037                                         Only Songs   
2181                                         Old Casino   
2194                                     Work This Time   
2591                                   I Don't Think So   
2670                                       Other Worlds   

                                 Artist Name(s) Release Date  
71                               Claude Debussy   2014-10-13  
119                                    Conduits   2013-04-16  
227                     The Pirates Of Penzance   1983-02-18  
239            Gustavo Santaolalla,Alan Umstead   2013-06-07  
362                               Local Natives   2013-01-29  
503                          Michele McLaughlin   2007-12-04  
564                                        Muse   2012-09-24  
601                              Ólafur Arnalds   2010-05-07  
604                              Ólafur Arnalds   2009-08-28  
1278                      Spongebob Squarepants   2009-07-14  
1330                                   Coldplay   2011-10-24  
1370                            Young the Giant   2014-01-17  
1915  The City of Prague Philharmonic Orchestra   2004-01-01  
1955                               Hippo Campus   2017-02-24  
1999                                 Lord Huron   2015-04-07  
2037                             The Wild Reeds   2017-04-07  
2181                                 Coastgaard   2016-02-26  
2194           King Gizzard & The Lizard Wizard   2014-03-07  
2591                                 Ben Phipps   2016-09-30  
2670                      Bassnectar,Dorfex Bos   2017-12-01  

Looks like there is only one song with 0 time signature. It's a piano piece with a tempo that rises and falls. This category might be for variable tempo, or unknown.

Claire De Lune is 9/8 time, so sort of waltzish but not really.

The Major General's Song is 4/4, but there are some stops in there and a lot of speaking, so I understand how that might be difficult to pick out. Same with Campfire Song Song lol.

Top of the Hill really sounds like 7/4 to me (1-2-123 sort of beat).

Þú ert jörðin is actually properly 1/4 time according to the internet, and relistening I understand how that could be the case. It's like there are little riffs each bar following a quadruplet pattern, but the major beats really only come every bar.

The Last of Us (You and Me) seems similar. It might be properly 1/4 time.

So it looks like this category is for actual single beats and unusual time signatures that Spotify isn't sure what to do with.

Joint Analysis¶

I mostly just want to showcase what's possible. Let's plot Energy and Popularity together to see whether there is a relationship.

In [20]:
x = 'Energy'
y = 'Popularity'

axes = seaborn.jointplot(x=data[x], y=data[y], kind='hex', color='r')
axes.set_axis_labels(x, y, fontsize=20);
No description has been provided for this image

The density is pretty scattered, doesn't the whole plot, meaning the relationship here is actually pretty weak. Surprising.

The Final Frontier¶

Finally, I'm going to follow this guy's example and feed the dimension-reduced data to a one-class SVM to get a sense of what the frontier of my normal taste looks like in that space, heat-map-of-the-universe-style.

t-SNE is a method for visualizing high-dimensional data in low-dimension. Songs which are more alike will be nearer each other in the feature space, but we can't visualize a space with that many dimensions. What we can do is reconstitute the points in 2D, attempting to preserve the pairwise distances, the notions of similarity, between songs.

In [23]:
show_percent = 2

from sklearn.manifold import TSNE
from random import random
from sklearn.svm import OneClassSVM
import numpy

# Create a dataframe of only the numerical features, all normalized so embedding
# doesn't get confused by scale differences
numerical_data = data.drop(['Spotify ID', 'Artist IDs', 'Track Name', 
        'Album Name', 'Artist Name(s)', 'Added By', 'Added At',
        'Genres'], axis=1)
numerical_data['Release Date'] = pandas.to_numeric(
    numerical_data['Release Date'].str.slice(0,4))
numerical_data = (numerical_data - numerical_data.mean())/numerical_data.std()
print('using:', list(numerical_data.columns))

# If you like, only include a subset of these, because the results with all
# is really hard to interpret
#tsne_data = numerical_data[['Popularity', 'Energy', 'Acousticness',
#                                'Valence', 'Loudness']]
#print("\nConsidering similarity with respect to the following features:")
#print(tsne_data.dtypes)

# Takes a 2D data embedding and an svm trained on it and plots the decision boundary
def plotFrontier(embedded, svm, technique_name, scale):
    # get all the points in the space, and query the svm on them
    xx, yy = numpy.meshgrid(numpy.linspace(min(embedded[:,0])*scale,
                                           max(embedded[:,0])*scale, 500),
                            numpy.linspace(min(embedded[:,1])*scale,
                                           max(embedded[:,1])*scale, 500))
    Z = svm.decision_function(numpy.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape) # positive Z means yes. negative means outliers.

    pyplot.figure(figsize=(20,20))
    pyplot.title('Decision boundary of One-class SVM in '+technique_name+' space')
    pyplot.contourf(xx, yy, Z, levels=numpy.linspace(Z.min(), 0, 7), cmap=pyplot.cm.Blues_r)
    pyplot.contour(xx, yy, Z, levels=[0], linewidths=2, colors='green') # the +/- boundary
    pyplot.contourf(xx, yy, Z, levels=[0, Z.max()], colors='lightgreen')

    pyplot.scatter(embedded[:, 0], embedded[:, 1], s=10, c='grey')
    for i,song in data.iterrows():
        if random() < show_percent*0.01: # randomly label % of points
        #if song['Artist Name(s)'] in ['Coldplay']:
            x, y = embedded[i]
            pyplot.annotate(song['Track Name'], (x,y), size=10,
                xytext=(-30,30), textcoords='offset points',
                ha='center',va='bottom',
                arrowprops={'arrowstyle':'->', 'color':'red'})

tsne_embedded = TSNE(n_components=2).fit_transform(numerical_data)

svm_tsne = OneClassSVM(gamma='scale')
svm_tsne.fit(tsne_embedded)

plotFrontier(tsne_embedded, svm_tsne, 't-SNE', 1.2)
using: ['Release Date', 'Duration (ms)', 'Popularity', 'Danceability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Time Signature']
No description has been provided for this image

The point scatter looks really different every time this runs, because it's stochastic. The clusters don't necessarily have sensible interpretations, though you might be able to label a few of them. It's good to see some notionally similar pieces ending up near each other. You can try this with a subset of these dimensions to try to make the result more interpretable.

Modifying the parameters of the SVM changes its fit significantly, so I'm not sure this is the best model. Gamma too large just clearly overfits the data. Gamma too small just makes the decision boundary a boring ellipse. Using gamma='scale' as the docs recommend is a more interesting middle ground, but still the SVM seems to believe that a great many of the songs I love fall outside the boundary.

I'll try a different dimensionality reduction technique. The original author uses Principle Component Analysis to feed the SVM.

In [24]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_embedded = pca.fit_transform(numerical_data)
print("% variance explained by successive PCA dimensions:",
      pca.explained_variance_ratio_)

svm_pca = OneClassSVM(gamma='scale')
svm_pca.fit(pca_embedded)

plotFrontier(pca_embedded, svm_pca, 'PCA', 1)
% variance explained by successive PCA dimensions: [0.21916295 0.09249043]
No description has been provided for this image

Ideally, songs falling nearer the center here, like Cheeseburger in Paradise and RAC's We Belong, are those that most characterize my taste numerically, and the odd ones, like Pink Floyd's Comfortably Numb and The Fellowship of the Ring orchestral suite, fall on the outside.

So in the end my music taste is a blob that doesn't even fit the data very well. And that's the point: Like many things, it's too complicated to boil down. You can't answer the question fully. But understanding elements of the answer can aid the process of discovery, and that's valuable. It's why Spotify is such a force at music recommendation. It's why Data Science.