Analytics at Spotify
At the heart of Spotify lives a massive and growing data-set. Most data is user-centric and allows us to provide music recommendations, choose the next song you hear on radio and many other things. We do our best to base every decision, programmatic and managerial, on data and this extends into the culture.
At my previous job, I developed software for Ad Agencies in the Digital Asset Management space, so you can say I was relatively new to “Big Data” as it were. New engineers at Spotify will notice that the culture has a way of engulfing you in a data-driven mindset. After working at Spotify for only a few months, I was talking about term weighting and signing up for internal courses on the R programming language.
I also participated in a hackathon where I developed a Spotify App code-named Genderify that tapped into our massive data-set to determine exactly how “manly” a playlist is. It was mostly a joke, but utilized listening data to provide an accurate statistical map of a playlist and displayed a result of 0-100, 100 representing an extreme edge case where a person registered as female had never listened to any tracks on your playlist.
Our Analytics Pipeline powers far more than satirical apps. It allows us to recognize trends, discover bugs, and analyze the effect of an event on a user and the entire ecosystem.
Internally, everyone (not just engineers) has access to three tools: Dashboards, Data Warehouse, and Luigi. Dashboards provides an interface similar to Google Analytics and allows users to create their own custom screens containing data they are interested in from our pipeline. For instance, we have dashboards that show us user growth in particular regions, or user engagement, or even the number of emails we deliver.
Data Warehouse is a more complex system that allows you to access our data-set directly. You can query the data, create map/reduce jobs using Hive, and even create mini data pipelines if that’s the kind of thing you’re into. For more complex operations, we have Luigi at our disposal, governing a zoo of Python, Pig and other animals which can be made to talk to any storage systems, run machine learning algorithms and even provide daily reports.
So what do we do with all this data? Pretty much everything. An example of an entirely data-driven decision would be our choice of a music recommendation algorithm that powers Spotify Radio.
Most of our recurring data is added to our analytics pipeline by a set of daemons that constantly parse the syslog on production machines looking for messages we have defined along with the associated data for each message. Matching data is compressed and periodically synced to HDFS. Typically data is available in our Data Warehouse and Dashboards within 24 hours, but in some cases data is available within a few hours or even instantly through tools like Storm.
So all this sounds… complicated. And I assure you, to build a pipeline and infrastructure like we have, it is. But to make use of it is actually really easy. Engineers can easily add data to our analytics pipeline by adding a new message to our log parser and simply logging information to syslog using the correct format.
Becoming Data Driven
My experience at Spotify is a perfect example of how simple this is and shows how any engineer can make a meaningful impact.
Shortly after joining Spotify, we decided as a company that we wanted to send users emails telling them if their friends joined and if new songs were added to a playlist they subscribed to. The hypothesis we wanted to test was that sending these emails would have a positive impact on user engagement and help more users to come back to using the app more often.
So… we needed a transactional email system. I took this project on as an opportunity to learn Python. With the help of a few other engineers, we built a fairly simple system that had the ability to deliver a lot of emails and also provided a way for people to create new email templates and A/B test different versions of an email template.
Within a few weeks we knew which email templates worked best and, more importantly, we could see the impact these email campaigns had on our users. We could clearly see that these emails were having a positive effect on user engagement.
So, how did we know the effect these emails had on users?
This backend system for sending emails would simply log a message every time an email was sent with the fields (username, timestamp, email-campaign, campaign-version).
Once this data made its way into HDFS, we had all the data we needed to determine the best performing email template for a campaign and we could track the effect a single email had on a user’s experience. We were able to see if an email had any effect on your listening habits, your account status and so on.
Powerful stuff. This data is very much still in use today.
Remove Bias, Acquire Data
Spotify strives to be entirely data driven. We are a company full of ambitious, highly intelligent, and highly opinionated people and yet as often as possible decisions are made using data. Decisions that cannot be made by data alone are meticulously tracked and fed back into the system so future decisions can be based off of it.
How fantastic is that? Sounds robotic, but humans cannot be trusted so it’s cool.
So the conclusion is to rely on data whenever possible. Don’t have enough data? Get more. Make data the most important asset you have because it is the only reliable decision maker that can scale your company.