A platform for our platform

In June this year (2016), a new team was formed within Bonnier Broadcasting with a mission to build a realtime data platform. The purpose being to address a set of high priority data needs we have identified, but also to support the growing data driven culture we see in both C More and TV4. A realtime data platform has of course no formal definition, and it can mean slightly different things to different people. To us, it means an accessible data warehouse where all our data sources are consolidated and integrated. These data sources range from relatively static sources, such as user or video databases, to event streams from apps and services, ingested in realtime. From all these sources we need to create data models, analysis, and visualizations that help the rest of the organization do a better job.

So, when starting out on this journey, one of the first questions to arise was, where should we build this platform? What platform should we choose as the foundation for our platform? This post recounts the reasoning behind our choice, and our experiences with the result so far.

TL;DR

We went with Google Cloud Platform for our data platform and we are mostly happy with it.

A cloud platform

For the majority of our system components (infrastructure services, databases, video platform) we have already made the move to the cloud. Several years ago we started using AWS and Heroku (and we are still migrating legacy services). At the time there were not many options, and we are still happy with AWS for our core needs.

This year, however, when discussing our future data platform, the cloud landscape has evolved substantially from when we picked AWS, and we saw an opportunity to take a step back and evaluate our options. And there are a host of options these days, at various abstractions levels, and with various levels of managed offerings.

Apart from pure functional requirements, of which none is particularly unique, we have a few non-functional constraints that effectively narrow us down a bit.

First, we want as much control as possible. We want to avoid, or at least minimize, vendor lock-in and dependencies to third-party products and solutions. This rules out a number of packaged solutions, such as Cloudera, Databricks, MapR, and others.

Second, we are a small team with an ambitous task. We want to minimize, as much as possible, the cost of operation and maintenance that comes with doing too much ourselves. The most obvious path from there would perhaps be to go in the Apache direction, where Hadoop, Spark, and an abundance of other tools, await the willing, and set them them up on our AWS account. Much, if not all, of what we want to achieve has already been done with these tools, and many well-known companies have contributed to the toolbox over the years.

However, a third constraint is that we have picked Go to be our language of choice. This is a conscious choice, made from a number of reasons (which I would be happy to share in a separate post). As a consequence, we have invalidated (or at least made awkward to use) from our list of options the majority of open source data processing tools and frameworks, which are available to your everyday Java, or Scala based shop.

GCP

Google Cloud Platform (GCP) has been moving forward at tremendous speed during the last year(s) to catch up with AWS, trying to establish itself as a serious contender in the cloud market. And with Spotify announcing early that they are migrating their data processing platform to GCP, the risk associated with making the same move was substantially lowered.

As it turns out, what Google Cloud wants to be matches our needs and constraints to a very large extent. Most, if not all, of the infrastruture we need is there. The following four components are what we use the most today.

  • BigQuery is the managed data warehouse we needed but did not find in Redshift.
  • Container Engine (GKE) is the managed Kubernetes platform we don’t have to waste our own time on setting up and maintaining.
  • Dataflow is the managed data processing service that automatically runs and scales our stream and batch jobs, that we write with Apache Beam/Dataflow.
  • PubSub is the managed message bus that we use to pass data between the services above.

A key word here is managed, and it translates to a huge amount of time and effort that our small team does not have to spend on stuff (e.g. setting up, configure, monitor) that is not pushing our data platform forward.

Many people I talk to seem to look at GCP as simply a Google version (albeit less mature) of AWS or Azure, that aside from subjective bias and pricing, they have more or less the same offering. Having started out with AWS for our platform, we have found this assumption to be false, at least when it comes to our use case. The managed aspect of the GCP services is a differentiator.

What about Go?

The third constraint I mentioned above was the Go language. These days, the JVM is the ruler of data engineering, and to a large extent data science as well. As a consequence, we are not completely free from the JVM dependency, as much as we want it. What’s holding us back is Beam/Dataflow, where the stream processing SDK is still only available in Java. The Python SDK is moving along, however, and there are rumors that a Go SDK is in the works somewhere inside Google. Hopefully we can contribute to that once it is released to the community.

However, since all Google Cloud API’s have official Go client libraries (of which most are hand crafted and idiomatic, and not generated from spec), we can do most of our plumbing and infrastructure in the language we feel gives us most speed and quality.

The flip side

When I earlier wrote about what Google Cloud wants to be, I was hinting at the major downside we have experienced with our choice of cloud platform. Several must-have’s for us in terms of functionality are still in beta, or even alpha, and one or two GA components are still lacking (I’m looking at you, Stackdriver). Each component is moving at a very high speed, and one shouldn’t invest too much effort in trying to work around a current limitation, as it may well be solved two months later.

So, while moving in the right direction at high speed generally is perceived as a good thing, it is also frustrating when you try to build something real on top of it. We wouldn’t want it the opposite way though, and all in all we are happy with the choice we made.

Läs om vår kommentarspolicy