What does BigQuery Omni mean for the future of data?

GCP recently announced the private alpha of BigQuery Omni, a service that allows you to run BigQuery (Dremel) on data that resides in another cloud (S3 in AWS at the moment, Azure soon) using Anthos. It’s an interesting foray into what looks like one of the earliest approaches to multi cloud analytics.

In this post we’ll explore not what Omni is (see the link above), but what Omni might signal for the future.

The problem

In 1956 transporting 5 MB of data storage looked like this.

IBM state of the art refrigeration unit (with attached storage), 1956

The IBM 350 disk storage unit contained fifty 60 cm spinning metal disks and weighed in at just over a tonne. 

In 60 years the technology has changed a bit but we’re still moving large amounts of data via trucks. Despite the massive improvements in networks and communications infrastructure, it’s still tricky to match the information density of a truck filled with hard disks.

At AWS Re:invent in 2016 to visually demonstrate this they decided to drive a truck full of Snowballs (an AWS specific transfer appliance also containing metal spinning disks) into the convention centre.

The AWS Snowmobile – because you’re never too old to play with trucks.

Admittedly it’s a larger truck but we’ve gone from transporting a few megabytes of data to 100 petabytes of data in what is a somewhat comparable amount of physical space.

AWS isn’t alone in offering this as a service, GCP has its own Transfer Appliance and Azure has the imaginatively named Data Box.

Most of these things exist for migrating on premise data centres to the cloud but why on earth do we need to drive around semi-trailers full of hard drives in 2020?

There’s more data than ever but the fundamental problem of being able to do anything useful with this data has relied upon having your compute co-localised with your data. Whether that’s in the same server, rack or data centre the interfaces between your data and your computation have typically been in close proximity, knowing that the tyranny of distance makes everything slow. Omni starts to hint at a world where maybe this doesn’t have to be the case.

It’s in – now what?

Getting your data into one of the public clouds is one problem, but if you want some portability into other clouds, or want to move it outside of the cloud for some reason, now you’ve got 99 problems. And those 99 problems are egress charges, the cost that accumulates for data transfers out of the cloud (and sometimes within the same services in the same cloud).

All the major clouds charge for egress – here’s a brilliant diagram from Duckbill Group that highlights some of this pricing on AWS.

Corey Quinn Cloud Economist at Duckbill group and author of Last Week in AWS – a newsletter you should sign up for.

As you can see, calculating egress costs is easy if you have a Masters of Forensic Accounting and traits of cloud induced masochism. Sadly, I only tick one of these boxes.

Egress is not only complicated, but often the economics don’t add up.

This doesn’t come as a surprise to vendors, who rely on the analogy of “data gravity” coined by Dave McCrory (then at Hyper9, now at Digital Realty) to describe the effect of where data lives – all else follows. The gravitational pull of housing data in a certain place inevitably attracts additional services, applications and data to it. This, in itself, is not a bad thing but makes the cost of switching or hedging your bets on multiple clouds far more difficult. Enter the escape velocity of a cloud.

GCP is aware of this and projects like Anthos and Omni are attempts to subvert this very real phenomenon by allowing data to reside in multiple clouds through the witchcraft of quantum entanglement (well not this part yet, at least).

Is the traditional data lake dead?

No, absolutely not.

Calling for the death of the data lake / pond / swamp would be premature, but this is a sign of shifting times. Unlike many industries data and analytics aren’t built on bedrock but instead on tectonic plates and that’s okay, if a little worrying at times.

The concepts of data warehousing still move extraordinarily slowly when compared to the technology itself. Data centralisation still has a place and is such a central tenet of data warehousing that the idea of distributing data across multiple clouds and regions for many will induce Edvard Munchs ‘The Scream’ levels of horror.

“We can’t do what now?” – Collective cry of marketing departments, May 25 2018.

New legislation is forcing attitudes to change with introduction of GDPR / ePrivacy / CCPA and other frameworks threatening to redress the long overdue power imbalances that exist between consumers and the companies that collect and process data on them.

Many companies are having to wrestle with the idea that data was sold to them as the new oil in a business case that failed to warn them of the inevitable risk and actuality of oil spills. Reactionary approaches to data management aren’t going to stand the test of time (there’s a fossil fuel joke in here somewhere).

Omni is a looking glass to the future

Upon multicloud entering the vocabulary of vendors it wasn’t met with nearly the same universal fanfare as big data. Much of that is due to the realisation that multi-cloud has for a long time been something that sounds fantastic on paper but is often fantastically difficult to achieve in reality.

The history of GCP demonstrates a serious attempt to try to make this whole multi-cloud thing work. Sure, there’s always going to be a degree of inevitable vendor lock in when making almost any PaaS / IaaS choice. In an industry where Google isn’t the incumbent they’ve been forced to develop software and services that aim at building gates into the walled gardens of their competitors.

Through both internal development (S3 compatible APIs for GCS, Anthos) and acquisition (Stackdriver, Looker, Alooma) and more recently Omni, there is a tangible investment in figuring out what the future of cloud bridging looks like.

Omni is a decisively more aggressive strategy that we’re seeing. It’s one thing to offer tools that interoperate with rival clouds, but it’s another thing entirely to start running your proprietary tech (Dremel) there.

This flex isn’t accidental – it’s a proactive recognition that for nearly 15 years the industry has continued to build open source software on the back of giants – with Google being one of those giants. Building these fundamental data movement services increases not just data gravity but increases engagement with additional high value services.

These decisions are increasingly symbolic of a strategy that is not just identifying the needs of their customers but also where they are in their cloud journey, and how to continue to capitalise on their own IP.

Does this mean others will follow suit?

It’s too early to tell. For many, multi-cloud strategies are either non-existent or in their infancy. Having strong USPs amongst the major cloud services may mean that it simply is no longer competitive for vendors to aim to capture 100% of customer cloud spend.

Ultimately this might be seen as a partial win for customers when big vendor clouds decide to play nicely with each other. However, this involves a shift from a combative environment to one that is more collaborative and that’s not always an easy sell.

Who knows, closer clouds might even yield a rainbow*?

* or it could be a thunderstorm, my metaphors are about as good as my meteorological understanding.

A Colorado rainbow and rain shaft observed while on College of Dupage’s Storm Chasing Trip 3.
Credit: Jared Rackley for the NOAA Weather in Focus Photo Contest 2015

Published by Mike Robins

CTO at Poplin Data

Popular posts like this

Mike Robins

Why you need to learn SQL

v0.2.14 of the Snowplow Inspector Released

Poplin Data Retail Week