When evaluating data collection platforms, many business owners end up being faced with a technically strategic choice without even knowing it. As you check the value props of services like Segment, Mixpanel or Pendo to find out if the reports and visualizations they produce are the right thing for your business, you've already made what is perhaps the most important choice: to store your data with a 3rd party provider.

The choices

The choice of whether or not to store your data with a 3rd party provider is a big one. Although it is a technological decision, the leadership needs to be very aware of the tradeoffs so that when the choice is made, they can prepare for short and long term consequences of each choice.

Choice 1: go with a 3rd party SaaS

Choosing a 3rd party SaaS is choosing to have someone else outside of your organization store and control access to your data.

This has been the default choice for startups and, although a bit less, for corporations as well for some time now. The value props and pricing are quite consistent and follow a pattern of:

  1. Cheap initial pricing
  2. Easy setup (minutes or hours of engineering time to get started)
  3. Scalable

Then as your use of the platforms increase, the price increases. In the top payment tiers, you'll fall within one of two categories. The first category are the nice ones that will set their pricing to be lower than the cost of hiring a few people to maintain an in-house version. The not-so-nice ones will charge huge amounts on the hopes that migration will be more difficult than is worth it should you ever want to do so.

On the privacy front, if you have personally identifiable data of your users then you will be trusting a 3rd party to guard it. This is not necessarily bad since their incentives to do so are very well aligned as data breaches would destroy their credibility and revenue. However, the more 3rd parties have access to your data, the more potential vulnerabilities exist.

The main operational benefits are ease-of-use and, potentially, cost. You might be able to skip the need for hiring an entire team to build out some subset of the functionality that these other systems offer for a fraction of the cost. Plus, should they choose, the 3rd party can pass the savings from economies of scale onto you.

The main operational hazards, however, are very serious in that: you don't have access to the raw data. Most providers will claim that you own the data but, in my humble opinion, there is no true data ownership without unfettered access to your raw data. Providers will typically give you access through web apps that are aggregations, summaries, or filtered versions of the entire dataset. Should you want to do anything with your data outside of their platform, the incentives are not aligned. The growth of their business depends on you staying on their platform more and more to access your data.

Choice 2: self host on the cloud

There is another choice. It is more manpower intensive but can provide significant savings and flexibility and that is to build what you need for yourself. In previous years, this has not been an option. In recent years with increasingly easy-to-use cloud providers such as AWS, GCP, and Azure, paired with open source ETL and dashboarding tools it is no longer prohibitively expensive to build exactly what you need for yourself in a way that gives you true ownership. It looks like this

The first thing you might ask is: aren't AWS, GCP, and Azure 3rd parties? What's the difference between sending your data to Mixpanel compared to Amazon? The answer is twofold:

  • If properly configured, the data stored in your cloud account won't be accessible by anybody who doesn't have your login credentials to the cloud infrastructure and you get to choose who that is. The likelihood of a malicious government or hacker getting into your account is much more likely to come from within your organization than through your cloud provider as they don't even give themselves access to it.
  • Incentives are aligned. As long as you pay your cloud bill, they don't care what you are actually doing with it. You can access it, move it, transform it, share it with another service, etc. without having to worry whether or not it collides with their business model.

Another question you might have is: can you actually build that? The answer is "usually". With open source tools such as sklearn, pandas, Airflow, and metabase running on your unopinionated cloud infrastructure, you won't be able to clone all of the, for example, Mixpanel functionality but rather only the subset needed for your business.

The benefits are huge and can save you in situations that may not occur for years and are impossible to predict. Plus, it doesn't offer any restrictions! For example, if you store your data in your own AWS account it is trivial to just pipe it out to Mixpanel. In that way you get the best of both worlds: have your raw data on infrastructure that you own as well as reap the benefits of a 3rd party offering.

How to make the choice

Like most worthwhile decisions in life, it depends. If you decide that it is important for you to own your data, you'll have some costs involved in building what you need in house. The good new is that, unlike the SaaS offerings, it can be cheaper in the long term as the pricing will scale with the amount of data you collect rather than an arbitrary metric like number of users or number of logins.

To be a bit more concrete, here's a series of questions that you can ask yourself:

  • How important is it for me to fundamentally own my data?
  • How much would it cost in the short term to use a 3rd party?
  • How much will it cost in a year to use a 3rd party?
  • Do the 3rd party offerings actually meet my needs?
  • How much are my initial costs in building it in-house?
  • How much will the in-house infrastructure itself cost monthly?
  • How much will in-house maintenance cost in labour?

With all of these in a spreadsheet, a few scenarios might arise:

  1. In-house is cheaper than SaaS in every dimension. Easy decision to build it in-house.
  2. In-house is more expensive in the short term but maintenance is cheaper than SaaS. Your decision might depend on how long it would take to realize an ROI weighted against the importance of owning your own data.
  3. In-house is more expensive than SaaS in every dimension. In this case, your decision is clearly around the importance of data ownership as you would be paying extra to have it.

Let us know if we can help

Through our strategic consulting, we regularly help clients with these cost-benefit analysis and we are happy to do so for you. Please contact me at sam@daredata.engineering to set up a free 1 hour strategic call.