How to introduce Data Science at your company

Machine Learning, Data Science and Artificial Intelligence are three terms that have been intertwined and used in multiple conversations during the past decade. Probably, in the business world, no other theme has caused so many questions, doubts, eyebrow raises and el dorado hopes.

If you are reading this post you might have some level of interest in understanding what Data Science / Machine Learning or Artificial Intelligence are and trust me, you are not alone in the world. Or better yet, let's trust Google Trends - the search interest on these terms have been on the rise over the past 5 years:

There are multiple opinions regarding the usage of algorithms in business and they range from:

On one end of the spectrum, people do not believe in the return on investment on A.I. and deeply believe that there is a huge FOMO (fear of missing out) by most companies.
On the opposite side we see a kind of cult and religious following on A.I. systems and there is a deep belief that most machine learning algorithms will magically solve every problem across the economy - sometimes due to technical ignorance, sometimes due to hidden agendas.

At DareData we sit in between these two spectrums - we believe data science usage is an excellent way to enhance your company's decision making processes - but just like another set of technologies, you will only see a proper return on investment by thinking thoroughly, iterating quickly and listening (really listening) to the stakeholders.

You've probably thought about several challenges that your company faces that could be solved by using algorithms - if you are like most business leaders, you immediately pose yourself a question:

"How can I introduce Data Science at my company with the lowest risk possible?"

Let's find out.

The importance of measuring 🎯

Start simple. Pick one thing that several relevant stakeholders within your company would see clear benefit from automating or predicting.

No matter how big your company, industry or revenue is, the first thing you should define, when developing an algorithm is:

What should my algorithm aim to achieve?

This seems like a simple, straightforward question but has more than meets the eye.

Breaking it down, to answer this question you have to:

Know what processes the algorithm will impact.
Know what people the algorithm will impact.
Know what data you will need to produce meaningful impact.

And to make it even harder, some of the metrics you might achieve by gathering the answers from the points above are not clearly straightforward or even easily quantifiable. If you want to start to experiment with DS it's important to choose a problem that is relatively easy to quantify as you would probably spend less money, less time and involve less people in finding short term gains.

But how can you do that without exposing yourself to too much risk i.e throwing money away?

Proof of Concept Approach

One thing that you might (or not) find surprising is that, at DareData, we start most of the relationships with our customers with a small project first (either in time or scale). This ensures that:

The customer feels safe about working with a Data Science company and does not feel that there is a huge investment if things don't sort out according to their expectations.
There's a clear path to walk the talk. Sometimes, there is a temptation to drift away in larger projects - stakeholders change or priorities change - and this is not beneficial for a company transitioning processes to data-driven ones. Starting small, the company can infer two things:

Do I really have a problem that I want to solve with data? If so, can I trust this partner to help me with it?

It's a good way to embrace an "experiment" mindset. Data Science has science in its name - and that's not coincidence. Data Science is based on a lot of experimental concepts and some organisations are not ready to embrace that mentality - it takes time, resources and an overall perception of intangible benefits that aren't born overnight. It is possible, with small proof of concept projects to show the benefits of embracing this mentality - for instance, building more flexible algorithms and data pipelines that are ready and embrace change (something that more traditional data projects can't do - Imagine changing the schema of a database every day!).

"Please, be specific!"

Let's imagine two companies - one that develops mobile apps (Company A), born in 2019 and other that provides electrical power to numerous cities (Company B), born in 1940.

How can they, being so different, benefit from a POC approach?

The Startup Case

Let's pick company A first. They probably know what data science is - they were born in the data science era. There's a bunch of data sitting around that no one in the company knows how to use and their budget is tied to their VC's will.

The CEO thinks that there is a huge churn problem with some of their apps - 80% of the users that download certain apps use them only once.

He wants to understand why but he lacks the resources to do that. Ideally, he would like to have an internal data science team with people that have some experience and know what are the right questions to ask but this is costly and completely out of his budget, hence he starts to look for external companies that might want to help him.

In his mind, two questions and fears might come into play:

"How can I make sure that I choose the right company to help me?"

and

"How can I be sure that I'm not throwing money away?"

In both questions, what he wants to is to minimize risk. He wants to be sure that if things go well, he chose the right company to help him start his journey in digging through all it's data and if things don't turn out as expected he doesn't want to be tied to a project that will drain him a lot of his money and patience.

Introducing data science with a small project first is an excellent way to tackle this problem. The CEO clearly wants to solve a problem - lower churn. If we approach this the POC way, we start by defining a simple problem and a metric of success:

Choose one of his apps and study throughout 1 or 2 months the churn pattern of the customers - what variables explain the app usage stop?
The company defines that they believe that an algorithm that finds 85% of the customer's churn is good enough. Also there's a need to have some kind of explainability because the objective of the CEO is not to only to try to understand which customers are more prone to churn but also to change the underlying process that might cause that churn (a high waiting time between menus, for instance).

With a problem and metric defined there's a clear path to success. Fast forwarding to the end of the project, several outcomes might happen - describing the most to least successful:

The accuracy target of the project was reached and the CEO was very happy with the variables that were shown as explanatory of churn. He can now do a couple of changes in the app that will probably result in lower churn in the future.
The accuracy target of the project was not reached but even so, some variables still showed some explanatory power and the CEO can now act upon them.
The accuracy was not met and no variables shown meaningful predictive power.

So what's the downside here? Basically spending a couple thousand of dollars in a project to understand that the data that the company has available can't explain churn. But there's some positive externalities in doing this project - think of the data pipeline that probably was built, as an example.

What if the project really works? It can basically be scaled across other apps of the company and the CEO is more confident that he can trust the company that is helping him - and probably can rely on them if he wants to recruit an in-house team.

There's a clear gain for smaller companies to use a POC approach - they can minimize the risk if things go amiss - so how can big companies (that have probably better cash flows) benefit from this approach?

The 80 year-old Company Case

Let's put ourselves in the shoes of another person - the director of a company that distributes electricity across a region of the country. This director went to countless conferences that basically scream the same:

You should be doing analytics and data science!!

But he also understands that there's a big resistance for automation and algorithms at the company. The commercial, marketing and finance department have a huge compounding experience on the business and it's certainly hard to top that.

.. and most analytics and data projects done at the company missed their timing, budget or both.

Adding to that, there's a whole hype sensation that big companies might face - there's countless technologies, methodologies and innovations regarding data that come and go, most of them passing through a kind of funnel process that enables the surviving of the fittest (or the luckiest) - SQL stuck around, Hadoop not so.

So, for a bigger company (and this is a generalisation, for sure) the main problem is twofold, one related to people and stakeholders, and one related to technology.

People wise, the biggest risk is to develop an algorithm that will be poorly received by the primary stakeholders. In every project there's resistance from people to understand the benefit of using algorithms or data - that's a natural human resistance that is almost as old as data collection itself.
Technology wise, the biggest risk is investing a whole lot of money in a huge project that will end up not bringing much value and becoming a kind of zombie project.

So, how exactly does a POC approach minimizes these risks for a big company?

As with the startup case, let's define a specific problem:

One of the first problems that might be solved using data, identified by the company, was the failure of some components of their power grid. They want to understand if they can predict this failure before it actually happens - currently, there is only a rule regarding the age of the components.
Predicting the failure of a component in the grid before it actually happens will probably prevent that some of the neighbour components also get faulty in the process and require substitution - not only this enables the company to save money on the long run, but also makes the downtime of the grid to be much lower.

And the metric of success:

The company thinks that lowering the total cost of replacing faulty components is a good financial metric of the success of the project - but, as replacing these probably faulty components might take time they also want to track the total downtime of the power network.

In this case, what's the best way to kickstart this project? You've guessed it - a POC approach.

During the first two months, tests can be done for just a specific region of the country - and for the same region a part of the grid should be left as-is to have a ceteris-paribus (meaning, all other things constant) approach. With a POC approach, here is the timeline:

Data collection for 2/3 weeks.
Development of the algorithm for 4 weeks.
Testing it in the real world and compare results for 1/2 months.

Now, the last step is a tricky one. It requires the participation of several stakeholders such as the IT department and the support team. But the good thing is that, with a POC approach, after developing the solution the company can probably test it when they feel ready and they have time to get everyone on board.

At the end of the project, again, three things might happen:

The company could lower the cost of maintenance of the power grid and even had a lower downtime when compared to the control group where the previous process was rolling on. The management is extremely happy and want to scale the entire solution to the whole country.
The company could lower the cost of the power grid but the downtime was higher - the company probably wants to test this a couple of times more to understand if there was a problem with the development or a question of bad communication with the support team.
The company had a higher cost and a higher downtime for the set where the predictive maintenance was deployed. They were not happy, so probably they will not roll the solution forward.

Now let's return to our talk about risk. What was the downside here, if the project fails? Let's point some common blunders of some projects:

Did the company deployed a solution to the whole country? No!
Did the company changed the way hundreds of their employees work? No!
Did the company spent a lot of money in resources to understand that predictive maintenance has a high chance of not working? No!
Did they actually convince employees that data was their saviour only to turnout that it is not? No!

It's not all roses

You are probably thinking - so if there seems to be a lot upside in using a POC approach in data science projects, why aren't all of them approached like this?

First, a POC approach is an experimental way of doing things - you test, try and if it doesn't work out you clearly say that it doesn't work out. Many organisations aren't ready for this (no judgement) as they are more inclined to certainty due to political, economical, cultural or other factors. It takes guts and courage to tell their stakeholders: "we did a project and it failed!"
Failing is sometimes perceived as weakness. Some POC's fail. Others don't. There's still a misconception about failure that is not beneficial if you want to approach some of your projects using a POC approach.
Need to think probabilistically. Most humans don't think probabilistically and that's fine. But when it comes to science in general, you will benefit by approaching it so.
Culture, culture, culture. As a conclusion, culture is the most important factor for a POC approach to work. If your culture enables it, you will benefit largely from introducing innovation at your company using POCs.

At DareData we believe that the POC approach is the bayesian way of introducing data science at your company. Bayesians tend to look at the world probabilistically and update their belief that something is factual by seeing more evidence of it.

Let me dive into this concept without getting too technical, boring or both. Bayes Theorem tells us that going from a baseline probability we can update our probability that something is true given new evidence. Imagine this for your company:

By baseline and comparing with competitors, you believe that the data you have or may acquire has 30% probability of adding value to your business/customers/employees.
You do a first POC - it goes extremely well, so in your mind this probability doubled, jumping to 60%.
You do a second POC - it goes alright but some targets have not been met, you revise your estimation of data value to 50%.
You do a third POC - it goes extremely well again, you decide to be more cautions and only raise your probability by a factor of 40% instead of doubling it - raising your probability to 70%.

Bottom line, if I ask you again, what do you prefer? Scaling a project for the entire organisation with 30% or 70% belief that will add you value?

As a final note, thank you for taking the time to read this! If you are interested in talking with us and just have a quick chat (no strings attached) visit our website (https://daredata.engineering/home) and hit the Scope a Project button - we would love to talk with you!