Tips on How to Manage Large Scale Data Science Projects

Use these tips to maximize the success of your data science project

Managing large-scale data science and machine learning projects is challenging because they differ significantly from software engineering. Since we aim to discover patterns in data without explicitly coding them, there is more uncertainty involved, which can lead to various issues such as:

  • Stakeholders’ high expectations may go unmet
  • Projects can take longer than initially planned

The uncertainty arising from ML projects is major cause of setbacks. And when it comes to large-scale projects — that normally have higher expectations attached to them — these setbacks can be amplified and have catastrophic consequences for organizations and teams.

This blog post was born after my experience managing large-scale data science projects with DareData. I’ve had the opportunity to manage diverse projects across various industries, collaborating with talented teams who’ve contributed to my growth and success along the way — its thanks to them that I could gather these tips and lay them out in writing.

Below are some core principles that have guided me in making many of my projects successful. I hope you find them valuable for your own projects as well!

Analysis / Statistics and Prediction

It’s very important to segment the project you are about to start, as the most confusing topic for stakeholders is the difference between AI, ML and DS. These topics are also everywhere on the news and media, and people use the terms interchangeably (I don’t blame them).

The most important thing every stakeholder needs to understand is if the project is about machine learning or not. Some projects are “data science” projects, but do not contain any prediction feature. This significantly reduces the uncertainty of the project.

I normally group projects in the following:

  • analysis for insight projects involve examining data from current or historical sources to derive actionable insights. These projects typically focus on understanding trends or patterns based on data that has already been collected. Common examples include reporting and business intelligence (BI) initiatives.
  • statistics for causality may seem like machine learning projects but are a bit different. They intend to analyse data in the context of statistical hypothesis, without trying to predict the future. A great example are all type of A/B tests or other treatment/control analysis.
  • machine learning or clustering: these are the real machine learning projects, and they can be supervised, unsupervised or reinforcement learning (or have a mix between them).
  • I view GenAI projects as a subset of machine learning, as they involve prediction and error handling. Like traditional ML projects, they require similar management strategies.

Projects are rarely limited to just one type. For instance, ML projects often feature dashboards that display both historical data and predictions. While this combination is beneficial, it’s important for stakeholders to understand that predictions typically come with more uncertainty and take longer to develop than analyzing past data.

Managing Expectations

ML projects require that you manage uncertainty very well.

Users and stakeholders, particularly under the GenAI hype, expect very high performance from AI systems (sometimes, with unrealistic expectations).

Managing the expectation of speed and accuracy of algorithms is absolutely crucial. Don’t promise high accuracy / f-score / other metric without seeing the data.

Don’t promise high speed without assessing the company’s systems and ability to scale. In essence, don’t overpromise.

Also, understand how to add value with your ML project. Are you working for some organization trying to use ML to reduce costs? Or are you looking to increase sales and revenue? Try to translate the main goal of the project from technical to business performance — this will make the goal of the project much simpler.

And that leads us to…

Success Metric

A project’s success metric is often the most important part of the project — Image by tinkerman @ Unsplash.com

No ML project should start without a success metric. Is it speed of prediction? Is it technical performance? Or, is it saved dollars?

For stakeholders, faster and more accurate is always better. If you don’t address these expectations via the success metric definition right from the start, you may end up in a scenario where stakeholders expect 100% accuracy from an ML system.

The success metric definition is your friend. Right from the start, define the playing field with your stakeholders and make them verbally (and in write) agree with a certain level of performance (technical, business, or other) and/or business impact.

Ideally, you should focus on optimizing a single metric to guide the project, as trade-offs show up. If that’s not feasible, establish a hierarchy of metrics to prioritize, allowing you to make informed choices when necessary.

Organizational Stage

When working on a project, do you know the organization’s AI and Data Maturity stage? One common mistake data scientists make is to disregard the data maturity and context they are inserted in.

Some organizations are able to fastly deploy machine learning models, while others have move trouble doing so. Some organizations already work under MLOps best practices, while others are having trouble keeping track of the results of their ML models. These nuances are extremely important to ensure the success of your project.

Answering these questions will set the stage for several things:

  • How much data you will be to consume into your ML model. For example, if you need more features, what’s the likelihood that you will get them faster?
  • How will you deploy the model or project?
  • Will the model be continuously monitored? Or does it need retraining?

These questions can only be answered by exploring the organization’s data processes. Engage with people (peers and leaders) to understand whether a clear data vision exists within the organization and how this vision aligns with the project you are developing.

Agile vs. CRISP DM

Although Agile is applied by many organizations in Software Engineering projects, it should be used with caution in the context of ML.

For example, this paper compared the usage of different project methodologies in ML context. It came to the conclusion that Agile mixed with CRISP-DM (Cross Industry Standard Process for Data Mining) may be a good combination to achieve positive outcomes (without leading to team frustration).

In my experience, I’ve noticed that some flexibility on sprint planning and task assignment is often needed with ML Projects. It’s quite normal to spend 2 or 3 weeks without any major breakthroughs or tasks may need to be completely changed due to new discoveries. If this is not taken into consideration, the team may feel some disproportionate pressure to deliver results (that may be subpar or, even worse, solve a problem that doesn’t match stakeholders’ expectation).

This study details the inherent conflicts between traditional PM and AI workflow logics. The table below (taken from the study) highlights some of them:

Managing artificial intelligence projects: Key insights from an AI consulting firm

Team Management

Team management is more an art than science. And every team is different and contains its own nuances.

The most important tip I can give on team management and leadership is to try to adhere as much as possible to the Manager-Maker schedule. Do you know that it takes around 10–15 minutes to get back “in the zone” when you are interrupted? This estimate can be even higher if we are speaking of complex tasks that need high levels of focus.

Protect your team from unwanted distractions (random pings on Teams / Slack or meetings that operate on a managers schedule) and they will appreciate. This is the best tip that I can give for a happy and productive team.

Thank you for taking the time to read this post. I hope you’ve enjoyed these tips and you can use them in the day-to-day.

Let me know if there’s something that you would like to add! I’m always looking for fresh new perspectives on AI/ML/DS project management. Knowing how to handle these projects is definitely a rare skill and often, Project and Product managers define a large portion of the initiative’s success.

This post is likely more relevant to consultancy-based projects or those in non-tech businesses, where data science and machine learning are crucial to day-to-day operations. But, hopefully, some of the tips will also be applicable to those organizations.

Want to know more about us? Visit daredata.ai or send me a message on LinkedIn.