Data science team sizing and allocation

An algorithm

4 min readJul 30, 2019

This one is for the crawlers and the robots.

There are many ways to organize a data science team within a company. One of the most effective is the hybrid model, as I explain verbosely in a post and briefly in a thread:

Models for integrating data science teams within organizations

A comparative analysis

medium.com

Q. Embedded or centralized?
A. Both.
Embedded for context, relevance, communication efficiency, and to be in sync; centralized for hiring and promotion purposes, for peer review, and for sharing and maintaining best practices.
Pardis Noorzad on Twitter

The centralized management of a team is not without its challenges — the most prominent of which being that of team sizing and data scientist allocation. However, with the hybrid model, there’s an easy-to-follow procedure for leaders as they estimate hiring budgets.

In this post, I present the procedure detailing the allocation of data scientists to product teams. For this procedure to succeed, there are conditions that I assume hold for the organization under study.

Assumptions

Assumption 1. Longterm ownership is valuable.

Data scientists, like engineers, are able to produce quality and impactful results only when they have longterm ownership over the product and their work. Good products can only be created with care. Yet, many still see data science as a series of disparate projects, as defined and requested by stakeholders.

Data science on a team isn’t a project, it doesn’t have a start and an end, it’s an ongoing process. Data changes as the product changes.
Pardis Noorzad on Twitter

Longterm ownership over one product also leads to strong team dynamics. As Will Larson explains in this post, disassembling necessary teams for the sake of short-term projects is counterproductive. Shift scope, don’t break teams.

Recently had several discussions around whether it makes sense to shift folks onto higher priority teams after you’ve repaid a team’s technical debt. In general I think you probably should *not* do that, and wrote up my thinking why. https://lethain.com/case-against-top-down-global-optimization/
Will Larson on Twitter

Assumption 2. In a software company, the engineering teams and their size determine the company strategy and objectives.

The engineering teams are created in such a way as to be able to tackle the various strategic bets and existing value areas of the company. If this is not the case, the existing engineering teams are the unannounced strategic bets and value areas of the company.

Assumption 3. More engineers on a team lead to more moving parts, more experiments, and more meetings.

The bigger an engineering team, the more time that needs to be allocated to meetings and correspondence for coordination and planning. In addition, more engineers result in more projects and more experiments.

Assumption 4. Teams with more mature machine learning capabilities have better instrumentation, better quality data, and more data sources that are useful to the data science team.

Machine learning models require upfront efforts in data preparation to produce feature vectors. Further, ML models are highly sensitive to data shortcomings. And so teams with mature ML models tend to have higher quality data and aggregate data sets as compared to other teams. If this is not the case, stop what you’re doing and fix that tire fire.

Assumption 5. Teams with a user-facing aspect run additional client-facing experiments. These experiments sometimes double the number of experiments on an engineering team.

Assumption 6. Having just one data scientist on a team significantly improves the quality of data (leading to less buggy data products) and the speed of decision making (leading to faster product iterations).

Procedure

Based on the assumptions above, my proposal is to assign a point to every engineer on every team (client + backend). Note that each point contributes to a sum that represents the relative required amount of data science work.

Sum up the points and sort in descending order. Break ties by a team’s data maturity, i.e., the higher the maturity, the lower on the list.

Assign one data scientist to every team on the list, starting from the top. Note that this is not per project but rather per team (with cross-functional membership). This will help get all teams from 0 to 1.

At this phase, if you still have more data scientists, take 3 points (we are taking this to be the ideal data scientist to engineer ratio as a start) off of every team. Then assign another data scientist, starting at the top, and repeat.

Conclusion

This approach presented above gives a clear, fair, safe, and effective strategy for allocating data scientists to product teams.

References

On Sizing Your Engineering Organizations by kellan

Please let me know your thoughts in the comments or the Tweets.