Monorepos for Data Teams

Content from this discussion, adapted for a longer format.

Within “it depends on your exact case”, I’m still pro mono-repo. The use cases I’ve seen where it might make sense not to be mono-repo were security (restricting the repo for PRs the way you need to) and function (completely disparate work) related.

Broadly, there are 3 things I like about mono-repos:

Workflows and tools (e.g., linters, bots, tests) get defined once
You can do all of the work to deploy something in one PR
Searchability - variable names, references all in one repo

GitLab’s data team handbook is one of my favorites, but they’re not showing you their actual code, if I remember correctly.

If I was starting again from scratch, here’s what I’d focus on when getting started with a mono-repo:

Structure: it’s helpful think about an IDE’s sidebar. How will you navigate the code? It’s okay to have to “travel” to other parts of the repo, but you want to make it easy to get where you’re going
Documentation: I love the strategy where everything starts with a merge request. You streamline how you communicate “big things” - code & docs. It’s helpful to think about strategy as you get started (e.g., does each project get a README?)
Contribution: how do you make it easy to contribute? This is everything from automation (e.g., setting an EditorConfig) to having a CONTRIBUTING to make it easy for people to understand what you want them to do /6
References: how do you connect code to your request tracking (e.g., what you get from an Intake Form)? Same for bugs/reports that “the data don’t look right”? Tight integration makes it easy to reference changes the next time something comes up

Last thought: it won’t be perfect immediately. Mono-repos are like homes; they’re a place you spend a lot of time & take tweaking to get to be exactly how you want them. Also, the great ones are loved and get regular maintenance.