Architecture : monolith works – Doctolib Engineering – Medium
How a pragmatic approach shapes Doctolib, a system handling 1B monthly requests from 20M patients
Doctolib helps doctors and patients simplify their healthcare path. Currently 60K practitioners use our service as a solution to manage their appointments and 20M patients use our public website to find a practitioner every month. Our company started from scratch 5 years ago, and today Doctolib is the leader in France and the biggest player in this market in Europe.
In technical terms, Doctolib’s usage represents more than 1B monthly requests to our servers with more than 25K active users throughout the day and nearly 3TB of data in our databases. This is just the beginning of our exciting adventure and we expect our traffic to double in the coming months, as has happened regularly since the beginning.
Throughout the following article you will see that we’ve managed to keep our architecture rather simple; the reasons behind this stem primarily from our methodology. We are not here to praise monolith architecture, but rather to recognize the power of pragmatism. Achieving technical simplicity is at the core of our philosophy and when you break it down, there are two fundamental elements that help us in this achievement: pragmatism and short feedback loops.
*YAGNI: You Ain’t Gonna Need It
Under the hood
Doctolib software is essentially a monolith written with Ruby on Rails and React/Rx JS. It runs on Unicorn web servers with a PostgreSQL database and Redis for caching and Resque jobs queuing.
The same backend application houses the public website, practitioner services, staff features, as well as data integration with healthcare systems, each for mobile and desktop. It is currently deployed on 13 virtual machines with 30 Unicorn workers each, which serve more than 1B monthly requests with a duration of under 400 ms for the 95th percentile.
Each worker is dedicated either to public or pro traffic, using HAProxy to route requests and manage load balancing, but they can also be reallocated easily and without any downtime if necessary.
We use PostgreSQL for storage, with data replication on a secondary instance for failover. Our database is about 3TB large and we need to be capable of handling 1M database requests per minute at peak.
As it turns out, we are quite comfortable with this monolith, despite the fact that at this point, we have more than 30 engineers working on the same codebase.
Fine tuned architecture
In order to deal with such huge volumes, we currently use mostly vertical scaling on our database and a lot of fine tuning; critical requests are heavily optimized, from SQL requests to JSON rendering, and useless N+1 requests are hunted. Yes, this is a rather classic approach, miles from early distributed micro-services and NoSQL fashioned architectures. That doesn’t mean that we won’t ever explore these options; we simply believe that we do not need it at this time. As far as our performance and scaling needs go, we can still achieve high availability with this classic approach for most of our features.
Our clients use Doctolib as their unique scheduling tool at work, so it’s quite critical to reach the best service availability possible. To achieve this, we had to implement some features so we can face heavy loads and mitigate incidents:
- We use feature toggles and circuit breakers at the application level. This gives us the ability to completely disable non-critical endpoints in order to ensure the critical ones are preserved from heavy load and to disable external services like SMS sending whenever the service is down, etc.
- Our infrastructure is fully replicated in two datacenters hosted in France by HADS certified providers; in case of major outage, we are able to switch to our passive datacenter in less than 10 minutes.
- Our secondary PG server is used for failover in active/passive mode, but we started to add some horizontal scaling capability with dynamic distribution of read requests at the database level, so we can mitigate risk in case we were not able to handle traffic peaks on our primary server.
- Some specific use cases require different technologies, for example, ElasticSearch permits us to provide health institutions with instantaneous results using autocomplete when searching a patient among hundreds of thousands.
A matter of mindset
As mentioned above, among the different fundamentals that we described in our Engineering Manifesto there are two that are especially important in the way that we build software:
- pragmatism, so we can keep our technical stack simple
- the gathering of widespread short-term feedback in order to give us the space to make these pragmatic decisions.
Simplicity and pragmatism
We take our philosophy of pragmatism seriously at Doctolib and use it to avoid falling into the traps of fashionable hype tech.
Think simple. Terms like “if ever”, “if one day” are prohibited, unless “one day” has a real date in the near future. “This is cleaner”, or “This is more generic” are also not good arguments to use on their own. We don’t want to introduce useless complexity just in case we need it or only as a means of achieving a perfect code design.
When someone proposes a solution outside of our technical stack to solve a particular problem, we always ask ourselves first if the solution could be found within our current stack, perhaps something we hadn’t thought of or are not aware of.
Let’s look at some examples:
- Our Ruby code sticks mostly to the standard Rails way while trying to sparingly use tricky features like ActiveRecord callbacks. It allows new joiners to jump into this part of the codebase without too much additional technical learning. This doesn’t mean that we do not do code design at all, as frameworks like Rails leave most of the work to do in pure business code.
- We also ended up having to deal with such an excess of technical complexity in our React/Rx front stack that many developers felt lost in the codebase. We decided to refactor it to move towards something that allows us to focus on business code rather than wiring technical code. This article about how we treated this technical debt can give you a glimpse of the problem.
- We introduced the Rust language to develop a low level desktop application. Rust is a great language, but it quickly became evident that this tiny repository was difficult to maintain as everybody works with JS and Ruby 99% of the time. So, when we found out that we could do the same using pkg in JS, we rewrote it to reduce complexity.
- We have introduced ElasticSearch, but before doing so we asked ourselves if it could be possible to achieve instant search using Postgres features as we already use PG quite intensively. We couldn’t manage it at this time, however, this is still something we might revisit as PostgreSQL improves over the years as we’d love to reduce the complexity / heterogeneity of our stack.
Pragmatism means that we have to be careful to always challenge our choices. We’re strongly opinionated about the way we move forward, not about where we end up. We don’t know exactly what we’ll need in the coming years, but we know that these choices work for the time being and certainly for the next six months; each choice brings us closer to figuring out how to jump over the next hurdle.
Short feedback loops
In order to be able to decide quickly what our next move will be, short feedback loops are completely mandatory.
To begin with, we wouldn’t all be able to work together in the same codebase without a certain number of survival practices:
- A new version of the product is released at least once a day; we couldn’t do it without relying heavily on continuous integration, code reviews, and our 8 000 tests harness, a lot of them written using TDD.
- We have a big monolith but we work with feature teams having well delineated functional scopes and all the required skills so they can make decisions concerning the product quickly and autonomously. All of these teams, of course, use practices such as close collaboration between tech and product, daily standup meetings, and team retrospectives to enforce early feedback and continuous improvement.
- Almost thirty developers working on the same codebase could be painful, however, short lived pull requests (75% have a lifetime under 3 days) and a lot of feature toggles allow us to keep a shared ownership of the different parts of the codebase without getting in each other’s way.
Another important source of feedback is the application itself, with feature usage and KPIs on the product side, and performance and system monitoring using tools like New Relic, Grafana or Sensu on the technical side. With all this we get instant feedback to react to incidents, but also obtain aggregated data in the long term so we can analyze progression and anticipate our next moves.
Can we become a giant?
So, everything is perfect! Of course not. Parts of our code grow quickly and become very complex, so we must be careful and refactor it before it becomes too difficult to maintain. Outages happen, and we have to learn from them in order to anticipate how to better adapt to the fast growth of our business.
As in every company, sometimes we invest too little on a subject, other times too much. This is probably the hardest part; because our business has grown, we’re no longer a small startup that can only rely on short term decisions. We have to adapt and find new solutions to climb to the next step. It is a delicate balance, in that it is important not to move too quickly or risk the introduction of a big change too early. Engineers from web giants like Facebook or Google would never have imagined what their architecture would look like five or ten years after they started. Even if they could have, they would never be where they are now if they tried to start building it day one.
Despite its provocative title, this article is not a praise for monolith: we don’t know what our software will look like in a few years, maybe it won’t even be a monolith anymore. It’s a praise for pragmatism. We strongly believe that being pragmatic is much of what has granted our success, and will continue to help us do so for the years to come.
And here is the great news: our codebase is a monolith, using rather simple architectural principles with an infrastructure using mostly vertical scaling, and we just started to explore different paths for specific cases, like horizontal scaling or caching. But this is just the beginning for us: there are still plenty of not silver bullets out there and that’s a very good thing to know.