The One Line Traefik/Docker Config That Broke My Routing

2024/04/20 Justin Hammond 3 minutes Software

I've used Traefik to route traffic to my Docker containers for years. For years I haven't had a problem with this. Containers that needed to connect to a database would run on their own "network" so the DB could talk to the UI. There would then be a global "traefik" network that all frontend containers used so that Traefik would know how to route the traffic to the UI. Up until a few months ago, this worked perfectly well with very little config (see my laravel-template project for an example of how little config this actually takes). Then one day, disaster struck.

Some of my deploys started failing - well rather, after a new deploy, some of the containers were unreachable, eventually resulting in 503 and 504 status code errors. I always have two replicas of every UI container I run in case something like this happens. I started noticing odd behavior where sometimes one of the two replicas would only return 5xx errors while the other worked correctly. I couldn't pin down if this was some odd change I had made in my deployment system Harvey or if this had something to do with Traefik or Docker.

labels:
  - traefik.docker.network=traefik

After many long evenings of research, I finally came across this small piece of documentation (found naturally first through many forums) that detailed how Traefik can choose the wrong network to route traffic to based on the order that Docker returns them in. For years it seemingly appeared that Docker returned (at least in my case) the two containers per service (internal for DB and "traefik" to connect to) in an order that Traefik always knew to use the "traefik" network to route traffic to the UI. A recent change must have been made which could result in the two networks coming back in different orders each time which would explain the sporadic 5xx errors - only fixable by redeploying the project - which could then lead to the same problem depending on your luck - which would necessitate another deploy. There were deploys during this time that I had to do a half-dozen times until both replicas were properly returning 200s on because traefik got the right networks to connect to.

TLDR: RTFM, and set your traefik container explicitly so you don't cry for weeks like me.

Justin Hammond
I love all things tech. I've been programming since the age of 12, repairing iPhones since 16, and founding tech companies since 20. I'm an open source fanatic, Apple fanboy, and love to explore new tech. I spend my time coding open source projects, tinkering with electronics and new tech products, and consulting teams on how to get things done.