Transitioning from Ingress-NGINX at Stack Overflow

| 5 min read

Facing the Ingress-NGINX Retirement

When the announcement came in November about the impending retirement of Ingress-NGINX, it felt like a sudden wake-up call for many users, including us at Stack Overflow. This tool had served as the backbone of our traffic management since we transitioned to Kubernetes. Its removal left us scrambling for alternatives, especially since we had neglected to fully engage in discussions around the **Gateway API**, a potential upgrade, previously thinking, “If it isn't broken, don’t fix it.” However, now we found ourselves needing a plan—and fast.

Crafting the Replacement Strategy

Given the multitude of traffic routing solutions available, it was crucial to streamline our options. We aimed to transition to the **Gateway API**, which promised enhanced features and improved role separation, rather than opting for yet another Ingress controller. Nevertheless, time was limited. If obstacles arose in adopting the Gateway API, sticking with another Ingress implementation could still be on the table. Our initial research helped us define criteria to hone in on three Gateway implementations, alongside a couple of Ingress alternatives as backups.

Gateway API Candidates

  • NGINX Gateway Fabric
  • Traefik
  • Istio

Ingress Alternatives

  • F5 NGINX Ingress
  • Traefik
The first filter we applied was ensuring that any replacement made the list of fully-conformant implementations of the Gateway API. This gave us a solid framework for evaluating options. Since we operate primarily in Google Cloud Platform (GCP) and Azure, any cloud-specific solutions were immediately ruled out. Our examination extended into the **1.4 feature matrix** and third-party benchmarks, zeroing in on the three contenders mentioned. Sadly, **HAProxy**—a former reliable ally from our data center days—was no longer a viable option, residing instead on a list of stale implementations. However, it has since made strides toward compliance. Among our fallback options, **Traefik** initially seemed promising due to its compatibility with NGINX annotations. Yet, as we dug deeper, we found that many of our critical annotations weren’t supported, leaving it less appealing than anticipated. Meanwhile, F5’s NGINX Ingress implementation showed some potential, but its reliance on proprietary resources versus standard Kubernetes types raised concerns about future compatibility with other controllers. Ultimately, we decided to exclude other Ingress options from consideration early in the process.

Analyzing Our Current Usage

To effectively test our candidates, we first exported all ingress objects from our main production clusters into YAML files. Utilizing a tool named Claude, we sorted these objects by use case. It turned out most of our routing needs were straightforward, yielding about half a dozen key tests, along with two scalability benchmarks to define.

Our Testing Setup

For our testing environment, we deployed **HTTPBin** as the primary backend. This tool is invaluable for monitoring HTTP requests and responses, particularly useful for scenarios where we might need to alter headers dynamically. HTTPBin features a `/headers` endpoint, enabling us to check if our tests align with expected input and output. We also set up a Go web server, stationed at `perf.`, designed to handle a high volume of requests rapidly, allowing us to stress-test the gateway systems thoroughly. To simulate varied response times, we adjusted latency parameters, ensuring our tests reflected high-stress conditions.

Performance Comparisons

While the initial setup for the three implementations was relatively straightforward, we encountered some challenges. For instance, Traefik required configuring an “entrypoint,” a unique element that establishes TCP listeners. This unexpected hurdle slightly disrupted the gateway functionality. Despite these issues, all solutions successfully addressed our use cases. Among the Gateway API features, Istio performed best, while Traefik fell short. However, it became apparent that some of the purported advantages of the Gateway API lacked the necessary depth for our complex scenarios. Particular features, such as header modification in the **HTTPRoute**, only supported static values. This limitation forced us to retreat to the individual implementation’s extension points, which, although flexible, added complexity to the evaluation process. Many times, we found ourselves needing to tweak our applications to adapt to implementation-specific behaviors. For example, transitioning from **ngx_http_auth_request_module** for authentication posed challenges because Istio had vastly different operational characteristics. Ultimately, despite evaluating numerous options, we settled on Istio. Its performance and stability throughout our rigorous testing stood out as the primary deciding factors, and there's assurance in its advanced features, which may become beneficial down the line. In the upcoming weeks, we’ll commence the migration, and should we run into any noteworthy obstacles, I'll surely share an update on our findings.
Source: Michael Frank · stackoverflow.blog