Web Page Broken on GKE only?

2 minute read

Everybody was confused when we found a webpage running in Tomcat in a Docker container was broken only on GKE(Google Kubernetes Engine). It worked well in all other environments, including Kubernetes built on GCE(Google Compute Engine) VMs(OS: CentOS 7)or local Oracle VirtualBox VMs, Docker engine on macOS or standalone Tomcat outside of docker container. Shouldn’t container be OS-agnostic? How could it affect a web page, which is the least likely to be affected?

There are a few things going on:

  1. The web page is rendered by MyFaces. There are multiple elements on the JSF page with ‘rendered’ attribute. When it works correctly, only one of the element will render. Those conditions can’t be true at the same time;
  2. We have a myfaces-shaded jar file on our classpath, which is a duplicate jar of other ‘real’ MyFaces implementation. If myfaces-shaded is loaded, it will throw some errors and enters an incorrect state. In this state, the ‘rendered’ attribute will be ignored on #1 page, which causes all elements to render. This is the direct cause of the broken web page;
  3. In our ‘working’ environments other than GKE, the myfaces-shaded never gets picked up by the classloader, which is why we never discovered this problem before. But we did have developers find this from time to time. It disappeared after restarting Tomcat. So nobody has actually followed this through;
  4. In GKE, the myfaces-shaded gets picked up by the classloader for some reason. Once we removed the myfaces-shaded in the classpath, it started to work.

gke

I suspect the GKE’s host OS, Container-Optimized OS from Google, has some optimization on the file system behaviour, which causes the order change when JVM classloader loads a list of JAR files. Imagine the JVM asks for a list of inodes with a file path list, the OS decides to optimize the response time, trying to minimize the mechanical movement. Hopefully, all data can be retrieved in one disk magnetic head trip. But the file distribution on the disk could be widely spreaded. If the head doesn’t want to go back and forth, the file order it picks up couldn’t be the same as in the API call parameter. I think this might not be an issue if the API clearly states the order is not guaranteed. And if there is an asynchronous mechanism, the JVM process might have already started to get callbacks and load jar files as the head moves across the disc radius.

Nevertheless, an application should never have 2 same classes with different versions on the classpath. The loading order, which shouldn’t be relied on in the first place, in other environments hides this bug by accident.

Categories: Tech

Updated: