Recently, one of the customers reported they can’t access to all UAT apps from their Melbourne office, but it worked fine for other offices. When they tried to access the UAT app domains, they were getting below errors: “The request service is temporarily unavailable. It is either overloaded or under maintenance. Please try later.”
WAF error
Due to the UAT environment IP restrictions on the WAF, it is normal behaviour for me to get the error messages due to the fact our Kloud office’s public IPs are not in the WAFs’ whitelist. This error approved the web traffic did hit the WAFs. Ping the URL hostname, it returned the correct IP without DNS problems, this means that the web traffic did go to the correct WAF farm considering the customer has a couple of other WAF farms in other countries. So we can focus on the AU WAFs now for the troubleshooting.
I pulled out all the WAFs access logs and planned to go through those to verify if the web traffic was hitting on the AU WAFs or went to somewhere else. I did a log search based on the public IPs which were provided by customer, no results returned for the last 7 days.Search Result 1
interesting. did it mean no traffic from Melbourne office came in? I did another search based on my public IPs, it clearly returned a couple of access logs related with my testing, correct time, correct client IP, correct WAF domain hostname, method is GET, Status is 503 which is correct because my office IP is restricted.
Search Result 2
Since customer mentioned all other offices had no problem to access the UAT app environment, I asked them to provide me with one public IP from another office, we tested it again and verified people in Indian office can successfully open the web app and I can see their web traffic appear in the WAF logs as well. I believed when Melbourne staff tried to browse the web app, the traffic should go to the same WAF farm because the DNS hostname was resolved to the same IP no matter whether in Melbourne or in India.
The question is what exactly happened and what was the root cause? :/
In order to capture another good example, I noted down the time and asked the customer to browse the website again. This time I did an access log search based on the time instead of Melbourne public IPs. I got a couple of results returned with some unknown IPs.
Search result 3
I googled the unknown IPs, it turned out they are Microsoft Australian data centre IPs. Now I kind of felt there are some routing or NAT issues in the customer network. I contacted the customer and provided the unknown IPs, customer did a bit of investigations on this and advised that those unknown IPs are the public IPs for their Azure Express Route interfaces. It makes sense now. Because customer didn’t whitelist their new Azure public IPs, so when web traffic came from the unknown source IPs (Azure Public IPs), WAF doesn’t know them and they were all being blocked as well, just like me. Once I added the new Azure IPs into the app whitelist IPs, all the access issues were resolved.