I was recently speaking with a colleague after they returned from visiting their college freshman for parents weekend. The student was frustrated they’re smartphone wasn’t working as well as they expected and were blaming the Wi-Fi. The colleague pointed out to their child that they were in an area where there wasn’t any Wi-Fi and the cause of the poor performance was a lack of carrier service to their smartphone. This brings up a good point. Some users aren’t even aware of how to identify whether or not their device is using carrier service or Wi-Fi. We’ve really got our work cut out for us don’t we?
Many of us who support Wi-Fi are familiar with how often users blame the Wi-Fi, when the Wi-Fi isn’t actually the problem. Having a deeper understanding of the systems and protocols that support the Wi-Fi is important when you’re administering Wi-Fi networks. I intend to highlight some of the more common problems that can occur causing the problem to be blamed on “bad Wi-Fi.” Depending on the organization your support, these services may be your responsibility or they may fall under the responsibility of another individual or group. In the end, we all strive for a good user experience so knowing how to quickly identify and resolve these problems or escalate them to the responsible individuals is important.
The goal of this entry is not to attempt to cover every problem that can occur, but to give some guidance on where to begin. While the Wi-Fi most often takes the blame, the true culprit is sometimes a client issue or an upper layer protocol.
As with all good network related troubleshooting, we should start at Layer 1 and work our way up the stack of the OSI model. This is sometimes challenging for folks new to Wi-Fi as there isn’t a Physical Layer connection that can be seen like with Ethernet. It’s there, you just can’t see it. You can use the Wi-Fi status indicator on the device you are using, which usually consists of a series of bars indicating the signal strength of the connection. The location and function of these vary across OS but are similar enough in nature. If you’re looking for additional information, I recommend becoming familiar with tools like ‘lswifi’ and ‘netsh.’ Both are available on Windows and provide deeper level of connection information. I’ve included a link at the bottom to a previous entry I did about ‘lswifi’. I’ve also included a link to github where ‘lswifi’ can be downloaded. Using these tools will allow you to determine what the current state of the user device.
When troubleshooting all issues of this nature I find it helpful to always compare results to other devices in the area. Depending on the issue, comparing to like devices is even better.
Client/Profile Issues
If the device is not connected to Wi-Fi, you’ll want to check if there are any available wireless networks in the area. If there are and the device won’t connect, you can begin to investigate why. It could be as simple as adding a passphrase or utilizing a captive portal to get connected. Other times it can be helpful to completely remove the profile, or “forget” the wireless network and then attempt to re-join/re-add it. Wireless profiles do get corrupted and you’d be amazed how many times removing and re-adding is the solution.
Another possibility is that the client device is experiencing a driver issue. What driver version is the device using? Is there a recommended driver version by the vendor? Are similar devices experiencing the same issue? These can all be indicators that a driver may need updating or reinstallation. This may or may not fall under your jurisdiction. When necessary, escalate to the appropriate individual or group.
DHCP Issues
If the device shows as connected, check to see if it has an IP address. Without an IP address, the device won’t be able to do anything. In high density areas, DHCP address pool exhaustion can be a frequent occurrence. If the device cannot get an IP address, you’ll need some ideas as to why. Can other devices get IP addresses? If DHCP is not functioning, you’ll want to check the DHCP server scope and ensure it is functioning properly and has available addresses to hand out. Again, this task may not fall under your responsibilities and may require escalating.
DNS Issues
If the device has an IP address but is still not able to access any resources, try seeing if DNS is working properly. Check to see if the device acquired a DNS address via DHCP. If it did, are you able to perform name lookups on that server? I encourage you to spend some time getting familiar with using ‘nslookup’ to test dns servers if you aren’t already. If the server you’ve acquired via DHCP will not perform name resolution, escalating the the individual/group managing DNS may be necessary.
If any resources are available to test via IP, try accessing them. If they are accessible but name-based resources are not, this can be another indication of a malfunctioning DNS server.
DNS may be working but experiencing delays in response. I’ve included a link at the bottom to DNS Benchmark. This tool is helpful in testing DNS server response time. The tool allows adding and removing servers to give you a bit of a customized view. I’ve worked in environments where this tool has helped identify an overloaded DNS server that was the root cause of user complaints of “bad Wi-Fi.’ If you’ve moved onto packet/frame captures, you can analyze DNS response times in Wireshark but I think those details my be best left to a dedicated blog entry.
I wrote a very rudimentary Python script with a list of the organization DNS servers and a list of URLs. The script iterates over each DNS server IP and performs an ‘nslookup’ for each URL in the list. It’s a quick way to check a group of DNS servers for timeouts when DNS servers are overloaded or not responding. Using the above mentioned tools or developing your own for your environment is a great way to get better at supporting your networks while limiting the cycles spent on testing.
Access Control Issues
Once you’ve ruled out all of the above, the issue could be related to access control. This is a complicated area and could be related to many things and can often be difficult to debug if you aren’t familiar with the architecture of the organization. Access control can be user-based, network-based, resource-based, time-based, or a combination of any of these as well as others. An example that comes to mind is an issue I worked on where access to streaming services was limited because a large amount of bandwidth and Wi-Fi capacity was being consumed by users streaming videos and downloading videos to watch later. The organization I supported implemented rate limiting to effectively ‘break’ streaming services on a specific SSID. Users complained about the Wi-Fi not working but it was by design and only for certain resources. While the impact is certainly seen as a Wi-Fi problem by the user, as support personnel, we need to hold to a higher standard so we can get to the root cause of the complaint.
In summary, when you are debugging user sourced complaints of ‘bad Wi-Fi’, be sure to use a systematic approach to troubleshooting and work your way up the stack.
Links
Lswifi blog entry
Lswifi download
https://github.com/joshschmelzle/lswifi
DNS Benchmark