Managing large networks come with a considerable number of complexities. Most of these complexities can be managed but considerable care must be taken as seemingly minor changes can have major impact. The network I manage currently has more than 14,000 wireless access points deployed across more than 200 sites supporting more than 150,000 users. Below I plan to outline several aspects of the managing a large network.
Firmware Updates
Something as simple as device firmware upgrades must be managed carefully to avoid the introduction of a bug that could have a dramatic impact on user connectivity and device performance. At one point or another we all have or will experience an issue with a faulty firmware or driver version. Lets face it, bugs happen. In order to avoid being majorly impacted by a firmware bug, proceed carefully when updating your wireless access points (or any other device for that matter). Pushing a firmware update to over 14,000 access points at one time, if a bug is discovered, could have severe consequences.
I like to use worst case scenario planning for deploying firmware updates. What’s the worst that could happen? The firmware update failing and having to touch every device to perform a factory reset. With over 14,000 devices deployed across almost 200 buildings, and having a limited support staff, this could take weeks if not months to resolve.
Currently the smallest scope we can update is one network/site given constraints imposed by the cloud managed access points we have deployed. When a stable firmware version becomes available, a smaller, lower impact site is chosen as the first site for testing. Once the firmware version has been installed, a burn-in period is given to allow for any negative impact to be noted. This burn-in period is usually weeks long. After a level of comfort has been achieved, the firmware will be rolled to a larger site and another burn-in period will begin. If after the second burn-in period no issues are noted, the deployment schedule is increased to multiple sites per week. Special attention is still given to observe any negative impact. If no issues are noted, the deployment schedule is again increased to multiple sites per night/maintenance window until the remaining sites have been upgraded. Managing upgrades in this fashion minimizes the risk of a major issue and allows mitigation via rolling back a minimal number of sites.
What type of firmware issues can occur? The worst case I’ve experienced to date was rolling out beta firmware that was intended to resolve a known bug related to MU-MIMO that caused significantly degraded performance for users (these details are better suited for another blog entry). The new firmware did in fact resolve the MU-MIMO bug but introduced a new bug that caused the access points to reboot multiple times per day. The bug only impacted high-density sites where client load-balancing was enabled. The new firmware version introduced 802.11v as part of client-load balancing management as well as the bug fix for the MU-MIMO issue. Two work arounds were to downgrade the firmware version or to disable client load-balancing. Both work arounds provided stability, but downgrading re-introduced the previous MU-MIMO bug.
When selecting new firmware, I recommend staying with proven versions. Most vendors signify their proven versions as ‘stable’, use a gold star, or have some other rating system. I recommend understanding the system used by your vendor or vendors so you can select properly vetted firmware. Expanding on this idea a little further – only use Beta firmware versions if you are using them to resolve an already discovered bug or if there is a new feature you absolutely cannot live without. As noted previously, beta versions almost always have bugs or they wouldn’t be beta. It’s just a matter of whether the bugs affect your environment or not.
Rogue AP Detection
Many wireless access points have built in WIDS/WIPS capabilities. Rogues AP detection is an important feature that can help detect rogues that have been introduced. Detecting is the important first step to removal or mitigation. Many management systems also support email alerts to notify admins when rogues are detected. In a network with a large number of access points with many surrounding networks, a great deal of care must be taken when enabling alerts of this nature. No one on an email distribution list wants hundreds or thousands of emails for false positives. Unfortunately in my environment, given the number of access points, the proximity to residential areas and configuration limitations imposed by the cloud management system, I was unable to implement a system that allowed thorough enough classification to allow enabling email alerts for rogue AP detection. Ultimately, a better solution was to develop a script leveraging the vendor API to provide first level detection of rouge APs that could be further scrutinized for classification, mitigation, and/or removal. I hope to do another entry on the API script in the future.
A trial test was done using around 10% of my sites. Over a holiday weekend 25 email alerts were generated. Obviously, this does not scale well and the ultimate decision was made to use the API script rather than email alerts.
Alerts
This section builds on the email alerts mentioned in the Rogue AP Detection section. Many vendors offer a variety of alerts. One that can be extremely helpful is when devices go offline. When managing thousands of devices, many of which don’t have battery backup systems, care must be taken when choosing the alert interval for down devices. One widespread power outage could result in thousands of emails being sent to admins and result in critical emails getting lost in “the noise”. Our current cloud platform allows configuring alerts based on the down time of the device. We set switches to alert after being down for 30 minutes and access points to 60 minutes. This minimizes notifications for brief power outages but allows emails to be sent for devices that may need attention.
In conclusion, when you are managing large scale networks be sure to spend time considering the impact of any configuration or firmware updates you make. The impacts may be far reaching. To sum things up in one sentence. Always have a plan!