A tricky networking task…

By geraint

We were recently tasked with a rather unusual cloud connectivity task, in this article I will discuss how we designed and delivered a working solution which allowed our client to maintain near 100% connectivity on a wide spread site with volatile Electrical & Internet connections to their AWS environments.

This piece of work was an absolute pleasure to design and implement – I’ve mainly been focussed on AWS and GCP architecture using code rather than hardware in recent years so having the opportunity to brush the dust off my networking skills and put them to good use in a cloud context was great fun, I hope to do more crazy networking like this in future.

I’ll apologise in advance, this is a fairly technical post but I have made a little effort to make it palatable for a wider audience – not that much effort, but a little all the same.

Author: Geraint Lee

Position: Director, Chief Technology Officer

Email: geraint@imagi.cloud

Phone: +44 3300 577 156

Linked In: linkedin.com/in/geraint-lee

Requirements

As with all projects, we started with gathering requirements from the client, and this is what we established:

  1. Connectivity to Amazon Web Services (AWS) must be maintained at all times.
  2. The application has been designed to tolerate up to 10 seconds of downtime however any downtime will have direct human impacts and may cause significant management challenges.
  3. Electrical supplies may fail at any time and it could take up to 30 minutes to restore.
  4. At some sites, only 4G connections will be available.
  5. All network cell sites will be highly populated and network contention is highly probable.
  6. Connectivity is required for Internal (AWS) HTTP endpoints.
  7. The average request size is 1kb and responses must be received within 100ms. Higher response times will work but are not optimal and will incur a financial impact for the client.
  8. Some responses to HTTP requests can be cached, cache headers are available.
  9. Four independent areas will require connectivity, one of which may not have an electrical supply for up to 8 hours.
  10. Some devices will require Wifi connectivity and cannot be hardwired.
  11. Some devices will be deployed in locations that will be separate to the main network, the redundancy requirements are lower for these devices, any device should be able to function in this way without configuration change.
  12. Large numbers of humans will be moving and behaving in unpredictable ways.
  13. Equipment is at risk of water damage & human interference.
  14. Vehicles may drive over cables.
  15. The distance between each deployment location will range from 50 meters to several kilometres.
  16. All areas must be able to communicate with AWS and each other.
  17. All equipment must be removable quickly, will be moved on a weekly basis and may not be handled with care.
  18. Overall equipment weight matters.
  19. The delivered solution should be supportable both on site and remotely without advanced networking knowledge.
  20. Remote VPN access will be required to all devices that form the network both in AWS and on site.

Sounds fun...

Quite the challenge. Thankfully I’ve already put a lot of thought in to a solution such as this having spent many hours on the train wondering why on earth in 2020 a reliable Wifi connection on a vehicle moving along a known, fixed path is still not something that’s been properly achieved…

We spent many hours in front of a whiteboard bouncing around ideas, each idea taking us closer to a workable solution… I eventually came up with an architecture which satisfied all requirements and set to work experimenting.

Hardware

We knew the type of devices we’d be needing but didn’t have much luck finding any similar projects online so we decided the best approach would be to buy a number of different devices and see which ones worked best.

4G Routers

We picked up a few different 4G routers, ProRoute GEM420, ProRoute 685, a generic unbranded one, some dongles with RJ45 connectors, and the cheapest decent looking device Amazon had to offer.

I’ve used the ProRoute equipment previously for less impressive projects and so already had a good benchmark to work from.

All devices worked well however the automated failover of all of them was nothing short of horrific taking anywhere from 30 seconds to 5 minutes to reconnect or switch sim card which immediately ruled out using any features of the routers to maintain connectivity and instead use them as simple routers…

During testing we noticed that some mobile networks force reconnections during normal operation and so it is likely we would lose connectivity at some point.

4G Sim Cards

We purchased business data sim cards for EE, Vodafone, O2 and Three. EE were by far the most helpful, and called me without request – we attempted to contact other providers but none would put us through to someone who could talk to us at a technical level.

The EE sales representative was surprisingly clued up on the technical aspects of their network and informed us that business users already have higher network priority and if we used a high gain antenna we’d probably find we won’t lose connectivity but they can’t promise anything, he did however inform us that if we were in a 5G area and contacted him ahead of time he could arrange for a portion of the band to be reserved for our exclusive use ensuring low latency and high bandwidth. Good to know.

4G Antennas

We considered the use of directional antennas but decided against them as the equipment would be moving on a weekly basis and so establishing the direction every time would be too much of a management overhead.

To test we switched our office network over to 4G for a month, from our offices we have direct line of sight to all network cell towers and each is a similar distance away, the best performance was realised with Vodafone, it was in fact faster than our BT “Fibre” connection, however the network disconnected once every 24 hours which occurred on all devices, but only certain networks – O2 and Vodafone both regularly disconnected, EE and Three did not.

For the antennas, we settled on some Dome shaped 12dBi gain omni direction antennas designed for vehicles which can be pole mounted. The decision was primarily based on the fact the others we tested were 4dBi and 8dBi gain, and this was 12dBi… there was no noticeable difference between them but that’s probably due to the fact we had direct line of sight to the cell sites, we will head somewhere remote and re-test at some point.

Long Range Wifi

We purchased some Ubiquity 15km range Wifi devices operating in the 5ghz band and tested a connection across Swansea Bay from our office to Mumbles (roughly 5km) which went well, we were able to connect to our office and browse the Internet. We haven’t tried them in poor weather conditions yet so they may not work longer term but figured for a 500-1km distance we’ll probably be fine… unless there’s an electrical storm… or extremely dense fog… or heavy rain… there’s a lot that could affect performance, but nothing has so far.

For longer distances devices which operate in the 3ghz backhaul frequency are available, we have not tested these yet but have been offered a pair to play with which are supposedly capable of 200km point to point links.

Switches

We decided at each area we would have a Layer 3 switch with LACP capabilities. We chose to use Netgear switches for testing due to the easy to manage requirement we felt a web interface would be more palatable than a command line interface for both our junior staff, and our client.

Servers

We purchased some low power 1U Network Appliance servers each with 4xgbit ethernet ports, one of these appliances would be installed at each deployment location along with a switch.

The low power element was important as we would be relying on UPS power during outages and so consuming as little as possible would allow us to use lower capacity, and therefore less expensive, but importantly, lighter weight UPS devices.

SSD drives were chosen, not necessarily for the higher throughput and latency, but the movement of magnetic drives on a weekly basis would probably not end well.

Electricity

With the knowledge we are probably going to lose electricity regularly, the obvious choice was to use a UPS sufficiently sized to power the network for at least 30 minutes, with recharge time considered. We purchased a small 1U UPS to get us started to see how it performed. Well, is the answer, it provided 50 minutes of battery backup power which would be plenty of time to get a generator moved or a cable run from another power source to restore connectivity while the system continued to operate on battery power.

For sites where no electrical supply would be available we decided to use:

  • 1 x 12V DC 250AH AGM Lorry Battery
  • 1 x 1000W Solar Charge Controller
  • 2 x 150W Solar Panels
  • 1 x 1500W 240V AC Pure Sine Wave Inverter

Tests with all equipment running showed we would be able to run basically indefinitely providing that the sun doesn’t disappear entirely, but I’m pretty sure we’ll have bigger things to worry about if that happened.

Equipment Enclosure

Many options were considered but it was clear that a flight case with a rack built in would be the best option offering protection from mishandling and ease of movement with wheels.

We established a 10U rack would be sufficient to house all of the networking equipment including Wifi and 4G devices, spare parts, tools and cables.

We installed a 16amp plug to connect electricity to the enclosure to feed the UPS and a 16amp socket to allow connection out of the enclosure to remote devices such as 4G routers that need to be placed further away to establish a better connection away from crowds.

Each deployment location has a full set of equipment housed in one of these boxes.

Cables

Reels were purchased for 100m twin CAT6 cables to live on allowing quick deployment and re-use of cables each time the solution is moved. When possible, each deployment location would be connected to each other using dual CAT6 links with mirrored ports in a LAG configuration to provide some level of redundancy. Longer distances would need to be covered either with switches placed every 100 meters and a maximum of 300 meters between locations, or wifi connections.

The majority of the time electrical distribution would be handled by a third party but there would be occasions where this would be a DIY activity and so a set of electrical cables and splitters were produced for these occasions and when 4G connections need to come from some distance away from the local server and PoE is not possible.

Software

With the hardware requirements settled. We now needed to get the software working, how would we stay connected to AWS in this challenging environment? What would we use for caching, and what can we cache? Where is DNS coming from? How will devices not on the main network know that they are different and need to establish their own connection to AWS? What about monitoring?

The following applications / protocols / tools were tabled:

  • OpenVPN
  • IPSec
  • LAG / LACP
  • VLANs
  • Heartbeat
  • Squid Proxy
  • Memcached
  • Redis
  • DHCPD
  • BIND
  • NGINX
  • Prometheus
  • Grafana
  • Python
  • Terraform
  • Ansible
I suspected some custom tools would be needed, and I wasn’t wrong… Python was firmly on the table as its’ easy to understand and so would be good for the manageability aspect – and all of us at Imagicloud code in Python.

The Network

Internet

We decided to aim for at least 3 active Internet connections for each deployment location, with two 4G routers at each deployment location. The third (and potentially fourth, fifth and sixth etc) could be connected to either via hard wired links or Wifi Bridges from other deployment locations.

This configuration provides protection if a particular part of the site is heavily congested with 4G traffic – both connections may drop for a single location and so 2 would not be sufficient to provide redundancy. Having a diverse location that may have less congested airwaves provides a better chance of maintaining connectivity.

A python script was produced to monitor all connections the server is able to find from any deployment location and an algorithm designed to ensure the most reliable connections are always used, this can result in a connection from another deployment location being chosen over a local connection. It also means each deployment location is entirely independent of the other and so site wide failures are much less likely. The following is considered by the python script:

  • Latency – how quickly do packets travel across the link
  • Connection reliability – how many times has it lost connection
  • Bandwidth availability – how much throughput can we achieve

The connection failover time is 4 seconds, however, a bond device over two independent OpenVPN tap links can be turned on if connections are extremely unreliable thus allowing packets to be sent over two or more independent mobile networks reducing the impact of packet loss – it is not turned on by default as it causes higher latency, data transfer costs and processing power at both the local and remote ends, but if needed, it can be used with up to 4 independent Internet connections forming a single virtual bonded connection.

Cross Site Connectivity

The ideal is hard wired connections in an environment such as those in consideration, however this isn’t always possible as the distance could exceed 300 meters, or even 1km at many deployment sites and so Wifi Bridges would be used instead for these scenarios. Long distance fibre connections didn’t seem a feasible option given the risk of damage – a fibre connection is far more likely to break than a CAT6 cable and transporting protective measures for fibre would be costly both in transportation weight and deployment time.

The Wifi bridges are deployed as a backup for hard wired connections to cater for the event of both cables breaking, I’ve had a lot of personal experience running CAT5/CAT6 in unintended and harsh environments with great success and so believe it to be a good choice, if anyone has any other ideas, please do let me know!

We built a custom python script to monitor both wired and wireless and perform failover within 2 seconds, this may be changed in future to use AS path based routing further reducing our failover time, but for now, this is acceptable.

Should on site cross site connectivity provisions fail (Wifi Bridge & Physical Cables), OpenVPN connections via 4G links back to AWS from each location will enable location to location communication in the event both the wifi and cable links have failed.

Those of you in the know may realise that there’s some missing pieces to this puzzle… a great deal of additional networking trickery was employed, but that’s our secret 😉

Wifi Bridges

The Ubiquity Wifi bridges are capable of transmitting 802.1q packets and so VLANs extend between deployment locations, a lot of Wifi bridges are incapable of this and so the introduction of a routing network is required which I was hoping to avoid to reduce complexity – situation successfully avoided. When possible, all deployment locations will be connected to each other using Wifi Bridges as well as cables.

I’m quite impressed by the bridges however latency is not insignificant, a bridge travelling 10 meters indoors incurs 7ms of network latency.

We’ll release another post on the Wifi bridges in future, we have ordered some different brands to see how they compare.

AWS Connectivity

OpenVPN and IPSec were considered, with IPSec ruled out rapidly due to the lack of static IP addressing available on 4G without paying an absolute fortune, and, well, I knew OpenVPN would do the job without any learning necessary.

We deployed an OpenVPN server on an EC2 instance in our Transit VPC which is peered with our clients VPCs and OpenVPN clients on each of the servers that form the on site network. This results in up to 4 operational connections to AWS available for use on the site network, with each connection capable of routing to other servers. To ensure a healthy link is always available, a dynamic routing script was created to ensure VPN clients are always able to reach any area of the deployment remotely, and that servers are always able to reach AWS and each other via any available connection.

The OpenVPN element raises our failover time to between 4 and 6 seconds, still within our 10 second target.

DHCP, DNS & Magic

Each server acts as a DHCP and DNS server and is capable of taking over from any other server that may suffer a hardware failure through use of network monitoring. The lowest numbered available server will take control of a different deployment location in the event of a server failure, this of course reduces the remaining redundancy but a replacement server can be installed without further downtime to restore the higher level of redundancy.

ISC’s DHCP Daemon and BIND were chosen for use due to their long term reliability, simplicity of configuration and low maintenance.

The requirement to enable client devices to know if they are on a managed network or directly on the Internet would be catered for using a magic split DNS zone which exists both on the Internet and on site. Using TXT records we configured Route 53 on the internet to respond with “INTERNET” if it was on the Internet, and reply with which network it is connected to if the local DNS server is asked, for example:

Internet Response

whereami.imagi.cloud. TXT “INTERNET”

Local Responses

whereami.imagi.cloud. TXT “NETWORK1”

whereami.imagi.cloud. TXT “NETWORK2”

The client was then able to update their application to perform a DNS TXT lookup and modify the application behaviour if it detects it is on a managed network, such as using a shared memcached instance running on a local server instead of one local to the device, and also to decide whether or not it needs to establish its’ own VPN link to AWS.

The magic element was to configure a 169.254.169.254 IP address on each server, which those of you in the know will know is an address that exists on most cloud providers to provided useful information about the resource, the IP is only available locally to the deployment location network with nginx serving a json file mimicking the AWS endpoint so that when boto3 attempts to get IAM credentials for AWS, it is returned a valid set of credentials and allow devices to be deployed without AWS IAM credentials for accessing SQS/SNS thus improving the client application security and removing risk of device theft leading to elements the wider AWS platform being compromised.

Monitoring

Not much thought went in to the choice of tooling – we’ve used Grafana with Prometheus and OpsGenie many times previously and know it can monitor everything we need to and so it made best business sense to use this toolset.

Prometheus is deployed on each server with each server configured to monitor devices local to it, including:

  • Server CPU, Memory, Network, Temperature, logs and running processes.
  • Client device CPU, Memory, temperature, logs and running processes.
  • Network throughput, switch port health and LAG groups.
  • Internet connections – latency, bandwidth, packet loss and packet retransmission.
  • UPS Power Consumption & Battery Capacity

As each client device could be connected to any of the networks on site, we needed a way to create a dynamic inventory for Prometheus, to do this we used a combination of python, nmap and dhcpd with mac address device classing in a known CIDR.

An EC2 instance was deployed in AWS with Prometheus and Grafana installed. The AWS Prometheus instance would monitor devices that connected outside of the known on site networks that establish their own VPN connections, again, we used python to generate a dynamic inventory of hosts connected to the VPN.

OpsGenie was chosen as an alerting mechanism so that both on site and remote engineers are notified of anomalies and are able to act accordingly ensuring layers of redundancy are maintained at all times.

Caching

Squid Proxy is installed on each server with iptables configured to transparently ‘hijack’ all requests so that assets such as CSS, JavaScript and Images can be cached locally without needing to be delivered over 4G multiple times or modify client device configuration significantly.

Squid also caches responses with appropriate cache headers sent by the client application endpoints thus reducing round trip times and increasing performance across each deployment location.

Each server is automatically configured with other servers on site as their neighbours to further reduce 4G bandwidth usage, only one server needs to ask, the others can then ask each other before using bandwidth unnecessarily. This is automatically disabled if cross site connectivity is detected to be turbulent.

Automation

Ansible was chosen to automate the deployment of the servers so that they can be easily restored in the event of a failure. Ansible configures all elements of the server including networking, application configuration and user access and allows us to easily roll out updates should the need arise.

Spare pre installed SSD drives are also available on site in the event of a server SSD failure so that it can be swapped out quickly.

We used Terraform to automate the AWS elements such as the deployment of OpenVPN, Prometheus, Cross Account VPC Peering and any necessary route tables.

More please.

Although the least cloudy cloud piece of work we’ve delivered, it was certainly one of the most enjoyable.

Cloud Computing has taken the physical elements away from many people entering the industry these days which is especially evident in the networking space, this was an excellent piece of work to introduce the team to the ‘old school’ way of doing things in a cloud computing context to gain a deeper understanding of networking which transfers nicely to cloud based environments.

As a company we’d love to undertake more complex networking tasks such as these, whether connecting more volatile sites to the cloud or installing AWS Direct Connect connections to fixed locations, we (possibly primarily I…) love a bit of complex networking!