A Postmortem of Our January DNS Incident

Last Updated: Apr 22, 2022 | Category: | Author: Ryan Reiffenberger

Author Info

Ryan is our Lead Web Architect here at Falls Technology Group. Starting in 1999, Ryan has been working on building websites, computers, and servers for over 20 years.

Get A Custom Built Responsive WordPress Website For Your Brand

You need to deliver your message and our team is here to help you achieve that. We will build you a custom responsive WordPress website that puts your business where you need to be on Google and increases your online sales potential.

First and foremost, We want to personally thank you for choosing Falls Technology Group, LLC. We are proud to be hosting your projects and look very much forward to continuing to work with you in the future. Second, We want to talk a bit about an outage that occurred within FTG Hosting that I’d like to shed a bit more light on.

What Happened

On January 17th at 6:00 AM CT, two out of our three DNS servers were shut down unexpectedly at one of our vendor sites. During this time we lost regional access to broadcast our hosting services to several parts of the United States and in Europe. This DNS outage resulted in a resolution error that took down a large number of our websites and access to their associated services.

Services Affected

After running our full post-mortem audit we have determined that the following services were affected:
Please Note that We Are Still Restoring Access To Some Services

  • DNS Hosting
  • Website Hosting
  • Email Hosting
  • SSL Certificate Generation
  • FTP/SSH Connectivity

Additionally, we have determined that approximately 85-90% of our clients were affected by this outage. If you are experiencing problems with your website, email, or other services listed above, please contact our Hosting Team immediately if you have not already or if we are not already in communication: https://www.fallstech.group/help

How We Responded

As soon as the DNS Outage was detected by our Network Operations Center, we immediately deployed three additional nodes to provide DNS routing services for our clients and their visitors. These nodes included San Francisco, USA, New York City, USA, and London, England. At 11:00 PM on January 17th, our node deployment was completed and we were able to begin restoring access to these resources.

Since deployment we have interconnected our infrastructure to provide several layers of redundancy beyond our existing backup technology. We additionally have temporarily shut off CDN delivery for all client sites until we can correct further configuration data with this system and get it confidently online. Our goal is to bring services back online as quickly as possible with as minimal damage as possible, and the CDN was determined to be an item that would have caused more harm than good during this transition. We will be reactivating this system within the next week.

Was Any Data Lost?

After further analysis we have determined that the only data lost during this connectivity loss included bounced visitors who may have been unable to access your site or emails that may have been lost in transit while your domain was not responding. We have further confirmed that no DNS Records, Inboxes, Website Data, or other on-server resources sustained any data loss.

What Are We Doing to Prevent This In The Future

First and foremost we are diversifying our portfolio of providers to allow for redundancy and interconnectivity between our systems that we previously did not have available to prevent full disruptions in our services. Our team has worked hard over the last 72 hours to build in further protections to prevent nodes from being taken offline and to allow for faster response times to issues arising in our system.

Steps Taken:

  • We have layered our DNS Cluster to ensure that data transfers between servers more quickly and a single DNS node failure will no longer take that region offline
  • We have increased the density and quantity of our DNS Nodes to ensure that global connectivity can remain uninterrupted even upon web server failure
  • We have since updated our business continuity policies to allow for quicker delegation of technical tasks between our team members
  • We have audited each individual website on our infrastructure to verify connectivity to ensure we are not allowing outages to occur without proper monitoring
  • We have completed the process of redeploying our CDN resources to provide better global access to client sites even in the event of the host server going offline
  • We have completed the process of implementing a faster recovery solution to ensure that we can immediately roll-back any configuration changes in the future that may cause outages
  • We have completed the process of fully validating our full archive of backup deployment

For these outages, We want to personally issue our most sincere apology. We are fortunate to have a team that is able to come together quickly to resolve these issues, however we can always improve on our hosting experience and do better by our clients each and every day.

If you have any questions or concerns about this outage, please contact our Hosting Support Team and they will be happy to assist you. You can do so here: https://www.fallstech.group/help

Did You Enjoy This Article? Share This Resource!

Leave a Comment