On Wednesday afternoon a small percentage of WPEngine websites using a paid version of Wordfence experienced a 500 Internal Server Error or white screen on their sites due to an erroneous firewall rule that we released. If you have experienced this issue, please check your email which contains instructions to fix the issue. You can also find guidance on our Twitter account along with our forums where we have posted a solution. We have also been hard at work in our ticketing system answering support requests from our affected customers. You can open a ticket by signing into this site and visiting your Licenses page, and clicking Get Help on the applicable license. You can also find instructions for the fix in this longer more detailed post.
Please keep in mind only a small percentage of WPEngine users using Wordfence were affected, and these were limited to paid users only due to the way we release firewall rules.
The rest of this post contains an after-action report of the root cause and what we’re doing about the issue.
On Wednesday at 2pm we released a new firewall rule to our Premium, Care and Response customers that was low priority and would propagate throughout all our paid customer sites during the following 24 hours.
Most free and paid Wordfence sites (approximately 95% or more) use a file to store the firewall rules. On some hosts we were not able to implement that so we added a backup method for compatibility a few years ago which stores the firewall rules in MySQL. WPEngine is one of the rare hosts where this is used. We have only heard of an isolated report on Pantheon where this storage method is also used, but nowhere else.
On sites that use this storage method for firewall rules, as they received the new rule on Wednesday, they started whitescreening or producing a 500 Internal Server Error. That is a catastrophic failure and leaves a site in a non-functioning state. It’s the worst case scenario for us and the customer.
By 5:15pm EST on Wednesday evening there were enough reports that one of our CS team members was able to correlate that there is a major issue underway. They posted a list of the issues we’ve received in Slack and it immediately received attention from a wide range of senior team members including our head of security, operations staff, head of products, executive team, dev team and QA team.
From 5:15pm until 5:45pm the team worked together to:
Confirm there is a common issue among these sites.
Investigate if it is a WPEngine operations issue, which it wasn’t.
Analyze the error logs we had received to isolate the issue to a new firewall rule.
Confirm it is related to a new firewall rule and which specific rule.
Propose pulling the rule and analyze the risk/benefit of doing that and post action steps for our customers.
At 5:45pm Chloe our head of product for Wordfence Intelligence pulled the offending rule from production.
Then we split into separate teams which each handled customer communication, root cause analysis, developing an automated fix, and developing an immediate fix for affected customers. Some team members were cross-functional.
We quickly determined we could not develop an automated fix after trying and testing various approaches. We confirmed that the fix for affected customers was to delete the firewall rules in their MySQL database which would remove the offending rule, bring the site back up and cause Wordfence to fetch fresh rules. The SQL for this is:
DELETE FROM wp_wfconfig WHERE name=’wafRules’
We recommend a database backup before you run this. You may have a different table prefix to ‘wp_’ and you may also have an upper-case C in the word ‘config’ above if you’ve been using the plugin for a long time.
If you need help running a query on your WPEngine site, you can find instructions on this page on WPEngine: https://wpengine.com/support/run-query-phpmyadmin/
Our communications team shared the fix on Twitter, our forums, and via tickets, and immediately started getting confirmation that this fix worked.
Today is Friday April 14th. This morning we held an after-action meeting to discuss the issue, root causes, what fixes we will be implementing longer term and what controls we will put in place to prevent a reoccurrence.
What Caused It?
As with most failures, this was a chain of events. Our firewall rules go through a rigorous QA process and that process included testing firewall rules on the ubiquitous file based storage system and had a process in place to test firewall rules on the MySQL based storage system. A while back that test for the MySQL stored rules was decommissioned and we inadvertently did not replace it. That was root cause one.
The second item in the chain was when we execute firewall rules in the plugin, our exception handling is not robust enough to handle the exception it encountered with this rule.
A third item in the chain is that about a year and a half ago we added new functionality to the firewall syntax on Wordfence to make it more powerful. But we haven’t used that functionality yet because our internal threat intelligence platform had not yet been updated to support it. Recently we added that support. So for the first time we rolled out a rule using this new functionality, which resulted in a higher likelihood of an exception being generated.
How Are We Preventing Similar Issues?
To prevent a future reoccurrence we are taking several steps.
Firstly we’re immediately putting a process in place to verify that all rules run on MySQL based rule storage systems.
Next, we’re implementing a long term solution by revamping our testing process to add an additional testing layer on external systems using our production infrastructure. Now before a rule is deployed it will go into ‘alpha’ mode and only be deployed to servers we own across a wide range of configurations and hosts. This will allow us to test all rules we deploy across real infrastructure running on real hosts, via our production infrastructure (as opposed to staging) as a final step before we deploy to our free or paid customers.
Either of the above two controls would have caught the issue we experienced on Wednesday.
In addition, we’re adding more robust exception handling to the plugin as it executes firewall rules. If a rule throws an exception, it will be caught, and gracefully handled. The rule will then be disabled and we will be notified. This will avoid errors on customer sites, and reduce our time to respond from 3 hours to minutes.
I’d like to sincerely apologize to the customers that were affected. Our records show that it was less than 200 in total based on reports received through our support channels, email, forums and social media. Our team works hard to avoid issues like these, and while deploying software to over 4 million sites every few weeks, along with firewall rules in real-time, presents a unique operational challenge, we are generally very good at keeping sites secure and producing rock solid software. This time we failed you and I’m sorry.
I created this business because a hacker took my own WordPress site offline in 2011. Keeping you online is at the very foundation of what we do and what we are built on. We have worked this week to do better and we will continue to do so.
Mark Maunder – Wordfence & Defiant Founder and CEO.
The post Post Action Report: Bad Firewall Rule Released to WPEngine Customers Wednesday appeared first on Wordfence.