This Week’s Downtime (and an update!)

Posted 3 years, 5 months ago by lulu

Hi! I'm here to address the current state of the site -- what happened this week, and plans moving on. 

First: What was up with the site this week?

I’ll start with an apology for the disruptive site downtime and poor performance this week. Here’s a breakdown of everything that’s happened:

  1. On the 23rd of October (Friday), around 6AM site time, writes to one of our servers went down. This resulted in a lot of buggy behaviour; image uploads were unavailable and new profile/posts/messages were sporadically unable to be saved. This issue was fixed at 9PM site time of the same day, when the server was recovered. 
  2. Due to downtime on Friday, I bumped the priority on a profile history feature in case anyone needed to recover their profiles. I didn’t test the feature adequately before launching, which resulted in a bug where the front page of user profiles would show an error when accessed by a logged out user. This bug was introduced at 6AM on the 25th of October (Sunday), and was not patched until 10PM on the 27th of October (Tuesday). 
  3. Beginning in the afternoon of 28th of October (Wednesday) (timing is to the best of my understanding based on user reports), TH started getting hit by bots, resulting in a full site slowdown and sporadic timeouts throughout the evening. This was fixed at 3AM the next morning. We’ve turned on captchas and bot challenges for high load pages which I’m still reviewing and adjusting, so please bear with us while the settings get tuned.

Due to a mix of this and lack of admin updates, there's been understandable fears that the site's on its last legs, so the most important clarification: TH isn't dying and it's not an abandoned project. Even when I'm not posting, I love the site and I love the user base and as long as there's still a single person still logging on I'll keep it trucking along to the best of my capability. I understand that rumours have circulated entirely as a result of my radio silence, and I apologise for the uncertainty and anxiety.

We've had a bad week with unacceptable downtime and bugs which could have been prevented or mitigated. I'll be installing monitoring which will help me respond more rapidly to unexpected server issues. The problems from the past few days have been diagnosed and addressed, and long-term site health is not damaged or deteriorating. We continue to host daily database backups for all user activity, and backups of image files over several servers, so in the event of downtime or slowdowns your profiles are still safe.

Regarding the site’s future and user concerns:

Since I’m already writing a post, I’d like to take the chance to address concerns I’ve seen from users regarding site management, and talk a bit about site status and where we currently are. 

There’s been some public speculation, so I'll confirm I’m the only admin/mod/dev actively working on TH. This is just clarification, not an excuse; the site has been solely my responsibility from the beginning and I understand it's my job as site admin to either manage the site's issues or hire staff on where I can't.

This is a summary of my understanding of primary user concerns:

  • Lack of support: HelpDesk tickets, PMs, emails. I try to do a chunk of tickets everyday, but a lot of these go unseen and end up getting buried. We also don't do proactive forum/character moderation, which we should be doing but won't be able to for awhile due to the resource commitment (eg. watching new character uploads / recent posts for misconduct instead of waiting for a ticket to be filed)
  • Lack of development work: New features, bug fixes, site improvements. This is where I'd love to be working, but the bulk of my time is going into the HelpDesk currently.
  • Lack of communication and awareness: Announcements, status updates, changelog, no admin/mod presence in the community. Since I don't use the forums or hang out on Discord, I'm blind to community concerns or moods unless directly informed.

I understand an active mod team would fix everything on this list; it would raise responsiveness on tickets, free me up to do development work, and open up more points of contact between site management and the user base. The bottlenecks for this are:

  • Funding: I want to be able to compensate mods for their time; I've recently started a new job which should help with stable funding (yay) but still need to figure out a budget.
  • Security: Hiring a Bad mod is my number one fear for the site. HelpDesk mods will have access to personal user information and control over user accounts. The community can get quite insular and conflicts of interest over ticket handling would be a nightmare. I need to figure out how to identify people that the user base can trust, and I need to put together a hiring post that can find them, which has been a struggle. 
  • Infrastructure: The infrastructure for a mod team is missing right now. Internal guidelines need to be ironed out before we can take new hires on, and proper mod tools need to be built since we can't give new mods access to the admin account or databases.

So, I want to hire mods to provide better support, but I need to set up infrastructure first, but my time is going to tickets so I don't have time to set up the infrastructure for new mods... etc. That's not to say work is stopped or that I’ve given up, but it's been slow.

I'll be honest and say I'm not in a position where I can answer user expectations yet. The reason I kept putting this post off is because I've been telling people the same thing for months and haven't been able to deliver and I know it's shameful. I understand that users are frustrated and have been waiting a long time, I know I can't just keep making promises or saying I'll try harder, I know I need to start doing better and turning up results, and I’m sorry I still don’t have those results ready. I'm grateful to both our veteran and new users for sticking around through everything and really appreciate you all a lot.

What I’m currently working on, aside from shovelling tickets:

  • My first priority is buying monitoring software for the site to prevent long outages like what happened Friday. 
  • The anti-bot features that got turned on need to be tuned so it stops bothering regular people who're just browsing the site.
  • Then back to mod tools...