Life Following an App Release (Part 2)

Nigel Hing
4 min readJul 4, 2021

This is part 2 of Life Following an App Release.

The first attempt we made to go live ended badly when we were forced to revert to the previous application on the same day before lunch. There were multiple issues at hand, pages going blank on the frontend for certain modules, errors being thrown when certain operations were made, incorrect calculations. The worst of all were these constant SQL timeout issues that happened periodically that would cause these locks on the database. We weren’t sure what were causing these timeout issues since it happens randomly and when it does, the admins are unable to work. We didn’t really know how to remove these locks at the time so that summed up the first disastrous attempt.

We took 2 weeks to fix whatever issues we could because we were pressed for time to release the app. Management wanted the app to be released before the end of the year and there was pressure. One of the senior devs told me that he wasn't confident that we could release the app in 2 weeks time; that it would probably take at least another month. Nevertheless, the team tried their best, working nights for the next 2 weeks, trying to fix the issues. We got everything fixed based on the complaints we got from the first launch(we added a few new extra features of course because users wanted it..) and we thought we were ready to launch.

We were definitely wrong because the 2nd attempt failed on the same day as well. It did last longer than the first attempt, this time we managed to keep it going till about 2pm before we were forced to revert. The biggest issue again were the timeout issues that we didn’t really know what was causing it. Of course, there were the usual accompanying issues like pages going blank, calculation issues, error messages as users dug deeper into the app on the 2nd attempt. This time though, we figured out we could monitor the locks that were happening on the database and figure out where they were coming from. One of the breakthroughs we had was realizing that we could kill these locks, thus freeing up the database. We took another 2 weeks, refactoring parts of our codebase that we believed were causing the timouts, fixed a bunch of other bugs and relaunched for the 3rd time. Looking back, the 3rd attempt was pretty much a do or die scenario because it was in November and we had to get the app stabilized by December.

The 3rd attempt started off very rocky but we were able to better respond to issues. As long as we could kill the locks that were causing these timeouts, we knew we were fine. The users weren’t completely halted in their work but we had to immediately kill the locks that were popping up on our database. We had a system where we screenshot the locks that were appearing on the database and send it to our senior backend developer who would identify the origin of the locks and revamped the code. Till now, I’m not really able to explain what the root cause was but I remembered it was something about concurrency, async operations and linux server not being able to handle async operations well. I was just too busy fixing all the frontend bugs as well as the backend for any incorrect calculations.

The calculations were particularly stressful for me because there were about 30 over different possible scenarios with the system allowing users to input both percentages and lump sum figures. For any errors, I had to debug the code on staging and figure out the issue fast. The calculations were important because its how we make payments and its very important that its accurate to the cent and had to be timely, meaning if there’s an issue, it had to be fixed on the same day. I was probably working from 9am to 2–3am on most days and even on weekends to fix bugs. My team lead would also jump in to help me with the calculation issues because that was the next most urgent issue apart from timeouts. It was really tough as well because since the app accepts both percentages and lump sums for the user inputs, I had developed the calculations a certain way to accommodate this approach. I got the approval that the methods which I was using were correct but lo and behold, after a week of the app going live, I was informed that the method of lump sum calculations were incorrect.

I had to change the logic within the day and of course, that meant spaghetti code. Even till today, its still spaghetti code that I have to revisit from time to time as I add in more scenarios. I wish I have time to go back and rewrite that entire module knowing what I know now. Over time, the issues became more and more spaced out to a more manageable level as we kept on firefighting daily. In about 4 weeks, the base version of the app was somewhat stable and we could finally take a breather.

Looking back, it was definitely a very challenging experience to go through but I’m glad I did go through it with my team under the guidance of my team lead. The 4 of us, 1 team lead/database/backend/PM, 1 senior backend, 1 junior backend and me who’s an intermediate full stack + minor PM(I had to gather my own requirements for some of the issues I had to fix on modules I was in charge of). I think we did a pretty darn good job considering it took us only 3 attempts and 4 weeks after the 3rd attempt to get it somewhat stable. It was an ERP project and there were 4 of us so I’m pretty amazed we pulled it off. I’ve heard of other nightmare stories from some other friends who have to deal with an unstable app for a year!

What are some other nightmare stories you’ve encountered?

--

--