The past 60 days at MailerLite have been a whirlwind of excitement and anticipation as we embarked on the largest and most complex project in our history—designing, building and implementing a new infrastructure for MailerLite.
This new infrastructure will eliminate the unexpected issues that we were susceptible to in the past and also provide our customers with a faster, more reliable marketing platform.
Our two months of preparation culminated this past weekend (June 27), as we upgraded MailerLite and implemented the new infrastructure. The work was completed in the allocated three-hours of scheduled maintenance, but we soon discovered an unexpected data processing issue that affected campaign sending and data visualization for many customers.
We worked tirelessly around the clock to identify and fix this interruption in service. Today, we’re relieved that MailerLite is fully operational, and we are once again excited and optimistic about the improved MailerLite. The infrastructure will open up new possibilities for more advanced features to help you grow.
For those of you who are interested, we’ve detailed our thought-process and what happened during our two-month journey to revamp MailerLite and make it better for our customers.
In the spring of 2020, our network and data center provider experienced a major DDoS attack which caused connectivity issues for our app. As a result, MailerLite suffered a major downtime and we had to rely on our provider to fix the issue.
We experienced the same problem 10 days after this first incident, which was unacceptable for us. From that moment, we knew that we needed to take more control of our app, which meant changing our infrastructure.
As we mentioned during that time in April, we’re already working on a new MailerLite rewrite, but that was not going to solve our immediate concerns. We decided that we needed to move the current version to a new and more reliable data center and improve our processes. And, we had to do it in less than two months!
To achieve our goals within the tight timeline, we faced two initial challenges: create a new infrastructure plan and then change our codebase so that it would be supported by the new infrastructure. From that moment, our development and DevOps teams started working 7 days per week to complete this ambitious migration plan.
Within the first month, we had the codebase and the new setup of the infrastructure. We partnered with Google Cloud to host our new infrastructure. This new setup would allow us to make the application even more scalable and we liked the multi-region server setup because it would help us avoid major downtimes that we had experienced in the past.
The next month was dedicated to the migration to Google Cloud. This was an enormous undertaking as MailerLite is a complex application that uses a variety of different technologies. We send millions of emails every day, which generates millions of different files that we have to transfer with no downtime. MailerLite also serves hundreds of thousands of landing pages and subscribe forms all across the internet, and we store huge amounts of data in different databases and clusters. These clusters are responsible for all the data that we collect (your subscribers, statistics, etc.). Each database has more than one copy that allows us to ensure that your data is always safe and is in place.
Despite the size and complexity, the migration plan went quite well as we transferred our data.
Three weeks before the big day (June 27th), we had everything synced and in place, which gave us time to test everything. One of the main tasks was to keep our customers’ landing pages and websites up with no downtime. We were happy with our results here.
We scheduled the maintenance for June 27th (Saturday) from 2 to 5 AM EDT and announced the timing to our customers via email, in-app notifications and on social media. All messages included a link to our status page, where customers could subscribe to real-time updates. On June 26th, we sent a final reminder to our customers and we went through our testing plan one last time. The team was prepared and ready to go.
Saturday, June 27th: Implementation started on schedule and everything was migrated to Google Cloud within the three-hour timeframe. We spent some extra time testing for quality assurance. One hour after the migration, the application was fully operational with no major interruption during testing.
Once we were ready to go full power and enable the sendings, we noticed strange behavior with the connection to our sending servers. It took some time to fix and re-enable the sendings. We noticed significant improvements in sending speed, which made us confident that the migration was a success. Although the whole migration process took a few hours more than we scheduled, we were up and running by Saturday afternoon EDT.
Later that day, we noticed that the data cluster that is responsible for data visualization in subscribers management was slower than usual. We monitored it for a few hours and determined it was too slow for our standards. By the time we noticed this bug, our migration team had already worked more than 17 hours without breaks. So we postponed the fix to the next day because it was still operational.
Sunday, June 28th: The next day we noticed that the cluster was even slower. We contacted the cluster software support engineers to help us investigate the issue. After 8 hours of debugging, we decided to re-do the data processing part of the migration because we couldn’t identify why it was taking so long to transfer the data from our main databases to a subscriber management cluster. During this time, we were working in real-time with both Google engineers and the data cluster engineers. After 15 hours of work revising the code, we managed to change the way the data is processed and the data transferring speed was good.
Monday, June 29th: When we checked the processing speed on Monday morning, it was not being sent as quickly as we wanted. At this point, the solution was to set up a whole new cluster in a different network and revert the code changes that we previously made on June 28th. This decision would require reindexing all the data in order to have it in one place. We spent the entire day carefully planning and creating the new cluster, and to our delight, it started responding very quickly. We were satisfied with the results. It took us a few more hours to start processing the data, but now everything was working and we were monitoring the servers. Unfortunately, when we initiated reindexing, we had to disable new campaign sendings.
Tuesday - Wednesday, June 30th - July 1st: During the two days that followed, subscriber indexing took place. Indexing is the process where we transfer subscriber data from our main databases to a different cluster. This process includes all the historical data about every subscriber. It's important to note that during indexing, all the data that we got was saved into our main database. The majority of features were not affected including automation, subscribe forms, landing pages and websites. All webforms including double opt-in were always fully operational as well.
The main customer-facing issues during this time were:
Email campaign sendings were stopped for non-indexed accounts. This did not apply to automated emails, which are working as usual.
Subscriber group numbers and email campaign statistics (such as open rates, clicks) didn’t display accurately during this time, but data was not affected. All data remained safe and the stats were restored.
Third-party integrations might have taken longer to be resolved for some customers because the other service providers had to complete work on their end. We informed these companies about the upgrade in advance.
These disruptions in service caused a flurry of customer support inquiries, which overloaded our capacity resulting in longer than usual responses. While it was sometimes hard to estimate timing, we did our best to keep everyone in the loop by posting status updates on our in-app notifications, FB Community page, Twitter and status.mailerlite.com. We also proactively reached out to certain customers via email who needed to update their campaign setup.
Thursday, July 2nd: Subscriber indexing continued and was finally completed by the end of the day. All of the issues were resolved for both premium and free MailerLite users.
As an added precaution, we’re monitoring the servers carefully and are making sure we have fundamentally resolved the issue. We do not expect similar failures in the future.
At MailerLite, we’re constantly improving our tools so that you can grow your audience and achieve your goals. Growth is good! But when you grow to a certain point, you’ll inevitably experience growing pains.
Our enthusiasm for delivering a faster and more reliable application was temporarily overshadowed by these unexpected issues. Our growing pains were on full display.
Today, we are happy to announce that the improved MailerLite is fully operational. The upgraded infrastructure will open up new possibilities for more advanced features to help you grow.
We will continue to learn from these experiences so when our next growth spurt comes, we’ll be ready!