2020-04-30T08:14:07.460765+00:00
Incident impacting Portal

Last night we deployed an update to improve MongoDB index usage and we deleted unused MongoDB indexes. These indexes are used to make sure data is loaded fast and efficient.

This morning when we scaled up the containers of our legacy chat platform, the new containers triggered the recreation of the indexes we had deleted last night. This went unnoticed to us and shortly after we were getting complaints about the performance of the dashboard and the chat . It took us a quite a while to figure out the deployment of the update on the legacy chat platform had failed last night, causing the old indexes to be recreated.

The reindexing caused our MongoDB performance to drop and memory to swap, which resulted in a couple of issues, inbox loading slowly, inbound calls not visible and the chat had difficulties loading. We did an emergency upgrade of the MongoDB cluster and cancelled the indexing operations of the unused indexes.

Performance was back to normal, but at that point two of the MongoDB replicasets were out of sync and had to resync, this caused another slow down of our platform around 12:00. This time the slowdown had cascading issues on the new livechat platform. Incoming requests couldn’t be handled fast enough, causing the background queue to get filled and overloading MongoDB with queries. Subsequently this caused the resync of the replicasets to take much longer than expected.

There are a couple of things we will to do prevent this from happening in the future.

The failed deployment went unnoticed, we will add extra notifications about failed deployments to prevent this in the future.

The reindexing of the unused indexes shouldn’t have triggered automatically, we will make sure indexing won’t be triggered on startup of containers.

We will discuss with our host if we can add extra monitoring to our Mongo cluster, so we will be informed about any issues, saving time in investigating possible causes.

During the performance issues we got a lot of questions about what’s going on. While we would love to update everybody personally, it is very hard answering everybody timely when working on the issues and get a lot of similar queries . Next to our status page we will add an extra notification banner on our website and application when any serious issues are effecting the platform.

We are deeply sorry for all the inconvience this has brought to you today.

Status remaining
Fully Operational
2020-04-30T13:45:46.506210+00:00

Last night we deployed an update to improve MongoDB index usage and we deleted unused MongoDB indexes. These indexes are used to make sure data is loaded fast and efficient.

This morning when we scaled up the containers of our legacy chat platform, the new containers triggered the recreation of the indexes we had deleted last night. This went unnoticed to us and shortly after we were getting complaints about the performance of the dashboard and the chat . It took us a quite a while to figure out the deployment of the update on the legacy chat platform had failed last night, causing the old indexes to be recreated.

The reindexing caused our MongoDB performance to drop and memory to swap, which resulted in a couple of issues, inbox loading slowly, inbound calls not visible and the chat had difficulties loading. We did an emergency upgrade of the MongoDB cluster and cancelled the indexing operations of the unused indexes.

Performance was back to normal, but at that point two of the MongoDB replicasets were out of sync and had to resync, this caused another slow down of our platform around 12:00. This time the slowdown had cascading issues on the new livechat platform. Incoming requests couldn't be handled fast enough, causing the background queue to get filled and overloading MongoDB with queries. Subsequently this caused the resync of the replicasets to take much longer than expected.

There are a couple of things we will to do prevent this from happening in the future.

1. The failed deployment went unnoticed, we will add extra notifications about failed deployments to prevent this in the future.

2. The reindexing of the unused indexes shouldn't have triggered automatically, we will make sure indexing won't be triggered on startup of containers.

3. We will discuss with our host if we can add extra monitoring to our Mongo cluster, so we will be informed about any issues, saving time in investigating possible causes.

4. During the performance issues we got a lot of questions about what's going on. While we would love to update everybody personally, it is very hard answering everybody timely when working on the issues and get a lot of similar queries . Next to our status page we will add an extra notification banner on our website and application when any serious issues are effecting the platform.

We are deeply sorry for all the inconvience this has brought to you today.

Status remaining
Fully Operational
2020-04-30T13:24:07.793092+00:00

Syncing is done, everything should be going to normal now.

Later we will publish our post mortem, we will explain what happend and what we will do in the further to handle this kind of issues.

Status changed to
Fully Operational
2020-04-30T12:18:28.111438+00:00

Still syncing. Unfortunately we cannot force this to go any faster.

Please hold tight. Hopefully is it done in a jiffy. Again, very sorry for the massive inconvenience.

Status remaining
Degraded Performance
2020-04-30T11:58:42.603652+00:00

Our mongoDB cluster is almost done resyncing. Performance should be back in 10-15 minutes.

Status remaining
Degraded Performance
2020-04-30T11:28:34.858453+00:00

One of our mongoDB databases has troubles with syncing all data because of the outage this morning. We are currently updating our nodes to minimize the impact.

Status remaining
Degraded Performance
2020-04-30T11:10:44.376180+00:00

One of our mongoDB databases has troubles with syncing all data because of the outage this morning. We are currently updating our nodes to minimize the impact.

Status remaining
Degraded Performance
2020-04-30T11:07:20.938237+00:00

Again we are experiencing issues, we are currently in contact with our provider to investigate this problem.

Status remaining
Degraded Performance
2020-04-30T10:54:44.864412+00:00

Again we are experiencing issues, we are currently in contact with our provider to investigate this problem.

Status changed to
Degraded Performance
2020-04-30T10:54:28.289787+00:00

Back to normal. We are monitoring the performance the next hours.

Status changed to
Fully Operational
2020-04-30T08:40:36.271009+00:00

Yes we found the issue. Due an update yesterday en triggered old indexes in de database our mongodb cluster overloaded this morning. Major load on our cluster caused issues with all our running apps, chat, portal etc. We are preparing a fix. Hopefully the performance will be back soon. Sorry for the inconvenience.

Status remaining
Degraded Performance
2020-04-30T08:28:35.777571+00:00

You may experience issues with loading conversation, create new messages, loading issues with switching between conversations.

Status remaining
Degraded Performance
2020-04-30T08:16:36.648124+00:00

Currently our customer are experiencing degraded performance of the belco portal.

We are investigating what causes this issue. If we know more we will post it here.

Status changed to
Degraded Performance
2020-04-30T08:14:07.460780+00:00
Back to Current Status