Wednesday, May 6, 2020

Cluster offline due to secondary in "RECOVERING" state and primary completely lost!

This is an account of what I faced when called in from a COVID-19 "stay-at-home" restriction to fix a MongoDB cluster that was essentially unusable.  What happened, and this is rare, is that our cluster suffered a network partition for an extended period of time for one of our shards, which I shall refer to as "shard2".  This was then followed by a total loss of data on the primary due to a catastrophic storage failure, which caused us to rebuild the server from scratch.  We use a "PSA" architecture - that is Primary, Secondary, Arbiter.  We use an arbiter instead of second secondary because one of our data centers can't meet the demands of a data-bearing node to make a Primary, Secondary, Secondary configuration. Additionally, we were using the default oplog size of 50GB.  Being that I was out due to COVID-19, the secondary that was partitioned moved to the state of "RECOVERING", once the network partition was resolved and it determined that its current optime was less than the minimum optime availble in the primary's oplog.  This is an odd name for this state, because it means that the secondary's data is STALE and has no hope of catching up to the primary, because the oplog entries that it needs have been deleted from the primary's oplog collection. The only way to "recover" the secondary that is in "RECOVERING" state, is to shut it down, remove its data, start it back up and let it do an "initial sync".  For that, you need either another secondary that is not "RECOVERING" or a primary.  In our case, we never have another secondary because we use the "PSA" architecture, that I mentioned earlier.  So, that leaves just a primary as the only option for recovering the secondary.  However, as I mentioned before, the primary was lost due to storage failure.  So I was left with a secondary in "RECOVERING" state and an arbiter.  The first thing that I tried was to shutdown the secondary, copy it's files to the newly rebuilt primary, and start the primary up.  Now, instead of having just one node in "RECOVERING" state, I had two.  That led me to conclude that there must be something telling each node that it didn't have the latest optime and so it couldn't become primary. So, I started digging around in the "local" database and I found a collection that contained a document that contained a timestamp that I believed was preventing the servers from becoming primaries.  The collection is replset.minvalid.  It has one document and that document has "ts" attribute that in my case was later (or higher) than the current optime shown in "rs.status()" for each node.  When you are working with a replica set without a primary, you cannot modify the data on any node.  So, what I did was to restart each node without the configuration that specifies replica set name or shard server options.  This brings each node up in "stand-alone" mode and allows you to alter the data that each contains.  I then set the "ts" attribute in the aforementioned document to the value found in the "optime" shown from rs.status().  I was able then to restart with the replication and sharding configuration and voila, I now had a working primary and secondary.  There is a big caveat here, though.  When the primary was lost, at that moment, whatever transactions it had processed since the network partition began, were also lost.  However, I was able to bring the cluster back online with minimal loss of data and that was a success.