Singular Incident postmortem 03.09.2022

On 02.09.2022 we had to stop sales on Singular NFT marketplace for 24 hours due to a series of events described below. To explain to you what happened we need to have a basic understanding of different implementations of RMRK standard.

Which implementation is currently in use?

At the moment of writing this incident report, Singular marketplace is only supporting system.remark implementation of RMRK standard. system.remark implementation is very limiting and partially centralised, you can find out more on how it works here. Here's a very rough representation on how this implementation works.

Why "system.remark" implementation sucks?

Once we receive new RMRK events from the blockchain we have to pass them to our "consolidator" but for it to output the correct latest NFTs state it has to have all the previous RMRK events ever created. This was fine in the beginning, but every day this creates more and more issues for RMRK team because the amount of events keeps growing and because we need to consolidate them off-chain, this is very hard to maintain. The current size of all RMRK events is almost 2GB now, which creates many issues around JSON parsing, file transfer, IPFS pinning and consolidation itself is getting slower.

What happened during this incedent.

One of the biggest collections on RMRK 2.0 were sending a huge amount of events over the course of a day, suddenly increasing the size of the events dump and putting a load on our off-chain consolidator process. This all was happening when several big collections were also migrating from RMRK 1.0 to RMRK 2.0 further increasing the load on our servers where consolidation and indexing is happening. As a result 3 things happened

Redis/bullmq (in-memory data caching and event queing system) choked with Out Of Memory error.
Event dumps stopped updating becase the JSON file got too big for nodejs to parse it as well as node OOM error processing it.
When re-indexing missing events since Redis error, node-fetch could not parse fetched dumps, again due to file being too big.

What was done as an immediate fix

We immediatly increased amount of memory available on this machine
We switched to JSON Streams
Again JSON Stream was added to process large JSON in chunks.

There are still few events that got lost in the process and we will re-index them soon.

What can be done short-term?

We have several short terms solutions in mind on how to optimise RMRK "system.remark" consolidator.

Generating separate event dumps per collection and consolidating each collection into a separate consolidated dump.

This will make dumps smaller in size and easier to manage and we could even exclude certain problematic or big collections easily.

Improve ease of re-indexing. (Currently largerly a manual labour)
Optimise consolidator (rewrite in rust / golang)

What is being done mid-term?

We were always open and vocal about this scalability problem of system.remark implementation, and that's why we started writing our Solidity contracts and Substrate pallets right away. Once they are fully ready we will allow users to mint and interact with these implementations from Singular, in fact we are working on this Singular implentation right now as you can see on our public dev Roadmap. You can hear more about the current state of our Solidity smart contracts on the upcoming Crowdcast that we are hosting this week.