Uninterrupted payments – who doesn't want that?

Some software systems are so critical that the absolutely highest demands on their availability must be met. Admittedly, in the financial sector, the matter is not one of life or death. But the requirements for real-time payments or authorisation processes are constantly increasing and maintenance windows in particular are no longer acceptable. And rightly so: if a maintenance window causes you to get the account statement an hour later or the portal is not accessible for an hour, it may be annoying but not too much of a loss. However, if the bank customer suddenly can no longer pay at the point of sale or cannot authorise a payment in real time, the unavailability becomes extremely relevant.

Therefore, after the authorisation process for card payments, the introduction of instant payments in Europe has made real-time credit transfers an application field for interruption-free systems as well.
Let us first discuss freedom from interruptions. Uninterrupted operation is defined by two different characteristics:

  1. Avoidance of planned unavailability
    The system is permanently operational during normal operation. It has no periodic times of limited functionality such as end of day or reorganisation.
    The system is designed so that even release changes can be carried out during operation without causing downtime.
  2. Reduction of unplanned unavailability
    The system is highly available even in error scenarios. The probability of guaranteed operation is therefore high despite failure of individual components. This probability is calculated or measured as the ratio of production time to runtime, i.e. the time including the downtime, for example 99.99 percent.
    Robustness in overload scenarios is of particular interest here. Although every system has its limits, it does make a difference whether everything collapses beyond the load limit or whether only the additional load cannot be processed as per specifications.

The enthusiasm for the topic usually drops considerably when looking at the costs. It is therefore worthwhile to find architectural solutions and not just shift everything onto the infrastructure. However, even the best software will only work if the system environment is available. I won't to go into more detail on high-availability infrastructure, operating systems, database systems and message brokers – all of which are prerequisites for an uninterrupted overall system. Instead I would like to focus on the software architecture. This can enable the targeted implementation of availability requirements while keeping costs under control.

Since high availability is expensive, the critical processes must first be identified. Therefore we must answer the question which processes must really work all the time and which can be made up for later. In real-time payments, for example, bulk processes are less critical than individual payments.

If large components fall under the critical processes, it should be analysed whether they can be bridged. Can an alternate component replace the critical tasks of a large non-highly available component for the downtime period? In payments, for example, the booking system can be such a large, non-highly available system and the online balance check can be the critical process that must be bridged.

Of course, payments processing as a whole does not work without statuses: unfortunately, money can only be spent once, so the account balance is a relevant status and a banking software must of course be able to accurately reflect this. In our case, this always leads to the use of databases and the need for persistence before and after each relevant status change. It is the design of the database model that determines whether or not we achieve our goal. Highly available processes should work with stable and/or migration-free data structures. This is the only way to avoid the need to shut down critical processes to change the database schema.

The remaining topic is robustness. Science also refers to resilience when describing that disruptions or partial failures of technical systems do not lead to complete failure. In payments, such disruptions can be peaks in the load above agreed limits or surrounding systems that do not respond as quickly as agreed. Downtime of business partner systems and missing acknowledgements of large amounts can also cause failures. In reactive programming, we have found a paradigm that allows for the desired robustness through orientation based on data flows. An overload can thus be encapsulated within affected areas and nothing stands in the way of uninterrupted operation for the remaining data – in our case payments.


Author: Thomas Riedel

0 comments:

Post a Comment