I always found it difficult to talk about what reliability is and how to achieve that, sometimes I have to admit that I was not aware of certain problems or I wasn’t sure how to address them.
That is why I thought it was a good idea to collect the solutions offered by MuleSoft to the various reliability problems in one place, giving a unique view and different options.
In this article, I do not intend to give low-level technical details of each solution, it’s better to link the related Mulesoft documentation for that, however, I am happy to expand on any of these if required.
Overview
Reliability is defined in Wikipedia as “the quality of being trustworthy or of performing consistently well”.
Reliability is one of the most important requirements (non-functional) in IT, however, often it is left out of any requirement discussion as either thought:
- not important for the customer (before everything goes to hell)
- not need to be discussed, and why things should go wrong
- too much technical, and so difficult for business people to understand
- sometimes the Integration Architect might not emphasize the importance of understanding it
Unfortunately, if these requirements are not agreed upon when issues come (and they always will), it will be difficult to identify them and it will not be easy for the Client team to convince the business to invest money in fixing that without a clear business value.
Reliability Approaches
Requirements
Reliability means no message data loss during the processing, in case of errors or after a stop or a crash of the server(s) processing the requests.
Various reliability patterns can be implemented to achieve reliability goals for synchronous and asynchronous flows.
Way to achieve a Reliable API
Reliability in Mule applications can be achieved using:
Reconnection strategies, see here
- When a system such as a DB of SFTP where a connection pool is used MuleSoft opens these connections at the start of the server and uses them while needed
- If for any reason (remote systems down or connectivity) this pool is not properly populated or the connections go down at any time, by default, MuleSoft keeps running and eventually, all the flows using these connections will keep failing
- By using this feature, it is possible to instruct Mule to reconnect and repopulate the pool at defined intervals.
- I always configure the Reconnection strategy for instance with (S)FTP or DB connector
Until Successful scope, see here
- It can execute a sequence of Mule processors, a defined number of times until all everything succeeds.
- It is very useful for HTTP Requests where the connection is unstable.
Redelivery policy, see here
- Redelivery policy in Mule 4 is similar to the until-successful scope but it gets applied always to the Source of the Flow.
- For the developer it is just a configuration, however, It works by saving in the background by Mule, the received message in a Mule default cache and incrementing the number of times it gets resubmitted after an error occurs
- When using this policy, it is a good practice to implement an error handler for the exception REDELIVERY_EXHAUSTED
- Mule 4 Redelivery policy can be applied to any flow source, but a better practice is based on using external systems redelivery when supported (suck us for JMSConsume operations).
RETRY_EXHAUSTED see here
- This exception handler as per best practice should always be implemented
- It can be thrown in Mule4 common by any connectors
Transactions (Try Scope), see here
- If a series of steps in a Mule flow must succeed or fail as one unit, in Mule, a good practice is to use a transaction to demarcate that unit.
- Transaction can start at the Source of the flow or can be demarcated via a Try Scope.
- If only a single system is required to be involved in a Transaction, MuleSoft local transaction will be sufficient to achieve the goal
- In case more than one system has to be involved it is recommended XA transaction that will support 2-phase commits, making sure all systems involved will commit or all will rollback
- The default approach in Mule4 is that in the happy path, the (XA or Single) transaction gets COMMITTED at the end, while in case of errors, within:
- On-Error-Propagate handler the transaction gets ROLLED-BACK
- On-Error-Continue handler the transaction gets COMMITTED
Message persistence for application downtime or crash might be required when the state has to be recovered
- Persistence can be implemented in Mule via VM, JMS, DB, Cache/Object Store, File
- When using JMS/ActiveMQ (other than Transaction based approaches) the ACK message approach can be used
Reliability Pattern in async scenario see here
- I find it useful when having a push mechanism scenario that is triggered by changes to a source system (often used for systems synchronisation). Sometimes ago I would have referred to this as ChangeDataCapture (CDC), while today these use cases are usually called webhooks.
- The main options for implementing the communication between the reliable acquisition and processing flows are MuleSoft VM or JMS/Active MQ, however, differences have to be noted:
- Non-Persistent VM This is the only VM option available in Cloud Hub 2, they are in memory only Queue, therefore cannot really be Reliable since if the Mule Server will go down, the messages will be lost
- Persistent VM, CH10 only, (based on Amazon SQS standard Queue, see here and here):
- “at least one delivery guaranteed”, there is a chance the same message could be processed more than 1 time, therefore the flow has to call idempotent operation or have an idempotent filter
- “message ordering” is not supported”
- Can be used for this when only between MuleSoft API
- It is recommended to configure a Redelivery policy (see here https://docs.mulesoft.com/connectors/vm/vm-reference#listener) and implement the Exception Handler for REDELIVERY_EXHAUSTED not to lose track of the messages LOST
- JMS/Active MQ
- “exactly one delivery” and “message ordering” is supported
- It is recommended to configure a FIFO Queue when message ordering is needed, with a Dead Letter Queue (see here) and implement some monitoring on the Dead Letter Queue so as not to lose track of the messages LOST