The concept of Site Reliability Engineering (SRE) was first introduced at Google where it was described as "SRE is what happens when you ask a software engineer to design an operations team." Whilst the term is not new, its adoption within financial technology is a more recent development.
The continued trend towards Managed Services and Software-as-a-Service models places an ever-increasing responsibility on technology vendors to provide a high standard of production operations. Deep technical knowledge and detailed monitoring is no longer enough within their traditional silos. At Vela, we believe the Site Reliability Engineering function is critical at bringing together operational excellence with the latest developments in technology. This makes a tangible difference to our fintech products and platforms whilst also benefiting clients who prefer to deploy and manage software themselves.
It’s well known that trading firms have been under huge amounts of pressure to do more with less. This continued pressure on margins and resources means that now more than ever selecting the right strategic technology partner is key to fill any void created by a lack of internal expertise, whilst also staying abreast of the latest technology developments.
As an SRE team at Vela we have focused on the following six key areas:
In other words, the right path should be the easiest one. Standardisation and consistency of our deployments is key, working on everything from default application configuration to system tuning. Using Ansible and other open-source technologies, we aim to deliver an immediate out-of-the-box solution that is optimised for the environment where it is used. With this model we have reduced the time taken to deploy new releases in production by 60% and reduced our new deployment setup time by 75%.
The speed at which technology advances provides opportunities to find improvements that can be applied to existing setups with minimal change or cost thereby improving performance, reducing footprint, and increasing lifespan. Identifying, testing, and publicising these opportunities internally, as well as to our clients, allows for an even better return on investment. Over the last six months we have identified changes that have brought a 30% decrease in latency to some applications deployed on Red Hat 7. This reduces the number of cores required to handle the traffic and lowers footprint as well as reduces cost.
As SRE teams bridge operations and engineering they are well positioned to test and develop new concepts. Recent examples of this for us include FPGA-based solutions, overclock CPU analysis, high density custom-built chassis, assessing the latest Intel chipset and defining cloud configurations for processing market data. Ultimately this is about creating short feedback loops, being able to quickly validate design and influence architecture.
We are focused on creating automation and workflows that make it easier and faster to complete changes and upgrades. By developing best practices and incorporating them into tools and workflows the possibility for error is significantly reduced. We are also working on sharing these tools so that clients managing software themselves can benefit from this too, as well as delivering onsite reviews where possible.
Site Reliability Engineering is similar in some respects to an architect’s role in that it requires a depth of experience on the product as well as a wider understanding of the environment such as network, virtualisation, hardware, etc. Working with engineering, sales, and product teams during the design of applications and deployments is therefore a key part of our effort in ensuring clients get the best possible solution based on their requirements and infrastructure.
Increasing data rates, regulation and a wealth of competition demand this data is not only available but driving a whole new level of observability. We believe accuracy and transparency are vital and have invested heavily in clear, honest, consistent monitoring and reporting. During latency testing, for example, we utilise hardware-based frameworks external to our software and include every tick, replicating the way in which many clients measure in their own environments. We therefore can guarantee our latency numbers are achievable by clients when deployed following our recommendations. In the area of monitoring, we are leveraging open-source technologies such as Prometheus, Loki and Grafana to build consistent monitoring stacks across products. A subsection of which is used for trend analysis and capacity planning.
The work we are doing in all six areas is about making sure clients get the most out of our solutions, whether it’s from an efficiency or performance perspective or to maximise their returns whilst reducing and removing frictional, operational barriers. Ultimately, we’re guided by a simple principle: It’s in our interest to make sure our clients are as successful as possible.