When customers move to the cloud, they often think that they copy their existing infrastructure to the cloud, and that is all the effort required. Commonly referred to as Lift-and-Shift and is generally possible. Still, often the cloud presents an opportunity to make things better or makes certain aspects of the transition more difficult.

This document shows various concepts, technologies, and architectural decisions that you can encounter during the process of onboarding infrastructure components to the cloud.

A prerequisite for this handbook is the Exoscale Certified Sales Professional training and certification and sufficient knowledge in setting up Linux- or Windows-based systems.

Recommended also is to acquire the skills to put the architectures described here into practice. Practical knowledge of these systems helps you with your work with the customers.

Basics

Let’s clarify the basics you always need to keep in mind: no machine has 100% uptime. Disks and power supply units (PSU) break. Even on a larger scale, a data center can suffer a catastrophic failure, rendering all the machines inoperable and data on them potentially lost.

Moving to the cloud does not alleviate the responsibility to design a redundant, failure tolerant system or create a disaster recovery plan. The cloud is not a silver bullet that makes operations a dreamland where you no longer need to worry about something failing.

As a cloud solution architect, you need to account for these possibilities and make sure that your systems are tolerant of all kinds of failures, from a single hypervisor to a whole zone disappearing.

Concepts

Before we start, let’s clarify some concepts that we need later.

Networking

Networking on Exoscale is split into two parts: the public Internet and the private networks.

By default, all instances on Exoscale get a public IPv4 address, and on request, also a public IPv6 address. The traffic over these addresses is subject to firewalling using security groups.

Security groups work in two ways:

The first method is to provide firewall rules based on the source/destination IP address in CIDR (Classless Inter-Domain Routing) notation and evaluating any packet coming in or going out based on its IP address.

The second method is to provide a security group name, disregard the IP address, and only consider the security group for filtering.

Other than this, security groups are conventional, packet-based firewalls, as used in other places.

Private Networks, on the other hand, are entirely different. A machine attached to a private network does not get an IP address assigned automatically, and firewall rules are also not applied. If you want to create a firewall, you have to do so yourself inside your instance. Alternatively, you could also create one private network per security zone.

You can connect private networks to on-premises networks with a little extra legwork. If a customer is willing to create a VPN or MPLS tunnel to an Exoscale data center, using Exoscale Private Connect can be piped directly into an Exoscale Private Network.

VPN

VPN or Virtual Private Network is an encrypted connection across the Internet that simulates a very long network cable. It is instrumental in connecting two remote locations, such as office network and Exoscale Private Network, where the traffic cannot go over the public Internet unencrypted.

There are many solutions for building a VPN, ranging from free and open-source, such as OpenVPN, to the commercial offerings, such as Cisco AnyConnect, Checkpoint VPN, or even standardized, cross-platform protocols supported by many software, such as L2TP, IPSec or PPTP. Some of these technologies are available as a supported template on Exoscale, and even licenses may be purchased through Exoscale, making the deployment easier.

When moving a customer to Exoscale, VPNs are handy because they let you connect the customer’s office network with an Exoscale private network - ensuring secure communication between the servers located in the cloud and on-premises. It may also be useful if the customer wants to move their Windows server and wants their office workstations to securely connect to that server.

A VPN is also essential when you want to connect private networks in multiple Exoscale zones. The VPN is deployed on single instances in each zone, and the traffic routing is through the VPN gateway in each zone.

It is important to note that Exoscale does not automatically take care of sending all the traffic via the VPN gateway. Each instance must be configured to route its traffic via the VPN gateway over the private network individually. To ensure no unencrypted traffic is accidentally sent over the Internet, security groups can be deployed, limiting the internet access for each instance.

Deploying Code

When it comes to getting a piece of software into the cloud, developers are progressively moving to a faster development/deployment cycle. After all, Time-To-Market has become the key metric over the past few years.

The used development tools need to adapt to the new fast-paced deployment cycles. CI/CD systems such as Jenkins are a staple among developers. When a developer pushes a piece of code into a Code Versioning System such as git, the CI/CD system automatically gets notified about the change and can immediately run a battery of tests to ensure that the software quality is up to specs. After testing, the CI/CD system can automatically push the code to the production environment.

In recent years a new trend has also emerged: instead of installing everything on servers directly and then deploying the code, the CI/CD system builds a so-called Docker image which is sent to a Docker Registry now. Then the image can be used to launch a container (lightweight virtual machine). Instead of continually updating a machine, we replace it when we need an update.

For more details on containers, see the Container Architecture Handbook.

CI/CD

CI/CD stands for Continuous Integration and Continuous Delivery. It is most commonly used in development companies and tries to speed up releases of particular software. Instead of a huge update, developers now ship software in tiny, granular increments where many changes happen. It is that if only a few things change, errors are more comfortable to track down.

Shipping software daily, hourly, or even more frequently requires excellent automation. Software installation by hand on a production system is a waste of time and reduces the number of release cycles.

Software like Jenkins, or online services like Travis CI or Circle CI makes this build and release process much more manageable.

The software developer puts their code into a code versioning system like git, and then the CI system automatically takes the source code and builds the software package. A large variety of included automated tests ensures that the shipped version contains as few bugs as possible. Inspections are particularly important because the frequency of delivered software packages does not allow for manual testing.

Depending on the rules set up in the CI system, the software package is then, again, automatically delivered to the production environment. This step represents the “D” as in “Delivery” of the “CD” (Continuous Delivery) process.

CI/CD is a crucial part of a “cloud-native” strategy, as there are many components to ship, and without automation, this would not be possible.

Microservices

Another trend that emerged in the past couple of years is the concept of microservices. Microservices mean breaking up a big, complex application into multiple smaller services that talk to each other.

The benefit is that it’s easier to debug each of the smaller services or scale under load. In theory, this sounds like a clear win, but it introduces complexity as a tradeoff in practice. Usually, only more significant projects move to microservices, as the overhead is too substantial for smaller projects.

Kubernetes is one of the enablers for microservices since it serves as the basis for, i.e., Istio, a so-called service mesh, that makes connecting microservices easier.

Windows Domain Controller

When migrating Windows-based systems, especially Windows servers in an office environment, you may come across the term “Windows Domain Controller” or DC.

A DC is a Windows server that holds a copy of the Windows user database and manages the policies that govern all office machines connected to it. It is an easy way to manage a large number of Windows machines centrally.

One of the most critical features of a DC is that it lets users log in to any machine and preserve their working environment (desktop configuration, file access, …).

Moving a DC to the cloud, however, brings some challenges. First of all, DC’s centralized file storage may cause quite a bit of network traffic, if office workers regularly upload and download files from the cloud instead of the local network. Second, since a DC is no longer on the local network, the data has to go over the Internet, open files, and work with them slower than before. It is essential to look at what (Internet) connection bandwidth the customer has, and the recommendation is to pick the closest Exoscale zone.

Note that Windows DCs have a quite advanced failover functionality, so if one DC fails, they can automatically switch over to a secondary DC. The recommended DC deployment configuration in a failover setup.

Basic Web Architecture

As a first example, let’s look at a typical architecture for a web service. The most straightforward setup consists of two components: a web server (including the application code) and a database. The application code itself receives the requests from the users and stores the data in the database server.

Schematically it looks like this:

One crucial aspect is that a web server has a public IP address, and using the DNS service, you can point a domain to it (example.com).

This setup is the same on both: an on-premises setup and cloud one. Generally, it needs a DNS service plus one or two servers.

Redundancy/Failover

As you can see, this architecture has one major shortcoming: each server role has only one server. If a server breaks, no matter if it’s the webserver or the database server, the service goes down. So, how do we make this service redundant?

Let’s look at the web server first. Let’s change the setup to have two web servers:

In this setup, the DNS service returns the IP address of either or just one of the two web servers. In the case of a web server outage, a switchover is triggered; hence, the pitfall in this scenario is slower processing due to the connected clients’ DNS cache. To speed up this process, we have to set a low TTL on our DNS records.

Alternatively, we could also use an Elastic IP, which provides the functionality of switching addresses between servers with immediate effect and the ability for the running server to continue serving traffic.

Both solutions can be automated using the Exoscale APIs, but with Elastic IPs, the traffic flows only over a single server, the other is idling.

Load Balancing

While redundancy takes care of outages by failover options, distributing heavy loads to utilize all servers in operation is also desirable. That can be done using a load balancer.

Load balancers are available in many flavors; in general, they are installed as a software package inside of an Exoscale instance. The load balancer’s configuration ensures that web traffic is distributed evenly to all running (backend) web servers.

However, load balancers need to be made redundant, too, so the above failover configuration with elastic IPs can be used.

However, two potential issues may prevent scaling web servers out.

The Session Problem

Web servers running web applications typically want to store a small amount of information about a user currently using the site. The stored data could be anything from the contents of their shopping cart to their login credentials. This is commonly referred to as sessions.

Usually, storing the session happens on the webserver that runs the web application. However, if multiple servers are in operation, the session data must be synchronized.

Well written web applications give you the ability to store session data in a database or contain built-in session synchronization. Such is the case, for example, with the well-known Java web server Tomcat.

The File Storage Problem

Another problem that prevents scaling is if the web application stores files (e.g., uploaded images) on the local disk directly as files. Well written web applications can use a different storage backend, such as a database or object storage. Since the object storage contains a web server, it can also serve files directly to the user.

Database Redundancy

As you can see, so far, we have only used a single database. Database redundancy is discussed in detail in the Database Architecture Handbook

Queue Processing Architecture

In this example, specifically, a video converting application, large video files are deposited on an Exoscale Object Storage, and an entry is made into a queue server. Queue processors running on Exoscale instances fetch the queue server’s task, download the source video files from the object storage, process them, and upload them to a different bucket. We take a look at an underlying queue processing architecture.

This setup, at its core, is an ideal workload for the cloud. Queue processors ran as needed, and if no tasks are present, servers can be shut down, on high workloads multiple servers solve the problem. In the case of a queue processor crashes, no information is lost.

However, there is a problem: the queue server holds information about all the jobs that need processing. If this instance crashes, the information about the jobs is lost. So the queue server needs to be made redundant to handle failures.

You can imagine queue servers as something that holds messages that need to be delivered. The queue processors read the messages. When they processed the corresponding video, they would mark the message as “delivered,” so the queue server does not try to give it to a different queue processor.

In general, there are three types of redundant queues. All types have to do with how many times a message is delivered. Imagine that a queue processor has a bug and never finishes processing a video for a bit of context.

In this case, the queue server delivers the video job once to one queue processor. The queue processor gets stuck and never gets back to the queue server, so the job is marked as pending. After a while, say, 30 minutes, you have a choice: Do you try again? Alternatively, do you mark the job as failed?

At-most-once

The “at-most-once” message delivery is ensured with this type of queue server, a message can be not delivered, but it is guaranteed that the message is never delivered twice. The best use case for this queue type is where a double delivery would cause massive problems.

At-least-once

The “at-least-once” message delivery is ensured with this type of queue server, a message is always delivered, but a message can be given more than once. The best use case for this queue type is where a double delivery causes no problem. It is the right choice for the video processing case because double processing a video is not an issue.

Exactly-once

The “exactly-once” message delivery is ensured with this type of queue server; a message is always delivered exactly-once, not less. However, this comes with some caveats. First, this queue type is the slowest due do the amount of synchronization needed. Second, the application processing the data can only work with databases that support two-phase commits (XA transactions), and the processing application needs to be aware of these. An excellent example of a system like this is an integration with the Java Messaging System.

Storage Systems

Applications need to store data somehow, and there are several possibilities. It is essential to pick the right storage solution for the task.

Direct-attached Storage

The default supported model on Exoscale is direct-attached storage means that a physical hard drive is directly attached to the virtual machine that is using it. The main benefit of such storage is the excellent speed the instance gets. The drawback is that this type of storage cannot be shared between multiple instances unless it uses some high-level protocol.

Network Block Storage

The sharing model of network bock storage is still limited to a single instance. Nevertheless, if this instance goes down, this type of storage can be easily attached to a different instance, because the attachment of storage is over the network. Beneficial is, this storage survives the death of an instance. The drawback is that it can still only used by one instance at any given time. Performance-wise it does not match direct-attached storage because the data is sent over the network. Exoscale does not offer Network Block Storage as a service at the time of writing, but customers can use solutions like Ceph RBD or OpenEBS to provide such a storage solution.

NFS

Network File System (NFS) lets multiple instances use the files stored on a single instance. NFS itself does not provide redundancy. Besides, NFS also has an inferior performance to any of the above solutions when reading a list of files in a directory.

GlusterFS

GlusterFS is a cluster filesystem that redundantly stores files across multiple instances, and lets applications on multiple instances use the files simultaneously. However, the drawback is that due to the lack of a dedicated metadata server, GlusterFS can be slow when it comes to directory listings.

CephFS

CephFS is a Ceph option to run with a dedicated metadata server to provide a shared filesystem that redundantly stores files and lets multiple instances use them. It offers high-performance file access at the cost of having to operate a large metadata server instance and the operations being hugely complex. It is not recommended to run CephFS unless a dedicated team is available to run it.

Object Storage

An alternative to filesystems and block storage is to store data in object storage. However, object storages require, for the most part, that applications running on instances “know” how to deal with them. The reason for that object storages does not provide all the mechanisms a filesystem does. For example, on object storage, the application cannot “lock” a file from modifications until it carries out a critical operation. Therefore object storage is not the right solution for databases. Exoscale provides object storage, but the customer can also use their object storage solution, such as Ceph Object Storage.

Backup & Disaster Recovery

Backup and disaster recovery training is vital for any infrastructure - however, taking backups matters.

Usually, when people move for the first time to the cloud, the tendency to think of backups so that a snapshot of the entire machine is the right thing to do is the natural thought that comes to mind. Yes, you can do that, and, yes, Exoscale provides snapshots functionality, but a backup approach like this also represents severe constraints.

If you take a snapshot of a running machine, you cannot ensure the backup’s consistency. Consistent snapshot backup requires a shut down of the machine.

A better way to create backups is on the application level. For example, if you desire a backup of a database, the backup tool of that database can be used in conjunction with the backup software agent. This way, the backups stay consistent and can be imported without a consistency problem.

The drawback is that if only the data is backed up, the machine configuration needs to be restored by another method. The best practice for this is “Infrastructure as Code” (see below).

Infrastructure as Code

Imagine a complex infrastructure. It took hundreds, if not thousands of clicks to configure the cloud account to launch the virtual machines and install the software in these virtual machines.

If the colleague who configured these leaves the company, all the knowledge is lost. Documentation is usually not kept up to date and is also lacking focus because everything else is more important.

It is also worth considering how recovery happens if the cloud account is compromised? You would have to reinstall all the machines from scratch, by hand, since you cannot be sure the attacker did not gain access to them.

That is where Infrastructure as Code comes in to help. Let’s look at an example with a tool called Terraform:

resource "exoscale_compute" "test" {
  display_name = "test.example.com"
  template     = "Linux Ubuntu 18.04 64-bit"
  zone         = "{{ default_zone }}"
  size         = "Medium"
  disk_size    = 50
  key_pair     = ""
}

The code snippet from above, if executed, create an instance on Exoscale with the defined parameter. It represents only a basic example; writing more complex solutions is, of course, possible.

Usually, for storing these code snippets, a code versioning system such as “git” is used. In case a code creator or maintainer leaves the company, a new colleague can easily pick up where the other has left the infrastructure as code definitions.

Using tools like Terraform, you can automate the creation of instances and install software inside the instances. In the creation process, the user-data field offers you the ability to inject a program into the instance, which runs the first time the instance starts.

For Ubuntu Linux, this is processed using cloud-init.

Windows images currently have no facilities to process user data, so any script entered is ignored. Instead, WinRM can be used to deploy instances using PowerShell/PowerShell DSC automatically.

With these tools adequately used, only the data needs to be backed up, infrastructure as code takes care of reinstalling the machines.

Since the whole structure’s description is in code, it is relatively easy to create a replica of the live system for testing purposes. Hence, Exoscale charges by the second, the testing platform can be spun up as needed, for only as much time as required for testing. By that, we can enable developers to showcase new versions without a fixed commitment to testing environments.

Cloud Challenges

The Special Snowflake Issue

The Special Snowflake Issue refers to setup when the customer has installed a single server by hand in their old installation. This setup has neither a redundant pair nor documentation of how it is setup. It is also most often true that if this kind of servers goes down, bad things happen. Quite often, these servers have pet names, too.

It is, in general, a bad idea to treat servers like pets. Treat your servers like cattle. They should be replaceable at any time, and they should not be unique. Take great care when moving such a server to the cloud. Often these servers have many skeletons in the closet, like finding the right configuration files to implement changes, and even something simple as an IP address change may break everything.

Under no circumstances should such servers be migrated in a limited window of announced maintenance time with no rollback scenario. The maintenance window is almost sure to be exceeded. Instead, the setup should be tested in the new constellation as much as possible before attempting a failover. If the new configuration is not adequate, move the production operations back to the original server as soon as possible.

The Large Disk Issue

Another concern that usually plagues lift and shift projects is the issue of the too-large disks. Customers with on-premises systems like VMware have frequently bought massive, fiber channel attached storages that allow the customer to create massive disks.

The problem is not making available this amount of storage, but considering the transfer time of backups and snapshots over the network in your performance and availability parameters. We are dealing here with a critical balance of storage amount and performance assurances. There are special-purpose-instances available to support larger disk sizes for dedicated workloads, but these instances have limitations compared to the standard-instances on Exoscale.

That is why moving large disk instances to the cloud is extremely tricky, and have to be considered wisely, most often the best decision is to let them in the on-premises datacenter.

The Backup Issue

In on-premises scenarios, backups are often solved under the VM layer by taking backups as snapshots of the running virtual machine. As discussed above, this may not be a great idea if you want consistent backups. Nevertheless, some customers request it when moving to the cloud.

The only out-of-the-box setup Exoscale offers at this time are VM snapshots, but these cannot be considered full backups. Although, it is possible to solve the problem with a 3rd party software.

The Network Throughput Issue

Guaranteeing network throughput in an on-premises setting is manageable. Running solutions in the cloud requiring this feature, most notably, storage systems such as Ceph, need dedicated bandwidth. Supporting this scenario with typical VM sizes requires dedicated hypervisors providing a sizable amount of bandwidth. Still, the costs associated with this setup are unattainable for smaller customers.

Network throughput can also become an issue when customers move their on-premises systems, such as a Windows Domain Controller or File Server to the cloud. Their internet line is still not fast enough, so the colleagues can no longer access their tools with adequate speed.

The Distance Issue

Apart from network throughput, network latency can also become an issue. Usually, but not always, this is connected to physical distance. The information needs to travel over the physical wire or fiber optic cable, so even with the speed of light, it takes a non-zero time to move across the network. If the data center is far away from data usage, the data access can be slow. In the case of real-time applications such as teleconferencing, online gaming, collaborative editing, remote desktop, and many more, have a measurable impact on service quality.

With services that require many round trips over the network to complete single actions, this is especially bad. Therefore, hosting applications and database servers should not be moved far apart or on different providers.

The License Issue

Some software vendors have decided to issue software licenses that may not be compatible with the cloud. Others choose only to allow their software to run on only certified cloud- or virtualization-providers. In this case, it may not be possible to run the software on Exoscale. Developing a hybrid cloud solution can answer this issue where a part of the software/application in question runs on a physical server, and the other part connects into the cloud architecture on Exoscale.