Multi-service Approach #16748

walkline · 2023-07-12T22:53:51Z

walkline
Jul 12, 2023

Hi there! I'm the author of the ToCloud9 project, and I would like to discuss with you the multi-service/clustering/sharding/layering/distributed system approach in WoW emulation, specifically focusing on AzerothCore.

Firstly, I'd like to describe the current approach that AzerothCore and most emulators are using.

In the simplified diagram, you can see that after realm selection, all game clients are connected to a single instance of the worldserver, which handles all game logic.
This approach has its advantages and disadvantages. On the one hand, it makes development easier and faster, and it also simplifies the setup and maintenance of the infrastructure for AzerothCore users.

However, adopting a distributed architecture opens up new possibilities.

1. Horizontal Scalability.

Nowadays, we have access to incredibly powerful hardware. With high-performance CPUs and optimizations, it is possible to handle between 4,000 and 10,000 simultaneous connections, which is a remarkable achievement. However, wouldn't it be even better if we could handle 20,000 or more connections? I believe that with a distributed architecture, this is indeed possible.

Let's consider an example. Imagine you are a server owner with an average online player count of 4,000. However, you have a highly anticipated event coming up, such as the opening of the dark portal, and you estimate a maximum online count of 8,000. Several weeks before the event, you invest in new hardware capable of handling that load and spend time migrating to it. Finally, everything is set for the event. However, the event turns out to be even more popular than expected, attracting 10,000 players who want to join your realm on that day. You enabled a queue and 2,000 players are left disappointed and spend the entire evening waiting. After the event, your average online count returns to around 4,500. Now, you face a decision: should you order cheaper hardware and migrate to it, or should you continue using the expensive hardware you purchased for the event? Although the more cost-effective option would be to switch to cheaper hardware, server owners often choose to stick with the more powerful hardware, anticipating future spontaneous growth in online player count or similar events. This decision may result in higher expenses but is deemed worthwhile.

Now, let's imagine how this situation could be handled with a scalable distributed system. Suppose you are running your server in a cloud provider that offers pay-as-you-go options, billing you on an hourly basis. You have a cluster of, let's say, three average VPS instances. For the event, you add three more instances to your cluster. Using an orchestration tool like Kubernetes, you can easily scale your system components, including the worldserver. As the online count reaches its peak during the event, you add two more instances to handle the load effectively, enabling a smooth gameplay experience for all 10,000 players without the need for a queue. Once the event is over, you scale down your cluster back to three or four instances, allowing you to pay only for the few additional hours during which the extra instances were utilized. This flexible approach ensures that you pay for the actual resources consumed, providing cost efficiency.

2. Availability

With a distributed architecture, availability can be significantly improved. Let's review several scenarios to illustrate this:

One of your instances crashes:
- In the current architecture, a crash in any instance would cause the entire realm to go down. All players would be disconnected, lose their progress, and need to wait for a reboot.
- In a distributed architecture, you can handle a situation where one of your nodes crashes. Players can be seamlessly connected to an available node instead of the crashed one, avoiding disconnections.
One of your instances causes significant lag or experiences overpopulation in a specific location:
- In the current architecture, any performance issues caused by a specific instance would affect all players on all maps, resulting in lag for everyone.
- With a distributed architecture, only players connected to the problematic node would be affected. All other players would have a smooth and enjoyable experience.
Releasing a new version of the worldserver with a fix for an exploit:
- In the current architecture, you would need to restart the entire realm, resulting in all players being disconnected.
- With a distributed architecture, you can adopt a canary deployment approach. You can roll out the update to a single worldserver instance and validate the fix for its stability. If everything is working fine, you can proceed to roll out the update to the remaining instances. However, if any issues arise, you can easily roll back that specific instance to the previous version. During these deployments, affected players can be seamlessly redirected to the working worldserver instances without experiencing any disconnections.

3. Crossrealms

I already saw some requests in AzerothCore community for crossrealm battlegrounds. If you designed distributed system with multi realm support then your system ~80% ready fro crossrealm functionality.

4. Composability

This option is more relevant to a microservice architecture. In a microservice architecture, you would divide the logical components into different microservices. Communication between these services requires exposing APIs for most of the services.
To illustrate composability, let's consider the example of an auction house service. You have implemented an auction house service that handles in-game auctions. However, you then decide that it would be beneficial to provide players with the opportunity to interact with the auction house from a website. If you have designed the API for the auction house service with extensibility in mind, you can likely reuse it without making any changes. You would simply need to create a new service, such as a gateway service, to sit in front of the auction house service.
This gateway service would handle requests from the website and communicate with the auction house service using the existing API. It acts as an interface between the website and the auction house service, allowing players to interact with the auction house seamlessly from both the game client and the website.
By designing services with composability in mind and leveraging a microservice architecture, you can achieve reusability and flexibility in implementing new features and integrations.

Sounds interesting?

If this sounds interesting to you, then I have both good and bad news. The good news is that there are some prototypes/PoCs that attempt to implement it. The bad news is that I'm not aware of any public and complete ones.

But I would like to promote the project I'm currently working on - ToCloud9. My goal is to bring this project to a complete state. The purpose of this post is to seek assistance with its development, but I'll elaborate on that later.

What is this ToCloud9 project about?

The main goal of this project is to make TrinityCore, AzerothCore and there forks distributed and cloud native with a minimum changes in the core itself. And as result we can utilise the possibilities mentioned at the beginning.

I have chosen to adopt a microservice architecture and have built a set of microservices using the Go language. I opted for Go because it is well-suited for microservices and accelerates development in my specific case. However, since ToCloud9 follows a microservice architecture, each microservice can be written or rewritten in nearly any programming language. The key requirement is that a new microservice must comply to the predefined API protocol and be capable of containerization.

There is a small demo that demonstrates some capabilities of this project.

ToCloud9 Architecture

In this section, I would like to discuss the architectural pillars of ToCloud9. Let's begin with the approach to distributing the load between worldservers.
To achieve this, we can introduce a server that sits between the game clients (players) and worldservers. This server needs to be intelligent enough to understand the WoW protocol and have the ability to switch players from one worldserver to another. For now, let's refer to this server as the "Proxy" (although I will introduce a better name later). For each game client this "Proxy" should establish a new connection to one of the worldservers. Here is a diagram illustrating this concept.

Now, let's delve into the details of how this newly introduced server can facilitate a switch from one worldserver to another. The most straightforward approach is to divide the worldservers based on maps. For instance, we can utilize "worldserver1" for Kalimdor and Eastern Kingdoms, while assigning "worldserver2" for the remaining maps. This simplified approach is what ToCloud9 currently employs, although it would be nice to have the capability to divide worldservers based on areas/zones in the future.

We have decided to distribute maps between worldservers, but we need to determine when and how this switch will actually happen. As mentioned before, the new server needs to understand the WoW protocol. It can intercept certain packets and handle them in a special way since it knows what the player and worldserver want to send to each other. To trigger the switch from one worldserver to another, the “Proxy” server should intercept and handle the SMsgNewWorld/MsgMoveWorldPortAck opcode. This opcodes informs the client that the player should be teleported to the new map.
To explain how ToCloud9 handles this opcode, I need to introduce a new component called the servers-registry. When the “Proxy” server needs to decide which worldserver to use for a given map, it sends a request to the servers-registry server. The servers-registry is a gRPC server that is aware of all available worldservers. To be visible to the servers-registry, a worldserver needs to make a gRPC call to the servers-registry (using libsidecar) with a list of maps that it can theoretically handle. The servers-registry then performs healthcheck requests to the worldserver to keep an up-to-date list of healthy servers. With knowledge of the worldservers and the maps they can handle, the servers-registry can dynamically distribute all the maps between them.
Lets summarise this with diagrams.

By this point in the diagrams, there was only one "Proxy" server. However, having only one instance would make it unscalable and a single point of failure. Due to this limitation, ToCloud9 supports scaling of the "Proxy" server. Now, we need to address the question of how to distribute players among the "Proxy" servers.
The answer to this question can be found in the auth and servers-registry servers.

We need a place with all available “Proxy” services. ToCloud9 uses the same approach for “Proxy” service discovery as for worldservices. Information about all available “Proxy” servers stored in servers-registry. Also servers-registry collects connected players metrics. Knowing amount of connected players for every “Proxy” server allows to find a “Proxy” server with least connected players.
We need to connect players to the correct “Proxy” server. To achieve it we need auth server that will return realmlist that points to the “Proxy” server with least connected players. ToCloud9 has its own implementation of authserver. When player provides correct credentials authserver makes request to server-registry that returns the “Proxy” servers with least connections for realms. After that authserver can provide needed realmlist to the game client.
Let’s describe this in the diagram.

At this point, all the load is distributed. However, it is far from being complete. If you were to log into the game with the described architecture, you would notice that some functionality is broken. For instance, you wouldn't be able to whisper to a player who is on another worldserver. Additionally, your guild tab would only display online players who are on the same worldserver as you. To address this issue, ToCloud9 moves such functionalities from the worldserver to new microservices.
Let's take the guild functionality in as an example. There is a separate microservice dedicated to guilds. This microservice exposes the following gRPC API:

service GuildService {
  rpc GetGuildInfo(GetInfoParams) returns (GetInfoResponse);

  rpc GetRosterInfo(GetRosterInfoParams) returns (GetRosterInfoResponse);

  rpc InviteMember(InviteMemberParams) returns (InviteMemberResponse);
  rpc InviteAccepted(InviteAcceptedParams) returns (InviteAcceptedResponse);

  rpc Leave(LeaveParams) returns (LeaveResponse);
  rpc Kick(KickParams) returns (KickResponse);

  rpc SetMessageOfTheDay(SetMessageOfTheDayParams) returns (SetMessageOfTheDayResponse);
  rpc SetGuildInfo(SetGuildInfoParams) returns (SetGuildInfoResponse);

  rpc SetMemberPublicNote(SetNoteParams) returns (SetNoteResponse);
  rpc SetMemberOfficerNote(SetNoteParams) returns (SetNoteResponse);

  rpc UpdateRank(RankUpdateParams) returns (RankUpdateResponse);
  rpc AddRank(AddRankParams) returns (AddRankResponse);
  rpc DeleteLastRank(DeleteLastRankParams) returns (DeleteLastRankResponse);

  rpc PromoteMember(PromoteDemoteParams) returns (PromoteDemoteResponse);
  rpc DemoteMember(PromoteDemoteParams) returns (PromoteDemoteResponse);

  rpc SendGuildMessage(SendGuildMessageParams) returns (SendGuildMessageResponse);
}

The most important client of this API is our "Proxy" server, which plays a crucial role in handling guild-related packets from the game client. When the "Proxy" server receives a guild-related packet, it reads the packet's content and generates a gRPC call to the guilds microservice. Upon receiving a response from the guilds microservice, the "Proxy" server generates a WoW packet based on this response and sends it back to the game client. Notably, there is no interaction with the worldserver involved in this functionality.

Since we have reached this point, it is necessary to clarify the naming for this "Proxy" server. As mentioned earlier, I believe that a more suitable name for it is "API Gateway". This name better reflects its role and purpose.
Now lets reflect this in diagrams.

In the diagram above, you will notice the inclusion of a new component called NATS Message Bus/Message Broker/PubSub. This component serves as a message bus and allows for publish-subscribe functionality. Certain microservices have the capability to produce events, such as GuildEventNewMessage, LBEventCharacterLoggedIn, MailEventIncomingMail, and more. Any microservice can subscribe to specific events using NATS and handle them accordingly.

Congratulations! Now you know the pillars on which ToCloud9 stands.

Now, let me list all the implemented microservices and outline the remaining work that needs to be done.

authserver authorizes players, provides realmlist and connects a game client to the API Gateway with the least active connections;
game-load-balancer (or API Gateway) holds game client TCP connection, offloads encryption, reads packets and routes requests to other services. For every character creates connection to the worldserver that Servers Registry provides. Also intercepts some packets and uses information from them to sync some states between services.
servers-registry holds information about every running instance of Game Load Balancer and worldserver. Makes health checks and collects necessary metrics (active connections at the moment). Assigns maps to worldserver instances.
chatserver at the moments holds characters online and handles "whisper" messages;
charserver provides information to handle SMsgCharEnum opcode. Holds information about connected players. Handles Who opcode.
gameserver (or worldserver) is modified TrinityCore world server (we can create version for AzerothCore as well) with sidecar library, that registers GameServer/worldserver in Servers Registry and handles health checks.
guildserver handles some guild opcodes. Still misses guild creation and guildbank functionality.
guidserver provides pool of guids of items and characters to the worldservers/gameservers.

And the current (simplified) architecture looks like this:

Things to be done (ordered by my priority):

Mail server - in progress, almost done.
Party server
Battleground and arena functionality
Global channels
Friends list
Auction Hause server
Looking For Group functionality
Sync transports between gameservers/worldservers
Anything else?

Epilog

So what now? Should AzerothCore replace the current architecture?
In my opinion, no. The current architecture works well for approximately 90% of AzerothCore users, especially those who are using it to play with a small group of friends. However, that remaining ~10% of users can be the most important for your project.
Ideally, AzerothCore should provide an option to run itself in a cluster mode. This would allow those users with specific requirements, such as high scalability, fault tolerance, to utilize AzerothCore in a distributed architecture.

How can you help the ToCloud9 project?

I’m looking for contributors. Yeah, maybe Go language is not well known in this community, but it is an easy language to learn. Additionally, ToCloud9's microservice architecture allows you to use other languages like C++, C#, Python, or any other similar language. For example, you could try implementing the auction house service in a different language. You can count on my assistance.
Provide feedback about the project and this post. I acknowledge that I may not be the best writer, so any feedback you can offer to improve the clarity and detail in certain topics would be greatly appreciated.
Press the Star button in this repository. By doing so, you increase the project's visibility, making it more likely to attract contributors.

So what your thoughts? Does this new approach looks interesting to you?

noisiver · 2023-07-13T00:18:55Z

noisiver
Jul 13, 2023

It looks interesting to most people, however it has been proposed multiple times before - some people have even supposedly worked on such a feature but nothing ever came from it because it takes a lot of effort to overhaul the core to support it.

The idea of having dungeons and raids on a separate system, maybe even have different continents on different servers, is a novel concept that would be great if implemented properly. Letting the dungeon (or instance) server crash without people questing in outland or kalimdor suffer for it, sounds great doesn't it? Obviously this is required either way or cross-realm wouldn't be possible, would it?

I don't think that having multiple worldservers running just to one can take over if one crashes is something that bothers me, at least not more than splitting the load between different processes for different areas which are possibly running on separate systems. I know this kind of things is good though, don't get me wrong. I'm not stupid.

I would say I'd like to contribute but I can't for two simple reasons. Go? I don't even know what that is. Docker? I'm sorry to say I'm not touching that. I just wanted to give my 2 cents on the matter, really. I do hope you get to where you want to go with it, not only for you but for the good of others.

5 replies

walkline Jul 13, 2023
Author

Thanks for sharing your thoughts, I really appreciate it!
Regarding Go, have you never heard about it? It's widely spread nowadays.
Out of curiosity, do you not like Docker itself or the concept of containerization in general?

noisiver Jul 13, 2023

I haven't heard of Go, that's all. I'm not a developer by trade, just as a hobby and I mainly work on the AzerothCore core and modules. I'm certain it's both good and widely known but I have just missed it.
I have had nothing but issues with Docker myself. I don't know why but it doesn't seem to like me. I love containers but I prefer LXC personally. Docker I know is great for developers, very easy to pass the code around but I myself fail to see the point in using it for anything but since the process is running on the host anyway. The main reason I dislike it is obviously because I keep running into issues when trying to use it. I have given it a lot of time and effort over a longer period of time. Because of this I have, sadly, given up hope.

walkline Jul 13, 2023
Author

Got it!
Yeah, when working with microservice development, tools like docker-compose can be very helpful for orchestrating multiple containers, although it is not mandatory. As for LXC, it think it doesn't support docker-compose, but you might find that Podman serves as a good alternative to Docker. It provides similar functionality and compatibility with Docker containers, and it also supports podman-compose as a replacement for docker-compose. Maybe this can be a good alternative for you to consider in the future.

noisiver Jul 13, 2023

I know about Podman but about a year ago I switched to Proxmox on all my servers and use exclusively LXC Containers and no they don't support docker-compose, it's just an isolated container instead of what Docker does and isn't limited to one "action" per container. It's like an extremely minimal VM. The code can still be compiled and run without Docker, has to be since that's what Docker does, so if I were to help test it and so on I could go that route. I am curious however, is the goal to split the services up into separate processes on the same system or is it to be able to split it among different VMs, containers or even physical servers?

walkline Jul 13, 2023
Author

Gotcha, thanks!
The end goal is to split them among different physical servers. But other cases, like same machine and different processes/containers would work too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-service Approach #16748

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multi-service Approach #16748

walkline Jul 12, 2023

1. Horizontal Scalability.

2. Availability

3. Crossrealms

4. Composability

Sounds interesting?

What is this ToCloud9 project about?

ToCloud9 Architecture

Now, let me list all the implemented microservices and outline the remaining work that needs to be done.

Epilog

Replies: 1 comment · 5 replies

noisiver Jul 13, 2023

walkline Jul 13, 2023 Author

noisiver Jul 13, 2023

walkline Jul 13, 2023 Author

noisiver Jul 13, 2023

walkline Jul 13, 2023 Author

walkline
Jul 12, 2023

Replies: 1 comment 5 replies

noisiver
Jul 13, 2023

walkline Jul 13, 2023
Author

walkline Jul 13, 2023
Author

walkline Jul 13, 2023
Author