Cluster configuration and node management
Environment configuration¶
File descriptors¶
To handle large traffic, some of the system variables need to be tuned. Number one on that list is the maximum number of file descriptors which often is set to 1024. Each MongooseIM connection consumes ~1 file descriptor, so the default value will not suffice for larger installations - when it is exceeded, emfile errors will appear in logs.
To check the current limit execute: ulimit -n
.
To list all limits execute: ulimit -a
.
In the example below we set limits for a mongooseim
user.
To increase the limit the following entries should be added in /etc/security/limits.conf
:
1 2 |
|
If you are using Ubuntu, all /etc/pam.d/common-session*
files should include session required pam_limits.so
.
vm.args
file¶
This file contains Erlang options used when starting the VM.
It is located in REL_ROOT/etc/vm.args
where REL_ROOT
is the path to a MongooseIM release
(ie. _build/prod/rel/mongooseim
if you build MongooseIM from source).
When using an SSL/TLS connection we advise to increase ERL_MAX_PORTS
to 350000
.
This value specifies how many ports (files, drivers, sockets etc) can be used by the Erlang VM.
Be cautious - it preallocates some structures inside the VM and will have impact on the memory usage.
We suggest 350000 for 100 k users when using an SSL/TLS connection or 250000 in other cases.
To check how memory consumption changes depending on ERL_MAX_PORTS
, use the following command:
1 |
|
Another change you need to make when building a MongooseIM cluster is setting the -sname
.
To do it, just set the -sname
option in vm.args
with node's hostname,
which must be resolvable on other nodes in the cluster.
Port range¶
To connect to other nodes, a freshly started node uses a port from the range inet_dist_listen_min
to inet_dist_listen_max
.
To enable this, add the following line to the vm.args
file:
1 |
|
Make sure that the range you set provides enough ports for all the nodes in the cluster.
Remember to keep an epmd port open (port 4369) if any firewall restrictions are required. Epmd keeps track of which Erlang node is using which ports on the local machine.
Connecting nodes¶
Checklist:
- working directory
rel/mongooseim
(root of a MongooseIM release or installation) - the same cookie across all nodes (
vm.args
-setcookie
parameter) - each node should be able to ping other nodes using its sname
(ex.
net_adm:ping('mongoose@localhost')
) - RDBMS backend is configured, so CETS could discover nodes
Initial node¶
Clustering is automatic. There is no difference between nodes.
There is no action required on the initial node.
Just start MongooseIM using mongooseim start
or mongooseim live
.
New node - joining cluster¶
Clustering is automatic.
1 2 3 |
|
ClusterMember
is the name of a running node set in vm.args
file, for example mongooseim@localhost
.
This node has to be part of the cluster we'd like to join.
First, MongooseIM will display a warning and a question if the operation should proceed:
1 |
|
If you type yes
MongooseIM will start joining the cluster.
Successful output may look like the following:
1 |
|
In order to skip the question you can add option -f
which will perform the action
without displaying the warning and waiting for the confirmation.
Leaving cluster¶
Stopping the node is enough to leave the cluster.
If you want to avoid the node joining the cluster again, you have to specify a different cluster_name
option in the CETS backend configuration. A different Erlang cookie is a good idea too.
To leave a running node from the cluster, call:
1 |
|
It only makes sense to use it if the node is part of a cluster, e.g join_cluster
was called on that node before.
Similarly to join_cluster
a warning and a question will be displayed unless the option -f
is added to the command.
The successful output from the above command may look like the following:
1 |
|
Removing a node from the cluster¶
A stopped node would be automatically removed from the node discovery table in RDBMS database after some time. It is needed so other nodes would not try to connect to the stopped node.
To remove another node from the cluster, call the following command from one of the cluster members:
1 |
|
where RemoteNodeName
is the name of the node that we'd like to remove from our cluster.
This command could be useful when the node is dead and not responding and we'd like to remove it remotely.
The successful output from the above command may look like the following:
1 |
|
Cluster status¶
Run the command:
1 |
|
joinedNodes
should contain a list of properly joined nodes:
1 2 3 4 |
|
It should generally be equal to the list of discoveredNodes
.
If it is not equal, you could have some configuration or networking issues.
You can check the unavailableNodes
, remoteNodesWithUnknownTables
,
and remoteNodesWithMissingTables
lists for more information (generally, these lists should be empty).
You can read the description for other fields of systemInfo
in the
GraphQL API reference.
For a properly configured 2 nodes cluster the metrics would show something like that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
You can use the following commands on any of the running nodes to examine the cluster or to see if a newly added node is properly clustered:
1 |
|
This command shows all running nodes. A healthy cluster should contain all nodes here. For example:
1 |
|
1 |
|
This command shows which nodes are considered stopped. This does not necessarily indicate that they are down but might be a symptom of a communication problem.
Load Balancing¶
Elastic Load Balancer (ELB)¶
When using ELB please be advised that some warm-up time may be needed before the load balancer works efficiently for a big load.
Software load balancer¶
A good example of load balancing on the application layer are HAProxy and Nginx.
DNS-based load balancing¶
Load balancing can be performed on a DNS level. A DNS response can have a number of IP addresses that can be returned to the client side in a random order.
On the AWS stack this type of balancing is provided by Route53. The description of their service can be found in the Route53 Developer's Guide.
Other¶
The approaches described above can be mixed - we can use DNS load balancing to pick a software load balancer which will select one of the nodes.