Introduction
Description
Rai MS is a Link State based protocol for the construction of Pub/Sub messaging systems which allows for loops and redundancies in the network connections between peers. It has 4 different types of network transports:
-
OpenPGM based multicast, with an unicast inbox protocol.
-
TCP point to point connections.
-
Mesh TCP all to all connections.
-
Local bridging compatible with RV, NATS, Redis.
The first 3 transports may be interconnected with redundancies. The local bridging transports strips or adds the meta-data of the message that allows for routing through the network, so it can’t be looped.
It uses a best effort delivery system. It serializes messages based on subject so that streams are delivered in order discarding duplicates, but messages which are lost in transit because of node or network failures, are not retransmitted.
Architecture
Authentication
A ECDSA key pair is generated for a service and for each user that is pre-configured. A ECDH key is generated for by peer on startup for a key exchange that establishes a 32 byte session key. This session key is used to authenticate messages sent and received. Each peer in the system has a unique session key so that a message from any one of them can be authenticated. This is described further in Authentication.
Console Interface
The model that a node implements in the base client is close to that of a router. The command line resembles a cisco style interface with the ability to bring up and down transports at run time, examine the state of them, ping other nodes, traceroute, get help on commands with the ? key, use command line completion, telnet into the node. More in Console.
Networking
A node consists of a router with several transports. The term "transport" is modeled as a switch, where other nodes on the transport are attached to switch and one port of the switch is attached to the router. All of the nodes plugged into the switch can communicate without going through the router. The facilitates a multicast style transport, where a single multicast send reaches multiple nodes within the switch. It also allows an listener to accept multiple local connections which use a protocol like RV, NATS, or Redis and do communication without regard to the other nodes attached to other switches or transports through the router.
The subscription mechanism has three layers: the router, the switch, and the connection. The router uses bloom filters to route subjects, the switch uses 32 bit mac addresses based on the subject/wildcard hash, and the connection uses a btree of subjects:
router <-> bloom filter 1 <-> switch 1 <-> mac 1 <-> connection 1 <-> btree entry 1 bloom filter 2 switch 2 mac 2 connection 2 btree entry 2
More about this in the Networking section.
Link-State Database
The first thing a node does after authentication, is download the peers LSDB (Link-State DB), which first consists of records for every other peer:
{ bridge id, session key, peer name, sub seqno, link seqno, bloom }
The seqno values allow for delta updates of the LSDB, which can add/remove a link or add/remove a subscription from the bloom filter. The bloom filter contains everything needed to filter the subscriptions that the peer has interest. It generally uses about 2 bytes per subscription for a false positive error rate at about 0.05% (1 in 2000 subjects), so if a peer has a 10,000 subjects or wildcards open, it will be about 20,000 bytes in size.
Then for each bridge id, it downloads the links that the peer is connected to for each transport/switch. The local bridging that occurs for foreign protocols like RV, NATS and Redis are directly attached to the peer and are considered the peers subscriptions. In other words, the bloom filter for a peer has all of the subscriptions for every RV, NATS or Redis client connected to it.
The link records are for nodes which are directly attached to the peer via a transport. There may be many nodes using the same link attached to the peer and a node may be reachable via multiple transports. The unique feature identifying this link is the bridge id, tport id pair.
{ bridge id, tport name, peer name, tport id, cost }
A delta update of the LSDB, whether link change or subscription change is broadcast to all of the nodes. If an network split occurs and some nodes are orphaned from the network for a period before rejoining, then synchronization of the LSDB with a peer occurs when the sub seqno or the link seqno has advanced. Any peer is capable of updating any other peer since the LSDB is the same in every one. The primary means of watching the seqno changes is with a transport heartbeat sent on a 10 second (default) interval between directly connected peers. In addition, each peer randomly chooses another peer to ping at a random interval based on the heartbeat interval.
The behavior of a transport which becomes too congested is that the heartbeat misses and the link is dropped and rejoined at the next heartbeat. The effect of this is that 10 seconds of traffic is rerouted or lost if there are no other routes to the peers on the other side.
More in the Link State section.
Multicast routing
Any time a link is added to the LSDB, the routing is recalculated using a Dijkstra path finding algorithm. The shortest path is chosen, and if multiple equal paths exist, then the link with the lowest weight is chosen. Load balancing can occur when there are two or more equal paths to a peer based on the subject mac of the destination. The LSDB is considered "consistent" when all peers agree that a link exists. If peer A has an outgoing link to peer B, then peer B must have a link to peer A. If this is not the case, then LSDB synchronization requests to the closest peer along the path are performed until the network converges to a consistent state.
All peers will choose the same route for a subject when the LSDB is synchronized. If the LSDB is not synchronized, then messages may be duplicated to alternative routes or may decide that routing is not necessary for a message when it is. For this reason, keeping the LSDB synchronized as fast as possible is a top priority of a node.
A technique called reverse path forwarding is used for multicast messages. If a destination unicast to a peer, which is the case for inbox style messaging, then there is only one path for the message, the shortest path. With multicast, there are multiple paths that a message may take, each is the shortest path to a subscriber. Reverse path forwarding uses the source of the message to route it. The algorithm increments the distance from the source to compute a set of nodes that are possible for a message at each hop, then chooses the best traversal of the network graph so that the entire network is covered with a minimal set of transmissions. Once this is calculated, it can be reused until a link in the LSDB is updated again. This set of paths is augmented with bloom filters from the peers, so that a router will forward a message only if it passes through the reverse path forwarding algorithm and it passes through the bloom filters attached to the path.
Wildcard Matching
A generic PCRE based conversion is used to enable multiple wildcard styles to coexist between peers. The bloom filter contains both a prefix and suffix matching filter, so that A.*.B is matched with both ends of the wildcard. When a subject is passed through a bloom filter the prefix of the subject is hashed with different seeds based on the prefix lengths used. If a peer is interested in subject prefix lengths of 3, 5, 10, 20, as well as the subject itself, these lengths are noted in the bloom filter and the hash set is calculated as
hs = hash( subject, seed = 0 ) h3 = hash( subject[1..3], seed = 3 ) h5 = hash( subject[1..5], seed = 5 ) h10 = hash( subject[1..10], seed = 10 ) h20 = hash( subject[1..20], seed = 20 )
If any of these are hash values present in the bloom filter, then a check for the suffix matches are done. The hash set is computed in groups before any routing based on the entire set of hashes needed is done in order to take advantage of instruction parallelism, computing several hashes for each iteration of the subject length.
Anycast and Shardcast
An anycast route is a single match of many. A set of peers interested in a subject can be computed because the LSDB contains filters for all of them. This set of peers interested can be randomly chosen and unicast routed to the chosen peer. If the peer has a false match, or the interest in the subject is lost, then that peer can choose another from the set and forward it.
A shardcast is a set of peers interested in the prefix of a subject, but only a shard of the subject space. The bloom filter contains enough info to filter by both the prefix hash and the subject space that a peer is interested in. In this case, the peers have predetermined how many shards there should be and how the shards are split between them. If A subscribes to X.* using shard 1/2 and B subscribes to X.* using shard 2/2, then the subjects X.Y and X.Z is split between A and B based on the hash of X.Y and the hash of X.Z. This is a variation of suffix matching where the hash of the subject is used to discriminate the route of the message.
Why use it?
Distributed systems are more often crossing network boundaries. Traditional broker based systems or multicast based systems have difficulty expanding beyond a these boundaries. To remedy this, network designs may deploy application specific routers, or they shard the messaging system, or they use other protocols like mesh or gossip based systems. All of these solutions have advantages and drawbacks.
The aim of this system is to:
-
Flexible transports and networking.
-
Fast message authentication.
-
Fast network convergence.
-
Distribute messages only when interest is present.
-
Utilize redundant links.
-
Flexible message distribution: inbox, multicast, anycast, shardcast.
-
Flexible wildcarding mechanism.
-
Ability to recover subscription interest at the endpoints.
Building
There are a lot of submodules and dependencies, so at present, building using the build Makefile is the easiest way to compile everything. Clone it, install the dependencies, clone all of the modules, build everything. The rpm dependencies will probably need the EPEL repo installed when using an enterprise RedHat, CentOS, or derivative for the liblzf-devel package (and maybe others).
$ git clone https://github.com/raitechnology/build $ cd build $ make install_rpm_deps $ make clone $ make
If this completes, there will be a static binary at raims/OS/bin/ms_server
where OS is something like RH8_x86_64
.
If you set the env var for debugging, then the RH8_x86_64-g
directory will be
populated without optimization and with the -g flag.
$ export port_extra=-g $ make
Running the MS server
The first task is to create the authentication keys for a service "test". The
ms_gen_key
program creates and updates the configuration. The user keys are
what stored in the user_X_svc_test.yaml
files and contain ECDH key pairs.
The service is a ECDSA key pair and signs each user and stores the signatures
in the svc_test.yaml
file. The startup.yaml
contains the startup config.
The config.yaml
file includes all of the files in the config directory.
$ cd build/raims $ ms_gen_key -u A B C -s test create dir config -- the configure directory create file config/.salt -- generate new salt create file config/.pass -- generated a new password create file config/config.yaml -- base include file create file config/param.yaml -- parameters file create file config/svc_test.yaml -- defines the service and signs u create file config/user_A_svc_test.yaml -- defines the user create file config/user_B_svc_test.yaml -- defines the user create file config/user_C_svc_test.yaml -- defines the user OK? y done
This creates the keys for users A, B, and C. These keys are encrypted with the
.pass
and .salt
files.
More about this in the [key config guide](keys.md).
Run the ms_server
program and configure it. The -u
option specifies the
user and service. The -c
option starts the command line interface, where the
networks can be defined and connected. This following defines a mesh endpoint
and saves it to the startup config.
$ ms_server -u B.test -c 05:54:26.267 session A.test[RthXjJscfuvnG2+J1/PJ1w] started, start time 1644818066.265990830 A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[249]> configure transport mytran A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[250](mytran)> type mesh A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[251](mytran)> listen * A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[252](mytran)> port 5000 A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[253](mytran)> show tport: mytran type: mesh route: listen: "*" port: 5000 A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[254](mytran)> exit A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[255]> listen mytran transport "mytran" started listening 05:55:09.934 listening on [::]:5000 05:55:09.937 network converges 0.003 secs, 0 uids authenticated, add_tport A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[256]> save config saved 05:55:12.790 update file A/param.yaml -- parameter config 05:55:12.790 create file A/startup.yaml -- startup config 05:55:12.790 create file A/tport_mytran.yaml -- transport
The files are described in the [configuration] section and the transports are described in the Networking section. The authentication keys need to be distributed to all the nodes, but the networking config will be somewhat unique to each node.
Configuration
Key Configuration
The key configuration files are necessary to join the network. They authenticate peers and the message traffic that flows between peers. It does not authenticate the local bridging protocols RV, NATS, or Redis.
Generating a master config is done with the ms_gen_key
program. The default
location for the config directory is ./config
, other locations are specified
with the -d
option.
Initially, the config directory is empty. Initialize with some users and a service name.
$ ms_gen_key -u A B C -s test create dir config -- the configure directory create file config/.salt -- generate new salt create file config/.pass -- generated a new password create file config/config.yaml -- base include file create file config/param.yaml -- parameters file create file config/svc_test.yaml -- defines the service and signs users create file config/user_A_svc_test.yaml -- defines the user create file config/user_B_svc_test.yaml -- defines the user create file config/user_C_svc_test.yaml -- defines the user OK? y done
Exporting the keys for each of the nodes causes the .pass
file the change and
the unnecessary private keys to be removed. The only private key that remains,
is for the peer. This trimmed configuration allows the peer to run, but not
generate new peers because the private key of the service is not present.
$ ms_gen_key -x A B C -s test - Loading service "test" - Signatures ok create dir A -- exported configure directory create file A/.salt -- a copy of salt create file A/.pass -- generated a new password create file A/param.yaml -- a copy of param create file A/config.yaml -- base include file create file A/svc_test.yaml -- defines the service and signs users create file A/user_A_svc_test.yaml -- defines the user create file A/user_B_svc_test.yaml -- defines the user create file A/user_C_svc_test.yaml -- defines the user create dir B -- exported configure directory create file B/.salt -- a copy of salt create file B/.pass -- generated a new password create file B/param.yaml -- a copy of param create file B/config.yaml -- base include file create file B/svc_test.yaml -- defines the service and signs users create file B/user_A_svc_test.yaml -- defines the user create file B/user_B_svc_test.yaml -- defines the user create file B/user_C_svc_test.yaml -- defines the user create dir C -- exported configure directory create file C/.salt -- a copy of salt create file C/.pass -- generated a new password create file C/param.yaml -- a copy of param create file C/config.yaml -- base include file create file C/svc_test.yaml -- defines the service and signs users create file C/user_A_svc_test.yaml -- defines the user create file C/user_B_svc_test.yaml -- defines the user create file C/user_C_svc_test.yaml -- defines the user OK? y done
Copy the A config to the A node/config, the B config directory to the B
node/config, etc. The .pass
file is unique for each peer so that it can be
removed after running the server, rendering the configured keys unreadable
until the .pass
file is restored or the peer’s config is regenerated from the
master config.
The copy of the master config includes a copy of the param.yaml
, as that can
contain global configuration, but doesn’t copy any local configuration such as
startup and network configuration.
The master config will also work, so just copying it to the peers will allow them to run if this type of security is unnecessary.
Single File Configuration
The ms_gen_key
option -o
will concatenate the configuration into a single
file:
$ ms_gen_key -s test -o test.yaml create dir config -- the configure directory create file config/.salt -- generate new salt create file config/.pass -- generated a new password create file config/config.yaml -- base include file create file config/param.yaml -- parameters file create file config/svc_test.yaml -- defines the service and signs users OK? y done - Output config to "test.yaml"
Running ms_server -d config
configuration from a directory and running
ms_server -d test.yaml
will load the configuration from a file. In both
cases, the configuration loaded will be the same.
A test network can be set up using only the loopback interface by describing the network using a format output by the show_graph command. The format of this is:
node A B C D tcp_link_ab A B : 200 tcp_link_bc B C : 100 tcp_link_ac A C : 200 tcp_link_bd B D : 200 tcp_link_dc D C : 300
The node
line declares all of the users. The tcp_
lines describe how the
users are connected. The number following the : is the cost of the
transport.
Running the ms_test_adj
program with this description will generate a
configuration, saved in a file called "graph.txt" and output to "graph.yaml":
$ ms_test_adj -l graph.txt > graph.yaml
The -l
option causes the links to be resolved by exchanging messages over the
loopback interface. At the bottom of the "graph.yaml" created, there will be
commands in comments to run this configuration. Running these will create 4
users in a network describe by the graph. The following uses those commands
with the first three running in the background and the last with a console
attached to it, but you could run each in a different terminal with consoles
attached in order to test with the sub and
trace commands to test how messages would be routed through
the network.
$ ms_server -d graph.yaml -u A -t link_ab.listen link_ac.listen &
$ ms_server -d graph.yaml -u B -t link_ab.connect link_bc.listen link_bd.listen &
$ ms_server -d graph.yaml -u C -t link_ac.connect link_bc.connect link_dc.listen &
$ ms_server -d graph.yaml -u D -t link_bd.connect link_dc.connect -c
In addition to "tcp" type links, you could also define "mesh" and "pgm" types, but the pgm would require non-loopback interface that has multicast, like a Linux bridge.
node A B C D mesh_link_abcd A B C D : 100 1000 100 1000 mesh_link_abcd2 A B C D : 1000 100 1000 100
The above graph would create two meshes, with different costs for some of the paths. This would route messages over both meshes by sharding the subject space and using one mesh for half of the subject space and the other mesh for the other half.
There is a graphical interface to view the network using the cytoscape package.
$ ms_server -c
chex.test[0vEvE73U78HkGZUgBK94mQ]@chex[10]> configure transport web type web port 8080 listen 127.0.0.1
Transport (web) updated
chex.test[0vEvE73U78HkGZUgBK94mQ]@chex[11]> listen web
Transport (web) started listening
0209 22:54:25.382 web: web start listening on 127.0.0.1:8080
0209 22:54:25.382 http_url http://127.0.0.1:8080
Connect to the url http://127.0.0.1:8080/graph_nodes.html with a web browser and paste the graph text into the text box after erasing the existing text, then click "show graph".
Parameters
The parameters section of the configuration is used to lookup values that can alter the behavior of the server. These fields can be set anywhere in the config files, but are usually in the "param.yaml" or "startup.yaml" files. Since the "config.yaml" includes "*.yaml", any yaml file in the config directory can contain parameters. Any field value pair which is not in a service, user, service, transport, or group section is added to the parameters section.
This configuration is a list of parameters:
parameters: pass: .pass salt: .salt heartbeat: 5 secs reliability: 10 secs tcp_noencrypt: true
The "parameters:" structure is optional and not necessary to define them.
Name | Type | Default | Description |
---|---|---|---|
salt |
filename |
none |
File to find encryption salt |
pass |
filename |
none |
File to find encryption password |
salt_data |
string |
none |
Base 64 encoded encryption salt |
pass_data |
string |
none |
Base 64 encoded encryption password |
listen |
array |
none |
Startup listen transports |
connect |
array |
none |
Startup connect transports |
pub_window_size |
bytes |
4 MB |
Size of publish window |
sub_window_size |
bytes |
8 MB |
Size of subscribe window |
pub_window_time |
time |
10 secs |
Time of publish window |
sub_window_time |
time |
10 secs |
Time of subscribe window |
heartbeat |
time |
10 secs |
Interval of heartbeat |
reliability |
time |
15 secs |
Time of publish reliability |
timestamp |
string |
LOCAL |
Log using local time or GMT |
pid_file |
string |
none |
Daemon pid file |
map_file |
string |
none |
Use for key value storage |
db_num |
string |
none |
Default db number for key value |
ipc_name |
string |
none |
Connect to IPC sockets |
tcp_timeout |
time |
10 secs |
Default timeout for TCP/mesh connect |
tcp_ipv4only |
boolean |
false |
Use IPv4 addressing only |
tcp_ipv6only |
boolean |
false |
Use IPv6 addressing only |
tcp_noencrypt |
boolean |
false |
Default for TCP/mesh encryption |
tcp_write_timeout |
time |
10 secs |
Timeout for TCP write |
tcp_write_highwater |
bytes |
1 MB |
TCP write buffer size |
idle_busy |
count |
16 |
Busy wait loop count |
working_directory |
dirname |
none |
Switch to directory when in daemon mode |
-
salt, pass, salt_data, pass_data — The salt, pass or salt_data, pass_data are required for startup. The keys defined in the configuration are encrypted with these values. Any key derived during execution is mixed with the salt and must be the same in all peers.
-
listen, connect — The startup transports. They are started before any other events are processed. If a listen fails, then the program exits. A connect failure will not cause an exit, since it retries.
-
pub_window_size, sub_window_size, pub_window_time, sub_window_time — These track the sequence numbers of messages sent and received. They are described in Publish sequence window.
-
heartbeat — The interval which heartbeats are published to directly connected peers. A link is not active when a heartbeat is missed for 1.5 times this interval. The link is reactivated when a heartbeat is received.
-
timestamp — When set to GMT, the time stamps are not offset by the local timezone.
-
pid_file — A file that contains the process id when forked in rvd mode.
-
map_file — If a Redis transport is used, this is where the data is stored. If no map is defined, then the data stored will fail and data retrieved will be zero. The
kv_server
command will initialize a map file. -
db_num — The default database number for the Redis transport.
-
ipc_name — When set, allows IPC processes to connect through Unix sockets and subscription maps using the same name. If the processes are shutdown, they will restart or stop the subscriptions using the maps.
-
tcp_timeout — The default retry timeout for TCP and mesh connections.
-
tcp_ip4only — Resolve DNS hostnames to IPv4 addresses only.
-
tcp_ip6only — Resolve DNS hostnames to IPv6 addresses only.
-
tcp_noencrypt — When true, the default for TCP and mesh connections is to to not encrypt the traffic.
-
tcp_write_timeout — Amount of time to wait for TCP write progress if the write buffer is full. After this time, socket is disconnected and messages lost. When a TCP write buffer has equal or more than
tcp_write_highwater
bytes then backpressure can be applied to the sockets that are forwarding data, causing them to add latency waiting for the writer to have space available. -
tcp_write_highwater — Amount of data to buffer for writing before applying back pressure to forwarding sockets.
-
idle_busy — Number of times to loop while no activity is present. More looping while idle keeps the process on a CPU for lower latency at the expense of wasted CPU cycles.
-
working_directory — When running in the background in daemon mode, which is without a console using RVD mode without the -foreground argument or with the -b argument, then switch to the directory after forking and detaching from the terminal. This directory can be used to store the .console_history files or other files that are saved using console subscription commands. If the command line with telnet is not used, then no files are created.
Startup
The startup section can be used to start transports during initialization.
This syntax is used by the save
console command, but can also be edited. The
following causes the transports named myweb to start with listen, then start
mymesh and mytcp with connect. The listeners are always started before the
connecters.
startup: listen: - myweb connect: - mymesh - mytcp
Hosts
The hosts section can be used to assign address strings to names, similar to an /etc/hosts configuration. The values assigned to the names are substituted in any connect or listen configuration of a transport. For example, the following hosts are used in the connect and listen portions of the net transport.
hosts: chex: 192.168.0.16 dyna: 192.168.0.18 transports: - tport: net type: mesh route: connect: chex:5001 listen: dyna:5000 startup: connect: - net
A mesh type transport with connect uses both the listen and the connect addresses defined, since all peers can both connect and accept connections.
Authentication
Authentication has two parts, the initial key exchange that sets up unique session key for each pear and message authentication that verifies that a peer sent it. The key exchange protocol uses a Elliptic Curve Diffie Hellman (ECDH) exchange that is signed by a Elliptic Curve Digital Signature (ECDSA). The message authentication uses a HMAC digest computed by enveloping the message with a peer’s session key and computing the hash along with sequencing by subject to prevent a replay of messages.
Key Exchange
Two peers authenticate with each other by signing a message with a configured ECDSA key. This message includes a generated a ECDH public key. The ECDH key is used by each side to compute the secret using the corresponding ECDH private key. The secret along with a unique nonce, a time stamp, and a sequence number to create a temporary key that is used to encrypt a random session key.
For peers A and B to complete the key exchange, there are 4 messages:
-
HELLO/HB from peer A sent to peer B — Includes a seqno, a time, a nonce, and a ECDH public key. Since these are unique for each side, call these A_seqno, A_time, A_nonce, A_ECDH_pub
-
AUTH from peer B sent to peer A — Includes B_ECDSA_sig, B_seqno, B_time, B_nonce, B_ECDH_pub, B_auth_key, A_seqno, and A_time. The A_seqno and A_time allow peer A to match the unique A_nonce which corresponds to the HELLO message sent previously. The last two HELLO messages are tracked so it must match one of these. The B_auth_key contains an AES encrypted session key which must be decrypted by computing a temporary key using the data from B as well as the ECDH secret computed from B_ECDH_pub and A_ECDH_pri. Peer A trusts peer B if the decrypted session key in B_auth_key authenticates the message using the HMAC computation and the HMAC computation is also signed by B_ECDSA_sig.
-
AUTH from peer A sent to peer B — The reverse of above, includes A_ECDSA_sig, A_seqno, A_time, A_nonce, A_ECDH_pub, A_auth_key, B_seqno, and B_time. The B_seqno and B_time are used to match the B_nonce included in the previous AUTH and used by peer A to create the temporary key which encrypts the A_auth_key session key for A. B trusts peer A if the decrypted session key in A_auth_key authenticates the message using the HMAC computation and the HMAC computation is also signed by A_ECDSA_sig.
-
AUTH_OK from peer B sent to peer A — This notifies peer B that authentication was successful.
If either AUTH fails the HMAC computation, then the authentication fails and one or both peers are ignored for a 20 seconds (or 2 times the heartbeat interval). It is possible that the latency of the key exchange is greater than 2 HELLO/HB messages so the nonce associated with the seqno/time pair is too old and the authentication must restart.
The ECDSA private key used to sign the authentication messages is either the configured key pair from the service or the configured key pair from the user. A configuration may not include the service private key in the case that a user has less privileges that the service, which has admin privileges. The service’s private key is able to sign users which don’t exist and are added to the system, but a user’s private key can only authenticate itself.
The following is from the Example Message Flow. This shows the HELLO/HB part of the key exchange, where peer A is ruby and peer B is dyna.
_X.HELLO ... ruby -> dyna bridge_16 [1027] : xq6vl+2HcoDxtt+7lC7dGQ digest_16 [1029] : mB1uDQ7fsGmYScIGU0kt6Q sub_s2 [1792] : "_X.HB" user_hmac_16 [1028] : TQO1sorP9oD+smMOrnvzuQ seqno_2 [273] : 1 time_8 [787] : 1663967268385616894 uptime_8 [788] : 17982050574 start_8 [794] : 1663967250404676993 interval_2 [277] : 10 cnonce_16 [1034] : IG45ISINnT0bX2Td6Ovivw pubkey_32 [1357] : +A2dlZCcDo8vS/XsWApNNfJwQH8ApmFIRTOcS+cPuAk sub_seqno_2 [274] : 0 user_s2 [1836] : "ruby" create_s2 [1838] : "1663967250.404513467" link_state_2 [281] : 0 converge_8 [839] : 1663967250404676993 uid_cnt_2 [292] : 0 uid_csum_16 [1036] : xq6vl+2HcoDxtt+7lC7dGQ version_s2 [1840] : "1.12.0-42" pk_digest_16 [1091] : SMnBqzoh/w6IFi2c7zoxMw
The seqno_2, time_8, cnonce_16, pubkey_32 are the A_seqno, A_time, A_nonce, and A_ECDH_pub. The user_hmac_16, start_8, and service ECDSA_pub are combined to create a hello_key which is used to authenticate the HELLO message stored in pk_digest_16, since the session key that is product of the key exchange is not yet known by dyna. The service ECDSA_pub is never sent over the wire so it is used as a pre-shared key in this instance. There is another pre-shared key used by the Key Derivation Function (KDF) to generate keys from secrets, nonces, seqnos, and time stamps. The KDF is seeded by a 640 byte salt and shared along with the service ECDSA_pub key in all of the peers that need to communicate.
The first AUTH message from peer B (dyna) to peer A (ruby):
_I.xq6vl+2HcoDxtt+7lC7dGQ.auth ... dyna -> ruby bridge_16 [1027] : wwEnbQEY2FMuwZGSjpi3jQ digest_16 [1029] : 3UY+SJQYy3wGN0dW3zc4fg sub_s2 [1792] : "_I.xq6vl+2HcoDxtt+7lC7dGQ.auth" user_hmac_16 [1028] : PYv43FUBG3N8ok+jn4nBPQ seqno_2 [273] : 1 time_8 [787] : 1663967268386849657 uptime_8 [788] : 63309580030 interval_2 [277] : 10 sub_seqno_2 [274] : 0 link_state_2 [281] : 0 auth_seqno_2 [285] : 1 auth_time_8 [798] : 1663967268385616894 auth_key_64 [1542] : AdM61M2DqR6hXdVnPnp716n5lQwcBAyx0N1jzGtzIM9OmAF4txsoZRd1YMOySIcxkyydHELJHfgVflEtnLg9Fg cnonce_16 [1034] : TEbM+MfLCp66ds36xh0JAA pubkey_32 [1357] : PyEHl7Y3IxAkK5OQMnJzggmlKlUo8+RiBif0P7h+8kg auth_stage_2 [305] : 1 user_s2 [1836] : "dyna" create_s2 [1838] : "1663967205.077153809" expires_s2 [1839] : null start_8 [794] : 1663967205077372910 version_s2 [1840] : "1.12.0-42" pk_sig_64 [1610] : gR2ovdrI4yfxdc7ZAR+ID00hj2HDYEcEexU/ib4CDAU4t2E/nzC6c1dK0s14RiZIWzHHxRFR6D2uJ/ZaHHwaDw
The auth_seqno_8, auth_time_8 are the A_seqno, A_time values from ruby used to find the A_nonce (cnonce_16) in the HELLO message. These along with seqno_8, time_8, cnonce_16, and pubkey_32 are used to construct the temporary key to decrypt the auth_key_64, which is session key used by dyna in the HMAC computation that authenticates messages and compare the result to digest_16. The pk_sig_64 is the ECDSA signature of the message signed either by the service’s private key or by the user dyna’s private key.
After this succeeds, then ruby trusts messages from dyna that have a HMAC computation digest_16 included with each message, along with an seqno and time stamp to prevent replays.
The second AUTH message from peer A (ruby) to peer B (dyna):
_I.wwEnbQEY2FMuwZGSjpi3jQ.auth ... ruby -> dyna bridge_16 [1027] : xq6vl+2HcoDxtt+7lC7dGQ digest_16 [1029] : h81umkyeNoYJAbomEWE+ng sub_s2 [1792] : "_I.wwEnbQEY2FMuwZGSjpi3jQ.auth" user_hmac_16 [1028] : TQO1sorP9oD+smMOrnvzuQ seqno_2 [273] : 1 time_8 [787] : 1663967268387280755 uptime_8 [788] : 17982688972 interval_2 [277] : 10 sub_seqno_2 [274] : 0 link_state_2 [281] : 0 auth_seqno_2 [285] : 1 auth_time_8 [798] : 1663967268386849657 auth_key_64 [1542] : v4mYze2OruL2L02gODDt7Fd9FHTDPLO0UD/auhab+FJiGgbD473osbwlYKfYBVgwvZMFqbLpVnLiGPHD+MXPtw cnonce_16 [1034] : zUYBUCh9n0L4F0dltxxtyg pubkey_32 [1357] : +A2dlZCcDo8vS/XsWApNNfJwQH8ApmFIRTOcS+cPuAk auth_stage_2 [305] : 2 user_s2 [1836] : "ruby" create_s2 [1838] : "1663967250.404513467" expires_s2 [1839] : null start_8 [794] : 1663967250404676993 version_s2 [1840] : "1.12.0-42" pk_sig_64 [1610] : 6lU9Yz3cvW178goVHwakHsFR55TYid9SHDwjIl/fPrxFVCkCujLxK2HQXNtw3zeVRgmi01pGEqemBUW59YuNDA
The same exchange from the first AUTH message is used in order for dyna to trust ruby.
System Compromise
If a host is compromised and the KDF pre-shared key and service ECDSA_pub key are discovered along with a user ECDSA_pri key, then an unauthorized party could masquerade as that user.
One way to prevent this is to remove the pre-shared 640 byte salt file after starting a server or the unique password file used to encrypt the ECDSA keys in the configuration files. Both the salt and password are needed to decrypt the keys.
Another option is to use stdin for reading the configuration so that no secrets are stored in the filesystem. For example, this will configure ms_server through sending a configuration through ssh to a remote host:
$ cat config.yaml | ssh host "bash -c '$( nohup /usr/bin/ms_server -d - -b > /dev/null 2> /dev/null )'"
The ms_server running on host will read the configuration from stdin (-d - argument) and then fork itself to run in the background (-b argument).
Message Authentication
The function of the key exchange protocol is to initialize each peer with a random 32 byte session key. The function of this key is to authenticate messages. A HMAC calculation of the message is done by enveloping the message data with the key and hashing it using a AES based hash that results in a 8 byte digest:
AES( IV = 8 bytes key )( [ message ] [ 24 bytes key ] )
Note that HMAC is traditionally performed as MD5( key.opad + MD5( key.ipad
message ) ) or SHA3( message + key ). The above AES construction is chosen
purely for speed, since AES instructions are widely available and an order of
magnitude faster than the other hashes. This may change in the near future
with the addition SHA instructions.
The header of every message contains these 5 fields which identify the source of the message, the HMAC digest of the message, the subject, a seqno and a time stamp:
bridge_16 [1027] : h783olFEb9ve8K07E7PoQg digest_16 [1029] : FKZxGPHiC7e5GXVKh2PWLg sub_s2 [1792] : "_I.xq6vl+2HcoDxtt+7lC7dGQ.ping" seqno_2 [273] : 4 stamp_8 [838] : 1663967313973571299
This header ensures that a message never contains the same bits and is always unique. It also allows the receivers to check that a replay has not occurred by tracking the sequences and time stamps for the subjects that it has seen previously. If the subject has never been seen before, then the time stamp is used to check that the message is at least as old as the last network convergence time stamp, described in more thoroughly in Message Loss. The bridge_16 identifies the source of the message and the digest_16 is computed with the source’s session key.
Subjects
Wildcarding Subscriptions
The subject schema used by the external bridged transports may introduce some incompatibilities when routing from one to another. The subscriptions and the patterns are separate operators internally. A subscription using wildcarding characters is allowed, but not interpreted differently that any other subject. A pattern subscription includes a field which causes the pattern to be evaluated with different syntax rules, Redis GLOB or NATS/RV. A publish is not interpreted as a wildcard, even when it contains wildcard syntax. Any string of bytes can be subscribed or published, but the wildcarding follows the syntax of the pattern type and uses a different subscription operator internally, as Redis does (sub, unsub, psub, punsub).
The _INBOX subject
There is a special rule for subjects that begin with the prefix _INBOX.
, it
is interpreted as a point to point message. This subject format finds the
peers which are subscribers, typically just one, and sends the message point to
point for each one. The subject and message are put into an envelope addressed
for each peer. The peers that forward this message along the path to the
recipient recognize this as using a different forwarding rule than normal
multicast subjects. For example, the point to point rules for forwarding will
use a UDP inbox protocol when OpenPGM is deployed. The point to point rule
will still forward to all subscriptions of an inbox subject, but it is
optimized for the case that there is only one subscription.
RV subject rules
-
A subject segment is separated by
.
and cannot start with a period or end with a period or have two periods appear within a subject without characters in between. -
A wildcard can substitute the segments with a
*
character or a trailing>
. -
A publish to a wildcard causes it to match the subjects subscribed. This is not supported by Rai MS since the bloom filters are not indexed by segments. Instead, Rai MS will route the wildcard publish as a normal subject.
-
An
_INBOX.
prefix implies a point to point publish which translates to an anycast Rai MS publish.
NATS subject rules
-
Same subject segmentation as RV.
-
Same wildcarding as RV.
-
It is not possible to publish to a wildcard.
-
No inbox point to point messaging.
-
A queue group publish translates to a Rai MS anycast publish.
Redis subject rules
-
There are no limitations for the characters used in a subject.
-
A wildcard is subscribed using a
psub
operator, so the characters are interpreted using wildcard rules. A*
character matches zero or more, a?
matches 0 or 1 characters. A[
and]
match any of the characters enclosed. A\
character escapes the wildcard characters. It is similar to a shell glob wildcard. -
A publish to a wildcard is the same as publishing to a subject.
-
No inbox point to point messaging and does not have syntax for request/reply semantics.
Networking
Description of Transports
A Rai MS transports function is to join all of the peers connected through a node together in one virtual overlay network that provides basic pub/sub multicast.
A transport has two primary roles, the routing of messages between peers and the managing of protocol dependent subscription management and message framing. The internal transports (PGM, TCP Mesh) all use the internal protocol semantics for messaging. The external bridged transports (RV, NATS, Redis) have protocols with similarities, but they have unique behaviors that make them more complicated that the internal transports.
The design of the internal transports allow them to be used by any of the
external transports, so RV can use a TCP mesh or PGM multicast or some of
combination of them interconnected. Similar for NATS and Redis, they can also
use PGM multicast as well as a TCP mesh. The routing of messages between peers
is agnostic to the type of protocol that the endpoint clients are using. It is
possible to use the Rai MS protocol directly as well. The ms_server
console
contains the ability to publish, subscribe without using an external client.
The Telnet transport uses the console protocol. The Web transport serves
builtin html pages that interface with the console protocol through websocket
protocol.
There are two sides to transport configuration, the listener and the connector. Only the internal transports support the connecting side (PGM, mesh, TCP), the client side (RV, NATS, Redis, Telnet, Web) only uses listeners and do not have a cost. The device option will auto-discover a connector or listener via multicast through a device. This requires that the connector and listener are on the same broadcast domain or have multicast routing configured.
The config file format is a JSON or YAML with a record that can have these fields:
tport: <name> type: <pgm | mesh | tcp | rv | nats | redis | telnet | web | name> route: listen: <address> connect: <address> device: <address> port: <number> cost: <number> <parm>: <value>
The name
identifies the transport so that it can be referenced for starting
and stopping in the console and the command line. It is also used by auto
discovery to match transports and it is sent to other peers so that it can be
read in log files and diagnostic output. It has no protocol implications
beyond auto discovery, a misspelling won’t cause it to stop working.
Services and Networks
The endpoint protocols: RV, NATS, Redis; all have a service defined to separate the data flow from one another. By using the same service name allows these endpoints to share the same namespace. The underlay network that connects the namespaces can both be configured using the YAML files or the console and also be specified by the connecting clients. The clients can specify a network with PGM multicast or with TCP endpoints and meshes. All of networks specified by a client that use TCP will still use multicast to resolve the endpoints by service name by using the name protocol.
Networks use a device name and a protocol or a multicast address. When a network is not specified by a client or configuration, then the links between services have to be configured by the YAML files and/or in the console.
Example networks and how they are interpreted. All of these have a service name associated with the network, which must match for namespace to communicate.
-
eth0;239.1.2.3
— Connect a PGM protocol to eth0 joining the multicast address of 239.1.2.3 for communicating with other peers. -
eth0;tcp.listen
— Connect a name protocol to the eth0 interface and advertise a TCP listen endpoint. -
eth0;tcp.connect
— Connect a name protocol to eth0, and advertise a TCP connection endpoint. These resolve to a connection when listen endpoints appear with a clients that use the above. -
eth0;mesh
— Connect a name protocol to eth0, and advertise a TCP mesh endpoint. This creates connections to all other mesh endpoints advertised. -
eth0;any
— Connect a name protocol to the eth0, and connect to a listen or a mesh advertised.
The device name eth0 can be substituted with an IPv4 address, like
192.168.1.0;tcp.listen
, or a hostname that resolves to an IPv4 address. If a
network is specified without a name, like ;tcp.listen
, then the machine’s
hostname is used to find the device.
The configuration for the PGM, name, TCP protocols are generated as needed by the client if they do not exist. When a service already is configured, then it is used instead and the network parameters are ignored.
Cost
All of the internal transports have a cost assigned to the links. The routing from peer to peer uses this cost to find a path that minimizes the cost. Equal cost links are utilized by each peer by encoding a path into the message header. This path is enumerated from 0 → 3, so there is a maximum of 4 equal cost paths possible between any 2 peers in the network. The per path cost can be configured by using different cost metrics for each link. The default cost is 1000 so that a configured cost can be less or greater than 1000. These configured metrics are replicated throughout the network so that every peer agrees the cost of every path that exists. A case where lowering the cost is useful is when some of the links have higher performance than others as is the case when all peers exist within a host or within a data center. A case when configuring different cost for each of the 4 paths is to load balance multiple links with equal performance.
Example of configuring a lower cost mesh on a bridge:
tport: rv_7500 type: mesh route: device: docker0 cost: 10
If every container within this host has a RV client that connects with a
network and service of -network eth0;mesh -service 7500
then the cost of 10
discovered through the docker0
bridge. The name protocols used will use the
name of the device as their tport name.
Example of configuring a load balanced cost for links through a data center:
transports: - tport: a_mesh type: mesh route: listen: * connect: [ host, host2, host3, host4 ] port: 5000 cost: [ 100, 1000, 1000, 1000 ] - tport: b_mesh type: mesh route: listen: * connect: [ host, host2, host3, host4 ] port: 5001 cost: [ 1000, 100, 1000, 1000 ] - tport: c_mesh type: mesh route: listen: * connect: [ host, host2, host3, host4 ] port: 5002 cost: [ 1000, 1000, 100, 1000 ] - tport: d_mesh type: mesh route: listen: * connect: [ host, host2, host3, host4 ] port: 5003 cost: [ 1000, 1000, 1000, 100 ]
This creates 4 equal mesh networks, each of which is preferred for part of the subject space. The connect and cost can be enumerated as connect, connect2, connect3, connect4 and cost, cost2, cost3, cost4 as well as an array.
TCP Encryption
The TCP type and mesh type links are encrypted using AES 128 bit in counter mode. The protocol above the link layer handles the authentication for trusting the peer and the messages that are received, described in Authentication. The encryption is set up by a ECDH exchange. Each side generates a ECDH keypair and sends the public key with a checksum and a 128 bit nonce value. Each side computes the secret key and uses the KDF to mix the secret with the nonce value to arrive at a 128 bit key and a 128 bit counter for sending and receiving. Thse are used to encrypt and decrypt the other sides bytes.
alice -> bob [ 8 bytes checksum ] [ 32 bytes pub key ] [ 16 bytes nonce ] bob -> alice [ 8 bytes checksum ] [ 32 bytes pub key ] [ 16 bytes nonce ] alice.secret = ECDH( bob public key, alice private key ) bob.secret = ECDH( alice public key, bob private key ) alice.recv key+counter = KDF( secret[32] + bob.nonce[16] ) -> 64 bytes alice.send key+counter = KDF( secret[32] + alice.once[16] ) -> 64 bytes bob.recv key+counter = KDF( secret[32] + alice.nonce[16] ) -> 64 bytes bob.send key+counter = KDF( secret[32] + bob.once[16] ) -> 64 bytes
The 32 byte secret will be the same on both ends. The nonce is a random 16 byte value. The KDF function mixes into the keys a preshared salt value, generated by ms_key_gen in a "config/.salt" file described in Configuration. Without this salt value, the key exchange will compute incorrect keys even though the secret is computed correctly.
The 8 bytes checksum is a CRC of the pub key and the nonce in big endian, so the first 4 bytes will be zero. The zero bytes cause an encrypted connection to an unencrypted endpoint to fail.
The 64 byte result of the KDF computation is folded with XOR to arrive at the 16 byte AES key and the 16 byte counter value.
Open PGM
PGM is a multicast protocol, which layers reliability on the native UDP multicast. The parameters for it declare the amount of memory used for buffering data and control the timers when retransmitting is necessary.
The type of PGM used is UDP encapsulated using the port specified. The address
specification has a network, a send address, and multiple receive addresses,
formatted as network;recv1,..;send
, so this is a valid address:
192.168.1.0;224.4.4.4,225.5.5.5;226.6.6.6
where the send address is the last
part and the middle addresses are where packets are received. If the network
part is unspecified, then the hostname is used to find the interface. If there
is only one multicast address, then that is used for both sending and
receiving.
Example tport_mypgm.yaml
:
tport: mypgm type: pgm route: listen: 192.168.1.0;224.4.4.4 port: 4444 cost: 100
Field | Default | Description |
---|---|---|
listen |
;239.192.0.1 |
Multicast address |
connect |
;239.192.0.1 |
Multicast address |
port |
9000 |
UDP port |
cost |
1000 |
Cost of PGM network |
mtu |
16384 |
Maximum UDP packet size |
txw_sqns |
4096 |
Send window size |
rxw_sqns |
4096 |
Receive window size |
txw_secs |
15 |
Send window in seconds |
mcast_loop |
2 |
Loop through the host |
The transmit and receive window sizes expand to the reliability time or the
txw_secs
parameter. When the txw_secs
is not set, then the reliability
passed on the command line or as a configuration parameter is used. The
receive window memory is not used until there is packet loss and a
retransmission occurs. Unrecoverable packet loss occurs when the transmission
window no longer has the sequences that are lost. The mcast_loop
, when set
to 2, allows two peers to share the same network on the same host. This causes
packets to loop back through the interface and allows multiple PGM networks to
coexist on the same multicast group.
In addition to the multicast networking, an inbox protocol is used for point to point messages. The network specified in the multicast address is used as the inbox network, with a random port.
The listen and connect addresses act similarly, two peers using different methods will communicate if the multicast send address matches one of the receive addresses and the inboxes are connected.
TCP Mesh
A TCP mesh is a group of peers which automatically maintain connections with every other peer. When a new peer joins the mesh, it opens a connection with all the other peers which are currently members of the mesh.
The timeout parameter causes the connecting peer to retry for this amount of time. When the timeout expires, the transport will not try to connect until told to do so again.
Multiple connect addresses are normally specified so that some connection likely succeeds if that network is running. Allow peers can specify multiple connect addresses since they use both listen and connect methods to join a network. After one connection succeeds, all the other connections in progress are stopped and the list of mesh members are downloaded from the peers and those are connected.
Example tport_mymesh.yaml
:
tport: mymesh type: mesh route: listen: * connect: [ host, host2, host3, host4 ] port: 9000 timeout: 0 noencrypt: true
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
connect |
localhost |
Active joiner |
device |
Use peer discovery |
|
port |
random |
Listener or connect port |
timeout |
15 |
Active connect timeout |
cost |
1000 |
Cost of mesh links |
noencrypt |
false |
Disable encryption |
If the mesh is a stable network, then the timeout set to a larger value or zero can prevent a network split where some parts of the network are isolated for a period of time. When a host is restarted doesn’t have as much of an effect by a timeout since it is rejoining an existing network. If a timeout expires, then an admin request to rejoin the network is possible through the console.
When a device
parameter is used, then multicast is used through the name
protocol to discover peers that are joining the same mesh, matching using the
tport name. After discovering the peer, a connection with TCP is used to join
the mesh. The port can be random with a device, since the address is
discovered rather than connected. Both the device and connect can be methods
can be used.
The noencrypt
parameter set to true disables tcp link encryption. Both the
listener and connector must match this setting, otherwise they will close the
connection after receiving the first bytes sent.
TCP Point-to-point
A TCP point to point connection to another peer. This is useful to create ad-hoc topologies at the network boundaries.
Example tport_mytcp.yaml
:
tport: mytcp type: tcp route: listen: eth0 connect: host port: 9001 timeout: 0
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
connect |
localhost |
Active joiner |
device |
Use peer discovery |
|
port |
random |
Listener or connect port |
timeout |
15 |
Active connect timeout |
cost |
1000 |
Cost of the TCP link |
edge |
false |
A peer at the edge |
noencrypt |
false |
Disable encryption |
A TCP protocol is either a listener or a connector, the appropriate config is used at run time when a connect or listen is used to activate the port. When device is used to discover the peers through the multicast name protocol, the listeners are matched with the connectors. When more than one listener is discovered by a connector, then connections are made to each one.
Whether a configuration is used to connect or listen is specified by a listen or connect command or configuration. If multiple connections are specified by the connect parameter, then the first connection that is successful will cause the others to stop trying to connect.
The edge
parameter set to true causes the passive peer to pool the
connections on a single transport, similar to a multicast transport where the
traffic is primarily through a gateway peer. The noencrypt
parameter set to
true disables tcp link encryption. Both the listener and connector must match
this setting, otherwise they will close the connection after receiving the
first bytes sent.
If the listen
or connect
parameters specify a port, as in "localhost:8000",
then that port overides the parameter port
configured. A device name is
resolved before the hostname DNS resolver is tried, so "eth0:8000" will resolve
the address configured on the eth0 device.
Tib RV
The RV protocol supports both the RV5 and RV6+ styles of clients. The RV6+ clients use the daemon for the inbox endpoint and don’t create sessions, the RV5 clients use a unique session for each connection and allow an inbox reply in the subscription start. These differences cause decades old software incompatibilities and pressure to re-engineer legacy messaging systems.
There clients usually specify the network and service they want to connect,
which is different from the other clients. When a client requests to connect
to a multicast network, the Rai MS ms_server
will start a PGM transport for
it, unless an existing transport is already defined named with a rv_
prefix
and a service numbered suffix.
When the rv_7500
transport exists as a TCP mesh, then this network is
remapped to the predefined transport when a RV client uses the service 7500
and the multicast network specified by the client is ignored. When no
multicast network is specified, then no Rai MS transport is created and
the existing transports are used.
Example tport_myrv.yaml
:
tport: myrv type: rv route: listen: * port: 7500
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
port |
random |
Listener port |
use_service_prefix |
true |
Use a service namespace |
no_permanent |
false |
Exit if no connections |
no_mcast |
false |
Ignore multicast networking |
no_fakeip |
false |
Use IPv4 address for session |
Unless the use_service_prefix is false, the traffic is segregated to the
_rv_7500
where service is 7500. If it is true, then all services that also
have use_service_prefix set to true will share the same namespace. Without
no_fakeip
set to true, the session and inbox values are random and not based
on the IPv4 address of the host. This allows RV networks to work without a
routable IPv4 network across private address spaces that are common with
NATs, VMs, and/or container networks.
NATS
NATS is a pub/sub system that is similar to RV with respect to subject schema with some extensions for queue groups and optionally persistent message streaming. The protocol support does not include the streaming components, only the pub/sub and queue groups. NATS does not have an inbox point-to-point publish scheme, it relies on the client to create a unique subject for this functionality.
Example tport_mynats.yaml
:
tport: mynats type: nats route: listen: * port: 4222
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
port |
random |
Listener port |
service |
_nats |
Service namespace |
network |
none |
Join a network |
If the network is specified, then starting the NATS service will also join the network. A network format is as described in Services and Networks.
Redis
Redis has a pub/sub component that has slightly different semantics, without a
reply subject for request/reply. It also uses the term channel
to refer to a
subscription. A pattern subscription is separated by a psub operator, allowing
subscriptions and publishes to any series of bytes.
Example tport_myredis.yaml
:
tport: myredis type: redis route: listen: * port: 6379
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
port |
random |
Listener port |
service |
_redis |
Service namespace |
network |
none |
Join a network |
The data operators that operate on cached structures like lists and sets, etc. These commands are available when a shared memory key value segment created and passed as a command line argument to the server (example: -m sysv:raikv.shm), or defined as a value in the config files (example: map: "sysv:raikv.shm").
If the network is specified, then starting the Redis service will also join the network. A network format is as described in Services and Networks.
Telnet
Telnet is a way to get a console prompt, but it doesn’t start by default. It uses the same transport config as the pub/sub protocols. It also can be used by network configuration tools to install a configuration remotely. A telnet client signals the service that it has terminal capabilities which enables command line editing.
Example tport_mytelnet.yaml
:
tport: mytelnet type: telnet route: listen: * port: 22
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
port |
random |
Listener port |
Web
Web handles http requests and websocket endpoints and integrates an web application that can be used to graph activity and show internal tables. The web application is compiled into the server, so no external file access is necessary.
Example tport_myweb.yaml
:
tport: myweb type: web route: listen: * port: 80 http_dir: "./" http_username: myuser http_password: mypassword
Field | Default | Description |
---|---|---|
listen |
* |
Passive listener |
port |
random |
Listener port |
http_dir |
none |
Serve files from this directory |
http_username |
none |
Adds username to digest auth |
http_password |
none |
Sets password for username |
http_realm |
none |
Sets realm for username |
htdigest |
none |
Load digest file for auth |
If http_dir is not set, then this service does not access the filesystem for processing http get requests. It has a set of html pages compiled into the binary that it uses for viewing the server state.
If http_dir is set, then the files located in the directory will override the
internal files. The html files and websocket requests also have a templating
system which substitute values from the server. If @(show ports)
appears in
a html page, it is replace with a html <table>
of ports. If template "res"
: @{show ports}
is sent using a websocket, it expands to a JSON array off
ports and the reply is "res" : [ports...]
.
Any of the commands from the console interface are now exposed on the http endpoint. Requesting "show ports" will respond with a JSON array of transports with the current totals of messages and bytes:
$ wget --http-user=myuser --http-password=mypassword -q -O - "http://localhost:80/?show ports" [{"tport":"rv.0", "type":"rv", "cost":1000, "fd":13, "bs":"", "br":"", "ms":"", "mr":"", "lat":"", "fl":"SLI", "address":"rv://127.0.0.1:7500"}, {"tport":"mesh4.1", "type":"mesh", "cost":1000, "fd":16, "bs":"", "br":"", "ms":"", "mr":"", "lat":"", "fl":"SLX", "address":"mesh://10.4.4.18:19500"}, {"tport":"primary.2", "type":"tcp", "cost":1000, "fd":18, "bs":29500, "br":47324, "ms":229, "mr":355, "lat":"26.5ms", "fl":"C", "address":"robotron.1@tcp://209.237.252.104:18500"}, {"tport":"secondary.3", "type":"tcp", "cost":1000, "fd":20, "bs":23276, "br":39134, "ms":181, "mr":311, "lat":"29.4ms", "fl":"C", "address":"edo.2@tcp://209.237.252.98:18500"}]
The websocket endpoint can also be used to subscribe subjects. When a message is published to the websocket, the format used is:
"subject" : { "field" : "value" }
This requires that the messages published can be converted to JSON or is already in JSON format.
The http_username / http_password or htdigest will cause http digest authentication to be used and require them for access. The above wget is used with the example configuration.
A htdigest file contains a list of users and can be created by the htdigest program distributed with the Apache packages.
$ htdigest -c .htdigest realm@raims myuser Adding password for myuser in realm realm@raims. New password: mypassword Re-type new password: mypassword $ cat .htdigest myuser:realm@raims:56f52efe43dcf419e991ea6452ae6f06
Then tport_myweb.yaml
is configured like this:
tport: myweb type: web route: listen: * port: 80 htdigest: ./.htdigest
Only one realm can be used by the service. If http_realm is configured then that realm is used, otherwise the first realm in the htdigest file is used. If no realm is specified but a user and password are specified, then "realm@raims" is used.
Link State
The Forwarding Set
Each node in a network must construct a forwarding set for any message sent by any peer. A forwarding set instructs the node where to send a message so that all subscribers of it will see the message exactly one time, when the network is converged and stable.
A "converged network" is one where all peers agree that a link exists. If peer A has in it’s database a link to peer B, then peer B must also have a link to peer A. If a link is missing, then the network tries to resolve the difference by asking the peers with the discrepancy which is correct.
Every peer has a bloom filter that contains all of the subscriptions currently active. The links database tells each peer how the network can be traversed for full coverage and the bloom filter prunes the coverage by dropping the message when there are no subscriptions active that match the subject on the other side of the link.
A simple redundant network is a circle:
dyna -- ruby | | bond -- chex
If the cost of each of the links is set to the default 1000, then the forwarding set for dyna is the link to ruby and bond. When ruby and bond receive a message from dyna, only one of them will forward the message to chex. The path cost from dyna → ruby → chex is equal to the path cost from dyna → bond → chex. The forwarding algorithm tracks the equal cost paths and ranks them in order of peer age. In the case that ruby is older than bond, then the ranking of these routes would by 1. dyna → ruby → chex and 2. dyna → bond → chex. The top 4 ranked routes are saved as the forwarding sets, and selected by the hash of the message subject. In this case, half of the subjects subscribed by chex and published from dyna would take the first path and the other half would take the second path.
The method of ranking the paths by peer age is used because the stability of the network is less affected when more transient peers are added and subtracted from the link state database.
Message Loss
Under normal conditions, the sequence of the message is one greater than the last sequence received. The sequence numbers are 64 bits so they will never be zero. These conditions are possible when a sequence is not in incrementing order from the last message received, which is what normally occurs.
-
Publisher includes a time stamp
This causes the subscriber to synchronize the sequence numbers. The publisher will always include a time stamp when the first message of a subject is published, or when the last sequence is old enough to be cycled from the publisher sequence window.
-
The first message received
When a subscription start occurs it will usually not contain a time stamp, unless it is the first message published.
-
The message sequence is repeated
A sequence is less than or equal the last sequence received. This indicates the message was already processed. The message is dropped.
-
The message sequence skips ahead
Some sequences are missing, indicating messages were lost. Notification of message loss is propagated to the subscriptions.
-
The message subject is not subscribed
The subscription may have dropped and the publisher has not yet seen the unsubscribe.
Multicast sequence numbers
The sequence numbers include a time frame when the publisher starts the message stream. This is the computation that creates a new sequence stream.
nanosecond time stamp = 1659131646 * 1000000000 = 0x17066b710b706c00 1 8 16 24 |-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+- |0 0 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 |-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+- 32 40 48 56 64 -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-| 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0| -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-| message sequence number = ( nano time >> 33 << 35 ) + 1 = 0x5c19adc000000001 1 8 16 24 |-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+- |0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 |-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+- 32 40 48 56 64 -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-| 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1| -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|
This truncates nanosecond time stamp to approximately 10 second intervals, a new time frame can only occur after 10 seconds. The time frame is stored in the upper 29 bits will be valid until the year 2115. The sequence resolution within a time frame is 35 bits or 34 billion sequences. These are rotated to new time frames when the sequence number is zero.
These are properties of the time frame encoded in the message sequence numbers:
-
A start of a new multicast stream sequence will use the current time, this is always after the last convergence time stamp. The current time is also used as needed when memory limitations prevent caching of the last sequence published. When the sequence is cached, the additional messages won’t change the time frame but will increment the sequence number.
-
A new subscription start or uncached sequence publish can verify that the first message received is greater than the network convergence time. This is used to validate that the message stream is uninterrupted to the start of the time frame, since message loss has not occurred since the before network convergence.
All of the transports are stream oriented, so a loss of unrecoverable network packets will cause connections to drop and a new convergence state by pruning the lost routes. All peers will agree on a time that convergence is reached. New time frames are created for all messages published so that the time frame constructed in any one peer greater than the convergence time in all peers.
When routes are added to or subtracted from the network, the message routing is not stable until all peers have finished adjusting their view of the network. The peer that publishes a message may use a sub-optimal forwarding path to the recipients until they are notified that better paths are available with link state exchanges.
Publish sequence window
A map of subject to sequence numbers for published multicast messages is
maintained by each peer. This map rotates when a configured memory limit is
reached, pub_window_size
, and the window time interval is greater than a
configured time, pub_window_time
, which must be at least 10 seconds. When a
subject is rotated out of the window, the sequence number is restarted with a
new time frame.
Subscription sequence window
A map of subject to sequence numbers for the subscriptions that a peer has
is also maintained. This validates that the messages are processed in order
and allows notification of message loss when the sequences skip and does not
allow a message to be processed twice. The memory limit for this is
sub_window_size
and time limit is sub_window_time
. When a subject is
rotated out of the window, then the publisher did not update for the window
time and the next sequence is treated as if a new subscription was created.
Message duplicates are avoided by discarding messages that are older than the
trailing edge of the subscription sequence window. The clock skew between
systems is estimated. The console command show skew
will display the
calculated clock skew between systems.
C.test[Jl8gk4f+gVaf60LxKtsaMg]@dyna[560]> show skew user | lat | hb | ref | ping | pong | time -----+--------+--------+-----+--------+---------+------------- A.1 | 187us | 451us | 0 | 104us | -2.22us | 01:32:56.384 B.2 | 304us | 1.25ms | 1 | 207us | -18.9us | 01:32:56.384 D.3 | 174us | 690us | 0 | 77.2us | -3.73us | 01:32:56.384 G.4 | 25.8ms | 4.5se | 1 | 4.5se | 4.49se | 01:32:51.897
The pong calculation subtracts the round trip time and is the most accurate, the others disregard the latency of the network. The HB are from time differences of directly attached peers using heartbeats and are shared with those not directly attached. The ref is the peer (0 = self, 1 = A.1) that originated the HB difference. The time is the estimated clock setting of the remote peer in the current timezone.
Configuration for sequence windows
The sizes and windows are in the parameters section of the config file and default to 4 megabyte (about 60,000 subjects for publishers and 20,000 for subscribers) and 10 seconds. The size of the windows will have an overhead of 48 bytes for publishers and 128 bytes for subscribers in addition to the subject size. The 10 second rotate timer could cause more memory to be used if lots of new subjects are published or lots of new subjects are subscribed within 10 seconds.
$ cat config/param.yaml parameters: pub_window_size: 10 mb pub_window_time: 60 sec sub_window_size: 10 mb sub_window_time: 60 sec
Show loss
The show loss
console command displays the messaging statistics for the
peers.
A.test[XftVokMK+WK12CNuEaRFuA]@dyna[545]> show loss user | repeat | rep time | not sub | not time | msg loss | loss time | ibx loss | ibx time -----+--------+----------+---------+----------+-----------+--------------+----------+--------- B.1 | 0 | | 0 | | 0 | | 0 | D.3 | 0 | | 0 | | 766 | 20:42:24.431 | 0 | C.4 | 0 | | 0 | | 0 | | 0 |
-
repeat — count of multicast messages received more than one time
-
rep time — last time of repeated messages
-
not sub — count of multicast messages received which were not subscribed
-
not time — last time of not subscribed
-
msg loss — number of multicast messages which were lost
-
loss time — last time of multicast message loss
-
ibx loss — number of messages which were lost from the inbox stream
-
ibx time — last time of inbox message loss
An inbox message loss is not unusual since the point to point messages are often used for link state exchanges and other network convergence functions. Inbox message loss is usually not as problematic as multicast message loss since there often timers are retries associated with their usage.
Multicast message loss is much more difficult to recover from, since there are usually many multicast streams and tracking the state of each one is a problem solved by persistent message queues. This requires clients which track the state of the messages they consume and notify the queue when they are finished with processing them.
Notification of message loss
If a message arrives with a sequence which is not in order, it is forwarded with state indicating how many messages are missing, if that can be determined. The protocol handling of this notification is to publish a message indicating how many messages were lost.
RV protocol
The RV protocol publishes a message to the
_RV.ERROR.SYSTEM.DATALOSS.INBOUND.BCAST
subject with a count of lost messages.
These are throttled so that on one is published per second after the first one
is published.
Example:
mtype : "A" sub : _RV.ERROR.SYSTEM.DATALOSS.INBOUND.BCAST data : { ADV_CLASS : "ERROR" ADV_SOURCE : "SYSTEM" ADV_NAME : "DATALOSS.INBOUND.BCAST" ADV_DESC : "lost msgs" lost : 7 sub_cnt : 7 sub1 : "RSF.REC.PAC.NaE" lost1 : 1 sub2 : "RSF.REC.MTC.NaE" lost2 : 1 sub3 : "RSF.REC.MCD.NaE" lost3 : 1 sub4 : "RSF.REC.MCD.N" lost4 : 1 sub5 : "RSF.REC.SPM4.NaE" lost5 : 1 sub6 : "RSF.REC.MER.NaE" lost6 : 1 sub7 : "RSF.REC.MER.N" lost7 : 1 scid : 7500 }
Internal Protocol
The protocol is asynchronous with timers to timeout RPCs and to throttle the rate which peers back off retries. As a result of this, the message flow for a network configuration is variable and can change with different conditions.
The function of each message is encoded in the subjects with the arguments passed as field values with some common flags and options encoded in the message header.
Each message is authenticated a session key using a message HMAC. The initial key exchange is signed by either the service private key or a configured user private key. The heartbeat messages are also authenticated with a hello key message HMAC derived from the service public key and the start time. These are messages that set up the initial key exchange before a session key is established, but can be weakly authenticated since service public key is encrypted at rest in the configuration and not shared over the network.
Any message that fails authentication is ignored.
Field Values
Each field in a message is encoded with a type and length. This allows new fields to be added without disrupting the message parsing. The first 16 bits encodes the type, length and field id. The rest of the field encodes the value. All integers are encoded in big endian.
fid = BRIDGE(3), type = OPAQUE_16(4) ( opaque 16 bytes ) 144 |-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+.. + |1 1 x x 0 1 0 0 0 0 0 0 0 0 1 1| | ^ ^ ^.....^ ^.............^ ^.................................... | | | | | | primitive type(4) fid(3) 128 bit bridge fixed
The types defined are bool (size:1), unsigned int (size:2,4,8), opaque (size:16,32,64), string (max size:64k), long opaque (max size:4G).
The first two bits, fixed and primitive, indicate whether the type has a fixed length, and whether the value is a field (primitive) or a message (not primitive). A message is another group of fields and is always encoded as a long opaque with the primitive bit set to 0. A message payload is always encoded as a long opaque with the primitive bit set to 1.
The types are enumerated as:
Type | Value | Size |
---|---|---|
bool |
0 |
1 byte |
unsigned short |
1 |
2 bytes |
unsigned int |
2 |
4 bytes |
unsigned long |
3 |
8 bytes |
opaque 16 |
4 |
16 bytes |
opaque 32 |
5 |
32 bytes |
opaque 64 |
6 |
64 bytes |
string |
7 |
16 bit length + up to 64K bytes |
long opaque |
8 |
32 bit length + up to 4G bytes |
The field values are aligned on 2 byte boundaries, so the value is padded one
byte when the value size is odd. There are currently 76 different field ids
(fid) and a maximum of 256 (defined in the header file msg.h
).
Message Framing
A message frame has 5 fixed length sections and 3 fields that are always present and use two bytes.
These header fields are:
Field | Size |
---|---|
Version |
1 bit |
Message Type |
2 bits |
Message Option |
5 bits |
Message Size |
3 bytes |
Subject Hash |
4 bytes |
Bridge |
2 byte type + 16 bytes |
Message Digest |
2 byte type + 16 bytes |
Subject |
2 byte type + 16 bit length + up to 64K |
The first 4 bytes encoded as:
bytes 0 -> 3 are ver(1), type(2), opt(5), message size (24) 1 8 16 24 32 |-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-| |1|0 0|0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0| ^ ^.^ ^.......^ ^.............................................^ | \ | | ver(1)| opt(0) 24 bit size(160) type(0)
The Message Type encodes 4 classes of messages:
Type | Value | Description |
---|---|---|
Mcast |
0 |
Multicast message with routeable payload |
Inbox |
1 |
Point to point message |
Router Alert |
2 |
System link state or subscription update |
Heartbeat |
3 |
Neighbor link keep alive |
A message that has routeable data always has the Multicast or Inbox type set. The Inbox type message is also used for RPC style communication between peers. The Router Alert type message alters the routing database by modifying the link state or the subscription state. A Heartbeat type is a periodic presence update. The peers which are directly connected are responsible for detecting link failures.
The Option Flags is a bit mask that encodes options for messages with Multicast and Inbox types that are routing data payloads to endpoints. These are:
Option | Value | Description |
---|---|---|
Ack |
1 |
Endpoints ack the reception |
Trace |
2 |
All peers along the route ack the reception |
Any |
4 |
Message is an anycast, destination is one endpoint of many |
MC0 |
0 |
Message is using multicast path 0 |
MC1 |
8 |
Message is using multicast path 1 |
MC2 |
16 |
Message is using multicast path 2 |
MC3 |
24 |
Message is using multicast path 3 |
The message size does not include the first 8 bytes, so the message frame size is 8 + the message size field. If the size is greater than 24 bits, then the next 32 bits are used to encode the size and the subject hash is calculated from the subject.
The Bridge, Message Digest and Subject are encoded in Type Length Value format. The Bridge is a 128 bit identity of the sender. The Message Digest is the authentication field. The receiving peer will authenticate that the message is valid by using the Bridge to look up the 512 bit session key of the sender and calculate an HMAC using the message data with the session key and compare it to the value contained in the Message Digest. In addition, there are sequence numbers and time stamps present that prevent the replay of each message frame.
The 4 multicast path options will select the one of the equal cost paths calculated from the current link state. Every peer can calculate these paths using the same replicated link state database, this results in 4 forwarding trees to the same destinations if there are enough redundant links.
System Subjects
The peers exchange messages to authenticate new peers, synchronize the link state of the network, subscription updates, and heartbeats to maintain neighbor links. These types of messages have unique subject prefixes as well as bits in the message type header indicating whether it is special.
There are 7 classes subject prefixes used:
Prefix | Description |
---|---|
_I. |
Inbox point to point |
_M. |
Generic multicast message |
_X. |
Heartbeat link presence message |
_Z. |
Link state broadcast message |
_S. |
Normal subscription multicast message |
_P. |
Pattern subscription multicast message |
_N. |
Peer statistics multicast message |
A broadcast style forwarding used by _Z, subjects is different from multicast forwarding. It will flood the authenticated peers in the network, adjusting each peer’s routing database as it is received. It uses this type of forwarding because this kind of update may cause the multicast forwarding to be temporarily incomplete until the network converges again.
The forwarding path for the Inbox, Heartbeat and broadcast subjects does not follow the multicast forwarding path, so they can’t be subscribed.
There is a separate sequence number domain defined for these because of the idempotent nature of maintaining the replicated state of the network. If a peer misses messages for a delta changes in the subscriptions or links database, the state is reinitialized by replicating it from an up to date peer.
The multicast subjects follow normal forwarding rules. The _M prefix is used for a multicast ping and a multicast link state sync.
The _N prefix has unique subjects for link and peer statistics like messages sent or received, bytes sent or received, as well as adjacency notifications. These are used to monitor an individual node or a group of them with pattern subscriptions. These stats are not sent unless there are subscriptions open.
Heartbeat Subjects
These are sent on a link between directly connected peers.
Subject | Description |
---|---|
_X.HELLO |
First message sent |
_X.HB |
Periodic message |
_X.BYE |
Last message sent |
_X.NAME |
Link discovery message |
-
_X.HELLO and _X.HB messages have two functions, the first is to initiate the authentication key exchange. The second is to keep a peer up to date with the last sequence numbers used by the subscription and link state. When heartbeats are not received within 1.5 intervals. The interval default is 10 seconds, this causes a link to be deactivated at :15 when HB expected at :10. When all of the direct links to a peer are inactive, then the peer is unauthenticated and marked as a zombie. The heartbeat timeout does not depend on a transport timeout, like a TCP reset. The result of this behavior is that overloaded or congested links that delay messages for longer than the 1.5 times the heartbeat interval will may incur message loss. This puts an upper bound on the link latency and alleviates back pressure to the publisher.
-
_X.BYE causes the peer to be unauthenticated and dropped from the peer db.
-
_X.NAME messages are multicast to a device for presence detection. Links between peers are only established when the type and name of a transport is matched within a service.
Link State Subjects
These are broadcast flooded to authenticated peers.
Subject | Description |
---|---|
_Z.ADD |
New peer added to peer db |
_Z.DEL |
Dropped peer from peer db |
_Z.BLM |
Subscription bloom filter resized |
_Z.ADJ |
Adjacency changed, link added or removed |
-
_Z.ADD is broadcast when a new peer is added to the peer db, usually as a result of authentication and also in the case when network splits and peers were joined again.
-
_Z.DEL is broadcast when a peer sent a _X.BYE or if it is no longer reachable because all routes to it are down.
-
_Z.BLM is broadcast when a peer resizes the bloom filter associated with the subscriptions and patterns it has open, this occurs approximately when crossing powers of two subscription counts (currently at 31, 62, 124, 248, …).
-
_Z.ADJ notifies when a peer adds are subtracts a link to another peer. It increments the link state sequence number so that peers apply this update only when the link state reflects the current state, otherwise a RPC synchronization request is used (_I.[bridge].sync_req) to resync.
Subscription Subjects
These are multicast to authenticated peers. They are updates to the bloom filter that can be missed and resynchronized with _Z.BLM or a resync RPC request.
Subject | Description |
---|---|
_S.JOIN |
Start a subscription |
_S.LEAV |
Stop a subscription |
_P.PSUB |
Start a pattern subscription |
_P.STOP |
Stop a pattern subscription |
-
_S.JOIN and _S.LEAV add and subtract subscriptions to a subject.
-
_P.PSUB and _P.STOP add and subtract pattern subscriptions. These contain a pattern type as well as the pattern string. The pattern types currently supported are either a RV style wildcard or a Redis glob style wildcard.
Inbox Subjects
The format of a subject with an _I. prefix also encodes the destination of the message by appending the 128 bridge id in base64.
Example:
_I.duBVZZwXfwBVlYgGNUZQTw.auth
All of the peers along the path to the destination use this bridge id to forward the message using the rules for the point to point route of the destination peer. This may be a TCP link or it may be a UDP Inbox link in the case of a multicast PGM transport. The suffix of the inbox subject indicate the type of request or reply it is. If the suffix is an integer then the endpoint is not a system function, but information requested by the console session or a web interface that is usually converted to text and displayed.
These suffixes are currently recognized:
Suffix | Description |
---|---|
auth |
Request authentication, peer verifies with user or service pub key |
subs |
Request and match active subscriptions strings with a pattern |
ping |
Request a pong reply, also has seqnos for maintaining state |
pong |
A reply to a ping, has latency information and update clock skew |
rem |
Remote admin request, run a console command from another peer |
add_rte |
After authenticated with peer, it will add other peers it knows |
sync_req |
Peer sends when it finds an old peer db or subscription state |
sync_rpy |
Response to a sync_req, includes current state if it is out of date |
bloom_req |
Peer requests bloom state, currently peers use adj_req instead |
bloom_rpy |
Response to a bloom_req, contains the bloom map of the subscriptions |
adj_req |
Peer requests when it finds an old link state or subscription state |
adj_rpy |
Response to adj_rpy, contains an up to date link state and bloom map for peer |
mesh_req |
Peer requests when it detects a missing mesh member |
mesh_rpy |
Response to mesh_rpy, contains missing link URLs |
trace |
Response to messages which have the Trace option flag in header |
ack |
Response to messages which have the Ack option flag in header |
any |
Encapsulates a peer _INBOX message, for point to point routing |
-
Auth does a key exchange between two peers. After completing successfully, each peer has a session key for the other. This allows messages to be sent by the other to be authenticated using Message Digest field.
-
Subs is a request for the open subscriptions. It is used by the console and the web interface for examining the network. The RPC reply is always a numeric string to forward to the terminal or web page that requested it.
-
Ping and pong are latency gathering functions for any two peers in the network, not necessarily directly connected. The current sequence numbers for the link state and subscription state are also exchanged for synchronizing peers which are not directly connected.
-
Rem is a remote console command execution, used in the console and web interfaces.
-
Add_rte is used after the auth key exchange to replicate the peer db to a new peer. This initial peer db only contains the names and bridge ids, so the new peer must request session keys, link state and subscription state for peers it does not already know about.
-
Sync_req and sync_rpy are used replicate the session keys. If a new peer does not have the session info from a _Z.ADD or a add_rte, it will request it from the peer that notified of the unknown peer session. This will often be the case after authentication occurs and the new peer receives an add_rte from an older peer that has a db with the current state of the network. This is the only other way that the unique session keys for each peer is distributed besides directly authenticating with a key exchange. The sync_rpy also includes the link state and subscription bloom filter of requested peer.
-
Bloom_req and bloom_rpy are RPCs for the subscription bloom filter. The adj_req and adj_rpy are used instead for this info.
-
Adj_req and adj_rpy are the main method that peers recover the current link state and subscription state. They work in a RPC request/response style. The request contains the sequence numbers that the source peer has in it’s db. The destination peer compares these numbers with it’s own db and replies when a sequence needs updating. Usually the destination peer is the one that the source needs synchronized, but a closer peer can be queried as well. This occurs when a lot of peers need to resynchronize as a result of a network split and reconnect.
-
Mesh_req and mesh_rpy are RPCs for distributing URLs for peers in the same mesh network. When a peer connects to a mesh, it uses the initial connection to find the addresses of all the other peers in the mesh with this RPC.
-
Trace and ack are sent as a multicast message is forwarded with the Message Options set in the header. These can be requested from a console publish using the "trace" or "ack" commands.
-
Any encapsulates an _INBOX point to point message and forwards it to the correct peer. An _INBOX publish does not have a destination other than a unique subject that another peer has subscribed, for example "_INBOX.7F000001.2202C25FE975070A48320.>". The peer that encapsulates this message finds the possible destinations by testing the bloom filters it has and then forwards to the matching peers. The usual case is that there is only one matching destination.
Example Message Flow
Two peers key exchange, ruby connecting to dyna:
Packet | Subject | Source | Destination | Description |
---|---|---|---|---|
ruby.1 |
_X.HELLO |
ruby |
dyna |
initial hello message after connection |
dyna.1 |
_I.xq6vl+2HcoDxtt+7lC7dGQ.auth |
dyna |
ruby |
dyna authenticates with ruby |
ruby.2 |
_I.wwEnbQEY2FMuwZGSjpi3jQ.auth |
ruby |
dyna |
ruby authenticates with dyna |
ruby.2 |
_Z.ADD |
ruby |
dyna |
ruby adds dyna to peer db |
ruby.2 |
_Z.ADJ |
ruby |
dyna |
ruby adds link to dyna |
dyna.2 |
_Z.ADJ |
dyna |
ruby |
dyna adds link to ruby |
dyna.2 |
_I.xq6vl+2HcoDxtt+7lC7dGQ.auth |
dyna |
ruby |
dyna confirms authentication |
dyna.2 |
_Z.ADD |
dyna |
ruby |
dyna adds ruby to peer db |
Ruby connecting dyna, a member of a network of 4 nodes: dyna, zero, one, and two. This is the message flow between ruby and dyna, which completes the initial synchronization of ruby.
Packet | Subject | Source | Destination | Description |
---|---|---|---|---|
ruby.1 |
_X.HELLO |
ruby |
dyna |
initial hello message after connection |
dyna.1 |
_I.q6pEpnzNyANEZKKp29532Q.auth |
dyna |
ruby |
dyna authenticates with ruby |
ruby.2 |
_I.tXB702RHKF0M69dl7K7vrw.auth |
ruby |
dyna |
ruby authenticates with dyna |
ruby.2 |
_Z.ADD |
ruby |
dyna |
ruby adds dyna to peer db |
ruby.2 |
_Z.ADJ |
ruby |
dyna |
ruby adds link to dyna |
ruby.2 |
_I.tXB702RHKF0M69dl7K7vrw.adj_req |
ruby |
dyna |
ruby requests adjacency of dyna |
dyna.2 |
_Z.ADJ |
dyna |
ruby |
dyna adds link to ruby |
dyna.2 |
_I.q6pEpnzNyANEZKKp29532Q.auth |
dyna |
ruby |
dyna confirms authentication |
dyna.2 |
_Z.ADD |
dyna |
ruby |
dyna adds ruby to peer db |
dyna.2 |
_I.q6pEpnzNyANEZKKp29532Q.add_rte |
dyna |
ruby |
dyna populates ruby peer db of other peers |
dyna.2 |
_I.q6pEpnzNyANEZKKp29532Q.adj_rpy |
dyna |
ruby |
dyna replies to adj_req, links to other peers |
ruby.3 |
_I.tXB702RHKF0M69dl7K7vrw.sync_req |
ruby |
dyna |
ruby requests sync of peer zero from dyna |
ruby.3 |
_I.tXB702RHKF0M69dl7K7vrw.sync_req |
ruby |
dyna |
ruby requests sync of peer one from dyna |
ruby.3 |
_I.tXB702RHKF0M69dl7K7vrw.sync_req |
ruby |
dyna |
ruby requests sync of peer two from dyna |
dyna.3 |
_I.q6pEpnzNyANEZKKp29532Q.sync_rpy |
dyna |
ruby |
dyna replies key, links, bloom for peer zero |
dyna.3 |
_I.q6pEpnzNyANEZKKp29532Q.sync_rpy |
dyna |
ruby |
dyna replies key, links, bloom for peer one |
dyna.3 |
_I.q6pEpnzNyANEZKKp29532Q.sync_rpy |
dyna |
ruby |
dyna replies key, links, bloom for peer two |
There is also message flow between dyna and zero, one, two. This is the flow between dyna and zero. The message flow between dyna and one, dyna and two is the same as dyna and zero.
Packet | Subject | Source | Destination | Description |
---|---|---|---|---|
dyna.1 |
_Z.ADJ |
dyna |
zero |
dyna notifies the new link from dyna to ruby |
dyna.1 |
_Z.ADD |
dyna |
zero |
dyna notifies the add ruby to peer db |
dyna.1 |
_Z.ADJ |
ruby |
zero |
forward from ruby for new link from ruby to dyna |
zero.1 |
_I.tXB702RHKF0M69dl7K7vrw.sync_req |
zero |
dyna |
zero requests sync of peer ruby from dyna |
dyna.2 |
_I.ia988C6TtC6/L3JC6D3GqA.sync_rpy |
dyna |
zero |
dyna replies key, links, bloom for peer ruby |
zero.2 |
_Z.ADD |
zero |
dyna |
zero notifies the add of ruby to peer db |
Adding ruby to the network ripples through the directly connected peers, which discover the new peer from the broadcasting of the _Z.ADD messages and then synchronize with each other to merge the ruby session key, the link state, and the subscription bloom state into the network state.
rvd Compatibility
rvd Arguments
If ms_server
is started in rvd
compatible mode, it will automatically start
a rv protocol on 7500 and a web service on 7580 unless arguments are present
that modify this. The protocol that is used between daemons is not compatible
with rvd
, but it does allow rv
clients to connect and communicate. In
other words, the client side is compatible, but the network side is not.
These arguments are recognized:
$ ms_server -help -cfg : config dir/file (default: exe_path/rv.yaml) -reliability : seconds of reliability (default: 15) -user user.svc : user name (default: hostname) -log : log file -log-rotate : rotate file size limit -log-max-rotations : max log file rotations -no-permanent : exit when no clients -foreground : run in foreground -listen : rv listen port -no-http : no http service -http : port for http service (default: listen + 80) -no-mcast : no multicast -console : run with console
Service Key Configuration
Without any arguments, the config file rv.yaml
is loaded from the directory
that ms_server
is installed. This config file can be generated with the
ms_gen_key
program. It should be the same for each instance that is joining
the same network and service, since it contains the service key pair that
authenticates the daemon with other daemons located on the network.
If ms_server
is installed in /usr/local/bin
then this can generate the
default config file for it in rvd
mode:
$ ms_gen_key -y -s rvd -o /usr/local/bin/rv.yaml create dir config -- the configure directory create file config/.salt -- generate new salt create file config/.pass -- generated a new password create file config/config.yaml -- base include file create file config/param.yaml -- parameters file create file config/svc_rvd.yaml -- defines the service and signs users done - Output config to "/usr/local/bin/rv.yaml"
The /usr/local/bin/rv.yaml
file must be installed on every machine that
connects to the network and expects to communicate with the initial machine.
The contents define the service key pair:
$ cat /usr/local/bin/rv.yaml services: - svc: rvd create: 1663653977.579093187 pri: QQ5FR17BZktlJnxW/Ln3YExIoq12rf725FEysQwjGJRSNmgskzUA70fQCivq... pub: IskYDB7cvb1TIiaGZQ7ZAtWAlwhvGa/7rEfyiRKVp2U10sH3Yl6Eo19c0J1V... parameters: salt_data: hDqyoJ9JSXEEBpiueoNPDEqxy3nsEOt7uoDrSvn4DlSvrLZDNQKG3fmK... pass_data: M+ALrLzVLaf/2OlRd7FTstX6pzAF66KQR86EhCxlwXY
The above service key pair is unique for every ms_gen_key
execution. The
private key is used to sign the authentication messages exchanged between
daemons, and the public key is used to verify that the peer is allowed to
exchange messages on the network. Unauthenticated peers will be ignored.
Starting in rvd Compatibility Mode
If the ms_server
is linked to rvd
and run that way, it will run in
compatibility mode:
$ ln -s /usr/local/bin/ms_server /usr/local/bin/rvd $ /usr/local/bin/rvd rvd running at [::]:7500 web running at [::]:7580 moving to background daemon
Unless the -foreground or the -console options are used, it forks itself to
release from the terminal that it is started. ms_server
will also run in
compatibility mode when an argument above is used, for example, ms_server
-listen 7501 -http 7581 -reliability 60
will run in compatible mode.
If there is already a rvd
running on port 7500, it will fail to start and
exit:
$ rvd 0919 23:13:08.635! rvd.0 listen *:7500 failed 0919 23:13:08.635! web: failed to start web at *.7580
A HUP signal will cause it to exit:
$ killall -HUP rvd
Connecting to Networks
The network parameter that the client specifies controls which network that
the ms_server
joins. It can specify a multicast address, TCP connections, or
a TCP mesh. Only daemons which connect to the same network will communicate.
The formats of these are:
Network | Description |
---|---|
eth0;239.192.0.1 |
PGM multicast address |
eth0;mesh |
Mesh network |
eth0;tcp.listen |
TCP listen |
eth0;tcp.connect |
TCP connect |
eth0 |
ANY connect |
(empty) |
no network |
A mesh network causes all the daemons to connect with one another by listening to a random port and multicasting that port to eth0. When other daemons receive this message, they will establish TCP connections with each other daemon.
A TCP network causes the listeners to multicast their random ports to eth0. When daemons that have tcp.connect as a network receive this message, they will connect to the listener. Multiple TCP listeners can exist on the same network. The result of having two "eth0;tcp.listen" specifications and two "eth0;tcp.connect" would be that both connectors will establish connections to both of the listeners.
The PGM multicast address uses UDP encapsulated multicast on the service port using OpenPGM and a UDP point to point protocol for inbox messaging.
The sockets will be bound to the eth0 interface with random ports, except for the PGM socket, which uses a wildcard address for joining the multicast and the service port for sending messages. Multiple services can join the same network, so -service 7500 and -service 7600 can coexist using the same network specification.
When two ms_server
instances are using the network "eth0;mesh" on service
7500 and service 7600, the ports console command will show these networks:
host1_7500.rv[+u7D0t7Cf5MP2USlooBtyA]@host1[632]> ports tport | type | cost | fd | ... | fl | address ----------+------+------+----+-----+------+------------------------------------------- rvd.0 | rv | | 13 | | SLI | rv://[::]:7500 rv_7500.1 | mesh | 1000 | 19 | | SLXD | mesh://10.88.0.2:37277 rv_7500.2 | mesh | 1000 | 21 | | X | host2_7500.1@mesh://10.88.0.3:37720 rv_7600.3 | mesh | 1000 | 24 | | SLXD | mesh://10.88.0.2:37109 rv_7600.4 | mesh | 1000 | 26 | | X | host2_7600.1@mesh://10.88.0.3:42620 web | web | | 14 | | S | web://[::]:7580 10.88.0.2 | name | | 17 | | S | name://10.88.0.2:59432;239.23.22.217:8327
The ANY specifier can either connect to a mesh or a TCP listener, depending which is present.
The empty network does not attempt to connect to anything, but it will find other services through existing connections.
If there exists a rv_7500 transport in the configuration (configured in rv.yaml or the -cfg argument), this overrides any client specified network connection for service 7500, so the client network argument is ignored.
The Peer Names
Each ms_server
instance uses the hostname of the machine to identify itself
unless the -user argument is used to specify another name. The daemon port
is appended to the user name so that multiple daemons appear as hostname_7500
and hostname_7600 when -listen 7500 and -listen 7600 are used for two different
daemon instances.
Console
Description of the Console
The console of ms_server
is available when the -c option is used or when a
Telnet protocol is defined. It offers command line editing and
completions. It can be used to define, start, or stop connections between
instances and also modify which IPC protocols are running for clients to use.
It also has many ways to examine and debug the network.
The output is usually colorized if the terminal supports it, with green and red used for log messages (normal and error) and white used for cli command execution results. Printing messages received are also colorized, green for field name, blue for field type, white for field values.
The user names and the transport names usually have an integer number appended
to them, for example lex_a2.3
is the user lex_a2
that has a uid
of 3.
This indicates either the uid
or the tport_id
of the identifier. The
string identifiers of users and transports can contain duplicates, since they
are identified using the bridge id. The bridge id is a unique random 128 bit
nonce, the strings attached to the users and transports are tags which usually
are unique, but not necessarily. The users and transports are kept in their
respective a tables and the uid
and tport_id
are indexes into these tables.
The *
is often used for uid 0 so that it stands out, since it is the peer
that the console is attached to. The tport_id
of 0 is also special, that is
where the client protocols are attached through local IPC, for example, a TCP
connection to 127.0.0.1:7500.
The command string entered into the cli will execute if it has enough characters
to distinguish it from the prefixes of other commands. If the string pi is
entered, then the command ping
will run, since pi is a unique prefix of
ping
. The show
prefix is optional when the command matches the second part
of the show
command, so pe will match and run the show peers
command.
The shortened command run t test
will match and run the show running
transport test
command.
Help Screen
The following is the help screen, displayed when "help" is entered at the cli.
Command | Description |
---|---|
ping [U] |
Ping peers and display latency of return |
tping [U] |
Ping peers with route trace flag |
mping [P] |
Multicast ping all peers using path P |
remote U C |
Run command C remotely on peer U |
connect T |
Start tport connect |
listen T |
Start tport listener |
shutdown T |
Shutdown tport |
network S N |
Configure service and join network |
Configure tport T |
|
Configure parameter P = V |
|
Save current config as startup |
|
show subs [U] [W] |
Show subscriptions of peers |
show seqno [W] |
Show subject seqno values for pub and sub |
Show the adjacency links |
|
Show active peers |
|
show ports [T] |
Show the active ports |
show cost [T] |
Show the port costs |
show status [T] |
Show the port status with any errors |
show routes [P] |
Show the route for each peer for path P (0-3) |
Show urls of connected peers |
|
show tport [T] |
Show the configured tports |
show user [U] |
Show the configured users |
Show event recorder |
|
Show current log buffer |
|
Show system seqno and time values |
|
Show system seqno and sums |
|
Show system publish type received |
|
show inbox [U] |
Show inbox sequences |
Show message loss counters and time |
|
Show peer system clock skews |
|
Show reachable peers through active tports |
|
show tree [U] |
Show multicast tree from me or U |
show path [P] |
Show multicast path P (0→3) |
show forward [P] |
Show forwarding P (0→3) |
Show fd statistics |
|
Show fd buffer memory usage |
|
Show pub and sub window memory usage |
|
show blooms [P] |
Show bloom centric routes for path P (0-3) |
Show users which have a bloom that match sub |
|
Show network description for node graph |
|
Show routing cache geom, hits and misses |
|
Show poll dispatch latency |
|
Show rv hosts and services |
|
Show rv subscriptions |
|
Show rpcs and subs running |
|
Show current config running |
|
Show transports running, T or all |
|
Show services running config, S or all |
|
Show users running config, U or all |
|
Show groups running config, G or all |
|
Show parameters running config, P or all |
|
Show startup config |
|
Show transports startup, T or all |
|
Show services startup config, S or all |
|
Show users startup config, U or all |
|
Show groups startup config, G or all |
|
Show parameters startup config, P or all |
|
sub S [F] |
Subscribe subject S, output to file F |
unsub S [F] |
Unsubscribe subject S, stop output file F |
psub W [F] |
Subscribe rv-wildcard W, output to file F |
punsub W [F] |
Unsubscribe rv-wildcard W, stop output file F |
gsub W [F] |
Subscribe glob-wildcard W, output to file F |
gunsub W [F] |
Unsubscribe glob-wildcard W, stop output file F |
snap S [F] |
Publish to subject S with inbox reply |
pub S M |
Publish msg string M to subject S |
trace S M |
Publish msg string M to subject S, with reply |
ack S M |
Publish msg string M to subject S, with ack |
rpc S M |
Publish msg string M to subject S, with return |
any S M |
Publish msg string M to any subscriber of S |
Cancel and show incomplete (ping, show subs) |
|
Mute the log output |
|
Unmute the log output |
|
Reseed bloom filter |
|
debug I |
Set debug flags to ival I |
wevents F |
Write events to file |
die [I] |
Exit without cleanup, with status 1 or I |
Exit console |
The arguments in square brackets are optional, the letters used above are:
-
U — User, the name of an
ms_server
instance, which is often the hostname of the machine. -
P — Path, a multicast path, numbered 0 to 3. This selects a precomputed path that all
ms_server
instances use to forward messages. It will only be different when there are redundant links with a cost that is less or equal to the primary path 0. -
T — Transport, the name of a connection endpoint that messages are routed through.
-
S — Service or Subject depending context. The name or number of a service, for example 7500 is the default RV service. A subject is any string of characters.
-
N — Network, formatted described in Connecting to Networks.
-
G — Group, defines a group of users, not currently used.
-
F — File, a path in the file system.
-
M — Message, a string of characters, as the console is limited to message formats that can be typed into the cli (string and json).
-
I — Integer
Testing Connectivity with Ping
-
ping [U]
-
tping [U]
-
mping [P]
These commands send a message to a peer and display the message returned. The
tping
command also sets the trace flag in the message sent so that all peers
along the path will also send a message back. This is useful in the way that
traceroute is useful, to find an unusual latency report or dropped messages.
The ping
and tping
optionally have an argument that specifies the name of
the peer to send the message. If no argument is used, then every peer
currently active will be sent a message. These messages are sent over the link
that is handling the inbox point to point messages. The subject of a ping
message uses the inbox format _I.<nonce>.ping
, where the nonce identifies the
destination peer. The return uses the _I.<nonce>.N
inbox subject, where
nonce identifies the peer of the sending console. The N part of the subject is
setup by the console to identify what the sending operation was and is used in
the reply field of the original message.
The mping
use a multicast path instead of an inbox path. The multicast path
is numbered and is added to the message header so that all peers which receive
and route this message will use the same path. All peers that receive it will
send an inbox reply message, similar to ping
. The subject used by the sender
is _M.ping
, which all peers are subscribed to. The multicast paths are
numbered 0 to 3, so mping 0
will use the first path, and mping 3
will use
the last path. Using different paths can be useful to check that all redundant
links in use are active and forwarding. The reply also includes which port the
message was received on, which will match the path 3 network path. The path 0
is often the same as the inbox path, except in the case of PGM, where inbox is
a UDP point to point protocol.
If the network is not yet stable, sometimes a ping operation will not complete.
When this occurs, use the cancel
command to show the completed and the
incomplete values. When a ping operation is started, the console estimates
the number of replies that are expected and waits for these to complete before
displaying the results. The tping
will display the acks of the message
as they are received but wait for the final results.
Example ping
.
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[0]> ping
user | cost | lat | tport | peer_tport
----------+------+-------+-------------+-------------
pic_a2.1 | 1000 | 189us | pic_amesh.2 | pic_amesh.2
pic_a4.3 | 1000 | 184us | pic_amesh.4 | pic_amesh.4
pic_a3.2 | 1000 | 214us | pic_amesh.3 | pic_amesh.3
pic_a.4 | 1000 | 219us | pic_amesh.5 | pic_amesh.6
lex_a.29 | 2000 | 296us | pic_amesh.5 | fo_mesh.12
lee_a.26 | 2000 | 340us | pic_amesh.5 | fo_mesh.12
lex_a4.17 | 3000 | 389us | pic_amesh.5 | lex_amesh.5
...
Example mping
.
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[1]> mping 1
user | cost | lat | tport | peer_tport
----------+------+--------+-------------+-------------
pic_a.4 | 1000 | 146us | pic_amesh.5 | pic_amesh.6
pic_a2.1 | 1000 | 158us | pic_amesh.2 | pic_amesh.2
pic_a4.3 | 1000 | 199us | pic_amesh.4 | pic_amesh.4
pic_a3.2 | 1000 | 245us | pic_amesh.3 | pic_amesh.3
edo_a.9 | 2000 | 265us | pic_amesh.5 | fo_mesh.12
lex_a.29 | 2000 | 278us | pic_amesh.5 | fo_mesh.12
lee_a.26 | 2000 | 279us | pic_amesh.5 | fo_mesh.12
...
The tport
field is where the reply inbox message was received, the
peer_tport
is where the ping
message was received at the peer.
Remote Command Execution
-
remote U C
Remote
will message a command to another peer, run it in it’s console and
return the result. This is useful because most often, a peer will not have a
console, a web interface, or a telnet protocol active. Without remote
, the
peer would need to be restarted in order to change the configuration or start a
console. With remote
, you could connect a peer with authentication, encryption
and a console to the network temporarily, make a change, then disconnect the
peer.
Example of remote
.
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[4]> rem lee_a1 show pubtype
from lee_a1.19:
type | recv_count | send_count
-----------------+------------+-----------
u_session_hello | 0 | 1
u_session_hb | 16217 | 16218
u_peer_add | 113 | 31
u_bloom_filter | 39 | 3
u_adjacency | 67 | 4
...
Update and Show the Configuration
-
configure transport T
-
configure parameter P V
-
save
-
show running
-
show running transport T
-
show running service S
-
show running user U
-
show running group G
-
show running parameter P
-
show startup
-
show startup transport T
-
show startup service S
-
show startup user U
-
show startup group G
-
show startup parameter P
These commands show and modify the running configuration. The save
command
write the running config to the startup config, when the directory and files
are writable.
The show running
and show startup
will print the config tree in yaml
to the console. The running configuration may have some dynamically created
users and protocols which are created as a result of the startup config. A
dynamically created user that is not preconfigured is one of these. These
will show in running
, but will not save to startup
.
Using the configure transport
command is the most often used command of
these. It will update the currently running transports as well as add new
ones. If it is used to modify an existing transport that is already running,
the new settings won’t change the active transport until it is restarted
with shutdown
and connect
or listen
. The configuration details of
transports are described in Networking, and the details of the parameters
are described in Parameters. Most of the parameters are only applied
at startup, so changing them will have an effect only when saved and the
process restarted.
Example of configure transport
and show running transport
.
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[110]> configure transport mesh
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[111](mesh)> type mesh
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[112](mesh)> port 9000
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[113](mesh)> connect host1
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[114](mesh)> connect2 host2
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[115](mesh)> listen *
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[116](mesh)> q
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[117]> show running transports mesh
transports:
- tport: mesh
type: mesh
route:
port: 9000
connect: host1
connect2: host2
listen: "*"
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[123]> configure transport test type tcp port 9000 connect host1
Transport (test) updated
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[124]> show running transports test
transports:
- tport: test
type: tcp
route:
port: 9000
connect: host1
The first configure
command enters into a cli sub command mode where only the
fields of the transport can be entered. The second configure
command sets
all of the fields on one line.
The commands show service
and show group
have limited usefulness at in the
current implementation, since only one service is used per ms_server
instance
and groups do not have operational functionality yet, eventually they will be
used for access control lists.
Transport Start and Stop
-
connect T
-
listen T
-
shutdown T
-
network S N
The transport T is defined before using the connect
, listen
, shutdown
commands. The network
command configures the transport if not already
configured, runs it, and also attaches a service to it. The configuration of
the transports is described in Networking.
Example of connect
, listen
, shutdown
.
chex.rvd[L+jUn266ADoL2fBschoqUg]@chex[108]> configure transport test type tcp port 9000 connect lexx.rai
Transport (test) updated
chex.rvd[L+jUn266ADoL2fBschoqUg]@chex[109]> connect test
Transport (test) started connecting
chex.rvd[L+jUn266ADoL2fBschoqUg]@chex[110]> shutdown test
Transport (test) is running tport 1
Transport (test) shutdown (1 instances down)
The Show Commands
-
show subs [U] [W]
Show the subscriptions active for user or for all users. The W
is a substring
for partial matches. This command uses inbox RPC calls to _I.<nonce>.subs
for all users which U
specifies. The *
user matches all users, so the W
argument can be specified.
Example, show all subscriptions for every user:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[38]> show subs
user | subject
----------+-------------------------------------------------------------------
pic_a1.* | _7603._INBOX.0AB98FB4.DAEMON
| (p) _7603._INBOX.0AB98FB4.763E17AA51E2DEF0.>
| test
----------+-------------------------------------------------------------------
pic_a2.1 | _7606._INBOX.173D29A5.DAEMON
| (p) _7606._INBOX.173D29A5.763E17AA5271FEF0.>
----------+-------------------------------------------------------------------
pic_a3.2 | _7500._INBOX.0072DD0A.DAEMON
| (p) _7500._INBOX.0072DD0A.663E17AA514B7DD0.>
| _7500.RSF3.REC.MOT.B
----------+-------------------------------------------------------------------
pic_a4.3 | _7500._INBOX.68AD2F1B.DAEMON
| (p) _7500._INBOX.68AD2F1B.763E17AA50777DD0.>
| _7500.RSF4.REC.DEM=.NaE
| _7500.RSF4.REC.NAI.NaE
...
The (p)
strings before the subject indicates that the subject was subscribed
as a pattern.
Example, show all subscriptions which have the substring DAEMON:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[41]> show subs * DAEMON
user | subject
----------+-----------------------------
pic_a1.* | _7603._INBOX.0AB98FB4.DAEMON
----------+-----------------------------
pic_a2.1 | _7606._INBOX.173D29A5.DAEMON
----------+-----------------------------
pic_a3.2 | _7500._INBOX.0072DD0A.DAEMON
----------+-----------------------------
pic_a4.3 | _7500._INBOX.68AD2F1B.DAEMON
...
Example, show subscriptions active at user edo_a3:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[44]> show subs edo_a3
user | subject
----------+---------------------------------------------
edo_a3.13 | _7500._INBOX.C6AD7566.DAEMON
| (p) _7500._INBOX.C6AD7566.763E17AA40C28DD0.>
| _7500.RSF5.REC.DD.N
| _7500.RSF5.REC.BBN.N
...
-
show seqno [W]
Show the sequences of the subjects received and published. The peers with IPC or console subscribers or publishers track the sequences the subjects to ensure the stream is completely serialized and notify of a data loss error when it is not in sequence. The details of how this works is described in Message Loss. This command only operates on the local sequence windows, the show windows command shows the memory usage of these.
The W is a substring that matches the subject so that the subjects in the
window can be filtered. Without W
, all of the subjects are printed.
Example, show the sequences of the subjects which contain ORCL:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[4]> show seqno ORCL
source | seqno | start | time | subject
-----------+--------+-------------------+-------------------+---------------------------
ipc | 52581 | 0207 10:16:16.108 | 0207 23:51:11.441 | _7500.RSF4.REC.ORCL.O
ipc | 145911 | 0207 10:20:50.986 | 0208 00:07:24.401 | _7500.RSF9.REC.ORCL.O
ipc | 128244 | 0207 10:25:25.864 | 0208 00:17:18.041 | _7500.RSF7.REC.ORCL.O
dex_a2.21 | 542769 | 0207 10:03:05.834 | 0208 00:22:42.401 | _7605._TIC.RSF5.REC.ORCL.O
dex_a1.20 | 542769 | 0207 10:03:05.834 | 0208 00:22:42.281 | _7602._TIC.RSF2.REC.ORCL.O
...
The source is the publisher, so IPC indicates that the client attached to the lex_a1 has published these messages, and dex_a2, dex_a1 indicate that these messages were received from clients attached to those peers (or the console). The start is the first time in the time frame that the subject was seen, the time is the last time it was seen. New time frames occur when the network link state database changes, since the sequence number time frame reference jump between old and new time frames and the seqno base is linear.
-
show adjacency
Show the adjacency tables. This command dumps the current link state database. It shows which peer has a link to another peer through which tport and the cost of the link (of path 0).
Example:
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[127]> show adj
user | adj | tport | type | cost
-----------+------------+--------------+------+-----
chex.* | | ipc.0 | ipc | 1000
| lex_a.1 | test.1 | tcp | 1000
-----------+------------+--------------+------+-----
lex_a.1 | edo_a.2 | fo_mesh.4 | mesh | 1000
| lex_a2.3 | lex_amesh.5 | mesh | 1000
| lex_a1.4 | lex_amesh.6 | mesh | 1000
| lex_a3.5 | lex_amesh.7 | mesh | 1000
| lex_a4.6 | lex_amesh.8 | mesh | 1000
| robo_a.7 | fo_mesh.9 | mesh | 1000
| lee_a.16 | fo_mesh.10 | mesh | 1000
| dex_a.21 | fo_mesh.11 | mesh | 1000
| pic_a.26 | fo_mesh.12 | mesh | 1000
| chex.* | lex_tcp.13 | tcp | 1000
-----------+------------+--------------+------+-----
edo_a.2 | edo_a4.8 | edo_amesh.4 | mesh | 1000
| edo_a3.9 | edo_amesh.5 | mesh | 1000
...
The user
is the peer that is maintaining the links that follow. It sends a
link state update messages when a link is added, dropped or cost is changed.
The adj
field is the peer which is directly attached to user
through the
tport
. The tport
is the name that user
is labeling this link. The
tport_id
number that follows the name (fo_mesh
+ .4) is the index into the
user’s transport table. The type
and cost
fields are also sent by user
in the link state update.
-
show peers
Shows info about the peers in the network that are active.
Example:
user | bridge | sub | seq | link | lat | max | avg | time | tport | cost
-----------+------------------------+-----+------+------+--------+--------+--------+-------------------+-----------+-----
chex.* | VCr9OQDldBjnGLnOXVF7gA | 3 | 3 | 4 | | | | 0320 18:37:34.182 | |
pic_a.1 | YdUS3pecw5BYzlj1Qns0uQ | 2 | 0 | 14 | 4.61ms | 6.55ms | 5.01ms | 0320 11:48:25.118 | pic_tcp.1 | 1000
edo_a.2 | KD28fBfgf6SpwPwH7QpwMA | 2 | 0 | 20 | 5.97ms | 7.92ms | 6.3ms | 0320 11:37:32.198 | edo_tcp.2 | 1000
pic_a3.3 | x+McKSRvAaAfOuOQEsvX9Q | 81 | 7923 | 8 | 5.57ms | 7.69ms | 5.43ms | 0320 11:48:25.066 | pic_tcp.1 | 2000
robo_a.4 | gIBRgIKDPjvTwVVuLxE8vg | 2 | 0 | 16 | 6.74ms | 8.67ms | 6.68ms | 0320 01:24:50.489 | edo_tcp.2 | 2000
dex_a.5 | t2M47zbouWPRJHwFFjVROg | 2 | 0 | 12 | 9.84ms | 9.84ms | 6.62ms | 0320 11:47:17.389 | edo_tcp.2 | 2000
...
The bridge
is the 128 bit random nonce created on startup by each peer. It
uniquely identifies the peer instance.
The sub
field are the number of subscriptions that are active. This number is
a counter in the bloom filter that is updated by the peer when subjects and
patterns are added or removed. It always contains at least 2 entries, one for
the _I.<nonce>.>
inbox pattern and one for the _M.>
multicast pattern.
The seq
field is the sequence number for each subscription operation. It is
serialized so that all subscriptions happen in the same order as the peer.
The link
field is the sequence number for each link state update. It is also
serialized so that adjacency table modifications occur in order.
The lat
, max
, avg
are ping round trip times that are sent 1.5x the
heartbeat interval to a random peer. They are tracked for at least an hour
before being rotated.
The time
is the start time of the peer.
The tport
and cost
reference the inbox route to peer.
The order in the table is by uid. Using the show peers nonce
orders the
table by bridge nonce, show peers start
orders the table by start time,
show peers user
orders the table by user name. The show peers host
shows
the first 4 bytes of the bridge used as the host id and show peers ip
shows
the first 4 bytes of the bridge in IPv4 dotted quad format.
Using show peers zombie
, the dead peers are displayed.
-
show ports [T]
Show info about transports that are active on the network.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[47]> show ports
tport | type | cost | fd | bs | br | ms | mr | lat | idle | fl | address
------------+--------+------+----+-------------+----------+-----------+--------+-------+--------+------+-------------------------------------------
rv.0 | rv | | 12 | | | | | | 27.8hr | LI | rv://127.0.0.1:7500
pic_amesh.1 | mesh | 1000 | 18 | | | | | | 27.8hr | LXCD | mesh://172.18.0.2:34344
pic_amesh.2 | mesh | 1000 | 19 | 3250008 | 3248028 | 10747 | 10747 | 173us | 1.99se | X | pic_a2.1@mesh://172.18.0.3:39340
pic_amesh.3 | mesh | 1000 | 21 | 3248424 | 5785922 | 10733 | 32929 | 240us | 1.39se | X | pic_a3.2@mesh://172.18.0.4:41320
pic_amesh.4 | mesh | 1000 | 23 | 3355474 | 5801830 | 10822 | 33084 | 225us | 835ms | X | pic_a4.3@mesh://172.18.0.5:43846
pic_amesh.5 | mesh | 1000 | 25 | 36957142584 | 29991114 | 100159342 | 245786 | 166us | 1.06ms | X | pic_a.4@mesh://172.18.0.1:57204
The tport
, type
are configured, and the cost
is either configured or
advertised by the peer in it’s link state message. If a transport is internal,
like an IPC transport, then it doesn’t have a cost associated with it.
The fd
field is the endpoint for the transport, usually a listener or a fd
assigned to the transport. There are usually one or more fds within the
transport that carry out the reading and writing of data to a network endpoint.
The fields bs
, br
, ms
, and mr
fields are bytes, messages sent and
received, which are collected from all the fds within the transport.
The idle
is the last time a message event occurred.
The fl
field are flags that are set on the transport. Each character is a
different flag:
-
L
— has a TCP listener -
M
— is a PGM multicast transport -
X
— is a mesh transport -
C
— is or was actively connecting the link -
T
— was accepted from a TCP listener -
E
— is marked as an edge link, there is no routing on the other side -
I
— is an IPC transport, which is are client endpoints -
D
— resolves the link using a multicast device -
-
— is shutdown -
*
— connecting in progress
The address
field is the address at the peer when TCP is used and the
multicast address when PGM is used.
-
show cost [T]
This is similar to show ports except that all 4 costs are printed for each transport.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[49]> show cost pic_amesh
tport | type | cost | cost2 | cost3 | cost4 | fd | fl | address
------------+--------+------+-------+-------+-------+----+------+-------------------------------------------
pic_amesh.1 | mesh | 1000 | 1000 | 1000 | 1000 | 18 | LXCD | mesh://172.18.0.2:34344
pic_amesh.2 | mesh | 1000 | 1000 | 1000 | 1000 | 19 | X | pic_a2.1@mesh://172.18.0.3:39340
pic_amesh.3 | mesh | 1000 | 1000 | 1000 | 1000 | 21 | X | pic_a3.2@mesh://172.18.0.4:41320
pic_amesh.4 | mesh | 1000 | 1000 | 1000 | 1000 | 23 | X | pic_a4.3@mesh://172.18.0.5:43846
pic_amesh.5 | mesh | 1000 | 1000 | 1000 | 1000 | 25 | X | pic_a.4@mesh://172.18.0.1:57204
...
-
show status [T]
Similar to show ports with a status errno if the system reported an error on a link. When everything is normal, the address is printed instead.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[50]> show status pic_amesh tport | type | fd | fl | status ------------+------+----+------+--------------------------------- pic_amesh.1 | mesh | 18 | LXCD | mesh://172.18.0.2:34344 pic_amesh.2 | mesh | 19 | X | pic_a2.1@mesh://172.18.0.3:39340 pic_amesh.3 | mesh | 21 | X | pic_a3.2@mesh://172.18.0.4:41320 pic_amesh.4 | mesh | 23 | X | pic_a4.3@mesh://172.18.0.5:43846 pic_amesh.5 | mesh | 25 | X | pic_a.4@mesh://172.18.0.1:57204 ...
-
show routes [P]
Show the routes. This shows how all the peers are connected and which port would be used to send and receive messages to/from the peer. It also displays which transports have been used in order to reach the peer.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[52]> show routes
user | tport | state | cost | path | lat | fd | route
----------+-------------+---------------+------+---------+--------+----+---------------------------------
pic_a2.1 | pic_amesh.2 | inbox,mesh,hb | 1000 | 0,1,2,3 | 143us | 19 | pic_a2.1@mesh://172.18.0.3:39340
| pic_amesh.3 | | 2000 | | | 21 | pic_a3.2@mesh://172.18.0.4:41320
| pic_amesh.4 | | 2000 | | | 23 | pic_a4.3@mesh://172.18.0.5:43846
| pic_amesh.5 | | 2000 | | | 25 | pic_a.4@mesh://172.18.0.1:57204
...
This shows that user pic_a2
messages have been received or sent through these
transports. The secondary transports are often used on startup when the other
links are not yet active or when a link fails.
The state
of the transport has these values:
-
inbox
— transport is the path for the inbox route -
mesh
— transport is part of a mesh -
hb
— transport is directly connected and has a heartbeat -
ucast
— transport has a point to point UDP protocol -
usrc
— transport uses a point to point UDP protocol to reach another peer
The cost
is the link cost of the path P argument, or 0 when not specified.
The path
field enumerates which transport is used to reach peer for each path.
The lat
, fd
are the same as show ports.
The route
is the directly connected peer address that a message is sent or
received.
-
show urls
Show the local and peer addresses as well as the url used to resolve the address of the peer. This is useful for mesh and multicast type networks since the endpoints are sometimes resolved through exchanging messages with the network. In the case of a mesh transport, a mesh url database is exchanged and links are established with all the peers that are in the mesh. The multicast PGM transport exchanges the unicast UDP endpoints for all the peers that are on the transport.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[54]> show urls
user | tport | state | cost | mesh | fd | url | local | remote
----------+-------------+---------------+------+-----------+----+-------------------------+-------------------------+------------------------
| ipc.0 | LI | | | 11 | | ipc://127.0.0.1:7500 | ipc://127.0.0.1:43992
| pic_amesh.1 | LXCD | | pic_amesh | 17 | mesh://172.18.0.2:34344 | |
pic_a2.1 | pic_amesh.2 | X | | pic_amesh | 20 | mesh://172.18.0.3:44108 | mesh://172.18.0.2:34344 | mesh://172.18.0.3:39340
pic_a3.2 | pic_amesh.3 | X | | pic_amesh | 22 | mesh://172.18.0.4:42851 | mesh://172.18.0.2:34344 | mesh://172.18.0.4:41320
pic_a4.3 | pic_amesh.4 | X | | pic_amesh | 24 | mesh://172.18.0.5:45836 | mesh://172.18.0.2:34344 | mesh://172.18.0.5:43846
pic_a.4 | pic_amesh.5 | X | | pic_amesh | 26 | mesh://172.18.0.1:36262 | mesh://172.18.0.2:34344 | mesh://172.18.0.1:57204
----------+-------------+---------------+------+-----------+----+-------------------------+-------------------------+------------------------
pic_a2.1 | pic_amesh.2 | inbox,mesh,hb | 1000 | pic_amesh | 19 | mesh://172.18.0.3:44108 | mesh://172.18.0.2:34344 | mesh://172.18.0.3:39340
| pic_amesh.3 | | 2000 | pic_amesh | 21 | | |
| pic_amesh.4 | | 2000 | pic_amesh | 23 | | |
| pic_amesh.5 | | 2000 | pic_amesh | 25 | | |
The top section is similar to show ports with addition of the urls.
The following sections is similar to show routes with the addition of the urls for each user.
The url
field is resolved by exchanging messages. The local
and remote
are addresses assigned to the connection. Since a mesh may be actively
connected by either peer, since all peers passive listeners and some have
active connections. The newer peers will usually have the active connections
and the older peers will have accepted connections. The local and remote
addresses will reflect that, since the accepted peers are assigned an address
by the system and the connecting peers use the url
address to connect.
-
show tport [T]
Show the state of the transports. This prints the configured transport and
whether it is active or not. The other transport show
commands will only
show the active transports. This will show the ones configured but not active
as well.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[55]> show tport
tport | type | state | listen | connect | device
----------+--------+-----------+---------------------+--------------------------------+------------
pic_amesh | mesh | accepting | | | mesh://eth0
rv | rv | accepting | rv://127.0.0.1:7500 | |
tel | telnet | accepting | telnet://*:2222 | |
ipc | ipc | ipc | | |
rvd.ipc | ipc | - | | |
eth0 | name | - | | name://eth0;239.23.22.217:8327 |
test | tcp | - | | tcp://robotron.rai:9000 |
The listen
, connect
, and device
fields show how the transport is
configured to resolve the connections.
-
show user [U]
Show the users configured.
Example:
chex.test[OsGpIaCbYCJbhnUVEp19Uw]@chex[135]> show users
uid | user | svc | create | expires
----+------+------+----------------------+--------
0 | chex | test | 1675847381.440084399 |
| dyna | test | 1675847381.440129724 |
| ruby | test | 1675847381.440176492 |
| zero | test | 1675847419.072423168 |
-
show events
The system tracks the authentication and transport and link state events in a buffer that rotates every 4096 entries. This is a compact table that has 6 integer fields that map to a time stamp, uids, transports and enumerated values depending on event type. These events are useful for resolving what happened to the network after something went wrong.
Example of an event log:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[59]> show events
stamp | tport | user | peer | event | data
------------------+-------------+-----------+-----------+-----------------+--------------------
0206 22:09:22.606 | | pic_a1.* | | startup |
0206 22:09:22.607 | ipc.0 | pic_a1.* | | on_connect | listen
0206 22:09:22.607 | pic_amesh.1 | pic_a1.* | (aes) | on_connect | listen
0206 22:09:22.607 | (mcast) | pic_a1.* | | send_hello |
0206 22:09:23.301 | pic_amesh.2 | pic_a1.* | (aes) | on_connect | mesh_accept
0206 22:09:23.327 | | pic_a1.* | | converge | add_tport
0206 22:09:23.340 | pic_amesh.2 | pic_a2.1 | pic_a1.* | add_user_route | neighbor
0206 22:09:23.340 | pic_amesh.2 | pic_a1.* | pic_a2.1 | send_challenge | hello
0206 22:09:23.342 | pic_amesh.2 | pic_a2.1 | | recv_challenge | handshake
0206 22:09:23.342 | | pic_a2.1 | (ecdh) | auth_add | handshake
0206 22:09:23.342 | (mcast) | pic_a1.* | pic_a2.1 | send_adj_change | add
0206 22:09:23.342 | pic_amesh.2 | pic_a2.1 | | send_trust | in_mesh
0206 22:09:23.342 | pic_amesh.2 | pic_a2.1 | | recv_peer_db | add_route
0206 22:09:23.342 | pic_amesh.2 | pic_a2.1 | pic_a1.* | recv_adj_change | update_adj
0206 22:09:23.367 | | pic_a1.* | | converge | adj_change
0206 22:09:23.889 | pic_amesh.3 | pic_a1.* | (aes) | on_connect | mesh_accept
0206 22:09:23.927 | | pic_a1.* | | converge | add_tport
0206 22:09:23.928 | pic_amesh.3 | pic_a3.2 | pic_a1.* | add_user_route | neighbor
The events that are logged are:
Event | Description |
---|---|
startup |
Initial event, time of start |
on_connect |
Transport listen, connect, or accept occurred |
on_shutdown |
Transport connection was closed or shutdown |
on_timeout |
Transport connection timed out |
auth_add |
Peer was authenticated and is now trusted |
auth_remove |
Peer authentication is dropped |
send_challenge |
An authentication challenge is sent to peer |
recv_challenge |
An authentication challenge is received from peer |
send_trust |
Authentication was successful, sent trust message |
recv_trust |
Peer notified that my node is now authenticated |
add_user_route |
Route to peer is found and the transport is labeled |
hb_queue |
Peer is added to the heartbeat timeout queue |
hb_timeout |
Peer heartbeat was not received within it’s interval |
send_hello |
Transport is initialized by sending a hello message |
recv_bye |
Peer intends to leave the network and sends a bye message |
recv_add_route |
Received a message that a peer was added to the network |
recv_peer_db |
All the peers that are known are exchanged with a new peer |
send_add_route |
Send a message when a peer is added to the network |
send_peer_del |
Send a message when peer is removed from the network |
sync_result |
Peer sync message was received, initialize peer state |
send_sync_req |
Request a peer sync after new peer is notified |
recv_sync_req |
Receive a sync request for my node or another peer |
recv_sync_fail |
Receive a sync request for an unknown peer |
send_adj_change |
Send a link state update message, add or remove link |
recv_adj_change |
Received a link state update message |
send_adj_req |
Link state for peer is stale, request the current link state |
recv_adj_req |
Receive a request for the current link state |
send_adj |
Send the current link state to a peer |
recv_adj_result |
Receive the current link state from a peer |
resize_bloom |
Resize my peers bloom filter and sent it to the network |
recv_bloom |
Received a peers bloom filter |
converge |
The network has no missing link states and is completely connected |
-
show logs
The last log 64K bytes of the log is buffered in the process. This command shows the this buffer.
-
show counters
Show the counters of heartbeat, inbox, and ping subjects.
Example:
pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[60]> show counters
user | start | hb seqno | hb time | snd ibx | rcv ibx | ping snd | ping stime | pong rcv | ping rcv
----------+-------------------+----------+-------------------+---------+---------+----------+-------------------+----------+---------
pic_a1.* | 0206 22:09:22.606 | | | | | | | |
pic_a2.1 | 0206 22:09:23.219 | 17021 | 0208 20:52:00.940 | 19 | 23 | 454 | 0208 20:50:22.608 | 454 | 442
pic_a3.2 | 0206 22:09:23.806 | 17021 | 0208 20:51:51.687 | 18 | 149 | 438 | 0208 20:50:43.808 | 438 | 444
pic_a4.3 | 0206 22:09:24.401 | 17020 | 0208 20:51:52.241 | 29 | 125 | 427 | 0208 20:51:00.008 | 427 | 438
pic_a.4 | 0206 22:09:24.433 | 17020 | 0208 20:51:52.275 | 35 | 37 | 422 | 0208 20:51:21.608 | 422 | 426
robo_a3.5 | 0206 22:09:06.260 | 0 | | 11 | 98 | 427 | 0208 20:51:40.528 | 427 | 421
robo_a2.6 | 0206 22:09:05.371 | 0 | | 11 | 15 | 424 | 0208 20:51:50.168 | 424 | 423
robo_a4.7 | 0206 22:09:07.183 | 0 | | 11 | 95 | 420 | 0208 20:41:30.568 | 420 | 418
robo_a1.8 | 0206 22:09:04.452 | 0 | | 11 | 15 | 423 | 0208 20:41:48.848 | 423 | 424
edo_a.9 | 0206 22:09:12.993 | 0 | | 2 | 20 | 422 | 0208 20:42:05.808 | 422 | 419
...
The start
field is when the process started. The hb seqno
and hb time
track the last heartbeat received from the peer when it is directly connected.
The snd ibx
, rcv ibx
are counters for many of the _I.<nonce>.
subjects
which guard against repeats. These are point to point messages, the peer has
the same counters which should match these. The show inbox
command will show the last 32 of these sequences. The ping
and pong
sequences have their own counters, since these are used to check connectivity
between peers and are expected to have loss when the network is unstable.
-
show sync
Show the link state seqno and sub seqno sums.
Example:
user | start | link_seqno | link_sum | sub_seqno | sub_sum | hb_diff | mc_req | mc_res | req_adj | res_adj | ping_adj
-----------+-------------------+------------+----------+-----------+---------+---------+--------+--------+---------+---------+---------
chex.* | 0225 01:38:14.590 | 5 | 1447 | 0 | 81677 | | | | | |
edo_a.1 | 0224 17:07:32.126 | 25 | 1447 | 0 | 81653 | 0 | 0 | 0 | 0 | 0 | 0
edo_a2.3 | 0224 17:07:29.173 | 8 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0
edo_a1.4 | 0224 17:07:27.696 | 8 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0
edo_a3.5 | 0224 17:07:30.591 | 8 | 0 | 6673 | 0 | 0 | 0 | 0 | 0 | 0 | 0
edo_a4.6 | 0224 17:07:32.052 | 7 | 0 | 6874 | 0 | 0 | 0 | 0 | 0 | 0 | 0
robo_a.7 | 0224 17:07:26.471 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
...
The start
field is when the process started. The link_seqno
and link_sum
are the link state seqno and the sum of all of the peers link state seqnos.
The sub_seqno
and sub_sum
are the subscription seqno and the sum of all
peers subscription seqnos. These sums will only appear then the nodes is
directly connected to the peer, since they are the values last seen in the
heartbeat messages.
The sequence numbers are always increasing after a change in the link state or subscription state, so the sums of these seqnos are unique for the current network state and provide a way for peers to check whether they are in sync with the network.
These are exchanged with the heartbeat messages. When a difference is detected,
the hb_diff
is incremented and a _M.sync
message is multicast to the network.
When a peer receives the sync message, it checks that their sums match with the
sending peer. If they do not match, then they reply with their current link
state and subscription seqno values in a _I.<nonce>.sync
point to point
message. When a peer receives the sync reply it checks that these are in sync
and requests adjacency with _I.<nonce>.sync_req
if they do not.
The hb_diff
may not always result in an actual difference with the network,
since it is possible that a subscription or a link state message is received
and applied to the peer at a different rate than the heartbeat is received, but
the reply of the current sequence numbers at the peer will most likely be less
than or equal the state of the network when the peer is in sync.
The mc_req
is the number of _M.sync
message received, mc_res
is the
number of _I.<nonce>.sync
messages received. The req_adj
is the number of
adjacency requests made as a result of the _M.sync
messages, and res_adj
is
adjacency requests made as a result of the _I.<nonce>.sync
messages and
ping_adj
is adjacency requests made as a result of _I.<nonce>.ping
messages.
-
show pubtype
When a message header is created or unpacked, a counter of the subject class is incremented. This shows these counters. These are only messages that are processed by the network, it is possible that two clients within the IPC transport are exchanging messages, these are not counted.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[7]> show pubtype
type | recv_count | send_count
-----------------+------------+-----------
u_session_hello | 0 | 1
u_session_hb | 68761 | 68765
u_peer_add | 134 | 35
u_peer_del | 16 | 4
u_bloom_filter | 39 | 3
u_adjacency | 115 | 4
u_sub_join | 224621 | 24
u_sub_leave | 223689 | 0
u_psub_start | 110 | 89
u_inbox_auth | 4 | 8
u_inbox_subs | 10 | 0
u_inbox_ping | 12476 | 12529
u_inbox_pong | 12529 | 12481
u_inbox_rem | 1 | 0
u_inbox_resub | 0 | 202
u_inbox_add_rte | 4 | 4
u_inbox_sync_req | 2 | 30
u_inbox_sync_rpy | 29 | 0
u_inbox_adj_req | 3 | 10
u_inbox_adj_rpy | 21 | 6
u_inbox_ack | 0 | 1
u_inbox_any | 0 | 224476
u_inbox | 0 | 1
u_mcast_ping | 5 | 0
u_inbox_any_rte | 80 | 0
mcast_subject | 1528812397 | 0
-
show inbox [U]
Show the types of the last 32 system RPC messages sent and received for each peer. Some peers may not have any of these if they are not directly connected.
This is an example of a peer attached to the console connecting to a larger network:
chex.rvd[xpO5ODZvoOcUMJ60QVaSBg]@chex[139]> inbox
user | send seqno | send type | recv seqno | recv type
--------+------------+------------------+------------+-----------------
lex_a.1 | 1 | u_inbox_auth | 1 | u_inbox_sync_rpy
| 2 | u_inbox_add_rte | 2 | u_inbox_auth
| 3 | u_inbox_adj_req | 3 | u_inbox_add_rte
| 4 | u_inbox_sync_req | 4 | u_inbox_adj_rpy
| 5 | u_inbox_sync_req | 5 | u_inbox_sync_rpy
| 6 | u_inbox_sync_req | 6 | u_inbox_sync_rpy
| 7 | u_inbox_sync_req | 7 | u_inbox_sync_rpy
| 8 | u_inbox_sync_req | 8 | u_inbox_sync_rpy
...
The first 3 sequences are the result of authentication, which causes both peers
to exchange all their known peers. The following u_inbox_sync_req
and
u_inbox_sync_rpy
pairs are used to request the peers which are not yet
authenticated. In this case, the connecting peer has no peers and the peer
attached to the network has lots of peers that need synchronizing.
-
show loss
Show the counters of repeated messages (old message sequences), messages not subscribed, have message loss, or have inbox loss.
When a message is repeated or not subscribed, a counter is incremented and the message is tossed. These types of events can occur through normal operation and don’t have an impact on clients.
The repeated messages can occur during network instability and not subscribed messages can occur because an unsubscribe has not yet reached the publisher or because the bloom filter did not filter the subject.
The message loss counters are more critical to correct behavior, since this indicates that messages did not reach all subscriptions. The inbox message loss can occur normally since these are used to synchronize peers during network instability, they are used to stabilize the network.
The point to point messages using the _INBOX
prefix will also use the inbox
sequences, but even these are not as critical since clients will have timeouts
and retry the operation that uses an _INBOX
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[11]> show loss
user | repeat | rep time | not sub | not time | msg loss | loss time | ibx loss | ibx time
-----------+--------+----------+---------+----------+----------+-------------------+----------+------------------
lex_a2.1 | 0 | | 0 | | 0 | | 0 |
lex_a3.2 | 0 | | 0 | | 0 | | 0 |
lex_a4.3 | 0 | | 0 | | 0 | | 0 |
edo_a.5 | 0 | | 0 | | 0 | | 0 |
robo_a.6 | 0 | | 0 | | 0 | | 0 |
edo_a4.7 | 0 | | 0 | | 0 | | 1 | 0209 08:22:25.120
edo_a3.8 | 0 | | 0 | | 0 | | 1 | 0209 08:22:25.120
edo_a1.9 | 0 | | 0 | | 640 | 0209 08:24:31.960 | 0 |
edo_a2.10 | 0 | | 0 | | 655 | 0209 08:24:32.080 | 0 |
robo_a3.11 | 0 | | 0 | | 0 | | 1 | 0209 08:22:25.120
robo_a2.12 | 0 | | 0 | | 630 | 0209 08:24:31.761 | 0 |
robo_a4.13 | 0 | | 0 | | 0 | | 1 | 0209 08:22:25.120
robo_a1.14 | 0 | | 0 | | 647 | 0209 08:24:23.841 | 0 |
lee_a1.15 | 0 | | 0 | | 1 | 0209 08:22:27.841 | 0 |
...
The user
is the sender of the message. The repeat
, rep time
is the count
and time stamp of the last instance. The not sub
, not time
are for the
not subscribed messages. The msg loss
, loss time
are for the multicast
message loss. The ibx loss
, ibx time
are for the point to point inbox
message loss.
-
show skew
Show the system time skew between peers. There are several messages that include a time stamp which can be used to estimate the system clock skew between peers. This is useful to guard against message replays. If a peer message arrives and the time + skew is older than the subscription window, then it is treated as a repeated message. When the time is within the subscription window, then a sequence will be associated with the last message received from peer. The subscription window rotate time is configurable, described in Parameters of the config section. This details of the loss calculation is described in [msg_loss].
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[11]> show skew
user | lat | hb | ref | ping | pong | time
-----------+-------+---------+-----+----------+---------+------------------
lex_a2.1 | 241us | 63.5us | 0 | 33.7us | -32.5us | 0209 08:47:48.395
lex_a3.2 | 119us | 76.9us | 0 | 31.4us | -7.15us | 0209 08:47:48.395
lex_a4.3 | 157us | 236us | 0 | 32.7us | -15.1us | 0209 08:47:48.395
edo_a.5 | 302us | -483us | 4 | -0.161us | -26.1us | 0209 08:47:48.395
robo_a.6 | 291us | -1.09ms | 4 | 0.154us | -1.41ms | 0209 08:47:48.397
edo_a4.7 | 521us | 282us | 4 | 31.6us | -131us | 0209 08:47:48.395
edo_a3.8 | 512us | 250us | 4 | -5.1us | -14.7us | 0209 08:47:48.395
edo_a1.9 | 308us | 1.26ms | 4 | -12.8us | 72.8us | 0209 08:47:48.395
edo_a2.10 | 452us | 1.02ms | 4 | -13.2us | -222us | 0209 08:47:48.395
robo_a3.11 | 528us | 314us | 4 | 28us | -1.44ms | 0209 08:47:48.397
robo_a2.12 | 468us | 477us | 4 | -3.79us | -1.47ms | 0209 08:47:48.397
robo_a4.13 | 633us | 571us | 4 | -8.7us | -1.5ms | 0209 08:47:48.397
...
The first message a peer will see when connecting is the heartbeat message and
authentication messages. These have a time attached to them and this is the
first time skew calculation that a peer will have. The hb
contains this
value and the ref
is the uid of the peer that is attached and calculated
the skew. The ping
and pong
values are calculated later when a ping
pong sequence of messages are exchange. These are more accurate because there
is a larger sample size as the uptime increases. The time
is the last time
a skew was calculated.
-
show reachable
Show which transport links can be used to reach a peer. This table associates
a connection fd
with a list of peers that are using it. If this connection
is lost, then these are peers that may be affected by this event.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[12]> show reachable
user | path | fd | tport
-----------+---------+----+------------
lex_a2.1 | 0,1,2,3 | 19 | lex_amesh.2
lex_a3.2 | | |
lex_a4.3 | | |
dex_a.24 | | |
pic_a.29 | | |
lee_a.18 | | |
robo_a.6 | | |
edo_a.5 | | |
-----------+---------+----+------------
lex_a3.2 | 0,1,2,3 | 21 | lex_amesh.3
lex_a2.1 | | |
lex_a4.3 | | |
dex_a.24 | | |
pic_a.29 | | |
robo_a.6 | | |
edo_a.5 | | |
...
The user
is the peer, the path
is a list of paths used with the connection
fd
, and the tport
is the transport that contains the connection.
-
show tree [U]
Show the multicast tree for a user or self. This iterates through the
adjacency tables by cost and shows the which peers will be reached after
each step. The cost increases until all the peers are exhausted. If
a U
argument is present, then the multicast tree starts from that peer
instead of the peer attached to the console.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[14]> show tree
cost | set | alt | source | tport | dest
-----+-----+-----+----------+--------------+--------
1000 | 0 | 0 | lex_a1.* | lex_amesh.2 | lex_a2
1000 | 1 | 0 | lex_a1.* | lex_amesh.3 | lex_a3
1000 | 2 | 0 | lex_a1.* | lex_amesh.4 | lex_a4
1000 | 3 | 0 | lex_a1.* | lex_amesh.5 | lex_a
-----+-----+-----+----------+--------------+--------
2000 | 0 | 0 | lex_a.33 | fo_mesh.7 | edo_a
2000 | 2 | 0 | lex_a.33 | fo_mesh.9 | robo_a
2000 | 1 | 0 | lex_a.33 | fo_mesh.8 | lee_a
2000 | 4 | 0 | lex_a.33 | fo_mesh.11 | dex_a
2000 | 3 | 0 | lex_a.33 | fo_mesh.10 | pic_a
-----+-----+-----+----------+--------------+--------
3000 | 0 | 0 | edo_a.5 | edo_amesh.4 | edo_a4
3000 | 1 | 0 | edo_a.5 | edo_amesh.5 | edo_a3
3000 | 2 | 0 | edo_a.5 | edo_amesh.6 | edo_a1
3000 | 3 | 0 | edo_a.5 | edo_amesh.7 | edo_a2
3000 | 4 | 0 | robo_a.6 | robo_amesh.4 | robo_a3
3000 | 5 | 0 | robo_a.6 | robo_amesh.5 | robo_a2
...
The set
is an index into the table used for the next hop, this is calculated
by transitioning across the transport links. Since the uids are displayed in
order, the set
may jump back and forth through the table. The alt
counter
is an alternate path counter. Only the 0 alt
path is used, but the others
are displayed.
The source
is the forwarding peer that sends the message, the tport
is the
transport local to the source
, and dest
is the receiver.
-
show path [P]
Show the transports used to reach a peer for a path. This is the forwarding table that is used to send a message from the local peer to other peers.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[15]> show path
tport | cost | path_cost | dest
------------+------+-----------+----------
lex_amesh.2 | 1000 | 1000 | lex_a2.1
lex_amesh.3 | 1000 | 1000 | lex_a3.2
lex_amesh.4 | 1000 | 1000 | lex_a4.3
lex_amesh.5 | 1000 | 2000 | edo_a.5
lex_amesh.5 | 1000 | 3000 | edo_a4.7
lex_amesh.5 | 1000 | 3000 | edo_a3.8
...
The tport
is used for sending a message to dest
. The cost
is the
first hop cost, the path_cost
is the total cost through all hops.
-
show forward [P]
Show the forwarding table for a message received from each of the peers. When a message is received from a peer, it may need to be forwarded to other peers to completely cover the network. This shows the forwarding tables for each peer.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[16]> show forward
source | tport | cost
----------+-------------+-----
lex_a1.* | lex_amesh.2 | 1000
| lex_amesh.3 | 1000
| lex_amesh.4 | 1000
| lex_amesh.5 | 1000
----------+-------------+-----
lex_a2.1 | |
----------+-------------+-----
lex_a3.2 | |
...
The source
is index the forwarding table used, the tport
is the transport
used to forward the message.
-
show fds
Show what each fd is used for. This iterates the fd
tables and shows what
each fd
is doing.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[17]> show fds
fd | rid | bs | br | ms | mr | ac | rq | wq | fl | type | kind | name | address
---+-----+-------------+--------------+-----------+------------+----+----+----+----+---------------------+---------------+--------------------------+--------------------------
3 | -1 | 0 | 15321 | 0 | 0 | | 0 | 0 | | logger | stdout | |
5 | -1 | | | | | | | | | timer_queue | timer | |
7 | -1 | 0 | 717092458452 | 0 | 1943883309 | | | | | ipc_route | ipc | rvd.ipc |
8 | -1 | 0 | 4235 | 0 | 0 | | 0 | 0 | | logger | stderr | |
9 | -1 | | | | | | | | | console_route | console | rvd.console |
10 | -1 | 0 | 99146804 | 0 | 690776 | | | | | session_mgr | session | rvd.session |
11 | 0 | 0 | 64848767199 | 0 | 261901166 | | | | | transport_route | tport | rvd.ipc.tport.0 |
12 | 0 | | | | | 12 | | | | rv_listen | rv_listen | rvd.ipc.rv.list.0 | 127.0.0.1:7500
13 | -1 | | | | | 1 | | | | telnet_listen | telnet_listen | telnet.tel | 0.0.0.0:2222
14 | -1 | 210 | 0 | 1 | 0 | | | | | name_connect | mcast_send | name.eth0.send | 239.23.22.217:8327
15 | -1 | 1000 | 1260 | 5 | 6 | | | | | name_listen | mcast_recv | name.eth0.recv | 239.23.22.217:8327
16 | -1 | | | | | | | | | name_listen | ucast_recv | name.eth0.inbox | 172.18.0.2:33643
17 | 1 | | | | | | | | | transport_route | tport | rvd.lex_amesh.tport.1 |
18 | 1 | | | | | 5 | | | | ev_tcp_tport_listen | tcp_listen | rvd.lex_amesh.tcp_list.1 | 172.18.0.2:42341
19 | 2 | 9458891878 | 28871168 | 27427393 | 121986 | | 0 | 0 | | ev_tcp_tport | tcp_accept | rvd.lex_amesh.tcp_acc.1 | lex_a2.1@172.18.0.3:41708
20 | 2 | 0 | 16338022 | 0 | 50587 | | | | | transport_route | tport | rvd.lex_amesh.tport.2 |
21 | 3 | 9548489486 | 28505122 | 27617221 | 120205 | | 0 | 0 | | ev_tcp_tport | tcp_accept | rvd.lex_amesh.tcp_acc.1 | lex_a3.2@172.18.0.4:44630
...
The fields are:
Field | Description |
---|---|
fd |
File descriptor |
rid |
Transport id that fd belongs to |
bs |
Bytes sent |
br |
Bytes received |
ms |
Messages sent |
mr |
Message received |
ac |
Listener accept count |
rq |
Bytes in the receive queue |
wq |
Bytes in the send queue |
fl |
Socket flags, R,r,<: reading, W,w,>: writing, +: processing. |
type |
What type of fd |
kind |
What class of fd |
name |
The name associated with fd |
address |
The local address |
-
show buffers
Show the buffer usage of each connection. These buffers expand to contain an entire message, since there is no streaming of large messages.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[18]> show buffers
fd | wr | wmax | rd | rmax | zref | send | recv | mall | pall | name
---+-------+-------+-------+-------+------+-----------+-----------+------+------+------------------------
3 | 32768 | 32768 | 16384 | 16384 | 0 | 0 | 124 | 0 | 0 |
8 | 32768 | 32768 | 16384 | 16384 | 0 | 0 | 74 | 0 | 0 |
19 | 32768 | 32768 | 16384 | 16384 | 0 | 27189290 | 67973 | 0 | 0 | rvd.lex_amesh.tcp_acc.1
21 | 32768 | 32768 | 16384 | 16384 | 0 | 27224765 | 66485 | 0 | 0 | rvd.lex_amesh.tcp_acc.1
23 | 32768 | 32768 | 16384 | 16384 | 0 | 30118303 | 68727 | 0 | 0 | rvd.lex_amesh.tcp_acc.1
25 | 32768 | 32768 | 16384 | 16384 | 0 | 5498186 | 38629165 | 0 | 0 | rvd.lex_amesh.tcp_acc.1
...
The fields are:
Field | Description |
---|---|
fd |
File descriptor |
wr |
Write buffer size |
wmax |
The largest write buffer used |
rd |
Read buffer size |
rmax |
The largest read buffer used |
zref |
Counter incremented after of zero copy sends |
send |
Bytes sent |
recv |
Bytes received |
mall |
Counter incremented when malloc() is used to make a buffer |
pall |
Counter incremented when a buffer is borrowed from the buffer pool |
name |
Name associated with fd |
-
show windows
Show the size and counts of the subject publish and subscribe windows as well as the size of subscription tables and bloom filters.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[19]> show windows
tab | count | size | win_size | max_size | rotate_time | interval
--------+-------+---------+----------+----------+-------------------+---------
sub | 22515 | 5534080 | 8388608 | 5534080 | 0208 13:23:11.393 | 10
sub_old | 0 | 0 | | | 0208 13:23:01.393 |
pub | 3737 | 344112 | 4194304 | 344112 | 0208 13:23:11.393 | 10
pub_old | 0 | 0 | | | 0208 13:23:01.393 |
inbox | 2724 | 817824 | | | 0209 09:52:42.761 |
route | 137 | 58848 | | | |
bloom | 1135 | 18392 | | | |
rv | 102 | 1290420 | | | |
The first two are the subscription and publish windows. These tables are
rotated to old when they get to win_size
with at least interval
seconds.
The max_size
is the largest size of this window.
The inbox
entry is a route cache for subjects that have a _INBOX
prefix.
The route
entry is a cache for routes, indexed by subject hash. The bloom
entry is the sum of the size of bloom filters for every peer in the network.
The rv
entry is the subscription table for RV clients attached.
-
show blooms [P]
Show where the bloom filters are used for a path. The forwarding table has only one transport entry for each peer, path combination. If a message is forwarded on more than one transport, it is because there are multiple peers that are subscribed across multiple transports for the path. The receiving side also filters the messages through the bloom filters by calculating the ports that are needed for the path to completely cover the network. There may be redundant transports that are inactive for each path either because the cost is more or the path selection prefers one transport over the other.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[20]> show blooms
fd | dest | tport | bloom | prefix | detail | subs | total
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
9 | console | ipc.0 | (console) | 0 | 0 | 0 | 0
11 | route | ipc.0 | (all-peers) | 0 | 0 | 0 | 0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
7 | ipc | lex_amesh.1 | (ipc) | 0x000061DF00C38000 | 0 | 24 | 113
10 | session | lex_amesh.1 | (console), (sys) | 0x04000108 | 0 | 7 | 15
17 | route | lex_amesh.1 | (all-peers) | 0 | 0 | 0 | 0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
7 | ipc | lex_amesh.2 | (ipc) | 0x000061DF00C38000 | 0 | 24 | 113
10 | session | lex_amesh.2 | (console), (sys) | 0x04000108 | 0 | 7 | 15
19 | lex_a2.1 | lex_amesh.2 | (peer), lex_a2 | 0x0000008004000108 | 0 | 84 | 91
20 | route | lex_amesh.2 | (all-peers) | 0 | 0 | 0 | 0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
7 | ipc | lex_amesh.3 | (ipc) | 0x000061DF00C38000 | 0 | 24 | 113
10 | session | lex_amesh.3 | (console), (sys) | 0x04000108 | 0 | 7 | 15
21 | lex_a3.2 | lex_amesh.3 | (peer), lex_a3 | 0x0000008004000108 | 0 | 98 | 105
22 | route | lex_amesh.3 | (all-peers) | 0 | 0 | 0 | 0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
7 | ipc | lex_amesh.4 | (ipc) | 0x000061DF00C38000 | 0 | 24 | 113
10 | session | lex_amesh.4 | (console), (sys) | 0x04000108 | 0 | 7 | 15
23 | lex_a4.3 | lex_amesh.4 | (peer), lex_a4 | 0x0000008004000108 | 0 | 89 | 96
24 | route | lex_amesh.4 | (all-peers) | 0 | 0 | 0 | 0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
7 | ipc | lex_amesh.5 | (ipc) | 0x000061DF00C38000 | 0 | 24 | 113
10 | session | lex_amesh.5 | (console), (sys) | 0x04000108 | 0 | 7 | 15
25 | lex_a.33 | lex_amesh.5 | (peer), lex_a, pic_a, edo_a, lee_a4, lee_a3, lee_a1, lee_a2, edo_a4, edo_a2 | 0x000061DF04C38108 | 0 | 482 | 636
| | | edo_a1, edo_a3, dex_a1, dex_a2, dex_a3, dex_a4, pic_a4, pic_a1, pic_a2 | | | |
| | | pic_a3, lee_a, dex_a | | | |
26 | route | lex_amesh.5 | (all-peers) | 0 | 0 | 0 | 0
Every peer has a bloom filter associated with it. The console
, ipc
, and
sys
filters are the local bloom filters which are combined into one filter in
another peer. They are split in the local peer so that the traffic destination
can be split to the separate processing functions. The sys
filter only match
the subjects that are used for the system, namely, the _I.<nonce>.>
subject
and the _M.>
subject. The console
are the subjects subscribed by the
console. The ipc
are the subjects subscribed by clients. The all-peers
are the combination of all the peers subscriptions, this is used for receiving
messages. The individual peer bloom filters are for forwarding messages.
The fields are:
Field | Description |
---|---|
fd |
File descriptor for the connection |
dest |
Where the message would go |
tport |
The transport that is used |
bloom |
The bloom filters |
prefix |
A bit mask of the prefix match length |
detail |
A bit mask of the prefix when a suffix is matched or sharded |
subs |
The subscription count, not including the patterns |
total |
The subscription count including the patterns |
-
show match S
Show which peer bloom filters match a subject. If a message was published
with subject S
, this shows which peer’s bloom filter would match it. This
doesn’t match against the local filters.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[33]> show match _7500.RSF.REC.AVP.N
user
---------
lee_a2.16
-
show graph
Show the graph description of the network. This creates a description of the network by matching the names of the transports with the names that the peers use. This doesn’t use any network probing, it uses the link state database to calculate the network connectivity. The link state database doesn’t have connection IP addresses associated with it, but it does have a link name and link type. The name/types are enough to describe the network, but doesn’t show how the links are connected to the host with IP addresses.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[34]> show graph
start lex_a1
node edo_a1 edo_a2 edo_a3 lex_a1 lex_a2 edo_a4 edo_a lex_a3 lex_a4 lee_a1 lee_a2 lee_a3 dex_a1 lee_a4 lee_a dex_a2 dex_a3 dex_a4 dex_a pic_a1 pic_a2 pic_a3 pic_a4 pic_a lex_a
mesh_lex_amesh lex_a1 lex_a2 lex_a3 lex_a4 lex_a
mesh_edo_amesh edo_a edo_a4 edo_a3 edo_a1 edo_a2
mesh_fo_mesh edo_a lee_a dex_a pic_a lex_a
mesh_lee_amesh lee_a1 lee_a2 lee_a4 lee_a lee_a3
mesh_dex_amesh dex_a1 dex_a2 dex_a3 dex_a4 dex_a
mesh_pic_amesh pic_a3 pic_a4 pic_a1 pic_a2 pic_a
The start
is the peer attached to the console. The node
is the list of
peers in the network ordered by age. The following lines have a prefix which
is the type of transport used, which is either mesh
, tcp
, or pgm
. The
suffix of the type is the name of the transport. Following the "type_name" are
the peers which are connected using this transport. If the cost is not the
default of 1000, then there will be a : followed by the cost of the
transport.
-
show cache
Show the route cache hit and miss statistics. To reduce the number of bloom filters and hash tables that a message must flow through to match the subject, the route for the subject is cached. This cache needs to be updated when a subscription operation occurs, so this purges the entries which are affected by these operations, reducing the cache effectiveness. When a new subject published will also cause a miss. The cache size has a maximum of 256K entries, and when this is hit, the cache is purged and recreated.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[35]> show cache
tport | hit_pct | hit | miss | max_cnt | max_size
----------------------+---------+-------------+------------+---------+---------
rvd.ipc.tport.0 | 86.70 | 14600408979 | 2239005394 | 24576 | 130
rvd.lex_amesh.tport.1 | 0.00 | 0 | 0 | 0 | 0
rvd.lex_amesh.tport.2 | 84.16 | 1513720684 | 284704081 | 1536 | 447
rvd.lex_amesh.tport.3 | 84.17 | 1513725449 | 284673772 | 1536 | 453
rvd.lex_amesh.tport.4 | 84.16 | 1513723831 | 284727847 | 1536 | 444
rvd.lex_amesh.tport.5 | 88.06 | 16786195897 | 2275244513 | 24576 | 209
Each tport
has a route cache. The hit_pct
is a percentage, hit * 100 /
total. The hit
is how many times an entry was present in the cache, a miss
is not present. The max_cnt
is the maximum number of cache entries that have
occurred since the transport was created. The max_size
is the max data size
of the entries, which are fds. Some of the entries will have zero size, when
there is no route for the subject.
-
show poll
Show the latency of poll states, the average time used for processing timers, read, write, and routing events.
Example:
lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[36]> show poll
timer_lat | timer_cnt | read_lat | read_cnt | rd_lo | route_lat | route_cnt | write_lat | write_cnt | wr_poll | wr_hi
----------+-----------+----------+------------+------------+-----------+------------+-----------+------------+---------+------
2.52us | 5548967 | 4.77us | 4936398767 | 2053110538 | 11.4us | 1434068849 | 15.1us | 2053110184 | 0 | 66
In a busy router, the read, route, write operations will process multiple messages at a time, depending on how many fit inside of a read buffer. A read buffer is 16KB and is resized only when a large message requires more memory. The sum of these is close to the average latency used by the router per message, even if the time used per message is a fraction of that, since the messages are processed in batches.
The read_cnt
is the sum of the counts in the rd_lo
and read
states, the
write_cnt
is the sum of the counts in the write
, wr_hi
, wr_poll
states.
The difference between rd_lo
and read
is that the rd_lo
state occurs after
the read buffer is full or the fd has no more data to read. The wr_hi
are
the number of times that the write buffer is full. The wr_poll
state is the
number of times that the fd is part of the poll set because there is back
pressure on the connection.
-
show hosts
Show the RV host services.
Example:
chex.rvd[VCr9OQDldBjnGLnOXVF7gA]@chex[229]> show hosts
svc | session | user | port | start | cl | bs | br | ms | mr | idl | odl
-----+------------------------+--------+------+-------------------+----+---------+------+------+----+-----+----
7500 | 542AFD39.5F75F9F9BFDED | chex | 7500 | 0320 19:15:55.308 | 1 | 2670095 | 1593 | 2438 | 17 | 0 | 0
7500 | 542AFD39.5F763014D6394 | nobody | 7500 | 0320 19:15:55.308 | 1 | 2670095 | 1593 | 2438 | 17 | 0 | 0
7501 | 542AFD39.5F76301B06616 | nobody | 7500 | 0320 23:03:15.414 | 1 | 0 | 1572 | 0 | 16 | 0 | 0
The svc
is the service number, session
is the session identifier, user
is
the user name associated with the session, port
is the daemon port number,
start
is when the host started. The cl
is the active number of clients.
If the number of clients is zero then the host service is not active, it
doesn’t publish any _RV
system subjects. The bs
, br
, ms
, mr
, idl
,
odl
are the same stats published with the
_RV.INFO.SYSTEM.HOST.STATUS.5230FA7C
message.
Field | Description |
---|---|
svc |
Service number |
session |
Session identifier |
session ip |
Session identifier in IPv4 address format |
port |
Daemon port number |
start |
Start time of the host |
cl |
Number of clients connected to service |
bs |
Bytes sent |
br |
Bytes received |
ms |
Messages sent |
mr |
Messagtes received |
idl |
Inbound data loss, messages lost by subscriptions |
odl |
Outbound data loss, messsages lost by publishers |
The session ip
will be a random address unless configured with the
no_fakeip
setting, described in Tib RV.
-
show rvsub
Show the RV subscriptions, which is any subscription that uses an service number. A service name used by another protocol that is not a valid RV service will not have a RV subscriptions.
Example:
chex.rvd[VCr9OQDldBjnGLnOXVF7gA]@chex[228]> show rvsub
svc | session | user | p | subject
-----+------------------------+--------+---+--------------------------------
7500 | 542AFD39.5F75F9F9BFDED | chex | |
7500 | 542AFD39.5F762C8EF4C3E | nobody | | RSF5.REC.EK.N
| | | | RSF5.REC.ITT.NaE
| | | | RSF5.REC.PPW.NaE
| | | p | _INBOX.542AFD39.5F762C8EF4C3E.>
7501 | 542AFD39.5F762CC9F8385 | nobody | | RSF.REC.TMX.N
| | | | RSF.REC.GLK.NaE
| | | p | _INBOX.542AFD39.5F762CC9F8385.>
The svc
field is the service number, the session is a identifier for the
connection, which in this case, uses the host prefix and a nanosecond
resolution timestamp as the unique identifier. There are other methods used,
but they usually have host prefix, a timestamp and/or a process id.
The user
is derived from the protocol’s method of attaching a user name to
the session. The user is often a login name when using RV. The p
is set
when the subscription is a pattern. The subject
is the subscription string.
-
show rpcs
Show the console rpcs that are currently running. These are created with commands entered into the console or the web interface. These are: "ping", "remote", "show subs", "sub <subject>", "psub <subject>", and "snap <subject>".
Example:
chex.rvd[VCr9OQDldBjnGLnOXVF7gA]@chex[234]> show rpcs
type | arg | recv | count
-----+----------------------+------+------
snap | _7500.RSF.REC.IBM.N | 0 | 1
sub | _7500.RSF.REC.TEST.X | 1 |
The type
is the command, the arg
is a subject or a peer name. The recv
is the number of messages received, count
is the number expected if it is not
a subscription type. The cancel command will stop the
non-subscription type commands, unsub or
punsub commands will stop the subscription type commands.
Test Pub Sub
These commands do pub/sub through the console. The messages have a format attached to them, which is an integer value mapped to decoding methods. If the format is matched with a decoder, then it is decoded to field/value pairs and printed. If a method is not matched, then the value is an opaque string of bytes and that displayed.
-
sub S [F]
Subscribe to subject S
. If a file is present, then the publishes are sent to
the file instead of printed to the console.
-
unsub S [F]
Unsubscribe to subject S
. If a file is present, then stop the publishes sent
to the file. If only unsub
is used, then all subjects are unsubscribed.
-
psub W [F]
Subscribe to RV style wildcard W
. If a file is present, then the publishes
are sent to the file instead of printed to the console.
-
punsub W [F]
Unsubscribe to RV style wildcard W
. If a file is present, then stop the
publishes sent to the file. If only punsub
is used, then all patterns are
unsubscribed.
-
gsub W [F]
Subscribe to glob style wildcard W
. If a file is present, then the publishes
are sent to the file instead of printed to the console.
-
gunsub W [F]
Unsubscribe to glob style wildcard W
. If a file is present, then stop the
publishes sent to the file.
-
snap S [F]
Publish an empty message to subject S
with an _INBOX reply, then wait for the
_INBOX subject and print the message received. The _INBOX used is assigned is
subscribed by the console automatically.
-
pub S M
Send a message M
so subscriptions S
.
-
trace S M
Send a message M
to subscriptions S
with the trace flag set, which causes
any of the intermediate hops as well as the final destination to send an ack
reply.
-
ack S M
Send a message M
to subscription S
with the ack flag set, which causes
the destinations to send an ack reply.
-
rpc S M
Send a message M
to subscription S
with a return inbox.
-
any S M
Randomly choose a subscription match for S
and forward message M
to that
endpoint. This would include both wildcard subscriptions and normal ones.
-
cancel
A cancel
command stops any console subscription or RPC, such as ping
. This
marks the endpoint as canceled, so if results are returned after a cancel
,
they will be discarded.
-
reseed
This alters the local bloom filter to use a different seed. Changing the bloom filter seed will alter the bits in the hash such that collisions occur at different positions. If a low rate subscription has a collision with a high rate subscription, this would cause unnecessary traffic that can be avoided by altering the bloom filter seed. This doesn’t solve when the 32 bit hashes have collisions, but these are much less likely than a bloom filter collision.
Mute the Logging
-
mute
The log messages are normally printed to the console, this mutes them. The log is still present, using the log command will show them and the log file if active, will still be appended. If messages to the console are being printed too fast for the terminal to display them, this will automatically turn on.
-
unmute
This removes the mute
for printing log messages to the console.
Turn On/Off Debug Logging
-
debug I
The integer value is either a mask or a list of strings that turn the debug
logging on or off. When debug 0
is used, this turns of the debug messages.
Name | Value | Description |
---|---|---|
tcp |
0x1 |
Print the subjects as they are sent or received on a TCP connection |
pgm |
0x2 |
Print the subjects as they are sent or received on a PGM connection |
ibx |
0x4 |
The inbox UDP protocol debugging |
transport |
0x8 |
Show the message route forwarding |
user |
0x10 |
User updates debugging, when changes are made to a user state |
link_state |
0x20 |
Link state message updates are printed |
peer |
0x40 |
Peer synchronization messages are printed |
auth |
0x80 |
Authentication messages are printed |
session |
0x100 |
System message dispatching, IPC message forwarding |
hb |
0x200 |
Heartbeat and ping messages |
sub |
0x400 |
Subscription starts and stops |
msg_recv |
0x800 |
Print system messages when they are received |
msg_hex |
0x1000 |
Dump the system messages in hex when they are received |
telnet |
0x2000 |
Show the telnet protocol states |
name |
0x4000 |
Display name transport update messages |
repeat |
0x8000 |
Print when the repeated subjects are received |
not_sub |
0x10000 |
Print when not subscribed subjects are received |
loss |
0x20000 |
Print debugging when message loss occurs |
adj |
0x40000 |
Print debugging when the link state Dijkstra algo runs |
conn |
0x80000 |
Show debugging about connections, when establish or dropped or timers expire |
stats |
0x100000 |
Print when forwarding a stats, when have subs to _N.> subjects |
dist |
This causes the Dijkstra algo to run once |
|
kvpub |
Turns on debugging when any message is processed |
|
kvps |
Turns on debugging when kv pubsub messages are processed |
|
rv |
Turns on debugging when rv message is processed |
The last 4 don’t have an integer mask because they use different debug variables that the others.
Write Events to File
-
wevents F
Dumps the current events to a log file for examining later. Useful when a networking problem occurs and is hard to reproduce.
Stop the Server
-
die [I]
Exit the process without shutting down existing connections and sending bye messages to the network.
-
quit/exit
Normal shutdown. Existing connections will stop reading new messages send bye messages to connected peers and flush the data in the write queues.
Monitoring
Monitoring.