Introduction

Description

Rai MS is a Link State based protocol for the construction of Pub/Sub messaging systems which allows for loops and redundancies in the network connections between peers. It has 4 different types of network transports:

  1. OpenPGM based multicast, with an unicast inbox protocol.

  2. TCP point to point connections.

  3. Mesh TCP all to all connections.

  4. Local bridging compatible with RV, NATS, Redis.

The first 3 transports may be interconnected with redundancies. The local bridging transports strips or adds the meta-data of the message that allows for routing through the network, so it can’t be looped.

It uses a best effort delivery system. It serializes messages based on subject so that streams are delivered in order discarding duplicates, but messages which are lost in transit because of node or network failures, are not retransmitted.

Architecture

Authentication

A ECDSA key pair is generated for a service and for each user that is pre-configured. A ECDH key is generated for by peer on startup for a key exchange that establishes a 32 byte session key. This session key is used to authenticate messages sent and received. Each peer in the system has a unique session key so that a message from any one of them can be authenticated. This is described further in Authentication.

Console Interface

The model that a node implements in the base client is close to that of a router. The command line resembles a cisco style interface with the ability to bring up and down transports at run time, examine the state of them, ping other nodes, traceroute, get help on commands with the ? key, use command line completion, telnet into the node. More in Console.

Networking

A node consists of a router with several transports. The term "transport" is modeled as a switch, where other nodes on the transport are attached to switch and one port of the switch is attached to the router. All of the nodes plugged into the switch can communicate without going through the router. The facilitates a multicast style transport, where a single multicast send reaches multiple nodes within the switch. It also allows an listener to accept multiple local connections which use a protocol like RV, NATS, or Redis and do communication without regard to the other nodes attached to other switches or transports through the router.

The subscription mechanism has three layers: the router, the switch, and the connection. The router uses bloom filters to route subjects, the switch uses 32 bit mac addresses based on the subject/wildcard hash, and the connection uses a btree of subjects:

  router <-> bloom filter 1 <-> switch 1 <-> mac 1 <-> connection 1 <-> btree entry 1
             bloom filter 2     switch 2     mac 2     connection 2     btree entry 2

More about this in the Networking section.

The first thing a node does after authentication, is download the peers LSDB (Link-State DB), which first consists of records for every other peer:

  { bridge id, session key, peer name, sub seqno, link seqno, bloom }

The seqno values allow for delta updates of the LSDB, which can add/remove a link or add/remove a subscription from the bloom filter. The bloom filter contains everything needed to filter the subscriptions that the peer has interest. It generally uses about 2 bytes per subscription for a false positive error rate at about 0.05% (1 in 2000 subjects), so if a peer has a 10,000 subjects or wildcards open, it will be about 20,000 bytes in size.

Then for each bridge id, it downloads the links that the peer is connected to for each transport/switch. The local bridging that occurs for foreign protocols like RV, NATS and Redis are directly attached to the peer and are considered the peers subscriptions. In other words, the bloom filter for a peer has all of the subscriptions for every RV, NATS or Redis client connected to it.

The link records are for nodes which are directly attached to the peer via a transport. There may be many nodes using the same link attached to the peer and a node may be reachable via multiple transports. The unique feature identifying this link is the bridge id, tport id pair.

  { bridge id, tport name, peer name, tport id, cost }

A delta update of the LSDB, whether link change or subscription change is broadcast to all of the nodes. If an network split occurs and some nodes are orphaned from the network for a period before rejoining, then synchronization of the LSDB with a peer occurs when the sub seqno or the link seqno has advanced. Any peer is capable of updating any other peer since the LSDB is the same in every one. The primary means of watching the seqno changes is with a transport heartbeat sent on a 10 second (default) interval between directly connected peers. In addition, each peer randomly chooses another peer to ping at a random interval based on the heartbeat interval.

The behavior of a transport which becomes too congested is that the heartbeat misses and the link is dropped and rejoined at the next heartbeat. The effect of this is that 10 seconds of traffic is rerouted or lost if there are no other routes to the peers on the other side.

More in the Link State section.

Multicast routing

Any time a link is added to the LSDB, the routing is recalculated using a Dijkstra path finding algorithm. The shortest path is chosen, and if multiple equal paths exist, then the link with the lowest weight is chosen. Load balancing can occur when there are two or more equal paths to a peer based on the subject mac of the destination. The LSDB is considered "consistent" when all peers agree that a link exists. If peer A has an outgoing link to peer B, then peer B must have a link to peer A. If this is not the case, then LSDB synchronization requests to the closest peer along the path are performed until the network converges to a consistent state.

All peers will choose the same route for a subject when the LSDB is synchronized. If the LSDB is not synchronized, then messages may be duplicated to alternative routes or may decide that routing is not necessary for a message when it is. For this reason, keeping the LSDB synchronized as fast as possible is a top priority of a node.

A technique called reverse path forwarding is used for multicast messages. If a destination unicast to a peer, which is the case for inbox style messaging, then there is only one path for the message, the shortest path. With multicast, there are multiple paths that a message may take, each is the shortest path to a subscriber. Reverse path forwarding uses the source of the message to route it. The algorithm increments the distance from the source to compute a set of nodes that are possible for a message at each hop, then chooses the best traversal of the network graph so that the entire network is covered with a minimal set of transmissions. Once this is calculated, it can be reused until a link in the LSDB is updated again. This set of paths is augmented with bloom filters from the peers, so that a router will forward a message only if it passes through the reverse path forwarding algorithm and it passes through the bloom filters attached to the path.

Wildcard Matching

A generic PCRE based conversion is used to enable multiple wildcard styles to coexist between peers. The bloom filter contains both a prefix and suffix matching filter, so that A.*.B is matched with both ends of the wildcard. When a subject is passed through a bloom filter the prefix of the subject is hashed with different seeds based on the prefix lengths used. If a peer is interested in subject prefix lengths of 3, 5, 10, 20, as well as the subject itself, these lengths are noted in the bloom filter and the hash set is calculated as

  hs = hash( subject, seed = 0 )
  h3 = hash( subject[1..3], seed = 3 )
  h5 = hash( subject[1..5], seed = 5 )
  h10 = hash( subject[1..10], seed = 10 )
  h20 = hash( subject[1..20], seed = 20 )

If any of these are hash values present in the bloom filter, then a check for the suffix matches are done. The hash set is computed in groups before any routing based on the entire set of hashes needed is done in order to take advantage of instruction parallelism, computing several hashes for each iteration of the subject length.

Anycast and Shardcast

An anycast route is a single match of many. A set of peers interested in a subject can be computed because the LSDB contains filters for all of them. This set of peers interested can be randomly chosen and unicast routed to the chosen peer. If the peer has a false match, or the interest in the subject is lost, then that peer can choose another from the set and forward it.

A shardcast is a set of peers interested in the prefix of a subject, but only a shard of the subject space. The bloom filter contains enough info to filter by both the prefix hash and the subject space that a peer is interested in. In this case, the peers have predetermined how many shards there should be and how the shards are split between them. If A subscribes to X.* using shard 1/2 and B subscribes to X.* using shard 2/2, then the subjects X.Y and X.Z is split between A and B based on the hash of X.Y and the hash of X.Z. This is a variation of suffix matching where the hash of the subject is used to discriminate the route of the message.

Why use it?

Distributed systems are more often crossing network boundaries. Traditional broker based systems or multicast based systems have difficulty expanding beyond a these boundaries. To remedy this, network designs may deploy application specific routers, or they shard the messaging system, or they use other protocols like mesh or gossip based systems. All of these solutions have advantages and drawbacks.

The aim of this system is to:

  1. Flexible transports and networking.

  2. Fast message authentication.

  3. Fast network convergence.

  4. Distribute messages only when interest is present.

  5. Utilize redundant links.

  6. Flexible message distribution: inbox, multicast, anycast, shardcast.

  7. Flexible wildcarding mechanism.

  8. Ability to recover subscription interest at the endpoints.

Building

There are a lot of submodules and dependencies, so at present, building using the build Makefile is the easiest way to compile everything. Clone it, install the dependencies, clone all of the modules, build everything. The rpm dependencies will probably need the EPEL repo installed when using an enterprise RedHat, CentOS, or derivative for the liblzf-devel package (and maybe others).

  $ git clone https://github.com/raitechnology/build
  $ cd build
  $ make install_rpm_deps
  $ make clone
  $ make

If this completes, there will be a static binary at raims/OS/bin/ms_server where OS is something like RH8_x86_64.

If you set the env var for debugging, then the RH8_x86_64-g directory will be populated without optimization and with the -g flag.

  $ export port_extra=-g
  $ make

Running the MS server

The first task is to create the authentication keys for a service "test". The ms_gen_key program creates and updates the configuration. The user keys are what stored in the user_X_svc_test.yaml files and contain ECDH key pairs. The service is a ECDSA key pair and signs each user and stores the signatures in the svc_test.yaml file. The startup.yaml contains the startup config. The config.yaml file includes all of the files in the config directory.

  $ cd build/raims
  $ ms_gen_key -u A B C -s test
  create dir  config                          -- the configure directory
  create file config/.salt                    -- generate new salt
  create file config/.pass                    -- generated a new password
  create file config/config.yaml              -- base include file
  create file config/param.yaml               -- parameters file
  create file config/svc_test.yaml            -- defines the service and signs u
  create file config/user_A_svc_test.yaml     -- defines the user
  create file config/user_B_svc_test.yaml     -- defines the user
  create file config/user_C_svc_test.yaml     -- defines the user
  OK? y
  done

This creates the keys for users A, B, and C. These keys are encrypted with the .pass and .salt files.

More about this in the [key config guide](keys.md).

Run the ms_server program and configure it. The -u option specifies the user and service. The -c option starts the command line interface, where the networks can be defined and connected. This following defines a mesh endpoint and saves it to the startup config.

  $ ms_server -u B.test -c
  05:54:26.267  session A.test[RthXjJscfuvnG2+J1/PJ1w] started, start time 1644818066.265990830
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[249]> configure transport mytran
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[250](mytran)> type mesh
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[251](mytran)> listen *
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[252](mytran)> port 5000
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[253](mytran)> show
  tport: mytran
  type: mesh
  route:
    listen: "*"
    port: 5000
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[254](mytran)> exit
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[255]> listen mytran
  transport "mytran" started listening
  05:55:09.934  listening on [::]:5000
  05:55:09.937  network converges 0.003 secs, 0 uids authenticated, add_tport
  A.test[RthXjJscfuvnG2+J1/PJ1w]@tracy[256]> save
  config saved
  05:55:12.790  update file A/param.yaml            -- parameter config
  05:55:12.790  create file A/startup.yaml          -- startup config
  05:55:12.790  create file A/tport_mytran.yaml     -- transport

The files are described in the [configuration] section and the transports are described in the Networking section. The authentication keys need to be distributed to all the nodes, but the networking config will be somewhat unique to each node.

Configuration

Key Configuration

The key configuration files are necessary to join the network. They authenticate peers and the message traffic that flows between peers. It does not authenticate the local bridging protocols RV, NATS, or Redis.

Generating a master config is done with the ms_gen_key program. The default location for the config directory is ./config, other locations are specified with the -d option.

Initially, the config directory is empty. Initialize with some users and a service name.

$ ms_gen_key -u A B C -s test
create dir  config                          -- the configure directory
create file config/.salt                    -- generate new salt
create file config/.pass                    -- generated a new password
create file config/config.yaml              -- base include file
create file config/param.yaml               -- parameters file
create file config/svc_test.yaml            -- defines the service and signs users
create file config/user_A_svc_test.yaml     -- defines the user
create file config/user_B_svc_test.yaml     -- defines the user
create file config/user_C_svc_test.yaml     -- defines the user
OK? y
done

Exporting the keys for each of the nodes causes the .pass file the change and the unnecessary private keys to be removed. The only private key that remains, is for the peer. This trimmed configuration allows the peer to run, but not generate new peers because the private key of the service is not present.

$ ms_gen_key -x A B C -s test
- Loading service "test"
- Signatures ok
create dir  A                          -- exported configure directory
create file A/.salt                    -- a copy of salt
create file A/.pass                    -- generated a new password
create file A/param.yaml               -- a copy of param
create file A/config.yaml              -- base include file
create file A/svc_test.yaml            -- defines the service and signs users
create file A/user_A_svc_test.yaml     -- defines the user
create file A/user_B_svc_test.yaml     -- defines the user
create file A/user_C_svc_test.yaml     -- defines the user
create dir  B                          -- exported configure directory
create file B/.salt                    -- a copy of salt
create file B/.pass                    -- generated a new password
create file B/param.yaml               -- a copy of param
create file B/config.yaml              -- base include file
create file B/svc_test.yaml            -- defines the service and signs users
create file B/user_A_svc_test.yaml     -- defines the user
create file B/user_B_svc_test.yaml     -- defines the user
create file B/user_C_svc_test.yaml     -- defines the user
create dir  C                          -- exported configure directory
create file C/.salt                    -- a copy of salt
create file C/.pass                    -- generated a new password
create file C/param.yaml               -- a copy of param
create file C/config.yaml              -- base include file
create file C/svc_test.yaml            -- defines the service and signs users
create file C/user_A_svc_test.yaml     -- defines the user
create file C/user_B_svc_test.yaml     -- defines the user
create file C/user_C_svc_test.yaml     -- defines the user
OK? y
done

Copy the A config to the A node/config, the B config directory to the B node/config, etc. The .pass file is unique for each peer so that it can be removed after running the server, rendering the configured keys unreadable until the .pass file is restored or the peer’s config is regenerated from the master config.

The copy of the master config includes a copy of the param.yaml, as that can contain global configuration, but doesn’t copy any local configuration such as startup and network configuration.

The master config will also work, so just copying it to the peers will allow them to run if this type of security is unnecessary.

Single File Configuration

The ms_gen_key option -o will concatenate the configuration into a single file:

$ ms_gen_key -s test -o test.yaml
create dir  config                   -- the configure directory
create file config/.salt             -- generate new salt
create file config/.pass             -- generated a new password
create file config/config.yaml       -- base include file
create file config/param.yaml        -- parameters file
create file config/svc_test.yaml     -- defines the service and signs users
OK? y
done
- Output config to "test.yaml"

Running ms_server -d config configuration from a directory and running ms_server -d test.yaml will load the configuration from a file. In both cases, the configuration loaded will be the same.

A test network can be set up using only the loopback interface by describing the network using a format output by the show_graph command. The format of this is:

node A B C D
tcp_link_ab A B : 200
tcp_link_bc B C : 100
tcp_link_ac A C : 200
tcp_link_bd B D : 200
tcp_link_dc D C : 300

The node line declares all of the users. The tcp_ lines describe how the users are connected. The number following the : is the cost of the transport.

Running the ms_test_adj program with this description will generate a configuration, saved in a file called "graph.txt" and output to "graph.yaml":

$ ms_test_adj -l graph.txt > graph.yaml

The -l option causes the links to be resolved by exchanging messages over the loopback interface. At the bottom of the "graph.yaml" created, there will be commands in comments to run this configuration. Running these will create 4 users in a network describe by the graph. The following uses those commands with the first three running in the background and the last with a console attached to it, but you could run each in a different terminal with consoles attached in order to test with the sub and trace commands to test how messages would be routed through the network.

$ ms_server -d graph.yaml -u A -t link_ab.listen link_ac.listen &
$ ms_server -d graph.yaml -u B -t link_ab.connect link_bc.listen link_bd.listen &
$ ms_server -d graph.yaml -u C -t link_ac.connect link_bc.connect link_dc.listen &
$ ms_server -d graph.yaml -u D -t link_bd.connect link_dc.connect -c

In addition to "tcp" type links, you could also define "mesh" and "pgm" types, but the pgm would require non-loopback interface that has multicast, like a Linux bridge.

node A B C D
mesh_link_abcd A B C D : 100 1000 100 1000
mesh_link_abcd2 A B C D : 1000 100 1000 100

The above graph would create two meshes, with different costs for some of the paths. This would route messages over both meshes by sharding the subject space and using one mesh for half of the subject space and the other mesh for the other half.

There is a graphical interface to view the network using the cytoscape package.

$ ms_server -c
chex.test[0vEvE73U78HkGZUgBK94mQ]@chex[10]> configure transport web type web port 8080 listen 127.0.0.1
Transport (web) updated
chex.test[0vEvE73U78HkGZUgBK94mQ]@chex[11]> listen web
Transport (web) started listening
0209 22:54:25.382  web: web start listening on 127.0.0.1:8080
0209 22:54:25.382  http_url http://127.0.0.1:8080

Connect to the url http://127.0.0.1:8080/graph_nodes.html with a web browser and paste the graph text into the text box after erasing the existing text, then click "show graph".

Parameters

The parameters section of the configuration is used to lookup values that can alter the behavior of the server. These fields can be set anywhere in the config files, but are usually in the "param.yaml" or "startup.yaml" files. Since the "config.yaml" includes "*.yaml", any yaml file in the config directory can contain parameters. Any field value pair which is not in a service, user, service, transport, or group section is added to the parameters section.

This configuration is a list of parameters:

parameters:
  pass: .pass
  salt: .salt
heartbeat: 5 secs
reliability: 10 secs
tcp_noencrypt: true

The "parameters:" structure is optional and not necessary to define them.

Name Type Default Description

salt

filename

none

File to find encryption salt

pass

filename

none

File to find encryption password

salt_data

string

none

Base 64 encoded encryption salt

pass_data

string

none

Base 64 encoded encryption password

listen

array

none

Startup listen transports

connect

array

none

Startup connect transports

pub_window_size

bytes

4 MB

Size of publish window

sub_window_size

bytes

8 MB

Size of subscribe window

pub_window_time

time

10 secs

Time of publish window

sub_window_time

time

10 secs

Time of subscribe window

heartbeat

time

10 secs

Interval of heartbeat

reliability

time

15 secs

Time of publish reliability

timestamp

string

LOCAL

Log using local time or GMT

pid_file

string

none

Daemon pid file

map_file

string

none

Use for key value storage

db_num

string

none

Default db number for key value

ipc_name

string

none

Connect to IPC sockets

tcp_timeout

time

10 secs

Default timeout for TCP/mesh connect

tcp_ipv4only

boolean

false

Use IPv4 addressing only

tcp_ipv6only

boolean

false

Use IPv6 addressing only

tcp_noencrypt

boolean

false

Default for TCP/mesh encryption

tcp_write_timeout

time

10 secs

Timeout for TCP write

tcp_write_highwater

bytes

1 MB

TCP write buffer size

idle_busy

count

16

Busy wait loop count

working_directory

dirname

none

Switch to directory when in daemon mode

  • salt, pass, salt_data, pass_data — The salt, pass or salt_data, pass_data are required for startup. The keys defined in the configuration are encrypted with these values. Any key derived during execution is mixed with the salt and must be the same in all peers.

  • listen, connect — The startup transports. They are started before any other events are processed. If a listen fails, then the program exits. A connect failure will not cause an exit, since it retries.

  • pub_window_size, sub_window_size, pub_window_time, sub_window_time — These track the sequence numbers of messages sent and received. They are described in Publish sequence window.

  • heartbeat — The interval which heartbeats are published to directly connected peers. A link is not active when a heartbeat is missed for 1.5 times this interval. The link is reactivated when a heartbeat is received.

  • timestamp — When set to GMT, the time stamps are not offset by the local timezone.

  • pid_file — A file that contains the process id when forked in rvd mode.

  • map_file — If a Redis transport is used, this is where the data is stored. If no map is defined, then the data stored will fail and data retrieved will be zero. The kv_server command will initialize a map file.

  • db_num — The default database number for the Redis transport.

  • ipc_name — When set, allows IPC processes to connect through Unix sockets and subscription maps using the same name. If the processes are shutdown, they will restart or stop the subscriptions using the maps.

  • tcp_timeout — The default retry timeout for TCP and mesh connections.

  • tcp_ip4only — Resolve DNS hostnames to IPv4 addresses only.

  • tcp_ip6only — Resolve DNS hostnames to IPv6 addresses only.

  • tcp_noencrypt — When true, the default for TCP and mesh connections is to to not encrypt the traffic.

  • tcp_write_timeout — Amount of time to wait for TCP write progress if the write buffer is full. After this time, socket is disconnected and messages lost. When a TCP write buffer has equal or more than tcp_write_highwater bytes then backpressure can be applied to the sockets that are forwarding data, causing them to add latency waiting for the writer to have space available.

  • tcp_write_highwater — Amount of data to buffer for writing before applying back pressure to forwarding sockets.

  • idle_busy — Number of times to loop while no activity is present. More looping while idle keeps the process on a CPU for lower latency at the expense of wasted CPU cycles.

  • working_directory — When running in the background in daemon mode, which is without a console using RVD mode without the -foreground argument or with the -b argument, then switch to the directory after forking and detaching from the terminal. This directory can be used to store the .console_history files or other files that are saved using console subscription commands. If the command line with telnet is not used, then no files are created.

Startup

The startup section can be used to start transports during initialization. This syntax is used by the save console command, but can also be edited. The following causes the transports named myweb to start with listen, then start mymesh and mytcp with connect. The listeners are always started before the connecters.

startup:
  listen:
    - myweb
  connect:
    - mymesh
    - mytcp

Hosts

The hosts section can be used to assign address strings to names, similar to an /etc/hosts configuration. The values assigned to the names are substituted in any connect or listen configuration of a transport. For example, the following hosts are used in the connect and listen portions of the net transport.

hosts:
  chex: 192.168.0.16
  dyna: 192.168.0.18
transports:
  - tport: net
    type: mesh
    route:
      connect: chex:5001
      listen: dyna:5000
startup:
  connect:
    - net

A mesh type transport with connect uses both the listen and the connect addresses defined, since all peers can both connect and accept connections.

Authentication

Authentication has two parts, the initial key exchange that sets up unique session key for each pear and message authentication that verifies that a peer sent it. The key exchange protocol uses a Elliptic Curve Diffie Hellman (ECDH) exchange that is signed by a Elliptic Curve Digital Signature (ECDSA). The message authentication uses a HMAC digest computed by enveloping the message with a peer’s session key and computing the hash along with sequencing by subject to prevent a replay of messages.

Key Exchange

Two peers authenticate with each other by signing a message with a configured ECDSA key. This message includes a generated a ECDH public key. The ECDH key is used by each side to compute the secret using the corresponding ECDH private key. The secret along with a unique nonce, a time stamp, and a sequence number to create a temporary key that is used to encrypt a random session key.

For peers A and B to complete the key exchange, there are 4 messages:

  • HELLO/HB from peer A sent to peer B — Includes a seqno, a time, a nonce, and a ECDH public key. Since these are unique for each side, call these A_seqno, A_time, A_nonce, A_ECDH_pub

  • AUTH from peer B sent to peer A — Includes B_ECDSA_sig, B_seqno, B_time, B_nonce, B_ECDH_pub, B_auth_key, A_seqno, and A_time. The A_seqno and A_time allow peer A to match the unique A_nonce which corresponds to the HELLO message sent previously. The last two HELLO messages are tracked so it must match one of these. The B_auth_key contains an AES encrypted session key which must be decrypted by computing a temporary key using the data from B as well as the ECDH secret computed from B_ECDH_pub and A_ECDH_pri. Peer A trusts peer B if the decrypted session key in B_auth_key authenticates the message using the HMAC computation and the HMAC computation is also signed by B_ECDSA_sig.

  • AUTH from peer A sent to peer B — The reverse of above, includes A_ECDSA_sig, A_seqno, A_time, A_nonce, A_ECDH_pub, A_auth_key, B_seqno, and B_time. The B_seqno and B_time are used to match the B_nonce included in the previous AUTH and used by peer A to create the temporary key which encrypts the A_auth_key session key for A. B trusts peer A if the decrypted session key in A_auth_key authenticates the message using the HMAC computation and the HMAC computation is also signed by A_ECDSA_sig.

  • AUTH_OK from peer B sent to peer A — This notifies peer B that authentication was successful.

If either AUTH fails the HMAC computation, then the authentication fails and one or both peers are ignored for a 20 seconds (or 2 times the heartbeat interval). It is possible that the latency of the key exchange is greater than 2 HELLO/HB messages so the nonce associated with the seqno/time pair is too old and the authentication must restart.

The ECDSA private key used to sign the authentication messages is either the configured key pair from the service or the configured key pair from the user. A configuration may not include the service private key in the case that a user has less privileges that the service, which has admin privileges. The service’s private key is able to sign users which don’t exist and are added to the system, but a user’s private key can only authenticate itself.

The ECDH algorithm used is EC25519. The ECDSA algorithm used is ED25519.

The following is from the Example Message Flow. This shows the HELLO/HB part of the key exchange, where peer A is ruby and peer B is dyna.

_X.HELLO ... ruby -> dyna
   bridge_16 [1027]   : xq6vl+2HcoDxtt+7lC7dGQ
   digest_16 [1029]   : mB1uDQ7fsGmYScIGU0kt6Q
   sub_s2 [1792]      : "_X.HB"
   user_hmac_16 [1028] : TQO1sorP9oD+smMOrnvzuQ
   seqno_2 [273]      : 1
   time_8 [787]       : 1663967268385616894
   uptime_8 [788]     : 17982050574
   start_8 [794]      : 1663967250404676993
   interval_2 [277]   : 10
   cnonce_16 [1034]   : IG45ISINnT0bX2Td6Ovivw
   pubkey_32 [1357]   : +A2dlZCcDo8vS/XsWApNNfJwQH8ApmFIRTOcS+cPuAk
   sub_seqno_2 [274]  : 0
   user_s2 [1836]     : "ruby"
   create_s2 [1838]   : "1663967250.404513467"
   link_state_2 [281] : 0
   converge_8 [839]   : 1663967250404676993
   uid_cnt_2 [292]    : 0
   uid_csum_16 [1036] : xq6vl+2HcoDxtt+7lC7dGQ
   version_s2 [1840]  : "1.12.0-42"
   pk_digest_16 [1091] : SMnBqzoh/w6IFi2c7zoxMw

The seqno_2, time_8, cnonce_16, pubkey_32 are the A_seqno, A_time, A_nonce, and A_ECDH_pub. The user_hmac_16, start_8, and service ECDSA_pub are combined to create a hello_key which is used to authenticate the HELLO message stored in pk_digest_16, since the session key that is product of the key exchange is not yet known by dyna. The service ECDSA_pub is never sent over the wire so it is used as a pre-shared key in this instance. There is another pre-shared key used by the Key Derivation Function (KDF) to generate keys from secrets, nonces, seqnos, and time stamps. The KDF is seeded by a 640 byte salt and shared along with the service ECDSA_pub key in all of the peers that need to communicate.

The first AUTH message from peer B (dyna) to peer A (ruby):

_I.xq6vl+2HcoDxtt+7lC7dGQ.auth ... dyna -> ruby
   bridge_16 [1027]   : wwEnbQEY2FMuwZGSjpi3jQ
   digest_16 [1029]   : 3UY+SJQYy3wGN0dW3zc4fg
   sub_s2 [1792]      : "_I.xq6vl+2HcoDxtt+7lC7dGQ.auth"
   user_hmac_16 [1028] : PYv43FUBG3N8ok+jn4nBPQ
   seqno_2 [273]      : 1
   time_8 [787]       : 1663967268386849657
   uptime_8 [788]     : 63309580030
   interval_2 [277]   : 10
   sub_seqno_2 [274]  : 0
   link_state_2 [281] : 0
   auth_seqno_2 [285] : 1
   auth_time_8 [798]  : 1663967268385616894
   auth_key_64 [1542] : AdM61M2DqR6hXdVnPnp716n5lQwcBAyx0N1jzGtzIM9OmAF4txsoZRd1YMOySIcxkyydHELJHfgVflEtnLg9Fg
   cnonce_16 [1034]   : TEbM+MfLCp66ds36xh0JAA
   pubkey_32 [1357]   : PyEHl7Y3IxAkK5OQMnJzggmlKlUo8+RiBif0P7h+8kg
   auth_stage_2 [305] : 1
   user_s2 [1836]     : "dyna"
   create_s2 [1838]   : "1663967205.077153809"
   expires_s2 [1839]  : null
   start_8 [794]      : 1663967205077372910
   version_s2 [1840]  : "1.12.0-42"
   pk_sig_64 [1610]   : gR2ovdrI4yfxdc7ZAR+ID00hj2HDYEcEexU/ib4CDAU4t2E/nzC6c1dK0s14RiZIWzHHxRFR6D2uJ/ZaHHwaDw

The auth_seqno_8, auth_time_8 are the A_seqno, A_time values from ruby used to find the A_nonce (cnonce_16) in the HELLO message. These along with seqno_8, time_8, cnonce_16, and pubkey_32 are used to construct the temporary key to decrypt the auth_key_64, which is session key used by dyna in the HMAC computation that authenticates messages and compare the result to digest_16. The pk_sig_64 is the ECDSA signature of the message signed either by the service’s private key or by the user dyna’s private key.

After this succeeds, then ruby trusts messages from dyna that have a HMAC computation digest_16 included with each message, along with an seqno and time stamp to prevent replays.

The second AUTH message from peer A (ruby) to peer B (dyna):

_I.wwEnbQEY2FMuwZGSjpi3jQ.auth ... ruby -> dyna
   bridge_16 [1027]   : xq6vl+2HcoDxtt+7lC7dGQ
   digest_16 [1029]   : h81umkyeNoYJAbomEWE+ng
   sub_s2 [1792]      : "_I.wwEnbQEY2FMuwZGSjpi3jQ.auth"
   user_hmac_16 [1028] : TQO1sorP9oD+smMOrnvzuQ
   seqno_2 [273]      : 1
   time_8 [787]       : 1663967268387280755
   uptime_8 [788]     : 17982688972
   interval_2 [277]   : 10
   sub_seqno_2 [274]  : 0
   link_state_2 [281] : 0
   auth_seqno_2 [285] : 1
   auth_time_8 [798]  : 1663967268386849657
   auth_key_64 [1542] : v4mYze2OruL2L02gODDt7Fd9FHTDPLO0UD/auhab+FJiGgbD473osbwlYKfYBVgwvZMFqbLpVnLiGPHD+MXPtw
   cnonce_16 [1034]   : zUYBUCh9n0L4F0dltxxtyg
   pubkey_32 [1357]   : +A2dlZCcDo8vS/XsWApNNfJwQH8ApmFIRTOcS+cPuAk
   auth_stage_2 [305] : 2
   user_s2 [1836]     : "ruby"
   create_s2 [1838]   : "1663967250.404513467"
   expires_s2 [1839]  : null
   start_8 [794]      : 1663967250404676993
   version_s2 [1840]  : "1.12.0-42"
   pk_sig_64 [1610]   : 6lU9Yz3cvW178goVHwakHsFR55TYid9SHDwjIl/fPrxFVCkCujLxK2HQXNtw3zeVRgmi01pGEqemBUW59YuNDA

The same exchange from the first AUTH message is used in order for dyna to trust ruby.

System Compromise

If a host is compromised and the KDF pre-shared key and service ECDSA_pub key are discovered along with a user ECDSA_pri key, then an unauthorized party could masquerade as that user.

One way to prevent this is to remove the pre-shared 640 byte salt file after starting a server or the unique password file used to encrypt the ECDSA keys in the configuration files. Both the salt and password are needed to decrypt the keys.

Another option is to use stdin for reading the configuration so that no secrets are stored in the filesystem. For example, this will configure ms_server through sending a configuration through ssh to a remote host:

$ cat config.yaml | ssh host "bash -c '$( nohup /usr/bin/ms_server -d - -b > /dev/null 2> /dev/null )'"

The ms_server running on host will read the configuration from stdin (-d - argument) and then fork itself to run in the background (-b argument).

Message Authentication

The function of the key exchange protocol is to initialize each peer with a random 32 byte session key. The function of this key is to authenticate messages. A HMAC calculation of the message is done by enveloping the message data with the key and hashing it using a AES based hash that results in a 8 byte digest:

  AES( IV = 8 bytes key )( [ message ] [ 24 bytes key ] )

Note that HMAC is traditionally performed as MD5( key.opad + MD5( key.ipad
message ) ) or SHA3( message + key ). The above AES construction is chosen purely for speed, since AES instructions are widely available and an order of magnitude faster than the other hashes. This may change in the near future with the addition SHA instructions.

The header of every message contains these 5 fields which identify the source of the message, the HMAC digest of the message, the subject, a seqno and a time stamp:

   bridge_16 [1027]   : h783olFEb9ve8K07E7PoQg
   digest_16 [1029]   : FKZxGPHiC7e5GXVKh2PWLg
   sub_s2 [1792]      : "_I.xq6vl+2HcoDxtt+7lC7dGQ.ping"
   seqno_2 [273]      : 4
   stamp_8 [838]      : 1663967313973571299

This header ensures that a message never contains the same bits and is always unique. It also allows the receivers to check that a replay has not occurred by tracking the sequences and time stamps for the subjects that it has seen previously. If the subject has never been seen before, then the time stamp is used to check that the message is at least as old as the last network convergence time stamp, described in more thoroughly in Message Loss. The bridge_16 identifies the source of the message and the digest_16 is computed with the source’s session key.

Subjects

Wildcarding Subscriptions

The subject schema used by the external bridged transports may introduce some incompatibilities when routing from one to another. The subscriptions and the patterns are separate operators internally. A subscription using wildcarding characters is allowed, but not interpreted differently that any other subject. A pattern subscription includes a field which causes the pattern to be evaluated with different syntax rules, Redis GLOB or NATS/RV. A publish is not interpreted as a wildcard, even when it contains wildcard syntax. Any string of bytes can be subscribed or published, but the wildcarding follows the syntax of the pattern type and uses a different subscription operator internally, as Redis does (sub, unsub, psub, punsub).

The _INBOX subject

There is a special rule for subjects that begin with the prefix _INBOX., it is interpreted as a point to point message. This subject format finds the peers which are subscribers, typically just one, and sends the message point to point for each one. The subject and message are put into an envelope addressed for each peer. The peers that forward this message along the path to the recipient recognize this as using a different forwarding rule than normal multicast subjects. For example, the point to point rules for forwarding will use a UDP inbox protocol when OpenPGM is deployed. The point to point rule will still forward to all subscriptions of an inbox subject, but it is optimized for the case that there is only one subscription.

RV subject rules

  1. A subject segment is separated by . and cannot start with a period or end with a period or have two periods appear within a subject without characters in between.

  2. A wildcard can substitute the segments with a * character or a trailing >.

  3. A publish to a wildcard causes it to match the subjects subscribed. This is not supported by Rai MS since the bloom filters are not indexed by segments. Instead, Rai MS will route the wildcard publish as a normal subject.

  4. An _INBOX. prefix implies a point to point publish which translates to an anycast Rai MS publish.

NATS subject rules

  1. Same subject segmentation as RV.

  2. Same wildcarding as RV.

  3. It is not possible to publish to a wildcard.

  4. No inbox point to point messaging.

  5. A queue group publish translates to a Rai MS anycast publish.

Redis subject rules

  1. There are no limitations for the characters used in a subject.

  2. A wildcard is subscribed using a psub operator, so the characters are interpreted using wildcard rules. A * character matches zero or more, a ? matches 0 or 1 characters. A [ and ] match any of the characters enclosed. A \ character escapes the wildcard characters. It is similar to a shell glob wildcard.

  3. A publish to a wildcard is the same as publishing to a subject.

  4. No inbox point to point messaging and does not have syntax for request/reply semantics.

Networking

Description of Transports

A Rai MS transports function is to join all of the peers connected through a node together in one virtual overlay network that provides basic pub/sub multicast.

A transport has two primary roles, the routing of messages between peers and the managing of protocol dependent subscription management and message framing. The internal transports (PGM, TCP Mesh) all use the internal protocol semantics for messaging. The external bridged transports (RV, NATS, Redis) have protocols with similarities, but they have unique behaviors that make them more complicated that the internal transports.

The design of the internal transports allow them to be used by any of the external transports, so RV can use a TCP mesh or PGM multicast or some of combination of them interconnected. Similar for NATS and Redis, they can also use PGM multicast as well as a TCP mesh. The routing of messages between peers is agnostic to the type of protocol that the endpoint clients are using. It is possible to use the Rai MS protocol directly as well. The ms_server console contains the ability to publish, subscribe without using an external client. The Telnet transport uses the console protocol. The Web transport serves builtin html pages that interface with the console protocol through websocket protocol.

There are two sides to transport configuration, the listener and the connector. Only the internal transports support the connecting side (PGM, mesh, TCP), the client side (RV, NATS, Redis, Telnet, Web) only uses listeners and do not have a cost. The device option will auto-discover a connector or listener via multicast through a device. This requires that the connector and listener are on the same broadcast domain or have multicast routing configured.

The config file format is a JSON or YAML with a record that can have these fields:

  tport: <name>
  type: <pgm | mesh | tcp | rv | nats | redis | telnet | web | name>
  route:
    listen: <address>
    connect: <address>
    device: <address>
    port: <number>
    cost: <number>
    <parm>: <value>

The name identifies the transport so that it can be referenced for starting and stopping in the console and the command line. It is also used by auto discovery to match transports and it is sent to other peers so that it can be read in log files and diagnostic output. It has no protocol implications beyond auto discovery, a misspelling won’t cause it to stop working.

Services and Networks

The endpoint protocols: RV, NATS, Redis; all have a service defined to separate the data flow from one another. By using the same service name allows these endpoints to share the same namespace. The underlay network that connects the namespaces can both be configured using the YAML files or the console and also be specified by the connecting clients. The clients can specify a network with PGM multicast or with TCP endpoints and meshes. All of networks specified by a client that use TCP will still use multicast to resolve the endpoints by service name by using the name protocol.

Networks use a device name and a protocol or a multicast address. When a network is not specified by a client or configuration, then the links between services have to be configured by the YAML files and/or in the console.

Example networks and how they are interpreted. All of these have a service name associated with the network, which must match for namespace to communicate.

  • eth0;239.1.2.3 — Connect a PGM protocol to eth0 joining the multicast address of 239.1.2.3 for communicating with other peers.

  • eth0;tcp.listen — Connect a name protocol to the eth0 interface and advertise a TCP listen endpoint.

  • eth0;tcp.connect — Connect a name protocol to eth0, and advertise a TCP connection endpoint. These resolve to a connection when listen endpoints appear with a clients that use the above.

  • eth0;mesh — Connect a name protocol to eth0, and advertise a TCP mesh endpoint. This creates connections to all other mesh endpoints advertised.

  • eth0;any — Connect a name protocol to the eth0, and connect to a listen or a mesh advertised.

The device name eth0 can be substituted with an IPv4 address, like 192.168.1.0;tcp.listen, or a hostname that resolves to an IPv4 address. If a network is specified without a name, like ;tcp.listen, then the machine’s hostname is used to find the device.

The configuration for the PGM, name, TCP protocols are generated as needed by the client if they do not exist. When a service already is configured, then it is used instead and the network parameters are ignored.

Cost

All of the internal transports have a cost assigned to the links. The routing from peer to peer uses this cost to find a path that minimizes the cost. Equal cost links are utilized by each peer by encoding a path into the message header. This path is enumerated from 0 → 3, so there is a maximum of 4 equal cost paths possible between any 2 peers in the network. The per path cost can be configured by using different cost metrics for each link. The default cost is 1000 so that a configured cost can be less or greater than 1000. These configured metrics are replicated throughout the network so that every peer agrees the cost of every path that exists. A case where lowering the cost is useful is when some of the links have higher performance than others as is the case when all peers exist within a host or within a data center. A case when configuring different cost for each of the 4 paths is to load balance multiple links with equal performance.

Example of configuring a lower cost mesh on a bridge:

  tport: rv_7500
  type: mesh
  route:
    device: docker0
    cost: 10

If every container within this host has a RV client that connects with a network and service of -network eth0;mesh -service 7500 then the cost of 10 discovered through the docker0 bridge. The name protocols used will use the name of the device as their tport name.

Example of configuring a load balanced cost for links through a data center:

  transports:
    - tport: a_mesh
      type: mesh
      route:
        listen: *
        connect: [ host, host2, host3, host4 ]
        port: 5000
        cost: [ 100, 1000, 1000, 1000 ]
    - tport: b_mesh
      type: mesh
      route:
        listen: *
        connect: [ host, host2, host3, host4 ]
        port: 5001
        cost: [ 1000, 100, 1000, 1000 ]
    - tport: c_mesh
      type: mesh
      route:
        listen: *
        connect: [ host, host2, host3, host4 ]
        port: 5002
        cost: [ 1000, 1000, 100, 1000 ]
    - tport: d_mesh
      type: mesh
      route:
        listen: *
        connect: [ host, host2, host3, host4 ]
        port: 5003
        cost: [ 1000, 1000, 1000, 100 ]

This creates 4 equal mesh networks, each of which is preferred for part of the subject space. The connect and cost can be enumerated as connect, connect2, connect3, connect4 and cost, cost2, cost3, cost4 as well as an array.

TCP Encryption

The TCP type and mesh type links are encrypted using AES 128 bit in counter mode. The protocol above the link layer handles the authentication for trusting the peer and the messages that are received, described in Authentication. The encryption is set up by a ECDH exchange. Each side generates a ECDH keypair and sends the public key with a checksum and a 128 bit nonce value. Each side computes the secret key and uses the KDF to mix the secret with the nonce value to arrive at a 128 bit key and a 128 bit counter for sending and receiving. Thse are used to encrypt and decrypt the other sides bytes.

alice -> bob [ 8 bytes checksum ] [ 32 bytes pub key ] [ 16 bytes nonce ]
bob -> alice [ 8 bytes checksum ] [ 32 bytes pub key ] [ 16 bytes nonce ]
alice.secret = ECDH( bob public key, alice private key )
bob.secret = ECDH( alice public key, bob private key )
alice.recv key+counter = KDF( secret[32] + bob.nonce[16] ) -> 64 bytes
alice.send key+counter = KDF( secret[32] + alice.once[16] ) -> 64 bytes
bob.recv key+counter = KDF( secret[32] + alice.nonce[16] ) -> 64 bytes
bob.send key+counter = KDF( secret[32] + bob.once[16] ) -> 64 bytes

The 32 byte secret will be the same on both ends. The nonce is a random 16 byte value. The KDF function mixes into the keys a preshared salt value, generated by ms_key_gen in a "config/.salt" file described in Configuration. Without this salt value, the key exchange will compute incorrect keys even though the secret is computed correctly.

The 8 bytes checksum is a CRC of the pub key and the nonce in big endian, so the first 4 bytes will be zero. The zero bytes cause an encrypted connection to an unencrypted endpoint to fail.

The 64 byte result of the KDF computation is folded with XOR to arrive at the 16 byte AES key and the 16 byte counter value.

Open PGM

PGM is a multicast protocol, which layers reliability on the native UDP multicast. The parameters for it declare the amount of memory used for buffering data and control the timers when retransmitting is necessary.

The type of PGM used is UDP encapsulated using the port specified. The address specification has a network, a send address, and multiple receive addresses, formatted as network;recv1,..;send, so this is a valid address: 192.168.1.0;224.4.4.4,225.5.5.5;226.6.6.6 where the send address is the last part and the middle addresses are where packets are received. If the network part is unspecified, then the hostname is used to find the interface. If there is only one multicast address, then that is used for both sending and receiving.

Example tport_mypgm.yaml:

  tport: mypgm
  type: pgm
  route:
    listen: 192.168.1.0;224.4.4.4
    port: 4444
    cost: 100
Field Default Description

listen

;239.192.0.1

Multicast address

connect

;239.192.0.1

Multicast address

port

9000

UDP port

cost

1000

Cost of PGM network

mtu

16384

Maximum UDP packet size

txw_sqns

4096

Send window size

rxw_sqns

4096

Receive window size

txw_secs

15

Send window in seconds

mcast_loop

2

Loop through the host

The transmit and receive window sizes expand to the reliability time or the txw_secs parameter. When the txw_secs is not set, then the reliability passed on the command line or as a configuration parameter is used. The receive window memory is not used until there is packet loss and a retransmission occurs. Unrecoverable packet loss occurs when the transmission window no longer has the sequences that are lost. The mcast_loop, when set to 2, allows two peers to share the same network on the same host. This causes packets to loop back through the interface and allows multiple PGM networks to coexist on the same multicast group.

In addition to the multicast networking, an inbox protocol is used for point to point messages. The network specified in the multicast address is used as the inbox network, with a random port.

The listen and connect addresses act similarly, two peers using different methods will communicate if the multicast send address matches one of the receive addresses and the inboxes are connected.

TCP Mesh

A TCP mesh is a group of peers which automatically maintain connections with every other peer. When a new peer joins the mesh, it opens a connection with all the other peers which are currently members of the mesh.

The timeout parameter causes the connecting peer to retry for this amount of time. When the timeout expires, the transport will not try to connect until told to do so again.

Multiple connect addresses are normally specified so that some connection likely succeeds if that network is running. Allow peers can specify multiple connect addresses since they use both listen and connect methods to join a network. After one connection succeeds, all the other connections in progress are stopped and the list of mesh members are downloaded from the peers and those are connected.

Example tport_mymesh.yaml:

  tport: mymesh
  type: mesh
  route:
    listen: *
    connect: [ host, host2, host3, host4 ]
    port: 9000
    timeout: 0
    noencrypt: true
Field Default Description

listen

*

Passive listener

connect

localhost

Active joiner

device

Use peer discovery

port

random

Listener or connect port

timeout

15

Active connect timeout

cost

1000

Cost of mesh links

noencrypt

false

Disable encryption

If the mesh is a stable network, then the timeout set to a larger value or zero can prevent a network split where some parts of the network are isolated for a period of time. When a host is restarted doesn’t have as much of an effect by a timeout since it is rejoining an existing network. If a timeout expires, then an admin request to rejoin the network is possible through the console.

When a device parameter is used, then multicast is used through the name protocol to discover peers that are joining the same mesh, matching using the tport name. After discovering the peer, a connection with TCP is used to join the mesh. The port can be random with a device, since the address is discovered rather than connected. Both the device and connect can be methods can be used.

The noencrypt parameter set to true disables tcp link encryption. Both the listener and connector must match this setting, otherwise they will close the connection after receiving the first bytes sent.

TCP Point-to-point

A TCP point to point connection to another peer. This is useful to create ad-hoc topologies at the network boundaries.

Example tport_mytcp.yaml:

  tport: mytcp
  type: tcp
  route:
    listen: eth0
    connect: host
    port: 9001
    timeout: 0
Field Default Description

listen

*

Passive listener

connect

localhost

Active joiner

device

Use peer discovery

port

random

Listener or connect port

timeout

15

Active connect timeout

cost

1000

Cost of the TCP link

edge

false

A peer at the edge

noencrypt

false

Disable encryption

A TCP protocol is either a listener or a connector, the appropriate config is used at run time when a connect or listen is used to activate the port. When device is used to discover the peers through the multicast name protocol, the listeners are matched with the connectors. When more than one listener is discovered by a connector, then connections are made to each one.

Whether a configuration is used to connect or listen is specified by a listen or connect command or configuration. If multiple connections are specified by the connect parameter, then the first connection that is successful will cause the others to stop trying to connect.

The edge parameter set to true causes the passive peer to pool the connections on a single transport, similar to a multicast transport where the traffic is primarily through a gateway peer. The noencrypt parameter set to true disables tcp link encryption. Both the listener and connector must match this setting, otherwise they will close the connection after receiving the first bytes sent.

If the listen or connect parameters specify a port, as in "localhost:8000", then that port overides the parameter port configured. A device name is resolved before the hostname DNS resolver is tried, so "eth0:8000" will resolve the address configured on the eth0 device.

Tib RV

The RV protocol supports both the RV5 and RV6+ styles of clients. The RV6+ clients use the daemon for the inbox endpoint and don’t create sessions, the RV5 clients use a unique session for each connection and allow an inbox reply in the subscription start. These differences cause decades old software incompatibilities and pressure to re-engineer legacy messaging systems.

There clients usually specify the network and service they want to connect, which is different from the other clients. When a client requests to connect to a multicast network, the Rai MS ms_server will start a PGM transport for it, unless an existing transport is already defined named with a rv_ prefix and a service numbered suffix.

When the rv_7500 transport exists as a TCP mesh, then this network is remapped to the predefined transport when a RV client uses the service 7500 and the multicast network specified by the client is ignored. When no multicast network is specified, then no Rai MS transport is created and the existing transports are used.

Example tport_myrv.yaml:

  tport: myrv
  type: rv
  route:
    listen: *
    port: 7500
Field Default Description

listen

*

Passive listener

port

random

Listener port

use_service_prefix

true

Use a service namespace

no_permanent

false

Exit if no connections

no_mcast

false

Ignore multicast networking

no_fakeip

false

Use IPv4 address for session

Unless the use_service_prefix is false, the traffic is segregated to the _rv_7500 where service is 7500. If it is true, then all services that also have use_service_prefix set to true will share the same namespace. Without no_fakeip set to true, the session and inbox values are random and not based on the IPv4 address of the host. This allows RV networks to work without a routable IPv4 network across private address spaces that are common with NATs, VMs, and/or container networks.

NATS

NATS is a pub/sub system that is similar to RV with respect to subject schema with some extensions for queue groups and optionally persistent message streaming. The protocol support does not include the streaming components, only the pub/sub and queue groups. NATS does not have an inbox point-to-point publish scheme, it relies on the client to create a unique subject for this functionality.

Example tport_mynats.yaml:

  tport: mynats
  type: nats
  route:
    listen: *
    port: 4222
Field Default Description

listen

*

Passive listener

port

random

Listener port

service

_nats

Service namespace

network

none

Join a network

If the network is specified, then starting the NATS service will also join the network. A network format is as described in Services and Networks.

Redis

Redis has a pub/sub component that has slightly different semantics, without a reply subject for request/reply. It also uses the term channel to refer to a subscription. A pattern subscription is separated by a psub operator, allowing subscriptions and publishes to any series of bytes.

Example tport_myredis.yaml:

  tport: myredis
  type: redis
  route:
    listen: *
    port: 6379
Field Default Description

listen

*

Passive listener

port

random

Listener port

service

_redis

Service namespace

network

none

Join a network

The data operators that operate on cached structures like lists and sets, etc. These commands are available when a shared memory key value segment created and passed as a command line argument to the server (example: -m sysv:raikv.shm), or defined as a value in the config files (example: map: "sysv:raikv.shm").

If the network is specified, then starting the Redis service will also join the network. A network format is as described in Services and Networks.

Telnet

Telnet is a way to get a console prompt, but it doesn’t start by default. It uses the same transport config as the pub/sub protocols. It also can be used by network configuration tools to install a configuration remotely. A telnet client signals the service that it has terminal capabilities which enables command line editing.

Example tport_mytelnet.yaml:

  tport: mytelnet
  type: telnet
  route:
    listen: *
    port: 22
Field Default Description

listen

*

Passive listener

port

random

Listener port

Web

Web handles http requests and websocket endpoints and integrates an web application that can be used to graph activity and show internal tables. The web application is compiled into the server, so no external file access is necessary.

Example tport_myweb.yaml:

  tport: myweb
  type: web
  route:
    listen: *
    port: 80
    http_dir: "./"
    http_username: myuser
    http_password: mypassword
Field Default Description

listen

*

Passive listener

port

random

Listener port

http_dir

none

Serve files from this directory

http_username

none

Adds username to digest auth

http_password

none

Sets password for username

http_realm

none

Sets realm for username

htdigest

none

Load digest file for auth

If http_dir is not set, then this service does not access the filesystem for processing http get requests. It has a set of html pages compiled into the binary that it uses for viewing the server state.

If http_dir is set, then the files located in the directory will override the internal files. The html files and websocket requests also have a templating system which substitute values from the server. If @(show ports) appears in a html page, it is replace with a html <table> of ports. If template "res" : @{show ports} is sent using a websocket, it expands to a JSON array off ports and the reply is "res" : [ports...].

Any of the commands from the console interface are now exposed on the http endpoint. Requesting "show ports" will respond with a JSON array of transports with the current totals of messages and bytes:

$ wget --http-user=myuser --http-password=mypassword -q -O - "http://localhost:80/?show ports"
[{"tport":"rv.0", "type":"rv", "cost":1000, "fd":13, "bs":"", "br":"", "ms":"", "mr":"", "lat":"", "fl":"SLI", "address":"rv://127.0.0.1:7500"},
{"tport":"mesh4.1", "type":"mesh", "cost":1000, "fd":16, "bs":"", "br":"", "ms":"", "mr":"", "lat":"", "fl":"SLX", "address":"mesh://10.4.4.18:19500"},
{"tport":"primary.2", "type":"tcp", "cost":1000, "fd":18, "bs":29500, "br":47324, "ms":229, "mr":355, "lat":"26.5ms", "fl":"C", "address":"robotron.1@tcp://209.237.252.104:18500"},
{"tport":"secondary.3", "type":"tcp", "cost":1000, "fd":20, "bs":23276, "br":39134, "ms":181, "mr":311, "lat":"29.4ms", "fl":"C", "address":"edo.2@tcp://209.237.252.98:18500"}]

The websocket endpoint can also be used to subscribe subjects. When a message is published to the websocket, the format used is:

"subject" : { "field" : "value" }

This requires that the messages published can be converted to JSON or is already in JSON format.

The http_username / http_password or htdigest will cause http digest authentication to be used and require them for access. The above wget is used with the example configuration.

A htdigest file contains a list of users and can be created by the htdigest program distributed with the Apache packages.

$ htdigest -c .htdigest realm@raims myuser
Adding password for myuser in realm realm@raims.
New password: mypassword
Re-type new password: mypassword

$ cat .htdigest
myuser:realm@raims:56f52efe43dcf419e991ea6452ae6f06

Then tport_myweb.yaml is configured like this:

  tport: myweb
  type: web
  route:
    listen: *
    port: 80
    htdigest: ./.htdigest

Only one realm can be used by the service. If http_realm is configured then that realm is used, otherwise the first realm in the htdigest file is used. If no realm is specified but a user and password are specified, then "realm@raims" is used.

Link State

The Forwarding Set

Each node in a network must construct a forwarding set for any message sent by any peer. A forwarding set instructs the node where to send a message so that all subscribers of it will see the message exactly one time, when the network is converged and stable.

A "converged network" is one where all peers agree that a link exists. If peer A has in it’s database a link to peer B, then peer B must also have a link to peer A. If a link is missing, then the network tries to resolve the difference by asking the peers with the discrepancy which is correct.

Every peer has a bloom filter that contains all of the subscriptions currently active. The links database tells each peer how the network can be traversed for full coverage and the bloom filter prunes the coverage by dropping the message when there are no subscriptions active that match the subject on the other side of the link.

A simple redundant network is a circle:

dyna  --  ruby

  |        |

bond  --  chex

If the cost of each of the links is set to the default 1000, then the forwarding set for dyna is the link to ruby and bond. When ruby and bond receive a message from dyna, only one of them will forward the message to chex. The path cost from dyna → ruby → chex is equal to the path cost from dyna → bond → chex. The forwarding algorithm tracks the equal cost paths and ranks them in order of peer age. In the case that ruby is older than bond, then the ranking of these routes would by 1. dyna → ruby → chex and 2. dyna → bond → chex. The top 4 ranked routes are saved as the forwarding sets, and selected by the hash of the message subject. In this case, half of the subjects subscribed by chex and published from dyna would take the first path and the other half would take the second path.

The method of ranking the paths by peer age is used because the stability of the network is less affected when more transient peers are added and subtracted from the link state database.

Message Loss

Under normal conditions, the sequence of the message is one greater than the last sequence received. The sequence numbers are 64 bits so they will never be zero. These conditions are possible when a sequence is not in incrementing order from the last message received, which is what normally occurs.

  • Publisher includes a time stamp

This causes the subscriber to synchronize the sequence numbers. The publisher will always include a time stamp when the first message of a subject is published, or when the last sequence is old enough to be cycled from the publisher sequence window.

  • The first message received

When a subscription start occurs it will usually not contain a time stamp, unless it is the first message published.

  • The message sequence is repeated

A sequence is less than or equal the last sequence received. This indicates the message was already processed. The message is dropped.

  • The message sequence skips ahead

Some sequences are missing, indicating messages were lost. Notification of message loss is propagated to the subscriptions.

  • The message subject is not subscribed

The subscription may have dropped and the publisher has not yet seen the unsubscribe.

Multicast sequence numbers

The sequence numbers include a time frame when the publisher starts the message stream. This is the computation that creates a new sequence stream.

nanosecond time stamp = 1659131646 * 1000000000 = 0x17066b710b706c00

 1               8               16              24
|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-
|0 0 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1
|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-
 32              40              48              56              64
 -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|
 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0|
 -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|

message sequence number = ( nano time >> 33 << 35 ) + 1 = 0x5c19adc000000001

 1               8               16              24
|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-
|0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0
|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-
 32              40              48              56              64
 -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1|
 -+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|

This truncates nanosecond time stamp to approximately 10 second intervals, a new time frame can only occur after 10 seconds. The time frame is stored in the upper 29 bits will be valid until the year 2115. The sequence resolution within a time frame is 35 bits or 34 billion sequences. These are rotated to new time frames when the sequence number is zero.

These are properties of the time frame encoded in the message sequence numbers:

  1. A start of a new multicast stream sequence will use the current time, this is always after the last convergence time stamp. The current time is also used as needed when memory limitations prevent caching of the last sequence published. When the sequence is cached, the additional messages won’t change the time frame but will increment the sequence number.

  2. A new subscription start or uncached sequence publish can verify that the first message received is greater than the network convergence time. This is used to validate that the message stream is uninterrupted to the start of the time frame, since message loss has not occurred since the before network convergence.

All of the transports are stream oriented, so a loss of unrecoverable network packets will cause connections to drop and a new convergence state by pruning the lost routes. All peers will agree on a time that convergence is reached. New time frames are created for all messages published so that the time frame constructed in any one peer greater than the convergence time in all peers.

When routes are added to or subtracted from the network, the message routing is not stable until all peers have finished adjusting their view of the network. The peer that publishes a message may use a sub-optimal forwarding path to the recipients until they are notified that better paths are available with link state exchanges.

Publish sequence window

A map of subject to sequence numbers for published multicast messages is maintained by each peer. This map rotates when a configured memory limit is reached, pub_window_size, and the window time interval is greater than a configured time, pub_window_time, which must be at least 10 seconds. When a subject is rotated out of the window, the sequence number is restarted with a new time frame.

Subscription sequence window

A map of subject to sequence numbers for the subscriptions that a peer has is also maintained. This validates that the messages are processed in order and allows notification of message loss when the sequences skip and does not allow a message to be processed twice. The memory limit for this is sub_window_size and time limit is sub_window_time. When a subject is rotated out of the window, then the publisher did not update for the window time and the next sequence is treated as if a new subscription was created.

Message duplicates are avoided by discarding messages that are older than the trailing edge of the subscription sequence window. The clock skew between systems is estimated. The console command show skew will display the calculated clock skew between systems.

C.test[Jl8gk4f+gVaf60LxKtsaMg]@dyna[560]> show skew
user |   lat  |   hb   | ref |  ping  |   pong  |     time
-----+--------+--------+-----+--------+---------+-------------
 A.1 |  187us |  451us |   0 |  104us | -2.22us | 01:32:56.384
 B.2 |  304us | 1.25ms |   1 |  207us | -18.9us | 01:32:56.384
 D.3 |  174us |  690us |   0 | 77.2us | -3.73us | 01:32:56.384
 G.4 | 25.8ms |  4.5se |   1 |  4.5se |  4.49se | 01:32:51.897

The pong calculation subtracts the round trip time and is the most accurate, the others disregard the latency of the network. The HB are from time differences of directly attached peers using heartbeats and are shared with those not directly attached. The ref is the peer (0 = self, 1 = A.1) that originated the HB difference. The time is the estimated clock setting of the remote peer in the current timezone.

Configuration for sequence windows

The sizes and windows are in the parameters section of the config file and default to 4 megabyte (about 60,000 subjects for publishers and 20,000 for subscribers) and 10 seconds. The size of the windows will have an overhead of 48 bytes for publishers and 128 bytes for subscribers in addition to the subject size. The 10 second rotate timer could cause more memory to be used if lots of new subjects are published or lots of new subjects are subscribed within 10 seconds.

$ cat config/param.yaml
parameters:
  pub_window_size: 10 mb
  pub_window_time: 60 sec
  sub_window_size: 10 mb
  sub_window_time: 60 sec

Show loss

The show loss console command displays the messaging statistics for the peers.

A.test[XftVokMK+WK12CNuEaRFuA]@dyna[545]> show loss
user | repeat | rep time | not sub | not time |  msg loss |   loss time  | ibx loss | ibx time
-----+--------+----------+---------+----------+-----------+--------------+----------+---------
 B.1 |      0 |          |       0 |          |         0 |              |        0 |
 D.3 |      0 |          |       0 |          |       766 | 20:42:24.431 |        0 |
 C.4 |      0 |          |       0 |          |         0 |              |        0 |
  • repeat — count of multicast messages received more than one time

  • rep time — last time of repeated messages

  • not sub — count of multicast messages received which were not subscribed

  • not time — last time of not subscribed

  • msg loss — number of multicast messages which were lost

  • loss time — last time of multicast message loss

  • ibx loss — number of messages which were lost from the inbox stream

  • ibx time — last time of inbox message loss

An inbox message loss is not unusual since the point to point messages are often used for link state exchanges and other network convergence functions. Inbox message loss is usually not as problematic as multicast message loss since there often timers are retries associated with their usage.

Multicast message loss is much more difficult to recover from, since there are usually many multicast streams and tracking the state of each one is a problem solved by persistent message queues. This requires clients which track the state of the messages they consume and notify the queue when they are finished with processing them.

Notification of message loss

If a message arrives with a sequence which is not in order, it is forwarded with state indicating how many messages are missing, if that can be determined. The protocol handling of this notification is to publish a message indicating how many messages were lost.

RV protocol

The RV protocol publishes a message to the _RV.ERROR.SYSTEM.DATALOSS.INBOUND.BCAST subject with a count of lost messages. These are throttled so that on one is published per second after the first one is published.

Example:

   mtype : "A"
     sub : _RV.ERROR.SYSTEM.DATALOSS.INBOUND.BCAST
    data : {
   ADV_CLASS : "ERROR"
  ADV_SOURCE : "SYSTEM"
    ADV_NAME : "DATALOSS.INBOUND.BCAST"
    ADV_DESC : "lost msgs"
        lost : 7
     sub_cnt : 7
        sub1 : "RSF.REC.PAC.NaE"
       lost1 : 1
        sub2 : "RSF.REC.MTC.NaE"
       lost2 : 1
        sub3 : "RSF.REC.MCD.NaE"
       lost3 : 1
        sub4 : "RSF.REC.MCD.N"
       lost4 : 1
        sub5 : "RSF.REC.SPM4.NaE"
       lost5 : 1
        sub6 : "RSF.REC.MER.NaE"
       lost6 : 1
        sub7 : "RSF.REC.MER.N"
       lost7 : 1
        scid : 7500
  }

Internal Protocol

The protocol is asynchronous with timers to timeout RPCs and to throttle the rate which peers back off retries. As a result of this, the message flow for a network configuration is variable and can change with different conditions.

The function of each message is encoded in the subjects with the arguments passed as field values with some common flags and options encoded in the message header.

Each message is authenticated a session key using a message HMAC. The initial key exchange is signed by either the service private key or a configured user private key. The heartbeat messages are also authenticated with a hello key message HMAC derived from the service public key and the start time. These are messages that set up the initial key exchange before a session key is established, but can be weakly authenticated since service public key is encrypted at rest in the configuration and not shared over the network.

Any message that fails authentication is ignored.

Field Values

Each field in a message is encoded with a type and length. This allows new fields to be added without disrupting the message parsing. The first 16 bits encodes the type, length and field id. The rest of the field encodes the value. All integers are encoded in big endian.

fid = BRIDGE(3), type = OPAQUE_16(4) ( opaque 16 bytes )            144
|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+.. +
|1 1 x x 0 1 0 0 0 0 0 0 0 0 1 1|                                     |
 ^ ^     ^.....^ ^.............^ ^....................................
 | |         |        |                        |
 | primitive type(4)  fid(3)               128 bit bridge
 fixed

The types defined are bool (size:1), unsigned int (size:2,4,8), opaque (size:16,32,64), string (max size:64k), long opaque (max size:4G).

The first two bits, fixed and primitive, indicate whether the type has a fixed length, and whether the value is a field (primitive) or a message (not primitive). A message is another group of fields and is always encoded as a long opaque with the primitive bit set to 0. A message payload is always encoded as a long opaque with the primitive bit set to 1.

The types are enumerated as:

Type Value Size

bool

0

1 byte

unsigned short

1

2 bytes

unsigned int

2

4 bytes

unsigned long

3

8 bytes

opaque 16

4

16 bytes

opaque 32

5

32 bytes

opaque 64

6

64 bytes

string

7

16 bit length + up to 64K bytes

long opaque

8

32 bit length + up to 4G bytes

The field values are aligned on 2 byte boundaries, so the value is padded one byte when the value size is odd. There are currently 76 different field ids (fid) and a maximum of 256 (defined in the header file msg.h).

Message Framing

A message frame has 5 fixed length sections and 3 fields that are always present and use two bytes.

These header fields are:

Field Size

Version

1 bit

Message Type

2 bits

Message Option

5 bits

Message Size

3 bytes

Subject Hash

4 bytes

Bridge

2 byte type + 16 bytes

Message Digest

2 byte type + 16 bytes

Subject

2 byte type + 16 bit length + up to 64K

The first 4 bytes encoded as:

bytes 0 -> 3 are ver(1), type(2), opt(5), message size (24)
 1               8               16              24              32
|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|-+-+-+-+-+-+-+-|
|1|0 0|0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0|
 ^ ^.^ ^.......^ ^.............................................^
 |    \    |                         |
ver(1)|   opt(0)                24 bit size(160)
     type(0)

The Message Type encodes 4 classes of messages:

Type Value Description

Mcast

0

Multicast message with routeable payload

Inbox

1

Point to point message

Router Alert

2

System link state or subscription update

Heartbeat

3

Neighbor link keep alive

A message that has routeable data always has the Multicast or Inbox type set. The Inbox type message is also used for RPC style communication between peers. The Router Alert type message alters the routing database by modifying the link state or the subscription state. A Heartbeat type is a periodic presence update. The peers which are directly connected are responsible for detecting link failures.

The Option Flags is a bit mask that encodes options for messages with Multicast and Inbox types that are routing data payloads to endpoints. These are:

Option Value Description

Ack

1

Endpoints ack the reception

Trace

2

All peers along the route ack the reception

Any

4

Message is an anycast, destination is one endpoint of many

MC0

0

Message is using multicast path 0

MC1

8

Message is using multicast path 1

MC2

16

Message is using multicast path 2

MC3

24

Message is using multicast path 3

The message size does not include the first 8 bytes, so the message frame size is 8 + the message size field. If the size is greater than 24 bits, then the next 32 bits are used to encode the size and the subject hash is calculated from the subject.

The Bridge, Message Digest and Subject are encoded in Type Length Value format. The Bridge is a 128 bit identity of the sender. The Message Digest is the authentication field. The receiving peer will authenticate that the message is valid by using the Bridge to look up the 512 bit session key of the sender and calculate an HMAC using the message data with the session key and compare it to the value contained in the Message Digest. In addition, there are sequence numbers and time stamps present that prevent the replay of each message frame.

The 4 multicast path options will select the one of the equal cost paths calculated from the current link state. Every peer can calculate these paths using the same replicated link state database, this results in 4 forwarding trees to the same destinations if there are enough redundant links.

System Subjects

The peers exchange messages to authenticate new peers, synchronize the link state of the network, subscription updates, and heartbeats to maintain neighbor links. These types of messages have unique subject prefixes as well as bits in the message type header indicating whether it is special.

There are 7 classes subject prefixes used:

Prefix Description

_I.

Inbox point to point

_M.

Generic multicast message

_X.

Heartbeat link presence message

_Z.

Link state broadcast message

_S.

Normal subscription multicast message

_P.

Pattern subscription multicast message

_N.

Peer statistics multicast message

A broadcast style forwarding used by _Z, subjects is different from multicast forwarding. It will flood the authenticated peers in the network, adjusting each peer’s routing database as it is received. It uses this type of forwarding because this kind of update may cause the multicast forwarding to be temporarily incomplete until the network converges again.

The forwarding path for the Inbox, Heartbeat and broadcast subjects does not follow the multicast forwarding path, so they can’t be subscribed.

There is a separate sequence number domain defined for these because of the idempotent nature of maintaining the replicated state of the network. If a peer misses messages for a delta changes in the subscriptions or links database, the state is reinitialized by replicating it from an up to date peer.

The multicast subjects follow normal forwarding rules. The _M prefix is used for a multicast ping and a multicast link state sync.

The _N prefix has unique subjects for link and peer statistics like messages sent or received, bytes sent or received, as well as adjacency notifications. These are used to monitor an individual node or a group of them with pattern subscriptions. These stats are not sent unless there are subscriptions open.

Heartbeat Subjects

These are sent on a link between directly connected peers.

Subject Description

_X.HELLO

First message sent

_X.HB

Periodic message

_X.BYE

Last message sent

_X.NAME

Link discovery message

  • _X.HELLO and _X.HB messages have two functions, the first is to initiate the authentication key exchange. The second is to keep a peer up to date with the last sequence numbers used by the subscription and link state. When heartbeats are not received within 1.5 intervals. The interval default is 10 seconds, this causes a link to be deactivated at :15 when HB expected at :10. When all of the direct links to a peer are inactive, then the peer is unauthenticated and marked as a zombie. The heartbeat timeout does not depend on a transport timeout, like a TCP reset. The result of this behavior is that overloaded or congested links that delay messages for longer than the 1.5 times the heartbeat interval will may incur message loss. This puts an upper bound on the link latency and alleviates back pressure to the publisher.

  • _X.BYE causes the peer to be unauthenticated and dropped from the peer db.

  • _X.NAME messages are multicast to a device for presence detection. Links between peers are only established when the type and name of a transport is matched within a service.

These are broadcast flooded to authenticated peers.

Subject Description

_Z.ADD

New peer added to peer db

_Z.DEL

Dropped peer from peer db

_Z.BLM

Subscription bloom filter resized

_Z.ADJ

Adjacency changed, link added or removed

  • _Z.ADD is broadcast when a new peer is added to the peer db, usually as a result of authentication and also in the case when network splits and peers were joined again.

  • _Z.DEL is broadcast when a peer sent a _X.BYE or if it is no longer reachable because all routes to it are down.

  • _Z.BLM is broadcast when a peer resizes the bloom filter associated with the subscriptions and patterns it has open, this occurs approximately when crossing powers of two subscription counts (currently at 31, 62, 124, 248, …​).

  • _Z.ADJ notifies when a peer adds are subtracts a link to another peer. It increments the link state sequence number so that peers apply this update only when the link state reflects the current state, otherwise a RPC synchronization request is used (_I.[bridge].sync_req) to resync.

Subscription Subjects

These are multicast to authenticated peers. They are updates to the bloom filter that can be missed and resynchronized with _Z.BLM or a resync RPC request.

Subject Description

_S.JOIN

Start a subscription

_S.LEAV

Stop a subscription

_P.PSUB

Start a pattern subscription

_P.STOP

Stop a pattern subscription

  • _S.JOIN and _S.LEAV add and subtract subscriptions to a subject.

  • _P.PSUB and _P.STOP add and subtract pattern subscriptions. These contain a pattern type as well as the pattern string. The pattern types currently supported are either a RV style wildcard or a Redis glob style wildcard.

Inbox Subjects

The format of a subject with an _I. prefix also encodes the destination of the message by appending the 128 bridge id in base64.

Example:

_I.duBVZZwXfwBVlYgGNUZQTw.auth

All of the peers along the path to the destination use this bridge id to forward the message using the rules for the point to point route of the destination peer. This may be a TCP link or it may be a UDP Inbox link in the case of a multicast PGM transport. The suffix of the inbox subject indicate the type of request or reply it is. If the suffix is an integer then the endpoint is not a system function, but information requested by the console session or a web interface that is usually converted to text and displayed.

These suffixes are currently recognized:

Suffix Description

auth

Request authentication, peer verifies with user or service pub key

subs

Request and match active subscriptions strings with a pattern

ping

Request a pong reply, also has seqnos for maintaining state

pong

A reply to a ping, has latency information and update clock skew

rem

Remote admin request, run a console command from another peer

add_rte

After authenticated with peer, it will add other peers it knows

sync_req

Peer sends when it finds an old peer db or subscription state

sync_rpy

Response to a sync_req, includes current state if it is out of date

bloom_req

Peer requests bloom state, currently peers use adj_req instead

bloom_rpy

Response to a bloom_req, contains the bloom map of the subscriptions

adj_req

Peer requests when it finds an old link state or subscription state

adj_rpy

Response to adj_rpy, contains an up to date link state and bloom map for peer

mesh_req

Peer requests when it detects a missing mesh member

mesh_rpy

Response to mesh_rpy, contains missing link URLs

trace

Response to messages which have the Trace option flag in header

ack

Response to messages which have the Ack option flag in header

any

Encapsulates a peer _INBOX message, for point to point routing

  • Auth does a key exchange between two peers. After completing successfully, each peer has a session key for the other. This allows messages to be sent by the other to be authenticated using Message Digest field.

  • Subs is a request for the open subscriptions. It is used by the console and the web interface for examining the network. The RPC reply is always a numeric string to forward to the terminal or web page that requested it.

  • Ping and pong are latency gathering functions for any two peers in the network, not necessarily directly connected. The current sequence numbers for the link state and subscription state are also exchanged for synchronizing peers which are not directly connected.

  • Rem is a remote console command execution, used in the console and web interfaces.

  • Add_rte is used after the auth key exchange to replicate the peer db to a new peer. This initial peer db only contains the names and bridge ids, so the new peer must request session keys, link state and subscription state for peers it does not already know about.

  • Sync_req and sync_rpy are used replicate the session keys. If a new peer does not have the session info from a _Z.ADD or a add_rte, it will request it from the peer that notified of the unknown peer session. This will often be the case after authentication occurs and the new peer receives an add_rte from an older peer that has a db with the current state of the network. This is the only other way that the unique session keys for each peer is distributed besides directly authenticating with a key exchange. The sync_rpy also includes the link state and subscription bloom filter of requested peer.

  • Bloom_req and bloom_rpy are RPCs for the subscription bloom filter. The adj_req and adj_rpy are used instead for this info.

  • Adj_req and adj_rpy are the main method that peers recover the current link state and subscription state. They work in a RPC request/response style. The request contains the sequence numbers that the source peer has in it’s db. The destination peer compares these numbers with it’s own db and replies when a sequence needs updating. Usually the destination peer is the one that the source needs synchronized, but a closer peer can be queried as well. This occurs when a lot of peers need to resynchronize as a result of a network split and reconnect.

  • Mesh_req and mesh_rpy are RPCs for distributing URLs for peers in the same mesh network. When a peer connects to a mesh, it uses the initial connection to find the addresses of all the other peers in the mesh with this RPC.

  • Trace and ack are sent as a multicast message is forwarded with the Message Options set in the header. These can be requested from a console publish using the "trace" or "ack" commands.

  • Any encapsulates an _INBOX point to point message and forwards it to the correct peer. An _INBOX publish does not have a destination other than a unique subject that another peer has subscribed, for example "_INBOX.7F000001.2202C25FE975070A48320.>". The peer that encapsulates this message finds the possible destinations by testing the bloom filters it has and then forwards to the matching peers. The usual case is that there is only one matching destination.

Example Message Flow

Two peers key exchange, ruby connecting to dyna:

Packet Subject Source Destination Description

ruby.1

_X.HELLO

ruby

dyna

initial hello message after connection

dyna.1

_I.xq6vl+2HcoDxtt+7lC7dGQ.auth

dyna

ruby

dyna authenticates with ruby

ruby.2

_I.wwEnbQEY2FMuwZGSjpi3jQ.auth

ruby

dyna

ruby authenticates with dyna

ruby.2

_Z.ADD

ruby

dyna

ruby adds dyna to peer db

ruby.2

_Z.ADJ

ruby

dyna

ruby adds link to dyna

dyna.2

_Z.ADJ

dyna

ruby

dyna adds link to ruby

dyna.2

_I.xq6vl+2HcoDxtt+7lC7dGQ.auth

dyna

ruby

dyna confirms authentication

dyna.2

_Z.ADD

dyna

ruby

dyna adds ruby to peer db

Ruby connecting dyna, a member of a network of 4 nodes: dyna, zero, one, and two. This is the message flow between ruby and dyna, which completes the initial synchronization of ruby.

Packet Subject Source Destination Description

ruby.1

_X.HELLO

ruby

dyna

initial hello message after connection

dyna.1

_I.q6pEpnzNyANEZKKp29532Q.auth

dyna

ruby

dyna authenticates with ruby

ruby.2

_I.tXB702RHKF0M69dl7K7vrw.auth

ruby

dyna

ruby authenticates with dyna

ruby.2

_Z.ADD

ruby

dyna

ruby adds dyna to peer db

ruby.2

_Z.ADJ

ruby

dyna

ruby adds link to dyna

ruby.2

_I.tXB702RHKF0M69dl7K7vrw.adj_req

ruby

dyna

ruby requests adjacency of dyna

dyna.2

_Z.ADJ

dyna

ruby

dyna adds link to ruby

dyna.2

_I.q6pEpnzNyANEZKKp29532Q.auth

dyna

ruby

dyna confirms authentication

dyna.2

_Z.ADD

dyna

ruby

dyna adds ruby to peer db

dyna.2

_I.q6pEpnzNyANEZKKp29532Q.add_rte

dyna

ruby

dyna populates ruby peer db of other peers

dyna.2

_I.q6pEpnzNyANEZKKp29532Q.adj_rpy

dyna

ruby

dyna replies to adj_req, links to other peers

ruby.3

_I.tXB702RHKF0M69dl7K7vrw.sync_req

ruby

dyna

ruby requests sync of peer zero from dyna

ruby.3

_I.tXB702RHKF0M69dl7K7vrw.sync_req

ruby

dyna

ruby requests sync of peer one from dyna

ruby.3

_I.tXB702RHKF0M69dl7K7vrw.sync_req

ruby

dyna

ruby requests sync of peer two from dyna

dyna.3

_I.q6pEpnzNyANEZKKp29532Q.sync_rpy

dyna

ruby

dyna replies key, links, bloom for peer zero

dyna.3

_I.q6pEpnzNyANEZKKp29532Q.sync_rpy

dyna

ruby

dyna replies key, links, bloom for peer one

dyna.3

_I.q6pEpnzNyANEZKKp29532Q.sync_rpy

dyna

ruby

dyna replies key, links, bloom for peer two

There is also message flow between dyna and zero, one, two. This is the flow between dyna and zero. The message flow between dyna and one, dyna and two is the same as dyna and zero.

Packet Subject Source Destination Description

dyna.1

_Z.ADJ

dyna

zero

dyna notifies the new link from dyna to ruby

dyna.1

_Z.ADD

dyna

zero

dyna notifies the add ruby to peer db

dyna.1

_Z.ADJ

ruby

zero

forward from ruby for new link from ruby to dyna

zero.1

_I.tXB702RHKF0M69dl7K7vrw.sync_req

zero

dyna

zero requests sync of peer ruby from dyna

dyna.2

_I.ia988C6TtC6/L3JC6D3GqA.sync_rpy

dyna

zero

dyna replies key, links, bloom for peer ruby

zero.2

_Z.ADD

zero

dyna

zero notifies the add of ruby to peer db

Adding ruby to the network ripples through the directly connected peers, which discover the new peer from the broadcasting of the _Z.ADD messages and then synchronize with each other to merge the ruby session key, the link state, and the subscription bloom state into the network state.

rvd Compatibility

rvd Arguments

If ms_server is started in rvd compatible mode, it will automatically start a rv protocol on 7500 and a web service on 7580 unless arguments are present that modify this. The protocol that is used between daemons is not compatible with rvd, but it does allow rv clients to connect and communicate. In other words, the client side is compatible, but the network side is not.

These arguments are recognized:

$ ms_server -help
   -cfg               : config dir/file (default: exe_path/rv.yaml)
   -reliability       : seconds of reliability (default: 15)
   -user user.svc     : user name (default: hostname)
   -log               : log file
   -log-rotate        : rotate file size limit
   -log-max-rotations : max log file rotations
   -no-permanent      : exit when no clients
   -foreground        : run in foreground
   -listen            : rv listen port
   -no-http           : no http service
   -http              : port for http service (default: listen + 80)
   -no-mcast          : no multicast
   -console           : run with console

Service Key Configuration

Without any arguments, the config file rv.yaml is loaded from the directory that ms_server is installed. This config file can be generated with the ms_gen_key program. It should be the same for each instance that is joining the same network and service, since it contains the service key pair that authenticates the daemon with other daemons located on the network.

If ms_server is installed in /usr/local/bin then this can generate the default config file for it in rvd mode:

$ ms_gen_key -y -s rvd -o /usr/local/bin/rv.yaml
create dir  config                  -- the configure directory
create file config/.salt            -- generate new salt
create file config/.pass            -- generated a new password
create file config/config.yaml      -- base include file
create file config/param.yaml       -- parameters file
create file config/svc_rvd.yaml     -- defines the service and signs users
done
- Output config to "/usr/local/bin/rv.yaml"

The /usr/local/bin/rv.yaml file must be installed on every machine that connects to the network and expects to communicate with the initial machine. The contents define the service key pair:

$ cat /usr/local/bin/rv.yaml
services:
  - svc: rvd
    create: 1663653977.579093187
    pri: QQ5FR17BZktlJnxW/Ln3YExIoq12rf725FEysQwjGJRSNmgskzUA70fQCivq...
    pub: IskYDB7cvb1TIiaGZQ7ZAtWAlwhvGa/7rEfyiRKVp2U10sH3Yl6Eo19c0J1V...
parameters:
  salt_data: hDqyoJ9JSXEEBpiueoNPDEqxy3nsEOt7uoDrSvn4DlSvrLZDNQKG3fmK...
  pass_data: M+ALrLzVLaf/2OlRd7FTstX6pzAF66KQR86EhCxlwXY

The above service key pair is unique for every ms_gen_key execution. The private key is used to sign the authentication messages exchanged between daemons, and the public key is used to verify that the peer is allowed to exchange messages on the network. Unauthenticated peers will be ignored.

Starting in rvd Compatibility Mode

If the ms_server is linked to rvd and run that way, it will run in compatibility mode:

$ ln -s /usr/local/bin/ms_server /usr/local/bin/rvd
$ /usr/local/bin/rvd
rvd running at [::]:7500
web running at [::]:7580
moving to background daemon

Unless the -foreground or the -console options are used, it forks itself to release from the terminal that it is started. ms_server will also run in compatibility mode when an argument above is used, for example, ms_server -listen 7501 -http 7581 -reliability 60 will run in compatible mode.

If there is already a rvd running on port 7500, it will fail to start and exit:

$ rvd
0919 23:13:08.635! rvd.0 listen *:7500 failed
0919 23:13:08.635! web: failed to start web at *.7580

A HUP signal will cause it to exit:

$ killall -HUP rvd

Connecting to Networks

The network parameter that the client specifies controls which network that the ms_server joins. It can specify a multicast address, TCP connections, or a TCP mesh. Only daemons which connect to the same network will communicate.

The formats of these are:

Network Description

eth0;239.192.0.1

PGM multicast address

eth0;mesh

Mesh network

eth0;tcp.listen

TCP listen

eth0;tcp.connect

TCP connect

eth0

ANY connect

(empty)

no network

A mesh network causes all the daemons to connect with one another by listening to a random port and multicasting that port to eth0. When other daemons receive this message, they will establish TCP connections with each other daemon.

A TCP network causes the listeners to multicast their random ports to eth0. When daemons that have tcp.connect as a network receive this message, they will connect to the listener. Multiple TCP listeners can exist on the same network. The result of having two "eth0;tcp.listen" specifications and two "eth0;tcp.connect" would be that both connectors will establish connections to both of the listeners.

The PGM multicast address uses UDP encapsulated multicast on the service port using OpenPGM and a UDP point to point protocol for inbox messaging.

The sockets will be bound to the eth0 interface with random ports, except for the PGM socket, which uses a wildcard address for joining the multicast and the service port for sending messages. Multiple services can join the same network, so -service 7500 and -service 7600 can coexist using the same network specification.

When two ms_server instances are using the network "eth0;mesh" on service 7500 and service 7600, the ports console command will show these networks:

host1_7500.rv[+u7D0t7Cf5MP2USlooBtyA]@host1[632]> ports
   tport  | type | cost | fd | ... |  fl  |                   address
----------+------+------+----+-----+------+-------------------------------------------
    rvd.0 |   rv |      | 13 |     |  SLI |                             rv://[::]:7500
rv_7500.1 | mesh | 1000 | 19 |     | SLXD |                     mesh://10.88.0.2:37277
rv_7500.2 | mesh | 1000 | 21 |     |    X |        host2_7500.1@mesh://10.88.0.3:37720
rv_7600.3 | mesh | 1000 | 24 |     | SLXD |                     mesh://10.88.0.2:37109
rv_7600.4 | mesh | 1000 | 26 |     |    X |        host2_7600.1@mesh://10.88.0.3:42620
      web |  web |      | 14 |     |    S |                            web://[::]:7580
10.88.0.2 | name |      | 17 |     |    S |  name://10.88.0.2:59432;239.23.22.217:8327

The ANY specifier can either connect to a mesh or a TCP listener, depending which is present.

The empty network does not attempt to connect to anything, but it will find other services through existing connections.

If there exists a rv_7500 transport in the configuration (configured in rv.yaml or the -cfg argument), this overrides any client specified network connection for service 7500, so the client network argument is ignored.

The Peer Names

Each ms_server instance uses the hostname of the machine to identify itself unless the -user argument is used to specify another name. The daemon port is appended to the user name so that multiple daemons appear as hostname_7500 and hostname_7600 when -listen 7500 and -listen 7600 are used for two different daemon instances.

Console

Description of the Console

The console of ms_server is available when the -c option is used or when a Telnet protocol is defined. It offers command line editing and completions. It can be used to define, start, or stop connections between instances and also modify which IPC protocols are running for clients to use. It also has many ways to examine and debug the network.

The output is usually colorized if the terminal supports it, with green and red used for log messages (normal and error) and white used for cli command execution results. Printing messages received are also colorized, green for field name, blue for field type, white for field values.

The user names and the transport names usually have an integer number appended to them, for example lex_a2.3 is the user lex_a2 that has a uid of 3. This indicates either the uid or the tport_id of the identifier. The string identifiers of users and transports can contain duplicates, since they are identified using the bridge id. The bridge id is a unique random 128 bit nonce, the strings attached to the users and transports are tags which usually are unique, but not necessarily. The users and transports are kept in their respective a tables and the uid and tport_id are indexes into these tables. The * is often used for uid 0 so that it stands out, since it is the peer that the console is attached to. The tport_id of 0 is also special, that is where the client protocols are attached through local IPC, for example, a TCP connection to 127.0.0.1:7500.

The command string entered into the cli will execute if it has enough characters to distinguish it from the prefixes of other commands. If the string pi is entered, then the command ping will run, since pi is a unique prefix of ping. The show prefix is optional when the command matches the second part of the show command, so pe will match and run the show peers command. The shortened command run t test will match and run the show running transport test command.

Help Screen

The following is the help screen, displayed when "help" is entered at the cli.

Command Description

ping [U]

Ping peers and display latency of return

tping [U]

Ping peers with route trace flag

mping [P]

Multicast ping all peers using path P

remote U C

Run command C remotely on peer U

connect T

Start tport connect

listen T

Start tport listener

shutdown T

Shutdown tport

network S N

Configure service and join network

configure transport T

Configure tport T

configure parameter P V

Configure parameter P = V

save

Save current config as startup

show subs [U] [W]

Show subscriptions of peers

show seqno [W]

Show subject seqno values for pub and sub

show adjacency

Show the adjacency links

show peers

Show active peers

show ports [T]

Show the active ports

show cost [T]

Show the port costs

show status [T]

Show the port status with any errors

show routes [P]

Show the route for each peer for path P (0-3)

show urls

Show urls of connected peers

show tport [T]

Show the configured tports

show user [U]

Show the configured users

show events

Show event recorder

show logs

Show current log buffer

show counters

Show system seqno and time values

show sync

Show system seqno and sums

show pubtype

Show system publish type received

show inbox [U]

Show inbox sequences

show loss

Show message loss counters and time

show skew

Show peer system clock skews

show reachable

Show reachable peers through active tports

show tree [U]

Show multicast tree from me or U

show path [P]

Show multicast path P (0→3)

show forward [P]

Show forwarding P (0→3)

show fds

Show fd statistics

show buffers

Show fd buffer memory usage

show windows

Show pub and sub window memory usage

show blooms [P]

Show bloom centric routes for path P (0-3)

show match S

Show users which have a bloom that match sub

show graph

Show network description for node graph

show cache

Show routing cache geom, hits and misses

show poll

Show poll dispatch latency

show hosts

Show rv hosts and services

show rvsub

Show rv subscriptions

show rpcs

Show rpcs and subs running

show running

Show current config running

show running transport T

Show transports running, T or all

show running service S

Show services running config, S or all

show running user U

Show users running config, U or all

show running group G

Show groups running config, G or all

show running parameter P

Show parameters running config, P or all

show startup

Show startup config

show startup transport T

Show transports startup, T or all

show startup service S

Show services startup config, S or all

show startup user U

Show users startup config, U or all

show startup group G

Show groups startup config, G or all

show startup parameter P

Show parameters startup config, P or all

sub S [F]

Subscribe subject S, output to file F

unsub S [F]

Unsubscribe subject S, stop output file F

psub W [F]

Subscribe rv-wildcard W, output to file F

punsub W [F]

Unsubscribe rv-wildcard W, stop output file F

gsub W [F]

Subscribe glob-wildcard W, output to file F

gunsub W [F]

Unsubscribe glob-wildcard W, stop output file F

snap S [F]

Publish to subject S with inbox reply

pub S M

Publish msg string M to subject S

trace S M

Publish msg string M to subject S, with reply

ack S M

Publish msg string M to subject S, with ack

rpc S M

Publish msg string M to subject S, with return

any S M

Publish msg string M to any subscriber of S

cancel

Cancel and show incomplete (ping, show subs)

mute

Mute the log output

unmute

Unmute the log output

reseed

Reseed bloom filter

debug I

Set debug flags to ival I

wevents F

Write events to file

die [I]

Exit without cleanup, with status 1 or I

quit/exit

Exit console

The arguments in square brackets are optional, the letters used above are:

  • U — User, the name of an ms_server instance, which is often the hostname of the machine.

  • P — Path, a multicast path, numbered 0 to 3. This selects a precomputed path that all ms_server instances use to forward messages. It will only be different when there are redundant links with a cost that is less or equal to the primary path 0.

  • T — Transport, the name of a connection endpoint that messages are routed through.

  • S — Service or Subject depending context. The name or number of a service, for example 7500 is the default RV service. A subject is any string of characters.

  • N — Network, formatted described in Connecting to Networks.

  • G — Group, defines a group of users, not currently used.

  • F — File, a path in the file system.

  • M — Message, a string of characters, as the console is limited to message formats that can be typed into the cli (string and json).

  • I — Integer

Testing Connectivity with Ping

  • ping [U]

  • tping [U]

  • mping [P]

These commands send a message to a peer and display the message returned. The tping command also sets the trace flag in the message sent so that all peers along the path will also send a message back. This is useful in the way that traceroute is useful, to find an unusual latency report or dropped messages.

The ping and tping optionally have an argument that specifies the name of the peer to send the message. If no argument is used, then every peer currently active will be sent a message. These messages are sent over the link that is handling the inbox point to point messages. The subject of a ping message uses the inbox format _I.<nonce>.ping, where the nonce identifies the destination peer. The return uses the _I.<nonce>.N inbox subject, where nonce identifies the peer of the sending console. The N part of the subject is setup by the console to identify what the sending operation was and is used in the reply field of the original message.

The mping use a multicast path instead of an inbox path. The multicast path is numbered and is added to the message header so that all peers which receive and route this message will use the same path. All peers that receive it will send an inbox reply message, similar to ping. The subject used by the sender is _M.ping, which all peers are subscribed to. The multicast paths are numbered 0 to 3, so mping 0 will use the first path, and mping 3 will use the last path. Using different paths can be useful to check that all redundant links in use are active and forwarding. The reply also includes which port the message was received on, which will match the path 3 network path. The path 0 is often the same as the inbox path, except in the case of PGM, where inbox is a UDP point to point protocol.

If the network is not yet stable, sometimes a ping operation will not complete. When this occurs, use the cancel command to show the completed and the incomplete values. When a ping operation is started, the console estimates the number of replies that are expected and waits for these to complete before displaying the results. The tping will display the acks of the message as they are received but wait for the final results.

Example ping.

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[0]> ping
   user   | cost |   lat |     tport   |  peer_tport
----------+------+-------+-------------+-------------
 pic_a2.1 | 1000 | 189us | pic_amesh.2 |  pic_amesh.2
 pic_a4.3 | 1000 | 184us | pic_amesh.4 |  pic_amesh.4
 pic_a3.2 | 1000 | 214us | pic_amesh.3 |  pic_amesh.3
  pic_a.4 | 1000 | 219us | pic_amesh.5 |  pic_amesh.6
 lex_a.29 | 2000 | 296us | pic_amesh.5 |   fo_mesh.12
 lee_a.26 | 2000 | 340us | pic_amesh.5 |   fo_mesh.12
lex_a4.17 | 3000 | 389us | pic_amesh.5 |  lex_amesh.5
...

Example mping.

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[1]> mping 1
   user   | cost |   lat  |     tport   |  peer_tport
----------+------+--------+-------------+-------------
  pic_a.4 | 1000 |  146us | pic_amesh.5 |  pic_amesh.6
 pic_a2.1 | 1000 |  158us | pic_amesh.2 |  pic_amesh.2
 pic_a4.3 | 1000 |  199us | pic_amesh.4 |  pic_amesh.4
 pic_a3.2 | 1000 |  245us | pic_amesh.3 |  pic_amesh.3
  edo_a.9 | 2000 |  265us | pic_amesh.5 |   fo_mesh.12
 lex_a.29 | 2000 |  278us | pic_amesh.5 |   fo_mesh.12
 lee_a.26 | 2000 |  279us | pic_amesh.5 |   fo_mesh.12
...

The tport field is where the reply inbox message was received, the peer_tport is where the ping message was received at the peer.

Remote Command Execution

  • remote U C

Remote will message a command to another peer, run it in it’s console and return the result. This is useful because most often, a peer will not have a console, a web interface, or a telnet protocol active. Without remote, the peer would need to be restarted in order to change the configuration or start a console. With remote, you could connect a peer with authentication, encryption and a console to the network temporarily, make a change, then disconnect the peer.

Example of remote.

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[4]> rem lee_a1 show pubtype
from lee_a1.19:
      type       | recv_count | send_count
-----------------+------------+-----------
 u_session_hello |          0 |          1
    u_session_hb |      16217 |      16218
      u_peer_add |        113 |         31
  u_bloom_filter |         39 |          3
     u_adjacency |         67 |          4
...

Update and Show the Configuration

  • configure transport T

  • configure parameter P V

  • save

  • show running

  • show running transport T

  • show running service S

  • show running user U

  • show running group G

  • show running parameter P

  • show startup

  • show startup transport T

  • show startup service S

  • show startup user U

  • show startup group G

  • show startup parameter P

These commands show and modify the running configuration. The save command write the running config to the startup config, when the directory and files are writable.

The show running and show startup will print the config tree in yaml to the console. The running configuration may have some dynamically created users and protocols which are created as a result of the startup config. A dynamically created user that is not preconfigured is one of these. These will show in running, but will not save to startup.

Using the configure transport command is the most often used command of these. It will update the currently running transports as well as add new ones. If it is used to modify an existing transport that is already running, the new settings won’t change the active transport until it is restarted with shutdown and connect or listen. The configuration details of transports are described in Networking, and the details of the parameters are described in Parameters. Most of the parameters are only applied at startup, so changing them will have an effect only when saved and the process restarted.

Example of configure transport and show running transport.

chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[110]> configure transport mesh
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[111](mesh)> type mesh
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[112](mesh)> port 9000
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[113](mesh)> connect host1
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[114](mesh)> connect2 host2
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[115](mesh)> listen *
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[116](mesh)> q
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[117]> show running transports mesh
transports:
  - tport: mesh
    type: mesh
    route:
      port: 9000
      connect: host1
      connect2: host2
      listen: "*"
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[123]> configure transport test type tcp port 9000 connect host1
Transport (test) updated
chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[124]> show running transports test
transports:
  - tport: test
    type: tcp
    route:
      port: 9000
      connect: host1

The first configure command enters into a cli sub command mode where only the fields of the transport can be entered. The second configure command sets all of the fields on one line.

The commands show service and show group have limited usefulness at in the current implementation, since only one service is used per ms_server instance and groups do not have operational functionality yet, eventually they will be used for access control lists.

Transport Start and Stop

  • connect T

  • listen T

  • shutdown T

  • network S N

The transport T is defined before using the connect, listen, shutdown commands. The network command configures the transport if not already configured, runs it, and also attaches a service to it. The configuration of the transports is described in Networking.

Example of connect, listen, shutdown.

chex.rvd[L+jUn266ADoL2fBschoqUg]@chex[108]> configure transport test type tcp port 9000 connect lexx.rai
Transport (test) updated
chex.rvd[L+jUn266ADoL2fBschoqUg]@chex[109]> connect test
Transport (test) started connecting
chex.rvd[L+jUn266ADoL2fBschoqUg]@chex[110]> shutdown test
Transport (test) is running tport 1
Transport (test) shutdown (1 instances down)

The Show Commands

  • show subs [U] [W]

Show the subscriptions active for user or for all users. The W is a substring for partial matches. This command uses inbox RPC calls to _I.<nonce>.subs for all users which U specifies. The * user matches all users, so the W argument can be specified.

Example, show all subscriptions for every user:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[38]> show subs
   user   |                               subject
----------+-------------------------------------------------------------------
 pic_a1.* |                                       _7603._INBOX.0AB98FB4.DAEMON
          |                       (p) _7603._INBOX.0AB98FB4.763E17AA51E2DEF0.>
          |                                                               test
----------+-------------------------------------------------------------------
 pic_a2.1 |                                       _7606._INBOX.173D29A5.DAEMON
          |                       (p) _7606._INBOX.173D29A5.763E17AA5271FEF0.>
----------+-------------------------------------------------------------------
 pic_a3.2 |                                       _7500._INBOX.0072DD0A.DAEMON
          |                       (p) _7500._INBOX.0072DD0A.663E17AA514B7DD0.>
          |                                               _7500.RSF3.REC.MOT.B
----------+-------------------------------------------------------------------
 pic_a4.3 |                                       _7500._INBOX.68AD2F1B.DAEMON
          |                       (p) _7500._INBOX.68AD2F1B.763E17AA50777DD0.>
          |                                            _7500.RSF4.REC.DEM=.NaE
          |                                             _7500.RSF4.REC.NAI.NaE
...

The (p) strings before the subject indicates that the subject was subscribed as a pattern.

Example, show all subscriptions which have the substring DAEMON:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[41]> show subs * DAEMON
   user   |            subject
----------+-----------------------------
 pic_a1.* | _7603._INBOX.0AB98FB4.DAEMON
----------+-----------------------------
 pic_a2.1 | _7606._INBOX.173D29A5.DAEMON
----------+-----------------------------
 pic_a3.2 | _7500._INBOX.0072DD0A.DAEMON
----------+-----------------------------
 pic_a4.3 | _7500._INBOX.68AD2F1B.DAEMON
...

Example, show subscriptions active at user edo_a3:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[44]> show subs edo_a3
   user   |                    subject
----------+---------------------------------------------
edo_a3.13 |                 _7500._INBOX.C6AD7566.DAEMON
          | (p) _7500._INBOX.C6AD7566.763E17AA40C28DD0.>
          |                          _7500.RSF5.REC.DD.N
          |                         _7500.RSF5.REC.BBN.N
...
  • show seqno [W]

Show the sequences of the subjects received and published. The peers with IPC or console subscribers or publishers track the sequences the subjects to ensure the stream is completely serialized and notify of a data loss error when it is not in sequence. The details of how this works is described in Message Loss. This command only operates on the local sequence windows, the show windows command shows the memory usage of these.

The W is a substring that matches the subject so that the subjects in the window can be filtered. Without W, all of the subjects are printed.

Example, show the sequences of the subjects which contain ORCL:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[4]> show seqno ORCL
  source   |  seqno |        start      |        time       |         subject
-----------+--------+-------------------+-------------------+---------------------------
       ipc |  52581 | 0207 10:16:16.108 | 0207 23:51:11.441 |      _7500.RSF4.REC.ORCL.O
       ipc | 145911 | 0207 10:20:50.986 | 0208 00:07:24.401 |      _7500.RSF9.REC.ORCL.O
       ipc | 128244 | 0207 10:25:25.864 | 0208 00:17:18.041 |      _7500.RSF7.REC.ORCL.O
 dex_a2.21 | 542769 | 0207 10:03:05.834 | 0208 00:22:42.401 | _7605._TIC.RSF5.REC.ORCL.O
 dex_a1.20 | 542769 | 0207 10:03:05.834 | 0208 00:22:42.281 | _7602._TIC.RSF2.REC.ORCL.O
 ...

The source is the publisher, so IPC indicates that the client attached to the lex_a1 has published these messages, and dex_a2, dex_a1 indicate that these messages were received from clients attached to those peers (or the console). The start is the first time in the time frame that the subject was seen, the time is the last time it was seen. New time frames occur when the network link state database changes, since the sequence number time frame reference jump between old and new time frames and the seqno base is linear.

  • show adjacency

Show the adjacency tables. This command dumps the current link state database. It shows which peer has a link to another peer through which tport and the cost of the link (of path 0).

Example:

chex.rvd[LQ9YfNwX/KtuiniQNvVkQg]@chex[127]> show adj
   user    |     adj    |     tport    | type | cost
-----------+------------+--------------+------+-----
    chex.* |            |        ipc.0 |  ipc | 1000
           |    lex_a.1 |       test.1 |  tcp | 1000
-----------+------------+--------------+------+-----
   lex_a.1 |    edo_a.2 |    fo_mesh.4 | mesh | 1000
           |   lex_a2.3 |  lex_amesh.5 | mesh | 1000
           |   lex_a1.4 |  lex_amesh.6 | mesh | 1000
           |   lex_a3.5 |  lex_amesh.7 | mesh | 1000
           |   lex_a4.6 |  lex_amesh.8 | mesh | 1000
           |   robo_a.7 |    fo_mesh.9 | mesh | 1000
           |   lee_a.16 |   fo_mesh.10 | mesh | 1000
           |   dex_a.21 |   fo_mesh.11 | mesh | 1000
           |   pic_a.26 |   fo_mesh.12 | mesh | 1000
           |     chex.* |   lex_tcp.13 |  tcp | 1000
-----------+------------+--------------+------+-----
   edo_a.2 |   edo_a4.8 |  edo_amesh.4 | mesh | 1000
           |   edo_a3.9 |  edo_amesh.5 | mesh | 1000
...

The user is the peer that is maintaining the links that follow. It sends a link state update messages when a link is added, dropped or cost is changed.

The adj field is the peer which is directly attached to user through the tport. The tport is the name that user is labeling this link. The tport_id number that follows the name (fo_mesh + .4) is the index into the user’s transport table. The type and cost fields are also sent by user in the link state update.

  • show peers

Shows info about the peers in the network that are active.

Example:

   user    |         bridge         | sub |  seq | link |   lat  |   max  |   avg  |        time       |    tport  | cost
-----------+------------------------+-----+------+------+--------+--------+--------+-------------------+-----------+-----
    chex.* | VCr9OQDldBjnGLnOXVF7gA |   3 |    3 |    4 |        |        |        | 0320 18:37:34.182 |           |
   pic_a.1 | YdUS3pecw5BYzlj1Qns0uQ |   2 |    0 |   14 | 4.61ms | 6.55ms | 5.01ms | 0320 11:48:25.118 | pic_tcp.1 | 1000
   edo_a.2 | KD28fBfgf6SpwPwH7QpwMA |   2 |    0 |   20 | 5.97ms | 7.92ms |  6.3ms | 0320 11:37:32.198 | edo_tcp.2 | 1000
  pic_a3.3 | x+McKSRvAaAfOuOQEsvX9Q |  81 | 7923 |    8 | 5.57ms | 7.69ms | 5.43ms | 0320 11:48:25.066 | pic_tcp.1 | 2000
  robo_a.4 | gIBRgIKDPjvTwVVuLxE8vg |   2 |    0 |   16 | 6.74ms | 8.67ms | 6.68ms | 0320 01:24:50.489 | edo_tcp.2 | 2000
   dex_a.5 | t2M47zbouWPRJHwFFjVROg |   2 |    0 |   12 | 9.84ms | 9.84ms | 6.62ms | 0320 11:47:17.389 | edo_tcp.2 | 2000
...

The bridge is the 128 bit random nonce created on startup by each peer. It uniquely identifies the peer instance.

The sub field are the number of subscriptions that are active. This number is a counter in the bloom filter that is updated by the peer when subjects and patterns are added or removed. It always contains at least 2 entries, one for the _I.<nonce>.> inbox pattern and one for the _M.> multicast pattern.

The seq field is the sequence number for each subscription operation. It is serialized so that all subscriptions happen in the same order as the peer.

The link field is the sequence number for each link state update. It is also serialized so that adjacency table modifications occur in order.

The lat, max, avg are ping round trip times that are sent 1.5x the heartbeat interval to a random peer. They are tracked for at least an hour before being rotated.

The time is the start time of the peer.

The tport and cost reference the inbox route to peer.

The order in the table is by uid. Using the show peers nonce orders the table by bridge nonce, show peers start orders the table by start time, show peers user orders the table by user name. The show peers host shows the first 4 bytes of the bridge used as the host id and show peers ip shows the first 4 bytes of the bridge in IPv4 dotted quad format.

Using show peers zombie, the dead peers are displayed.

  • show ports [T]

Show info about transports that are active on the network.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[47]> show ports
    tport   |  type  | cost | fd |      bs     |    br    |     ms    |   mr   |   lat |  idle  |  fl  |                   address
------------+--------+------+----+-------------+----------+-----------+--------+-------+--------+------+-------------------------------------------
       rv.0 |     rv |      | 12 |             |          |           |        |       | 27.8hr |   LI |                        rv://127.0.0.1:7500
pic_amesh.1 |   mesh | 1000 | 18 |             |          |           |        |       | 27.8hr | LXCD |                    mesh://172.18.0.2:34344
pic_amesh.2 |   mesh | 1000 | 19 |     3250008 |  3248028 |     10747 |  10747 | 173us | 1.99se |    X |           pic_a2.1@mesh://172.18.0.3:39340
pic_amesh.3 |   mesh | 1000 | 21 |     3248424 |  5785922 |     10733 |  32929 | 240us | 1.39se |    X |           pic_a3.2@mesh://172.18.0.4:41320
pic_amesh.4 |   mesh | 1000 | 23 |     3355474 |  5801830 |     10822 |  33084 | 225us |  835ms |    X |           pic_a4.3@mesh://172.18.0.5:43846
pic_amesh.5 |   mesh | 1000 | 25 | 36957142584 | 29991114 | 100159342 | 245786 | 166us | 1.06ms |    X |            pic_a.4@mesh://172.18.0.1:57204

The tport, type are configured, and the cost is either configured or advertised by the peer in it’s link state message. If a transport is internal, like an IPC transport, then it doesn’t have a cost associated with it.

The fd field is the endpoint for the transport, usually a listener or a fd assigned to the transport. There are usually one or more fds within the transport that carry out the reading and writing of data to a network endpoint.

The fields bs, br, ms, and mr fields are bytes, messages sent and received, which are collected from all the fds within the transport.

The idle is the last time a message event occurred.

The fl field are flags that are set on the transport. Each character is a different flag:

  • L — has a TCP listener

  • M — is a PGM multicast transport

  • X — is a mesh transport

  • C — is or was actively connecting the link

  • T — was accepted from a TCP listener

  • E — is marked as an edge link, there is no routing on the other side

  • I — is an IPC transport, which is are client endpoints

  • D — resolves the link using a multicast device

  • - — is shutdown

  • * — connecting in progress

The address field is the address at the peer when TCP is used and the multicast address when PGM is used.

  • show cost [T]

This is similar to show ports except that all 4 costs are printed for each transport.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[49]> show cost pic_amesh
    tport   |  type  | cost | cost2 | cost3 | cost4 | fd |  fl  |                   address
------------+--------+------+-------+-------+-------+----+------+-------------------------------------------
pic_amesh.1 |   mesh | 1000 |  1000 |  1000 |  1000 | 18 | LXCD |                    mesh://172.18.0.2:34344
pic_amesh.2 |   mesh | 1000 |  1000 |  1000 |  1000 | 19 |    X |           pic_a2.1@mesh://172.18.0.3:39340
pic_amesh.3 |   mesh | 1000 |  1000 |  1000 |  1000 | 21 |    X |           pic_a3.2@mesh://172.18.0.4:41320
pic_amesh.4 |   mesh | 1000 |  1000 |  1000 |  1000 | 23 |    X |           pic_a4.3@mesh://172.18.0.5:43846
pic_amesh.5 |   mesh | 1000 |  1000 |  1000 |  1000 | 25 |    X |            pic_a.4@mesh://172.18.0.1:57204
...
  • show status [T]

Similar to show ports with a status errno if the system reported an error on a link. When everything is normal, the address is printed instead.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[50]> show status pic_amesh
    tport   | type | fd |  fl  |              status
------------+------+----+------+---------------------------------
pic_amesh.1 | mesh | 18 | LXCD | mesh://172.18.0.2:34344
pic_amesh.2 | mesh | 19 |    X | pic_a2.1@mesh://172.18.0.3:39340
pic_amesh.3 | mesh | 21 |    X | pic_a3.2@mesh://172.18.0.4:41320
pic_amesh.4 | mesh | 23 |    X | pic_a4.3@mesh://172.18.0.5:43846
pic_amesh.5 | mesh | 25 |    X | pic_a.4@mesh://172.18.0.1:57204
...
  • show routes [P]

Show the routes. This shows how all the peers are connected and which port would be used to send and receive messages to/from the peer. It also displays which transports have been used in order to reach the peer.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[52]> show routes
   user   |     tport   |      state    | cost |   path  |   lat  | fd |               route
----------+-------------+---------------+------+---------+--------+----+---------------------------------
 pic_a2.1 | pic_amesh.2 | inbox,mesh,hb | 1000 | 0,1,2,3 |  143us | 19 | pic_a2.1@mesh://172.18.0.3:39340
          | pic_amesh.3 |               | 2000 |         |        | 21 | pic_a3.2@mesh://172.18.0.4:41320
          | pic_amesh.4 |               | 2000 |         |        | 23 | pic_a4.3@mesh://172.18.0.5:43846
          | pic_amesh.5 |               | 2000 |         |        | 25 |  pic_a.4@mesh://172.18.0.1:57204
...

This shows that user pic_a2 messages have been received or sent through these transports. The secondary transports are often used on startup when the other links are not yet active or when a link fails.

The state of the transport has these values:

  • inbox — transport is the path for the inbox route

  • mesh — transport is part of a mesh

  • hb — transport is directly connected and has a heartbeat

  • ucast — transport has a point to point UDP protocol

  • usrc — transport uses a point to point UDP protocol to reach another peer

The cost is the link cost of the path P argument, or 0 when not specified.

The path field enumerates which transport is used to reach peer for each path.

The lat, fd are the same as show ports.

The route is the directly connected peer address that a message is sent or received.

  • show urls

Show the local and peer addresses as well as the url used to resolve the address of the peer. This is useful for mesh and multicast type networks since the endpoints are sometimes resolved through exchanging messages with the network. In the case of a mesh transport, a mesh url database is exchanged and links are established with all the peers that are in the mesh. The multicast PGM transport exchanges the unicast UDP endpoints for all the peers that are on the transport.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[54]> show urls
   user   |     tport   |      state    | cost |    mesh   | fd |            url          |           local         |          remote
----------+-------------+---------------+------+-----------+----+-------------------------+-------------------------+------------------------
          |       ipc.0 |            LI |      |           | 11 |                         |    ipc://127.0.0.1:7500 |   ipc://127.0.0.1:43992
          | pic_amesh.1 |          LXCD |      | pic_amesh | 17 | mesh://172.18.0.2:34344 |                         |
 pic_a2.1 | pic_amesh.2 |             X |      | pic_amesh | 20 | mesh://172.18.0.3:44108 | mesh://172.18.0.2:34344 | mesh://172.18.0.3:39340
 pic_a3.2 | pic_amesh.3 |             X |      | pic_amesh | 22 | mesh://172.18.0.4:42851 | mesh://172.18.0.2:34344 | mesh://172.18.0.4:41320
 pic_a4.3 | pic_amesh.4 |             X |      | pic_amesh | 24 | mesh://172.18.0.5:45836 | mesh://172.18.0.2:34344 | mesh://172.18.0.5:43846
  pic_a.4 | pic_amesh.5 |             X |      | pic_amesh | 26 | mesh://172.18.0.1:36262 | mesh://172.18.0.2:34344 | mesh://172.18.0.1:57204
----------+-------------+---------------+------+-----------+----+-------------------------+-------------------------+------------------------
 pic_a2.1 | pic_amesh.2 | inbox,mesh,hb | 1000 | pic_amesh | 19 | mesh://172.18.0.3:44108 | mesh://172.18.0.2:34344 | mesh://172.18.0.3:39340
          | pic_amesh.3 |               | 2000 | pic_amesh | 21 |                         |                         |
          | pic_amesh.4 |               | 2000 | pic_amesh | 23 |                         |                         |
          | pic_amesh.5 |               | 2000 | pic_amesh | 25 |                         |                         |

The top section is similar to show ports with addition of the urls.

The following sections is similar to show routes with the addition of the urls for each user.

The url field is resolved by exchanging messages. The local and remote are addresses assigned to the connection. Since a mesh may be actively connected by either peer, since all peers passive listeners and some have active connections. The newer peers will usually have the active connections and the older peers will have accepted connections. The local and remote addresses will reflect that, since the accepted peers are assigned an address by the system and the connecting peers use the url address to connect.

  • show tport [T]

Show the state of the transports. This prints the configured transport and whether it is active or not. The other transport show commands will only show the active transports. This will show the ones configured but not active as well.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[55]> show tport
   tport  |  type  |    state  |        listen       |             connect            |    device
----------+--------+-----------+---------------------+--------------------------------+------------
pic_amesh |   mesh | accepting |                     |                                | mesh://eth0
       rv |     rv | accepting | rv://127.0.0.1:7500 |                                |
      tel | telnet | accepting |     telnet://*:2222 |                                |
      ipc |    ipc |       ipc |                     |                                |
  rvd.ipc |    ipc |         - |                     |                                |
     eth0 |   name |         - |                     | name://eth0;239.23.22.217:8327 |
     test |    tcp |         - |                     |        tcp://robotron.rai:9000 |

The listen, connect, and device fields show how the transport is configured to resolve the connections.

  • show user [U]

Show the users configured.

Example:

chex.test[OsGpIaCbYCJbhnUVEp19Uw]@chex[135]> show users
uid | user |  svc |        create        | expires
----+------+------+----------------------+--------
  0 | chex | test | 1675847381.440084399 |
    | dyna | test | 1675847381.440129724 |
    | ruby | test | 1675847381.440176492 |
    | zero | test | 1675847419.072423168 |
  • show events

The system tracks the authentication and transport and link state events in a buffer that rotates every 4096 entries. This is a compact table that has 6 integer fields that map to a time stamp, uids, transports and enumerated values depending on event type. These events are useful for resolving what happened to the network after something went wrong.

Example of an event log:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[59]> show events
       stamp      |     tport   |    user   |    peer   |       event     |         data
------------------+-------------+-----------+-----------+-----------------+--------------------
0206 22:09:22.606 |             |  pic_a1.* |           |         startup |
0206 22:09:22.607 |       ipc.0 |  pic_a1.* |           |      on_connect |              listen
0206 22:09:22.607 | pic_amesh.1 |  pic_a1.* |     (aes) |      on_connect |              listen
0206 22:09:22.607 |     (mcast) |  pic_a1.* |           |      send_hello |
0206 22:09:23.301 | pic_amesh.2 |  pic_a1.* |     (aes) |      on_connect |         mesh_accept
0206 22:09:23.327 |             |  pic_a1.* |           |        converge |           add_tport
0206 22:09:23.340 | pic_amesh.2 |  pic_a2.1 |  pic_a1.* |  add_user_route |            neighbor
0206 22:09:23.340 | pic_amesh.2 |  pic_a1.* |  pic_a2.1 |  send_challenge |               hello
0206 22:09:23.342 | pic_amesh.2 |  pic_a2.1 |           |  recv_challenge |           handshake
0206 22:09:23.342 |             |  pic_a2.1 |    (ecdh) |        auth_add |           handshake
0206 22:09:23.342 |     (mcast) |  pic_a1.* |  pic_a2.1 | send_adj_change |                 add
0206 22:09:23.342 | pic_amesh.2 |  pic_a2.1 |           |      send_trust |             in_mesh
0206 22:09:23.342 | pic_amesh.2 |  pic_a2.1 |           |    recv_peer_db |           add_route
0206 22:09:23.342 | pic_amesh.2 |  pic_a2.1 |  pic_a1.* | recv_adj_change |          update_adj
0206 22:09:23.367 |             |  pic_a1.* |           |        converge |          adj_change
0206 22:09:23.889 | pic_amesh.3 |  pic_a1.* |     (aes) |      on_connect |         mesh_accept
0206 22:09:23.927 |             |  pic_a1.* |           |        converge |           add_tport
0206 22:09:23.928 | pic_amesh.3 |  pic_a3.2 |  pic_a1.* |  add_user_route |            neighbor

The events that are logged are:

Event Description

startup

Initial event, time of start

on_connect

Transport listen, connect, or accept occurred

on_shutdown

Transport connection was closed or shutdown

on_timeout

Transport connection timed out

auth_add

Peer was authenticated and is now trusted

auth_remove

Peer authentication is dropped

send_challenge

An authentication challenge is sent to peer

recv_challenge

An authentication challenge is received from peer

send_trust

Authentication was successful, sent trust message

recv_trust

Peer notified that my node is now authenticated

add_user_route

Route to peer is found and the transport is labeled

hb_queue

Peer is added to the heartbeat timeout queue

hb_timeout

Peer heartbeat was not received within it’s interval

send_hello

Transport is initialized by sending a hello message

recv_bye

Peer intends to leave the network and sends a bye message

recv_add_route

Received a message that a peer was added to the network

recv_peer_db

All the peers that are known are exchanged with a new peer

send_add_route

Send a message when a peer is added to the network

send_peer_del

Send a message when peer is removed from the network

sync_result

Peer sync message was received, initialize peer state

send_sync_req

Request a peer sync after new peer is notified

recv_sync_req

Receive a sync request for my node or another peer

recv_sync_fail

Receive a sync request for an unknown peer

send_adj_change

Send a link state update message, add or remove link

recv_adj_change

Received a link state update message

send_adj_req

Link state for peer is stale, request the current link state

recv_adj_req

Receive a request for the current link state

send_adj

Send the current link state to a peer

recv_adj_result

Receive the current link state from a peer

resize_bloom

Resize my peers bloom filter and sent it to the network

recv_bloom

Received a peers bloom filter

converge

The network has no missing link states and is completely connected

  • show logs

The last log 64K bytes of the log is buffered in the process. This command shows the this buffer.

  • show counters

Show the counters of heartbeat, inbox, and ping subjects.

Example:

pic_a1.rvd[CrmPtIc8B3ZedgdVTW7XOQ]@pic_a1[60]> show counters
   user   |        start      | hb seqno |       hb time     | snd ibx | rcv ibx | ping snd |     ping stime    | pong rcv | ping rcv
----------+-------------------+----------+-------------------+---------+---------+----------+-------------------+----------+---------
 pic_a1.* | 0206 22:09:22.606 |          |                   |         |         |          |                   |          |
 pic_a2.1 | 0206 22:09:23.219 |    17021 | 0208 20:52:00.940 |      19 |      23 |      454 | 0208 20:50:22.608 |      454 |      442
 pic_a3.2 | 0206 22:09:23.806 |    17021 | 0208 20:51:51.687 |      18 |     149 |      438 | 0208 20:50:43.808 |      438 |      444
 pic_a4.3 | 0206 22:09:24.401 |    17020 | 0208 20:51:52.241 |      29 |     125 |      427 | 0208 20:51:00.008 |      427 |      438
  pic_a.4 | 0206 22:09:24.433 |    17020 | 0208 20:51:52.275 |      35 |      37 |      422 | 0208 20:51:21.608 |      422 |      426
robo_a3.5 | 0206 22:09:06.260 |        0 |                   |      11 |      98 |      427 | 0208 20:51:40.528 |      427 |      421
robo_a2.6 | 0206 22:09:05.371 |        0 |                   |      11 |      15 |      424 | 0208 20:51:50.168 |      424 |      423
robo_a4.7 | 0206 22:09:07.183 |        0 |                   |      11 |      95 |      420 | 0208 20:41:30.568 |      420 |      418
robo_a1.8 | 0206 22:09:04.452 |        0 |                   |      11 |      15 |      423 | 0208 20:41:48.848 |      423 |      424
  edo_a.9 | 0206 22:09:12.993 |        0 |                   |       2 |      20 |      422 | 0208 20:42:05.808 |      422 |      419
...

The start field is when the process started. The hb seqno and hb time track the last heartbeat received from the peer when it is directly connected. The snd ibx, rcv ibx are counters for many of the _I.<nonce>. subjects which guard against repeats. These are point to point messages, the peer has the same counters which should match these. The show inbox command will show the last 32 of these sequences. The ping and pong sequences have their own counters, since these are used to check connectivity between peers and are expected to have loss when the network is unstable.

  • show sync

Show the link state seqno and sub seqno sums.

Example:

    user   |        start      | link_seqno | link_sum | sub_seqno | sub_sum | hb_diff | mc_req | mc_res | req_adj | res_adj | ping_adj
-----------+-------------------+------------+----------+-----------+---------+---------+--------+--------+---------+---------+---------
    chex.* | 0225 01:38:14.590 |          5 |     1447 |         0 |   81677 |         |        |        |         |         |
   edo_a.1 | 0224 17:07:32.126 |         25 |     1447 |         0 |   81653 |       0 |      0 |      0 |       0 |       0 |        0
  edo_a2.3 | 0224 17:07:29.173 |          8 |        0 |         2 |       0 |       0 |      0 |      0 |       0 |       0 |        0
  edo_a1.4 | 0224 17:07:27.696 |          8 |        0 |         2 |       0 |       0 |      0 |      0 |       0 |       0 |        0
  edo_a3.5 | 0224 17:07:30.591 |          8 |        0 |      6673 |       0 |       0 |      0 |      0 |       0 |       0 |        0
  edo_a4.6 | 0224 17:07:32.052 |          7 |        0 |      6874 |       0 |       0 |      0 |      0 |       0 |       0 |        0
  robo_a.7 | 0224 17:07:26.471 |         18 |        0 |         0 |       0 |       0 |      0 |      0 |       0 |       0 |        0
...

The start field is when the process started. The link_seqno and link_sum are the link state seqno and the sum of all of the peers link state seqnos. The sub_seqno and sub_sum are the subscription seqno and the sum of all peers subscription seqnos. These sums will only appear then the nodes is directly connected to the peer, since they are the values last seen in the heartbeat messages.

The sequence numbers are always increasing after a change in the link state or subscription state, so the sums of these seqnos are unique for the current network state and provide a way for peers to check whether they are in sync with the network.

These are exchanged with the heartbeat messages. When a difference is detected, the hb_diff is incremented and a _M.sync message is multicast to the network. When a peer receives the sync message, it checks that their sums match with the sending peer. If they do not match, then they reply with their current link state and subscription seqno values in a _I.<nonce>.sync point to point message. When a peer receives the sync reply it checks that these are in sync and requests adjacency with _I.<nonce>.sync_req if they do not.

The hb_diff may not always result in an actual difference with the network, since it is possible that a subscription or a link state message is received and applied to the peer at a different rate than the heartbeat is received, but the reply of the current sequence numbers at the peer will most likely be less than or equal the state of the network when the peer is in sync.

The mc_req is the number of _M.sync message received, mc_res is the number of _I.<nonce>.sync messages received. The req_adj is the number of adjacency requests made as a result of the _M.sync messages, and res_adj is adjacency requests made as a result of the _I.<nonce>.sync messages and ping_adj is adjacency requests made as a result of _I.<nonce>.ping messages.

  • show pubtype

When a message header is created or unpacked, a counter of the subject class is incremented. This shows these counters. These are only messages that are processed by the network, it is possible that two clients within the IPC transport are exchanging messages, these are not counted.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[7]> show pubtype
      type       | recv_count | send_count
-----------------+------------+-----------
 u_session_hello |          0 |          1
    u_session_hb |      68761 |      68765
      u_peer_add |        134 |         35
      u_peer_del |         16 |          4
  u_bloom_filter |         39 |          3
     u_adjacency |        115 |          4
      u_sub_join |     224621 |         24
     u_sub_leave |     223689 |          0
    u_psub_start |        110 |         89
    u_inbox_auth |          4 |          8
    u_inbox_subs |         10 |          0
    u_inbox_ping |      12476 |      12529
    u_inbox_pong |      12529 |      12481
     u_inbox_rem |          1 |          0
   u_inbox_resub |          0 |        202
 u_inbox_add_rte |          4 |          4
u_inbox_sync_req |          2 |         30
u_inbox_sync_rpy |         29 |          0
 u_inbox_adj_req |          3 |         10
 u_inbox_adj_rpy |         21 |          6
     u_inbox_ack |          0 |          1
     u_inbox_any |          0 |     224476
         u_inbox |          0 |          1
    u_mcast_ping |          5 |          0
 u_inbox_any_rte |         80 |          0
   mcast_subject | 1528812397 |          0
  • show inbox [U]

Show the types of the last 32 system RPC messages sent and received for each peer. Some peers may not have any of these if they are not directly connected.

This is an example of a peer attached to the console connecting to a larger network:

chex.rvd[xpO5ODZvoOcUMJ60QVaSBg]@chex[139]> inbox
  user  | send seqno |     send type    | recv seqno |     recv type
--------+------------+------------------+------------+-----------------
lex_a.1 |          1 |     u_inbox_auth |          1 | u_inbox_sync_rpy
        |          2 |  u_inbox_add_rte |          2 |     u_inbox_auth
        |          3 |  u_inbox_adj_req |          3 |  u_inbox_add_rte
        |          4 | u_inbox_sync_req |          4 |  u_inbox_adj_rpy
        |          5 | u_inbox_sync_req |          5 | u_inbox_sync_rpy
        |          6 | u_inbox_sync_req |          6 | u_inbox_sync_rpy
        |          7 | u_inbox_sync_req |          7 | u_inbox_sync_rpy
        |          8 | u_inbox_sync_req |          8 | u_inbox_sync_rpy
...

The first 3 sequences are the result of authentication, which causes both peers to exchange all their known peers. The following u_inbox_sync_req and u_inbox_sync_rpy pairs are used to request the peers which are not yet authenticated. In this case, the connecting peer has no peers and the peer attached to the network has lots of peers that need synchronizing.

  • show loss

Show the counters of repeated messages (old message sequences), messages not subscribed, have message loss, or have inbox loss.

When a message is repeated or not subscribed, a counter is incremented and the message is tossed. These types of events can occur through normal operation and don’t have an impact on clients.

The repeated messages can occur during network instability and not subscribed messages can occur because an unsubscribe has not yet reached the publisher or because the bloom filter did not filter the subject.

The message loss counters are more critical to correct behavior, since this indicates that messages did not reach all subscriptions. The inbox message loss can occur normally since these are used to synchronize peers during network instability, they are used to stabilize the network.

The point to point messages using the _INBOX prefix will also use the inbox sequences, but even these are not as critical since clients will have timeouts and retry the operation that uses an _INBOX

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[11]> show loss
   user    | repeat | rep time | not sub | not time | msg loss |      loss time    | ibx loss |      ibx time
-----------+--------+----------+---------+----------+----------+-------------------+----------+------------------
  lex_a2.1 |      0 |          |       0 |          |        0 |                   |        0 |
  lex_a3.2 |      0 |          |       0 |          |        0 |                   |        0 |
  lex_a4.3 |      0 |          |       0 |          |        0 |                   |        0 |
   edo_a.5 |      0 |          |       0 |          |        0 |                   |        0 |
  robo_a.6 |      0 |          |       0 |          |        0 |                   |        0 |
  edo_a4.7 |      0 |          |       0 |          |        0 |                   |        1 | 0209 08:22:25.120
  edo_a3.8 |      0 |          |       0 |          |        0 |                   |        1 | 0209 08:22:25.120
  edo_a1.9 |      0 |          |       0 |          |      640 | 0209 08:24:31.960 |        0 |
 edo_a2.10 |      0 |          |       0 |          |      655 | 0209 08:24:32.080 |        0 |
robo_a3.11 |      0 |          |       0 |          |        0 |                   |        1 | 0209 08:22:25.120
robo_a2.12 |      0 |          |       0 |          |      630 | 0209 08:24:31.761 |        0 |
robo_a4.13 |      0 |          |       0 |          |        0 |                   |        1 | 0209 08:22:25.120
robo_a1.14 |      0 |          |       0 |          |      647 | 0209 08:24:23.841 |        0 |
 lee_a1.15 |      0 |          |       0 |          |        1 | 0209 08:22:27.841 |        0 |
...

The user is the sender of the message. The repeat, rep time is the count and time stamp of the last instance. The not sub, not time are for the not subscribed messages. The msg loss, loss time are for the multicast message loss. The ibx loss, ibx time are for the point to point inbox message loss.

  • show skew

Show the system time skew between peers. There are several messages that include a time stamp which can be used to estimate the system clock skew between peers. This is useful to guard against message replays. If a peer message arrives and the time + skew is older than the subscription window, then it is treated as a repeated message. When the time is within the subscription window, then a sequence will be associated with the last message received from peer. The subscription window rotate time is configurable, described in Parameters of the config section. This details of the loss calculation is described in [msg_loss].

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[11]> show skew
   user    |   lat |    hb   | ref |   ping   |   pong  |        time
-----------+-------+---------+-----+----------+---------+------------------
  lex_a2.1 | 241us |  63.5us |   0 |   33.7us | -32.5us | 0209 08:47:48.395
  lex_a3.2 | 119us |  76.9us |   0 |   31.4us | -7.15us | 0209 08:47:48.395
  lex_a4.3 | 157us |   236us |   0 |   32.7us | -15.1us | 0209 08:47:48.395
   edo_a.5 | 302us |  -483us |   4 | -0.161us | -26.1us | 0209 08:47:48.395
  robo_a.6 | 291us | -1.09ms |   4 |  0.154us | -1.41ms | 0209 08:47:48.397
  edo_a4.7 | 521us |   282us |   4 |   31.6us |  -131us | 0209 08:47:48.395
  edo_a3.8 | 512us |   250us |   4 |   -5.1us | -14.7us | 0209 08:47:48.395
  edo_a1.9 | 308us |  1.26ms |   4 |  -12.8us |  72.8us | 0209 08:47:48.395
 edo_a2.10 | 452us |  1.02ms |   4 |  -13.2us |  -222us | 0209 08:47:48.395
robo_a3.11 | 528us |   314us |   4 |     28us | -1.44ms | 0209 08:47:48.397
robo_a2.12 | 468us |   477us |   4 |  -3.79us | -1.47ms | 0209 08:47:48.397
robo_a4.13 | 633us |   571us |   4 |   -8.7us |  -1.5ms | 0209 08:47:48.397
...

The first message a peer will see when connecting is the heartbeat message and authentication messages. These have a time attached to them and this is the first time skew calculation that a peer will have. The hb contains this value and the ref is the uid of the peer that is attached and calculated the skew. The ping and pong values are calculated later when a ping pong sequence of messages are exchange. These are more accurate because there is a larger sample size as the uptime increases. The time is the last time a skew was calculated.

  • show reachable

Show which transport links can be used to reach a peer. This table associates a connection fd with a list of peers that are using it. If this connection is lost, then these are peers that may be affected by this event.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[12]> show reachable
   user    |   path  | fd |     tport
-----------+---------+----+------------
  lex_a2.1 | 0,1,2,3 | 19 | lex_amesh.2
  lex_a3.2 |         |    |
  lex_a4.3 |         |    |
  dex_a.24 |         |    |
  pic_a.29 |         |    |
  lee_a.18 |         |    |
  robo_a.6 |         |    |
   edo_a.5 |         |    |
-----------+---------+----+------------
  lex_a3.2 | 0,1,2,3 | 21 | lex_amesh.3
  lex_a2.1 |         |    |
  lex_a4.3 |         |    |
  dex_a.24 |         |    |
  pic_a.29 |         |    |
  robo_a.6 |         |    |
   edo_a.5 |         |    |
...

The user is the peer, the path is a list of paths used with the connection fd, and the tport is the transport that contains the connection.

  • show tree [U]

Show the multicast tree for a user or self. This iterates through the adjacency tables by cost and shows the which peers will be reached after each step. The cost increases until all the peers are exhausted. If a U argument is present, then the multicast tree starts from that peer instead of the peer attached to the console.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[14]> show tree
cost | set | alt |  source  |     tport    |   dest
-----+-----+-----+----------+--------------+--------
1000 |   0 |   0 | lex_a1.* |  lex_amesh.2 |  lex_a2
1000 |   1 |   0 | lex_a1.* |  lex_amesh.3 |  lex_a3
1000 |   2 |   0 | lex_a1.* |  lex_amesh.4 |  lex_a4
1000 |   3 |   0 | lex_a1.* |  lex_amesh.5 |   lex_a
-----+-----+-----+----------+--------------+--------
2000 |   0 |   0 | lex_a.33 |    fo_mesh.7 |   edo_a
2000 |   2 |   0 | lex_a.33 |    fo_mesh.9 |  robo_a
2000 |   1 |   0 | lex_a.33 |    fo_mesh.8 |   lee_a
2000 |   4 |   0 | lex_a.33 |   fo_mesh.11 |   dex_a
2000 |   3 |   0 | lex_a.33 |   fo_mesh.10 |   pic_a
-----+-----+-----+----------+--------------+--------
3000 |   0 |   0 |  edo_a.5 |  edo_amesh.4 |  edo_a4
3000 |   1 |   0 |  edo_a.5 |  edo_amesh.5 |  edo_a3
3000 |   2 |   0 |  edo_a.5 |  edo_amesh.6 |  edo_a1
3000 |   3 |   0 |  edo_a.5 |  edo_amesh.7 |  edo_a2
3000 |   4 |   0 | robo_a.6 | robo_amesh.4 | robo_a3
3000 |   5 |   0 | robo_a.6 | robo_amesh.5 | robo_a2
...

The set is an index into the table used for the next hop, this is calculated by transitioning across the transport links. Since the uids are displayed in order, the set may jump back and forth through the table. The alt counter is an alternate path counter. Only the 0 alt path is used, but the others are displayed.

The source is the forwarding peer that sends the message, the tport is the transport local to the source, and dest is the receiver.

  • show path [P]

Show the transports used to reach a peer for a path. This is the forwarding table that is used to send a message from the local peer to other peers.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[15]> show path
    tport   | cost | path_cost |    dest
------------+------+-----------+----------
lex_amesh.2 | 1000 |      1000 |  lex_a2.1
lex_amesh.3 | 1000 |      1000 |  lex_a3.2
lex_amesh.4 | 1000 |      1000 |  lex_a4.3
lex_amesh.5 | 1000 |      2000 |   edo_a.5
lex_amesh.5 | 1000 |      3000 |  edo_a4.7
lex_amesh.5 | 1000 |      3000 |  edo_a3.8
...

The tport is used for sending a message to dest. The cost is the first hop cost, the path_cost is the total cost through all hops.

  • show forward [P]

Show the forwarding table for a message received from each of the peers. When a message is received from a peer, it may need to be forwarded to other peers to completely cover the network. This shows the forwarding tables for each peer.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[16]> show forward
  source  |     tport   | cost
----------+-------------+-----
 lex_a1.* | lex_amesh.2 | 1000
          | lex_amesh.3 | 1000
          | lex_amesh.4 | 1000
          | lex_amesh.5 | 1000
----------+-------------+-----
 lex_a2.1 |             |
----------+-------------+-----
 lex_a3.2 |             |
...

The source is index the forwarding table used, the tport is the transport used to forward the message.

  • show fds

Show what each fd is used for. This iterates the fd tables and shows what each fd is doing.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[17]> show fds
fd | rid |      bs     |      br      |     ms    |     mr     | ac | rq | wq | fl |         type        |      kind     |           name           |           address
---+-----+-------------+--------------+-----------+------------+----+----+----+----+---------------------+---------------+--------------------------+--------------------------
 3 |  -1 |           0 |        15321 |         0 |          0 |    |  0 |  0 |    |              logger |        stdout |                          |
 5 |  -1 |             |              |           |            |    |    |    |    |         timer_queue |         timer |                          |
 7 |  -1 |           0 | 717092458452 |         0 | 1943883309 |    |    |    |    |           ipc_route |           ipc |                  rvd.ipc |
 8 |  -1 |           0 |         4235 |         0 |          0 |    |  0 |  0 |    |              logger |        stderr |                          |
 9 |  -1 |             |              |           |            |    |    |    |    |       console_route |       console |              rvd.console |
10 |  -1 |           0 |     99146804 |         0 |     690776 |    |    |    |    |         session_mgr |       session |              rvd.session |
11 |   0 |           0 |  64848767199 |         0 |  261901166 |    |    |    |    |     transport_route |         tport |          rvd.ipc.tport.0 |
12 |   0 |             |              |           |            | 12 |    |    |    |           rv_listen |     rv_listen |        rvd.ipc.rv.list.0 |            127.0.0.1:7500
13 |  -1 |             |              |           |            |  1 |    |    |    |       telnet_listen | telnet_listen |               telnet.tel |              0.0.0.0:2222
14 |  -1 |         210 |            0 |         1 |          0 |    |    |    |    |        name_connect |    mcast_send |           name.eth0.send |        239.23.22.217:8327
15 |  -1 |        1000 |         1260 |         5 |          6 |    |    |    |    |         name_listen |    mcast_recv |           name.eth0.recv |        239.23.22.217:8327
16 |  -1 |             |              |           |            |    |    |    |    |         name_listen |    ucast_recv |          name.eth0.inbox |          172.18.0.2:33643
17 |   1 |             |              |           |            |    |    |    |    |     transport_route |         tport |    rvd.lex_amesh.tport.1 |
18 |   1 |             |              |           |            |  5 |    |    |    | ev_tcp_tport_listen |    tcp_listen | rvd.lex_amesh.tcp_list.1 |          172.18.0.2:42341
19 |   2 |  9458891878 |     28871168 |  27427393 |     121986 |    |  0 |  0 |    |        ev_tcp_tport |    tcp_accept |  rvd.lex_amesh.tcp_acc.1 | lex_a2.1@172.18.0.3:41708
20 |   2 |           0 |     16338022 |         0 |      50587 |    |    |    |    |     transport_route |         tport |    rvd.lex_amesh.tport.2 |
21 |   3 |  9548489486 |     28505122 |  27617221 |     120205 |    |  0 |  0 |    |        ev_tcp_tport |    tcp_accept |  rvd.lex_amesh.tcp_acc.1 | lex_a3.2@172.18.0.4:44630
...

The fields are:

Field Description

fd

File descriptor

rid

Transport id that fd belongs to

bs

Bytes sent

br

Bytes received

ms

Messages sent

mr

Message received

ac

Listener accept count

rq

Bytes in the receive queue

wq

Bytes in the send queue

fl

Socket flags, R,r,<: reading, W,w,>: writing, +: processing.

type

What type of fd

kind

What class of fd

name

The name associated with fd

address

The local address

  • show buffers

Show the buffer usage of each connection. These buffers expand to contain an entire message, since there is no streaming of large messages.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[18]> show buffers
fd |   wr  |  wmax |   rd  |  rmax | zref |    send   |    recv   | mall | pall |           name
---+-------+-------+-------+-------+------+-----------+-----------+------+------+------------------------
 3 | 32768 | 32768 | 16384 | 16384 |    0 |         0 |       124 |    0 |    0 |
 8 | 32768 | 32768 | 16384 | 16384 |    0 |         0 |        74 |    0 |    0 |
19 | 32768 | 32768 | 16384 | 16384 |    0 |  27189290 |     67973 |    0 |    0 | rvd.lex_amesh.tcp_acc.1
21 | 32768 | 32768 | 16384 | 16384 |    0 |  27224765 |     66485 |    0 |    0 | rvd.lex_amesh.tcp_acc.1
23 | 32768 | 32768 | 16384 | 16384 |    0 |  30118303 |     68727 |    0 |    0 | rvd.lex_amesh.tcp_acc.1
25 | 32768 | 32768 | 16384 | 16384 |    0 |   5498186 |  38629165 |    0 |    0 | rvd.lex_amesh.tcp_acc.1
...

The fields are:

Field Description

fd

File descriptor

wr

Write buffer size

wmax

The largest write buffer used

rd

Read buffer size

rmax

The largest read buffer used

zref

Counter incremented after of zero copy sends

send

Bytes sent

recv

Bytes received

mall

Counter incremented when malloc() is used to make a buffer

pall

Counter incremented when a buffer is borrowed from the buffer pool

name

Name associated with fd

  • show windows

Show the size and counts of the subject publish and subscribe windows as well as the size of subscription tables and bloom filters.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[19]> show windows
   tab  | count |   size  | win_size | max_size |     rotate_time   | interval
--------+-------+---------+----------+----------+-------------------+---------
    sub | 22515 | 5534080 |  8388608 |  5534080 | 0208 13:23:11.393 |       10
sub_old |     0 |       0 |          |          | 0208 13:23:01.393 |
    pub |  3737 |  344112 |  4194304 |   344112 | 0208 13:23:11.393 |       10
pub_old |     0 |       0 |          |          | 0208 13:23:01.393 |
  inbox |  2724 |  817824 |          |          | 0209 09:52:42.761 |
  route |   137 |   58848 |          |          |                   |
  bloom |  1135 |   18392 |          |          |                   |
     rv |   102 | 1290420 |          |          |                   |

The first two are the subscription and publish windows. These tables are rotated to old when they get to win_size with at least interval seconds. The max_size is the largest size of this window.

The inbox entry is a route cache for subjects that have a _INBOX prefix. The route entry is a cache for routes, indexed by subject hash. The bloom entry is the sum of the size of bloom filters for every peer in the network. The rv entry is the subscription table for RV clients attached.

  • show blooms [P]

Show where the bloom filters are used for a path. The forwarding table has only one transport entry for each peer, path combination. If a message is forwarded on more than one transport, it is because there are multiple peers that are subscribed across multiple transports for the path. The receiving side also filters the messages through the bloom filters by calculating the ports that are needed for the path to completely cover the network. There may be redundant transports that are inactive for each path either because the cost is more or the path selection prefers one transport over the other.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[20]> show blooms
fd |   dest   |     tport   |                                     bloom                                   |       prefix       | detail | subs | total
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
 9 |  console |       ipc.0 |                                                                   (console) |                  0 |      0 |    0 |     0
11 |    route |       ipc.0 |                                                                 (all-peers) |                  0 |      0 |    0 |     0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
 7 |      ipc | lex_amesh.1 |                                                                       (ipc) | 0x000061DF00C38000 |      0 |   24 |   113
10 |  session | lex_amesh.1 |                                                            (console), (sys) |         0x04000108 |      0 |    7 |    15
17 |    route | lex_amesh.1 |                                                                 (all-peers) |                  0 |      0 |    0 |     0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
 7 |      ipc | lex_amesh.2 |                                                                       (ipc) | 0x000061DF00C38000 |      0 |   24 |   113
10 |  session | lex_amesh.2 |                                                            (console), (sys) |         0x04000108 |      0 |    7 |    15
19 | lex_a2.1 | lex_amesh.2 |                                                              (peer), lex_a2 | 0x0000008004000108 |      0 |   84 |    91
20 |    route | lex_amesh.2 |                                                                 (all-peers) |                  0 |      0 |    0 |     0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
 7 |      ipc | lex_amesh.3 |                                                                       (ipc) | 0x000061DF00C38000 |      0 |   24 |   113
10 |  session | lex_amesh.3 |                                                            (console), (sys) |         0x04000108 |      0 |    7 |    15
21 | lex_a3.2 | lex_amesh.3 |                                                              (peer), lex_a3 | 0x0000008004000108 |      0 |   98 |   105
22 |    route | lex_amesh.3 |                                                                 (all-peers) |                  0 |      0 |    0 |     0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
 7 |      ipc | lex_amesh.4 |                                                                       (ipc) | 0x000061DF00C38000 |      0 |   24 |   113
10 |  session | lex_amesh.4 |                                                            (console), (sys) |         0x04000108 |      0 |    7 |    15
23 | lex_a4.3 | lex_amesh.4 |                                                              (peer), lex_a4 | 0x0000008004000108 |      0 |   89 |    96
24 |    route | lex_amesh.4 |                                                                 (all-peers) |                  0 |      0 |    0 |     0
---+----------+-------------+-----------------------------------------------------------------------------+--------------------+--------+------+------
 7 |      ipc | lex_amesh.5 |                                                                       (ipc) | 0x000061DF00C38000 |      0 |   24 |   113
10 |  session | lex_amesh.5 |                                                            (console), (sys) |         0x04000108 |      0 |    7 |    15
25 | lex_a.33 | lex_amesh.5 | (peer), lex_a, pic_a, edo_a, lee_a4, lee_a3, lee_a1, lee_a2, edo_a4, edo_a2 | 0x000061DF04C38108 |      0 |  482 |   636
   |          |             |      edo_a1, edo_a3, dex_a1, dex_a2, dex_a3, dex_a4, pic_a4, pic_a1, pic_a2 |                    |        |      |
   |          |             |                                                        pic_a3, lee_a, dex_a |                    |        |      |
26 |    route | lex_amesh.5 |                                                                 (all-peers) |                  0 |      0 |    0 |     0

Every peer has a bloom filter associated with it. The console, ipc, and sys filters are the local bloom filters which are combined into one filter in another peer. They are split in the local peer so that the traffic destination can be split to the separate processing functions. The sys filter only match the subjects that are used for the system, namely, the _I.<nonce>.> subject and the _M.> subject. The console are the subjects subscribed by the console. The ipc are the subjects subscribed by clients. The all-peers are the combination of all the peers subscriptions, this is used for receiving messages. The individual peer bloom filters are for forwarding messages.

The fields are:

Field Description

fd

File descriptor for the connection

dest

Where the message would go

tport

The transport that is used

bloom

The bloom filters

prefix

A bit mask of the prefix match length

detail

A bit mask of the prefix when a suffix is matched or sharded

subs

The subscription count, not including the patterns

total

The subscription count including the patterns

  • show match S

Show which peer bloom filters match a subject. If a message was published with subject S, this shows which peer’s bloom filter would match it. This doesn’t match against the local filters.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[33]> show match _7500.RSF.REC.AVP.N
   user
---------
lee_a2.16
  • show graph

Show the graph description of the network. This creates a description of the network by matching the names of the transports with the names that the peers use. This doesn’t use any network probing, it uses the link state database to calculate the network connectivity. The link state database doesn’t have connection IP addresses associated with it, but it does have a link name and link type. The name/types are enough to describe the network, but doesn’t show how the links are connected to the host with IP addresses.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[34]> show graph
start lex_a1
node edo_a1 edo_a2 edo_a3 lex_a1 lex_a2 edo_a4 edo_a lex_a3 lex_a4 lee_a1 lee_a2 lee_a3 dex_a1 lee_a4 lee_a dex_a2 dex_a3 dex_a4 dex_a pic_a1 pic_a2 pic_a3 pic_a4 pic_a lex_a
mesh_lex_amesh lex_a1 lex_a2 lex_a3 lex_a4 lex_a
mesh_edo_amesh edo_a edo_a4 edo_a3 edo_a1 edo_a2
mesh_fo_mesh edo_a lee_a dex_a pic_a lex_a
mesh_lee_amesh lee_a1 lee_a2 lee_a4 lee_a lee_a3
mesh_dex_amesh dex_a1 dex_a2 dex_a3 dex_a4 dex_a
mesh_pic_amesh pic_a3 pic_a4 pic_a1 pic_a2 pic_a

The start is the peer attached to the console. The node is the list of peers in the network ordered by age. The following lines have a prefix which is the type of transport used, which is either mesh, tcp, or pgm. The suffix of the type is the name of the transport. Following the "type_name" are the peers which are connected using this transport. If the cost is not the default of 1000, then there will be a : followed by the cost of the transport.

  • show cache

Show the route cache hit and miss statistics. To reduce the number of bloom filters and hash tables that a message must flow through to match the subject, the route for the subject is cached. This cache needs to be updated when a subscription operation occurs, so this purges the entries which are affected by these operations, reducing the cache effectiveness. When a new subject published will also cause a miss. The cache size has a maximum of 256K entries, and when this is hit, the cache is purged and recreated.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[35]> show cache
         tport        | hit_pct |      hit    |    miss    | max_cnt | max_size
----------------------+---------+-------------+------------+---------+---------
      rvd.ipc.tport.0 |   86.70 | 14600408979 | 2239005394 |   24576 |      130
rvd.lex_amesh.tport.1 |    0.00 |           0 |          0 |       0 |        0
rvd.lex_amesh.tport.2 |   84.16 |  1513720684 |  284704081 |    1536 |      447
rvd.lex_amesh.tport.3 |   84.17 |  1513725449 |  284673772 |    1536 |      453
rvd.lex_amesh.tport.4 |   84.16 |  1513723831 |  284727847 |    1536 |      444
rvd.lex_amesh.tport.5 |   88.06 | 16786195897 | 2275244513 |   24576 |      209

Each tport has a route cache. The hit_pct is a percentage, hit * 100 / total. The hit is how many times an entry was present in the cache, a miss is not present. The max_cnt is the maximum number of cache entries that have occurred since the transport was created. The max_size is the max data size of the entries, which are fds. Some of the entries will have zero size, when there is no route for the subject.

  • show poll

Show the latency of poll states, the average time used for processing timers, read, write, and routing events.

Example:

lex_a1.rvd[L0MOCmhQpwX2JqsBjYBypA]@lex_a1[36]> show poll
timer_lat | timer_cnt | read_lat |  read_cnt  |    rd_lo   | route_lat |  route_cnt | write_lat |  write_cnt | wr_poll | wr_hi
----------+-----------+----------+------------+------------+-----------+------------+-----------+------------+---------+------
   2.52us |   5548967 |   4.77us | 4936398767 | 2053110538 |    11.4us | 1434068849 |    15.1us | 2053110184 |       0 |    66

In a busy router, the read, route, write operations will process multiple messages at a time, depending on how many fit inside of a read buffer. A read buffer is 16KB and is resized only when a large message requires more memory. The sum of these is close to the average latency used by the router per message, even if the time used per message is a fraction of that, since the messages are processed in batches.

The read_cnt is the sum of the counts in the rd_lo and read states, the write_cnt is the sum of the counts in the write, wr_hi, wr_poll states. The difference between rd_lo and read is that the rd_lo state occurs after the read buffer is full or the fd has no more data to read. The wr_hi are the number of times that the write buffer is full. The wr_poll state is the number of times that the fd is part of the poll set because there is back pressure on the connection.

  • show hosts

Show the RV host services.

Example:

chex.rvd[VCr9OQDldBjnGLnOXVF7gA]@chex[229]> show hosts
 svc |         session        |  user  | port |        start      | cl |    bs   |  br  |  ms  | mr | idl | odl
-----+------------------------+--------+------+-------------------+----+---------+------+------+----+-----+----
7500 | 542AFD39.5F75F9F9BFDED |   chex | 7500 | 0320 19:15:55.308 |  1 | 2670095 | 1593 | 2438 | 17 |   0 |   0
7500 | 542AFD39.5F763014D6394 | nobody | 7500 | 0320 19:15:55.308 |  1 | 2670095 | 1593 | 2438 | 17 |   0 |   0
7501 | 542AFD39.5F76301B06616 | nobody | 7500 | 0320 23:03:15.414 |  1 |       0 | 1572 |    0 | 16 |   0 |   0

The svc is the service number, session is the session identifier, user is the user name associated with the session, port is the daemon port number, start is when the host started. The cl is the active number of clients. If the number of clients is zero then the host service is not active, it doesn’t publish any _RV system subjects. The bs, br, ms, mr, idl, odl are the same stats published with the _RV.INFO.SYSTEM.HOST.STATUS.5230FA7C message.

Field Description

svc

Service number

session

Session identifier

session ip

Session identifier in IPv4 address format

port

Daemon port number

start

Start time of the host

cl

Number of clients connected to service

bs

Bytes sent

br

Bytes received

ms

Messages sent

mr

Messagtes received

idl

Inbound data loss, messages lost by subscriptions

odl

Outbound data loss, messsages lost by publishers

The session ip will be a random address unless configured with the no_fakeip setting, described in Tib RV.

  • show rvsub

Show the RV subscriptions, which is any subscription that uses an service number. A service name used by another protocol that is not a valid RV service will not have a RV subscriptions.

Example:

chex.rvd[VCr9OQDldBjnGLnOXVF7gA]@chex[228]> show rvsub
 svc |         session        |  user  | p |              subject
-----+------------------------+--------+---+--------------------------------
7500 | 542AFD39.5F75F9F9BFDED |   chex |   |
7500 | 542AFD39.5F762C8EF4C3E | nobody |   | RSF5.REC.EK.N
     |                        |        |   | RSF5.REC.ITT.NaE
     |                        |        |   | RSF5.REC.PPW.NaE
     |                        |        | p | _INBOX.542AFD39.5F762C8EF4C3E.>
7501 | 542AFD39.5F762CC9F8385 | nobody |   | RSF.REC.TMX.N
     |                        |        |   | RSF.REC.GLK.NaE
     |                        |        | p | _INBOX.542AFD39.5F762CC9F8385.>

The svc field is the service number, the session is a identifier for the connection, which in this case, uses the host prefix and a nanosecond resolution timestamp as the unique identifier. There are other methods used, but they usually have host prefix, a timestamp and/or a process id. The user is derived from the protocol’s method of attaching a user name to the session. The user is often a login name when using RV. The p is set when the subscription is a pattern. The subject is the subscription string.

  • show rpcs

Show the console rpcs that are currently running. These are created with commands entered into the console or the web interface. These are: "ping", "remote", "show subs", "sub <subject>", "psub <subject>", and "snap <subject>".

Example:

chex.rvd[VCr9OQDldBjnGLnOXVF7gA]@chex[234]> show rpcs
type |          arg         | recv | count
-----+----------------------+------+------
snap |  _7500.RSF.REC.IBM.N |    0 |     1
 sub | _7500.RSF.REC.TEST.X |    1 |

The type is the command, the arg is a subject or a peer name. The recv is the number of messages received, count is the number expected if it is not a subscription type. The cancel command will stop the non-subscription type commands, unsub or punsub commands will stop the subscription type commands.

Test Pub Sub

These commands do pub/sub through the console. The messages have a format attached to them, which is an integer value mapped to decoding methods. If the format is matched with a decoder, then it is decoded to field/value pairs and printed. If a method is not matched, then the value is an opaque string of bytes and that displayed.

  • sub S [F]

Subscribe to subject S. If a file is present, then the publishes are sent to the file instead of printed to the console.

  • unsub S [F]

Unsubscribe to subject S. If a file is present, then stop the publishes sent to the file. If only unsub is used, then all subjects are unsubscribed.

  • psub W [F]

Subscribe to RV style wildcard W. If a file is present, then the publishes are sent to the file instead of printed to the console.

  • punsub W [F]

Unsubscribe to RV style wildcard W. If a file is present, then stop the publishes sent to the file. If only punsub is used, then all patterns are unsubscribed.

  • gsub W [F]

Subscribe to glob style wildcard W. If a file is present, then the publishes are sent to the file instead of printed to the console.

  • gunsub W [F]

Unsubscribe to glob style wildcard W. If a file is present, then stop the publishes sent to the file.

  • snap S [F]

Publish an empty message to subject S with an _INBOX reply, then wait for the _INBOX subject and print the message received. The _INBOX used is assigned is subscribed by the console automatically.

  • pub S M

Send a message M so subscriptions S.

  • trace S M

Send a message M to subscriptions S with the trace flag set, which causes any of the intermediate hops as well as the final destination to send an ack reply.

  • ack S M

Send a message M to subscription S with the ack flag set, which causes the destinations to send an ack reply.

  • rpc S M

Send a message M to subscription S with a return inbox.

  • any S M

Randomly choose a subscription match for S and forward message M to that endpoint. This would include both wildcard subscriptions and normal ones.

  • cancel

A cancel command stops any console subscription or RPC, such as ping. This marks the endpoint as canceled, so if results are returned after a cancel, they will be discarded.

  • reseed

This alters the local bloom filter to use a different seed. Changing the bloom filter seed will alter the bits in the hash such that collisions occur at different positions. If a low rate subscription has a collision with a high rate subscription, this would cause unnecessary traffic that can be avoided by altering the bloom filter seed. This doesn’t solve when the 32 bit hashes have collisions, but these are much less likely than a bloom filter collision.

Mute the Logging

  • mute

The log messages are normally printed to the console, this mutes them. The log is still present, using the log command will show them and the log file if active, will still be appended. If messages to the console are being printed too fast for the terminal to display them, this will automatically turn on.

  • unmute

This removes the mute for printing log messages to the console.

Turn On/Off Debug Logging

  • debug I

The integer value is either a mask or a list of strings that turn the debug logging on or off. When debug 0 is used, this turns of the debug messages.

Name Value Description

tcp

0x1

Print the subjects as they are sent or received on a TCP connection

pgm

0x2

Print the subjects as they are sent or received on a PGM connection

ibx

0x4

The inbox UDP protocol debugging

transport

0x8

Show the message route forwarding

user

0x10

User updates debugging, when changes are made to a user state

link_state

0x20

Link state message updates are printed

peer

0x40

Peer synchronization messages are printed

auth

0x80

Authentication messages are printed

session

0x100

System message dispatching, IPC message forwarding

hb

0x200

Heartbeat and ping messages

sub

0x400

Subscription starts and stops

msg_recv

0x800

Print system messages when they are received

msg_hex

0x1000

Dump the system messages in hex when they are received

telnet

0x2000

Show the telnet protocol states

name

0x4000

Display name transport update messages

repeat

0x8000

Print when the repeated subjects are received

not_sub

0x10000

Print when not subscribed subjects are received

loss

0x20000

Print debugging when message loss occurs

adj

0x40000

Print debugging when the link state Dijkstra algo runs

conn

0x80000

Show debugging about connections, when establish or dropped or timers expire

stats

0x100000

Print when forwarding a stats, when have subs to _N.> subjects

dist

This causes the Dijkstra algo to run once

kvpub

Turns on debugging when any message is processed

kvps

Turns on debugging when kv pubsub messages are processed

rv

Turns on debugging when rv message is processed

The last 4 don’t have an integer mask because they use different debug variables that the others.

Write Events to File

  • wevents F

Dumps the current events to a log file for examining later. Useful when a networking problem occurs and is hard to reproduce.

Stop the Server

  • die [I]

Exit the process without shutting down existing connections and sending bye messages to the network.

  • quit/exit

Normal shutdown. Existing connections will stop reading new messages send bye messages to connected peers and flush the data in the write queues.

Monitoring

Monitoring.