04 Jan

Observium pollers vertical scaling

There are several free monitoring tools available for commercial or non-commercial use. I decided to use Observium to monitor my home and remote labs. I had no specific requirements except that it had to have an option to run in Docker containers on RaspberryPi (ARMv6/8 CPUs). The downside of all monitoring solutions are the resource requirements. There are three resources that matter – storage, memory, and CPUs. Vertical scaling of Observium pollers and other processes is one of the solutions if your hardware resources are limited.

The vertical scaling means adding more workers doing the same process in parallel, but tasks assigned to each process should not overlap. In my case, I wanted to spread the polling across my RaspberryPi cluster. Polling process can consume the CPU and make RaspberriPi unresponsible. In worst scenario, if you have many devices or you are poling lots of data from your devices the polling processes may not finish their work within 5 minutes (this is how often devices have to be queried) so you will miss some data. You may of course tune number of threads the polling process start but any platform has it resources limitations you cannot bypass.  

In this article I will present my solution based on Docker Swarm cluster. It has some limitations and downsides but may work for many people, not only on RaspberyPi.

Observium poller vertical scaling idea

The idea of poller vertical scaling has its own section in Observium Community Edition documentation. Unfortunately, one crucial paragraph is… empty! But I will cover it here. Make sure you get familiar with the official documentation before the further reading of my article because I will only slightly focus on some configuration aspects described in the documentation.

You will need the vertical scaling in two cases:

  • When host running Observium is running out of resources
  • When network segmentation requires multiple discovery and polling agents because you cannot access all devices from single place in your network.

In both cases, you can separate the polling and discovery from the web front-end. However, all of them will still need access to the database and RRD repository. 

In my lab, the problem was that polling took over 4 minutes at some point (old devices are slow to respond) and consumes a significant amount of CPU making some of the RaspberryPi 3 B processes unresponsible. Therefore I wanted to distribute polling across whole Swarm stack leaving front-end and discovery in a single independent container. 

Lab diagram - Observium
Lab diagram – Observium

There are two things you need to remember when you try to build scalable Obervium infrastructure and distribute the services across multiple hosts:

  • The recurring tasks are the cron jobs. You need to run them exactly as defined in configration templates. Running them more often or not often enough will corrupt gathered data. 
  • The recurring tasks run on separate hosts cannot overlap each other. Running discovery or housekeeping on each poller node means wasting the resources. If pollers on different hosts query same device you create situation similar to running poller to often on single hosts which may corrupt the data

Front-end web server, database and RRD’s

The database is hosted on my NAS and it has to be accessible by all Observium instances no matter of its role. Observium configuration file contains all required details as hostname, port, credentials, and database name.

The rpi-1 host runs single container with Observium CE and RRDCached service. The RRDCached is a daemon that receives updates to RRD files and writes them into the files. It allows multiple hosts send updates at the same time taking care of the writing to disk process. You can find the rrdcached package in most Linux distributions. If you want to use it in Observium you need to update the configuration file and uncomment NETWORK_OPTIONS=”-L” line. This will tell the daemon to listen on TCP port 42217 for incoming updates. You cannot use the RRDCached and update the RRD files locally at the same time, so once you configure the daemon you must update the Observium configuration file. Inside the container you can’t use the socket to write to rrdcached daemon, you must use the network connection.

In my lab, the RRDCached and web front-end (including the all recurring tasks except the poller) share the same container. In large installations, it will be good to separate them – that will let you scale the web front-end service as well using the docker swarm replicas. However, the container with web front-end cannot perform housekeeping and discovery anymore.

Poller scaling (static)

This paragraph covers the missing section in Observium documentation.

As I described before the idea of vertical scaling is to distribute the polling process across multiple devices. Each poller instance should query predefined group of devices in such way, that each device recorded in Observium database is queried by exactly one poller every 5 minutes. To achieve this you need to provide two additional parameters to poller-wrapper.py script – the total number of pollers running in your system (-i) and the index number unique for each poller (-n). Using those two values poller script will determinate which devices from the database it should query.

As much as this approach is easy it creates the serious problems in scaling. The poller-wrapper.py is the script defined in the cron configuration and each poller container must have the cron configuration with correct parameters. Every time you add or remove the poller container, you must update the cron configuration of all other poller containers. It can cause problems no matter if you use stand alone containers or Swarm.

I decided to go with static configuration files. I use Docker Swarm stack to achieve the load distribution and redundancy, so I always have 4 replicas of poller containers. Technically it is exactly one replica of four different containers, because each must have dedicated cron configuration provided.

ID                  NAME                             MODE                REPLICAS            IMAGE                                                                        PORTS
smh1yzx3nwc3        observium-poller_poller-node-0   replicated          1/1                 observium-ce-poller:ubuntu19.04   
dvzli7cr5765        observium-poller_poller-node-1   replicated          1/1                 observium-ce-poller:ubuntu19.04   
4cplgqyrqxrz        observium-poller_poller-node-2   replicated          1/1                 observium-ce-poller:ubuntu19.04   
59bytcl7pb0o        observium-poller_poller-node-3   replicated          1/1                 observium-ce-poller:ubuntu19.04   

There are several way you can provide the configuration file for each container. I am using the configs section introduced in the docker-compose configuration file version 3.3. Because each Swarm worker node must have access to the configuration file I store them on remote NFS share mounted to each node. If I change number of workers I need to update all configuration files, but at least I have them in one place.

configs:
  observium-poller-node-0.cron:
    file: /home/docker-nfs/observium-poller-swarm/observium-poller-node-0.cron

services:
  poller-node-0:
       configs:
         - source: observium-poller-node-0.cron
           target: /etc/cron.d/observium-poller

The same way I also provide the Observium configuration file (config.php) because it is common for all containers. The downside is that every time you edit configuration files you must update the Swarm stack which means rebuilding all containers. If this is an issue in your network then attach the configuration files to containers in volumes sections. Any change will be visible in the container right away. Cron checks the configuration files every minute and you don’t need to restart its process or the container to apply changes.

Dynamic poller scaling (ideas)

The presented approach is not what you may expect thinking about the vertical scaling using containers and Docker Swarm stack. It is more load distribution and redundancy than scaling, but it fits my requirements at the moment. The main problem in adapting it to dynamic scaling is the cron configuration that must contain a correct unique pair of parameters for each replica. As far as I looked in the Docker Swarm documentation there is no API method, command or environment variable you can use to read the number of configured and active replicas. Parsing the console output should always be the last resort solution. And it may be even better if you can read them from the inside of the container. But even if you get those two values you still need to dynamically update the cron configuration in each replica.

At this moment I don’t know any flexible and reasonable solution using tools available in docker or external free software that would let me solve those problems.

  • Script reading number of replicas from CLI output on the manager, calculating the instance number for each running replica and those values as environment variables to each container. This can be tricky as well. You can provide the total number of replicas to all containers using the docker service update command, but to provide unique instance number for each instance you may need to use docker exec to set it directly. Then you need another script running inside the container that will periodically recreate the corn configuration based on actual environment variables values.
  • You may use the database to maintain required information. The minimum information you need to store is container name and assigned instance number you will later use to generate the cron configuration. You need to write the script that will register the container, read the instance number value and regenerate the cron configuration. This script will run inside each container. Other script running on stack manager must still update all of the service containers on number of total running replicas.

If you know how to solve this issue easier let me know, I will test it and describe it.

2 thoughts on “Observium pollers vertical scaling

  1. Useful article, thank you! I can't tell what year it was written in, but the Observium documentation has changed since it was written. You provide a more useful write-up than Observium. But you are describing *Horizontal* scaling here, not Vertical. See https://en.wikipedia.org/wiki/Scalability#Horizontal_(scale_out)_and_vertical_scaling_(scale_up) to at least prove I'm not trying to troll you ;-). Vertical scaling is adding more CPUs, more RAM, more NICs, etc., etc. to a node (or perhaps to multiple nodes simultaneously if it's a multi-node system already). Horizontal scaling is adding more nodes. You can of course do both simultaneously!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.