"CheckMK’s documentations that say that something hasn’t been yet documented are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know. And if one looks throughout the history of CheckMK’s documentations and other documentations, it is the latter category that tends to be the difficult ones." :stuck_out_tongue:

This HowTo describes how to create a very simple CheckMK-Agent plugin, where the check itself is only supplying one single JSON-line as output. Also this check won’t use any metrics and is not supplying any config for the plugin on the host itself. I’ll try to cover these topics in an own HowTo later, as CheckMK’s documentation is again, not very helpful because it has a quite confusing structure, is not complete or outdated. You have to look at other plugins to scrape through the code and find useful info 😐

Also the new "API" doesn’t make it really easier to code your plugins IMHO and is just differently complicated. But enough ranting, let’s begin.

Agent Structure

As far as I understood a minimal agent-plugin like this one should consist of following files:

└── local
    ├── bin
    ├── lib
    │   ├── check_mk -> python3/cmk
    │   └── python3
    │       └── cmk
    │           └── base
    │               └── plugins
    │                   └── agent_based
    │                       └── oom_kills.py           # Check-File: file where the output of the agent is analyzed
    └── share
        └── check_mk
            ├── agents
            │   ├── bakery
            │   │   └── oom_kills.py                   # Bakery-File: file which defines what the plugin should consist of
            │   └── plugins
            │       └── oom_kills.py                   # Plugin-File: file which will be installed on the host
            ├── checkman
            │   └── oom_kills                          # Checkman-File: file with info about the check
            └── web
                └── plugins
                    └── wato
                        ├── oom_kills_bakery.py        # Bakery-UI-File: file which creates the WATO-UI page for configuring the agent-plugin
                        └── oom_kills_parameters.py    # Check-Parameters-UI-File: file which create the WATO-UI page for configuring the parameters of the check

So 6 files are need. I’ll begin with the plugin-file which will be installed on the host.

Plugin-File

This "Plugin-File" is executed by the check_mk_agent on the host and is gathering all the info needed. In this case, the output will look like:

<<<oom_kills>>>
{"timestamp": "2021-09-10 15:52:02.433240", "last_run": "2021-09-13 08:14:04.432899"}

The script is executed with each check_mk_agent call and looks for OOM-kills since it’s last run.

it scrapes the dmesg output and looks for the pattern "(O|o)ut of memory"
it saves also the output to some tmp-file, so it knows when the last OOM occured
creates the needed output and returns it

The plugins are normally placed in /usr/lib/check_mk_agent/plugins

As dmesg -T may not be available on some Linux-distributions I plugged togehter this script from several sources. It should be Python 2 and 3 compatible. Also the script cannot/should not run multiple times at once in case it takes longer to scrape through dmesg‘s output.

#!/usr/bin/env python
# -*- encoding: utf-8; py-indent-offset: 4 -*-

# some parts used from https://gist.github.com/saghul/542780
# Copyright (C) 2010 Saúl ibarra Corretgé <saghul@gmail.com>

# 2021-09-10 - c.steinkogler[at]cashpoint.com
#
# Changelog:
# 2021-09-10 - updated for use with CheckMK v2.0, removed some and/or reformatted some lines
#
# Write result to file, and outputs diff after run again
# The process locks itself - so it cannot run multiple times simultanously

"""
pydmesg: dmesg with human-readable timestamps
"""

from __future__ import with_statement

import re
import subprocess
import sys
import socket
import json
from datetime import datetime, timedelta
from collections import Counter

_datetime_format = "%Y-%m-%d %H:%M:%S"
_dmesg_line_regex = re.compile("^\[((\s*)(?P<time>\d+\.\d+))\](?P<line>.*(O|o)ut of memory.*\((?P<process>.*)\).*)$")
_dmesg_fallback_line_regex = re.compile("^\[((\s*)(?P<time>\d+\.\d+))\](?P<line>.*)$")

def read_json_file(json_file_name):
    with open(json_file_name, 'r') as f:  # open in readonly mode
        json_data = f.read()
        json_data = json.loads(json_data)
    # endwith

    return json_data
# enddef

def write_json_file(json_file_name, json_data):
    # current_date_time = datetime.datetime.today()
    # now = current_date_time.strftime('%Y%m%d_%H%M%S')

    with open(json_file_name, 'w') as f:
        pretty_json = json.dumps(json_data, indent=2)
        f.write(pretty_json)
    # endwith

    return json_file_name
# enddef

def get_lock(process_name, section=None):
    # Without holding a reference to our socket somewhere it gets garbage
    # collected when the function exits
    get_lock._lock_socket = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)

    if section is None:
        section = process_name
    # endif

    try:
        get_lock._lock_socket.bind('\0' + process_name)
        # print("Process locked for %s" % section)
    except socket.error:
        print("%s already or still running. Please, try again later.\n" % section)
        sys.exit(253)
    # endtry
# enddef

def exec_process(cmdline, silent, input=None, **kwargs):
    """Execute a subprocess and returns the returncode, stdout buffer and stderr buffer.
       Optionally prints stdout and stderr while running."""
    try:
        sub = subprocess.Popen(cmdline, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, **kwargs)
        stdout, stderr = sub.communicate(input=input)
        returncode = sub.returncode
        if not silent:
            sys.stdout.write(stdout)
            sys.stderr.write(stderr)
        # endif
    except OSError as e:
        if e.errno == 2:
            raise RuntimeError('"%s" is not present on this system' % cmdline[0])
        else:
            raise
        # endif
    # endtry

    if returncode != 0:
        raise RuntimeError('Got return value %d while executing "%s", stderr output was:\n%s' % (
        returncode, " ".join(cmdline), stderr.rstrip("\n")))
    # endif

    return stdout
# enddef

def human_dmesg():
    print("<<<oom_kills>>>")
    now = datetime.now()
    first_run = False
    uptime_diff = None
    tmp_file = "/tmp/check_oom_kills.json"
    processes = []
    tmp_output = {}

    try:
        last_output = read_json_file(tmp_file)
        last_event_timestamp = datetime.strptime(last_output['timestamp'], '%Y-%m-%d %H:%M:%S.%f')
        last_run_time = datetime.strptime(last_output['last_run'], '%Y-%m-%d %H:%M:%S.%f')
    except:
        first_run = True
    # endtry

    try:
        with open('/proc/uptime') as f:
            uptime_diff = f.read().strip().split()[0]
    except IndexError:
        return
    else:
        try:
            uptime = now - timedelta(seconds=int(uptime_diff.split('.')[0]),
                                     microseconds=int(uptime_diff.split('.')[1]))
        except IndexError:
            return
        # endtry
    # endtry

    dmesg_data = exec_process(['dmesg'], True)
    for line in dmesg_data.split(b'\n'):
        if not line:
            continue

        line = line.decode("utf-8")
        match = _dmesg_line_regex.match(line)
        if match:
            try:
                seconds = int(match.groupdict().get('time', '').split('.')[0])
                nanoseconds = int(match.groupdict().get('time', '').split('.')[1])
                microseconds = int(round(nanoseconds * 0.001))
                line = match.groupdict().get('line', '')
                process = match.groupdict().get('process', '')
                t = uptime + timedelta(seconds=seconds, microseconds=microseconds)
                if first_run:
                    last_run_time = t
                    last_event_timestamp = t
            except IndexError:
                pass
            else:
                if last_run_time <= t and last_event_timestamp <= t:
                    processes.append(process)
                # endif
            # endtry
        # endif
    # endfor

    process_counter = dict(Counter(processes))
    tmp_output.update(process_counter)
    try:
        last_event_timestamp = {"timestamp": str(t)}
    # if we do not have any OOM kills we need a timestamp anyway - we take the first we can find
    # TODO: find prettier solution
    except UnboundLocalError:
        dmesg_data = exec_process(['dmesg'], True)
        for line in dmesg_data.split(b'\n'):
            if not line:
                continue
            # endif

            line = line.decode("utf-8")
            match = _dmesg_fallback_line_regex.match(line)

            if match:
                try:
                    seconds = int(match.groupdict().get('time', '').split('.')[0])
                    nanoseconds = int(match.groupdict().get('time', '').split('.')[1])
                    microseconds = int(round(nanoseconds * 0.001))
                    line = match.groupdict().get('line', '')
                    t = uptime + timedelta(seconds=seconds, microseconds=microseconds)
                except IndexError:
                    pass
                else:
                    break
                # endtry
            # endif
        # endfor
        last_event_timestamp = {"timestamp": str(t)}
    # endtry

    last_run_time = {"last_run": str(now)}
    tmp_output.update(last_event_timestamp)
    tmp_output.update(last_run_time)
    write_json_file(tmp_file, tmp_output)
    print(json.dumps(tmp_output))
# enddef

if __name__ == '__main__':
    get_lock("oom_kills")
    human_dmesg()
# endif

Check-File

This file is where the output of the plugin is analyzed and where it’s decided if some threshold was hit. In this case the script consist of:

a parse-function which will interpret the plugin-ouput into a nice JSON
a discover-funtion which will inventorize the check’s services
a check-function which will do the logic and decides if the service should show as OK, WARN, CRIT or UNKNOWN

#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-

#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#
# Check_MK Check OOM Kills Plugin
# a check for detecting constant OOM kills
# e.g. constantly restarting pods/containers on kubernetes/docker
#
# This Plugin uses parts of https://gist.github.com/saghul/542780 in its agent-plugin
#
# This is free software;  you can redistribute it and/or modify it
# under the  terms of the  GNU General Public License  as published by
# the Free Software Foundation in version 2.  This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY;  with-
# out even the implied warranty of  MERCHANTABILITY  or  FITNESS FOR A
# PARTICULAR PURPOSE. See the  GNU General Public License for more de-
# ails.  You should have  received  a copy of the  GNU  General Public
# License along with GNU Make; see the file  COPYING.  If  not,  write
# to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
# Boston, MA 02110-1301 USA.

# FOR TESTING
# run for example this test shell script
"""
./oom_test1.sh
#!/bin/bash

choom -p $$ -n 1000

while true; do
    echo "still running"
    sleep 10
done
"""
# on a machine with oom_kills.py plugin installed - choom changes the OOM score to the max value of 1000
# if OOMkiller is triggerd it will kill this process first. You can trigger the killer then with
"""
echo f > /proc/sysrq-trigger
"""
# the check plugin will detect this OOM similar to
"""
<<<oom_kills>>>
{"oom_test1.sh": 1, "timestamp": "2021-09-10 13:01:36.323840", "last_run": "2021-09-10 13:01:48.323380"}
"""
# if between runs there are multiple processes killed, they will be listed like
"""
<<<oom_kills>>>
{"oom_test1.sh": 1, "oom_test2.sh": 1, "timestamp": "2021-09-10 13:01:36.323840", "last_run": "2021-09-10 13:01:48.323380"}
"""

import json
# from .agent_based_api.v1 import *
# I use pycharm so that I can use the "Go To" -> "Declaration or Usages" work I use this
# Not sure if it's smart to keep this in productive checks else
# comment the following line(s) later after check works and is finished and uncomment the above
from cmk.base.plugins.agent_based.agent_based_api.v1 import *

def parse_oom_kills(string_table):
    # we check if we have any output received
    # usually in this check we receive only 1 line - so we will find the correct output in string_table[0]
    if len(string_table) > 0:
        # we create from the output one single line
        line = str(" ".join(string_table[0]))

        try:
            json_output = json.loads(line)
        except Exception as read_json_error:
            json_output = {"timestamp": "EXEC-ERROR - please check output of plugin on host - Additional info: %s" % str(read_json_error)}
        # endtry
    else:
        json_output = {"timestamp": "EXEC-ERROR - please check output of plugin on host"}
    # endif

    return json_output
# enddef

# we register the parse function
register.agent_section(
    name="oom_kills",
    parse_function=parse_oom_kills
)

# "section" now contains only the one pretty JSON line, which is output by "parse_oom_kills"
def discover_oom_kills(section):
    # if we didn't find an error we will pass on an item
    if "EXEC-ERROR" not in section["timestamp"]:
        yield Service()
# enddef

def check_oom_kills(params, section):
    # some default variables
    # inital state is always OK
    state = State.OK
    # empty output summary
    summary = ""
    # default we don't have an error
    error = False
    # warn, crit - levels for service
    warn, crit = params["kill_levels"]

    if "EXEC-ERROR" in section["timestamp"]:
        state = State.UNKNOWN
        summary = 'unknown state - connection error no parseable output of agent (?)'
        error = True
    # endif

    if not error:
        # copy the parsed stuff
        count_kills_data = section.copy()
        # remove timestamp and lastrun
        count_kills_data.pop('timestamp')
        count_kills_data.pop('last_run')
        count_kills = sum(count_kills_data.values())

        # should put a green OK in the summary
        summary_state = "(.)"

        # we do not need to up to second is ok
        approximate_timestamp = str(section["timestamp"]).split('.', 1)

        if count_kills == 0:
            summary = "No kills since " + approximate_timestamp[0] + " " + summary_state
        else:
            if (warn is not None) and (crit is not None):
                if count_kills >= warn and count_kills >= crit:
                    state = State.worst(State.CRIT, state)
                    # should put a red CRIT in the summary
                    summary_state = "(!!)"
                elif warn <= count_kills < crit:
                    state = State.worst(State.WARN, state)
                    # should put a yellow WARN in the summary
                    summary_state = "(!)"
                else:
                    state = State.worst(State.OK, state)
                # endif
            else:
                # just tag the check with the yellow WARN - but it still will be marked as green
                # as long as there are no levels set
                if count_kills >= 1:
                    summary_state = "(!)"
                # endif
            # endif

            # if there were multiple processes killed, we create one long summary text
            for process in count_kills_data.keys():
                if summary == "":
                    summary = "Process %s killed %d times" % (process, count_kills_data[process])
                else:
                    summary = summary + ", Process %s killed %d times" % (process, count_kills_data[process])
                # endif
            # endfor

            summary = summary + " since last check " + summary_state
        # endif
    # endif

    yield Result(state=state, summary=summary)
# enddef

register.check_plugin(
    name="oom_kills",
    service_name="OOM Kills",
    discovery_function=discover_oom_kills,
    check_function=check_oom_kills,
    check_default_parameters={"kill_levels": (1, 2)},
    check_ruleset_name="oom_kills"
)

Check-Parameters-UI-File

As you may have noticed the "Check-File" has some default-parameters which may not be useful. To be able to change paramters like those, you have to create a "Check-Parameters-UI-File" (that’s what I call it). This file just contains some definitions which are used to generate the corresponding WATO-UI page for the service.

In this case we only need some variable kill_levels which holds a tuple with a WARN- and CRIT-threshold.

#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-

#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#

# This is free software;  you can redistribute it and/or modify it
# under the  terms of the  GNU General Public License  as published by
# the Free Software Foundation in version 2.  This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY;  with-
# out even the implied warranty of  MERCHANTABILITY  or  FITNESS FOR A
# PARTICULAR PURPOSE. See the  GNU General Public License for more de-
# ails.  You should have  received  a copy of the  GNU  General Public
# License along with GNU Make; see the file  COPYING.  If  not,  write
# to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
# Boston, MA 02110-1301 USA.

# for autotranslating stuff it seems
from cmk.gui.i18n import _
# we need Dictionary, Tuple and Integer as we use those later
from cmk.gui.valuespec import (
    Dictionary,
    Tuple,
    Integer,
)

# we use CheckParameterRulespecWithoutItem because we only have one line in the output of oom_kills
from cmk.gui.plugins.wato import (
    CheckParameterRulespecWithoutItem,
    rulespec_registry,
    RulespecGroupCheckParametersOperatingSystem,
)

# looks similar to old checkmk versions - still can get pretty messy if you have a lot of parameter options
# for a check
def _parameter_valuespec_oom_kills():
    return Dictionary(
        elements=[
            (
                "kill_levels",
                Tuple(
                    title="Thresholds",
                    elements=[
                       Integer(title="Warning threshold", default_value=1),
                       Integer(title="Critical threshold", default_value=2)
                    ]
                )
            )
        ],
    )
# enddef

# need to register the thing so it works - again "WithoutItem" as we only receive one single "item"
rulespec_registry.register(
    CheckParameterRulespecWithoutItem(
        check_group_name="oom_kills",
        group=RulespecGroupCheckParametersOperatingSystem,
        match_type="dict",
        parameter_valuespec=_parameter_valuespec_oom_kills,
        title=lambda: _("Thresholds for detecting OOM kills"),
    )
)

Checkman-File

A file which contains the manual and info for the plugin and check.

title: OOM-Kills Check
agents: plugin
license: GPL
distribution: cashpoint
description:
 Checks for constantly OOM-killed processes and reports how often
 a process was killed inbetween to check intervals. May be useful
 for example to detect constantly restarting docker/kubernetes
 containers

 Returns {CRIT} if the value is out of the given ranges and
 {OK} otherwise.

inventory:
 Creates one check containing all info about found issues.

Bakery-File

This file just contains info which files the plugin consists of and should be added to the agent built for a host. In this case it only defines that oom_kills.py should be added.

#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-

#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#

# This is free software;  you can redistribute it and/or modify it
# under the  terms of the  GNU General Public License  as published by
# the Free Software Foundation in version 2.  This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY;  with-
# out even the implied warranty of  MERCHANTABILITY  or  FITNESS FOR A
# PARTICULAR PURPOSE. See the  GNU General Public License for more de-
# ails.  You should have  received  a copy of the  GNU  General Public
# License along with GNU Make; see the file  COPYING.  If  not,  write
# to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
# Boston, MA 02110-1301 USA.

from pathlib import Path
from typing import Any, Dict

from cmk.base.cee.plugins.bakery.bakery_api.v1 import FileGenerator, OS, Plugin, register

def get_oom_kills_files(conf: Dict[str, Any]) -> FileGenerator:
    yield Plugin(base_os=OS.LINUX, source=Path("oom_kills.py"))
# enddef

register.bakery_plugin(
    name="oom_kills",
    files_function=get_oom_kills_files,
)

Bakery-UI-File

Contains the info what the UI for configuring the corresponding agent-rule should look like. In this case we only need a dropdown, where you can select if the plugin should be installed or not.

#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-

#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#

# This is free software;  you can redistribute it and/or modify it
# under the  terms of the  GNU General Public License  as published by
# the Free Software Foundation in version 2.  This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY;  with-
# out even the implied warranty of  MERCHANTABILITY  or  FITNESS FOR A
# PARTICULAR PURPOSE. See the  GNU General Public License for more de-
# ails.  You should have  received  a copy of the  GNU  General Public
# License along with GNU Make; see the file  COPYING.  If  not,  write
# to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
# Boston, MA 02110-1301 USA.

from cmk.gui.i18n import _
from cmk.gui.plugins.wato import (
    HostRulespec,
    rulespec_registry,
)
from cmk.gui.valuespec import (
    DropdownChoice,
)
from cmk.gui.cee.plugins.wato.agent_bakery.rulespecs.utils import RulespecGroupMonitoringAgentsAgentPlugins

def _valuespec_agent_config_oom_kills():
    return DropdownChoice(
        title=_("OOM-Kills (Linux)"),
        help=_("This will deploy the agent plugin <tt>oom_kills</tt> for checking if there are constant oom-kills "
               "(e.g. crashing pods on kubernetes/docker)"),
        choices=[
            (True, _("Deploy plugin")),
            (None, _("Do not deploy plugin")),
        ]
    )
# enddef

rulespec_registry.register(
    HostRulespec(
        group=RulespecGroupMonitoringAgentsAgentPlugins,
        name="agent_config:oom_kills",
        valuespec=_valuespec_agent_config_oom_kills,
    )
)

Links and Info

The in my eyes messy Check_MK documentation: https://docs.checkmk.com/latest/en/devel_check_plugins.html
Heinlein’s git-repo where you already can find some nice new CheckMK plugins: https://github.com/HeinleinSupport/check_mk_extensions

A grep command to look for stuff in your site’s folder to quickly find where maybe useful code is hidden

# execute in site's user home
grep -rI --exclude=werk* --exclude=*.html --exclude=*.js  "bakery_api.v1" *

Todo

improve the check 😉 – e.g. add some config which tells the check-plugin to run with each x-th check_mk_agent call
understand the parsing function better, seems there is some better method to format the output to correct JSON
maybe recode the "Plugin-File" itself, this surely can be done in a much better way :neutral_face:

Preview

stay tuned, more HowTos coming.

HowTo for SNMP-checks
Hope I can add a HowTo for a full feature agent-plugin ASAP

CheckMK v2 – Coding own CheckMK-Agent Plugin (no metrics, no multiple services, no config for the host’s plugin)