"CheckMK’s documentations that say that something hasn’t been yet documented are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know. And if one looks throughout the history of CheckMK’s documentations and other documentations, it is the latter category that tends to be the difficult ones." :stuck_out_tongue:
This HowTo describes how to create a very simple CheckMK-Agent plugin, where the check itself is only supplying one single JSON-line as output. Also this check won’t use any metrics and is not supplying any config for the plugin on the host itself. I’ll try to cover these topics in an own HowTo later, as CheckMK’s documentation is again, not very helpful because it has a quite confusing structure, is not complete or outdated. You have to look at other plugins to scrape through the code and find useful info 😐
Also the new "API" doesn’t make it really easier to code your plugins IMHO and is just differently complicated. But enough ranting, let’s begin.
Agent Structure
As far as I understood a minimal agent-plugin like this one should consist of following files:
└── local
├── bin
├── lib
│ ├── check_mk -> python3/cmk
│ └── python3
│ └── cmk
│ └── base
│ └── plugins
│ └── agent_based
│ └── oom_kills.py # Check-File: file where the output of the agent is analyzed
└── share
└── check_mk
├── agents
│ ├── bakery
│ │ └── oom_kills.py # Bakery-File: file which defines what the plugin should consist of
│ └── plugins
│ └── oom_kills.py # Plugin-File: file which will be installed on the host
├── checkman
│ └── oom_kills # Checkman-File: file with info about the check
└── web
└── plugins
└── wato
├── oom_kills_bakery.py # Bakery-UI-File: file which creates the WATO-UI page for configuring the agent-plugin
└── oom_kills_parameters.py # Check-Parameters-UI-File: file which create the WATO-UI page for configuring the parameters of the check
So 6 files are need. I’ll begin with the plugin-file which will be installed on the host.
Plugin-File
This "Plugin-File" is executed by the check_mk_agent
on the host and is gathering all the info needed. In this case, the output will look like:
<<<oom_kills>>>
{"timestamp": "2021-09-10 15:52:02.433240", "last_run": "2021-09-13 08:14:04.432899"}
The script is executed with each check_mk_agent
call and looks for OOM-kills since it’s last run.
- it scrapes the
dmesg
output and looks for the pattern "(O|o)ut of memory" - it saves also the output to some tmp-file, so it knows when the last OOM occured
- creates the needed output and returns it
The plugins are normally placed in /usr/lib/check_mk_agent/plugins
As
dmesg -T
may not be available on some Linux-distributions I plugged togehter this script from several sources. It should be Python 2 and 3 compatible. Also the script cannot/should not run multiple times at once in case it takes longer to scrape throughdmesg
‘s output.
#!/usr/bin/env python
# -*- encoding: utf-8; py-indent-offset: 4 -*-
# some parts used from https://gist.github.com/saghul/542780
# Copyright (C) 2010 Saúl ibarra Corretgé <saghul@gmail.com>
# 2021-09-10 - c.steinkogler[at]cashpoint.com
#
# Changelog:
# 2021-09-10 - updated for use with CheckMK v2.0, removed some and/or reformatted some lines
#
# Write result to file, and outputs diff after run again
# The process locks itself - so it cannot run multiple times simultanously
"""
pydmesg: dmesg with human-readable timestamps
"""
from __future__ import with_statement
import re
import subprocess
import sys
import socket
import json
from datetime import datetime, timedelta
from collections import Counter
_datetime_format = "%Y-%m-%d %H:%M:%S"
_dmesg_line_regex = re.compile("^\[((\s*)(?P<time>\d+\.\d+))\](?P<line>.*(O|o)ut of memory.*\((?P<process>.*)\).*)$")
_dmesg_fallback_line_regex = re.compile("^\[((\s*)(?P<time>\d+\.\d+))\](?P<line>.*)$")
def read_json_file(json_file_name):
with open(json_file_name, 'r') as f: # open in readonly mode
json_data = f.read()
json_data = json.loads(json_data)
# endwith
return json_data
# enddef
def write_json_file(json_file_name, json_data):
# current_date_time = datetime.datetime.today()
# now = current_date_time.strftime('%Y%m%d_%H%M%S')
with open(json_file_name, 'w') as f:
pretty_json = json.dumps(json_data, indent=2)
f.write(pretty_json)
# endwith
return json_file_name
# enddef
def get_lock(process_name, section=None):
# Without holding a reference to our socket somewhere it gets garbage
# collected when the function exits
get_lock._lock_socket = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
if section is None:
section = process_name
# endif
try:
get_lock._lock_socket.bind('\0' + process_name)
# print("Process locked for %s" % section)
except socket.error:
print("%s already or still running. Please, try again later.\n" % section)
sys.exit(253)
# endtry
# enddef
def exec_process(cmdline, silent, input=None, **kwargs):
"""Execute a subprocess and returns the returncode, stdout buffer and stderr buffer.
Optionally prints stdout and stderr while running."""
try:
sub = subprocess.Popen(cmdline, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, **kwargs)
stdout, stderr = sub.communicate(input=input)
returncode = sub.returncode
if not silent:
sys.stdout.write(stdout)
sys.stderr.write(stderr)
# endif
except OSError as e:
if e.errno == 2:
raise RuntimeError('"%s" is not present on this system' % cmdline[0])
else:
raise
# endif
# endtry
if returncode != 0:
raise RuntimeError('Got return value %d while executing "%s", stderr output was:\n%s' % (
returncode, " ".join(cmdline), stderr.rstrip("\n")))
# endif
return stdout
# enddef
def human_dmesg():
print("<<<oom_kills>>>")
now = datetime.now()
first_run = False
uptime_diff = None
tmp_file = "/tmp/check_oom_kills.json"
processes = []
tmp_output = {}
try:
last_output = read_json_file(tmp_file)
last_event_timestamp = datetime.strptime(last_output['timestamp'], '%Y-%m-%d %H:%M:%S.%f')
last_run_time = datetime.strptime(last_output['last_run'], '%Y-%m-%d %H:%M:%S.%f')
except:
first_run = True
# endtry
try:
with open('/proc/uptime') as f:
uptime_diff = f.read().strip().split()[0]
except IndexError:
return
else:
try:
uptime = now - timedelta(seconds=int(uptime_diff.split('.')[0]),
microseconds=int(uptime_diff.split('.')[1]))
except IndexError:
return
# endtry
# endtry
dmesg_data = exec_process(['dmesg'], True)
for line in dmesg_data.split(b'\n'):
if not line:
continue
line = line.decode("utf-8")
match = _dmesg_line_regex.match(line)
if match:
try:
seconds = int(match.groupdict().get('time', '').split('.')[0])
nanoseconds = int(match.groupdict().get('time', '').split('.')[1])
microseconds = int(round(nanoseconds * 0.001))
line = match.groupdict().get('line', '')
process = match.groupdict().get('process', '')
t = uptime + timedelta(seconds=seconds, microseconds=microseconds)
if first_run:
last_run_time = t
last_event_timestamp = t
except IndexError:
pass
else:
if last_run_time <= t and last_event_timestamp <= t:
processes.append(process)
# endif
# endtry
# endif
# endfor
process_counter = dict(Counter(processes))
tmp_output.update(process_counter)
try:
last_event_timestamp = {"timestamp": str(t)}
# if we do not have any OOM kills we need a timestamp anyway - we take the first we can find
# TODO: find prettier solution
except UnboundLocalError:
dmesg_data = exec_process(['dmesg'], True)
for line in dmesg_data.split(b'\n'):
if not line:
continue
# endif
line = line.decode("utf-8")
match = _dmesg_fallback_line_regex.match(line)
if match:
try:
seconds = int(match.groupdict().get('time', '').split('.')[0])
nanoseconds = int(match.groupdict().get('time', '').split('.')[1])
microseconds = int(round(nanoseconds * 0.001))
line = match.groupdict().get('line', '')
t = uptime + timedelta(seconds=seconds, microseconds=microseconds)
except IndexError:
pass
else:
break
# endtry
# endif
# endfor
last_event_timestamp = {"timestamp": str(t)}
# endtry
last_run_time = {"last_run": str(now)}
tmp_output.update(last_event_timestamp)
tmp_output.update(last_run_time)
write_json_file(tmp_file, tmp_output)
print(json.dumps(tmp_output))
# enddef
if __name__ == '__main__':
get_lock("oom_kills")
human_dmesg()
# endif
Check-File
This file is where the output of the plugin is analyzed and where it’s decided if some threshold was hit. In this case the script consist of:
- a parse-function which will interpret the plugin-ouput into a nice JSON
- a discover-funtion which will inventorize the check’s services
- a check-function which will do the logic and decides if the service should show as OK, WARN, CRIT or UNKNOWN
#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-
#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#
# Check_MK Check OOM Kills Plugin
# a check for detecting constant OOM kills
# e.g. constantly restarting pods/containers on kubernetes/docker
#
# This Plugin uses parts of https://gist.github.com/saghul/542780 in its agent-plugin
#
# This is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation in version 2. This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY; with-
# out even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more de-
# ails. You should have received a copy of the GNU General Public
# License along with GNU Make; see the file COPYING. If not, write
# to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
# Boston, MA 02110-1301 USA.
# FOR TESTING
# run for example this test shell script
"""
./oom_test1.sh
#!/bin/bash
choom -p $$ -n 1000
while true; do
echo "still running"
sleep 10
done
"""
# on a machine with oom_kills.py plugin installed - choom changes the OOM score to the max value of 1000
# if OOMkiller is triggerd it will kill this process first. You can trigger the killer then with
"""
echo f > /proc/sysrq-trigger
"""
# the check plugin will detect this OOM similar to
"""
<<<oom_kills>>>
{"oom_test1.sh": 1, "timestamp": "2021-09-10 13:01:36.323840", "last_run": "2021-09-10 13:01:48.323380"}
"""
# if between runs there are multiple processes killed, they will be listed like
"""
<<<oom_kills>>>
{"oom_test1.sh": 1, "oom_test2.sh": 1, "timestamp": "2021-09-10 13:01:36.323840", "last_run": "2021-09-10 13:01:48.323380"}
"""
import json
# from .agent_based_api.v1 import *
# I use pycharm so that I can use the "Go To" -> "Declaration or Usages" work I use this
# Not sure if it's smart to keep this in productive checks else
# comment the following line(s) later after check works and is finished and uncomment the above
from cmk.base.plugins.agent_based.agent_based_api.v1 import *
def parse_oom_kills(string_table):
# we check if we have any output received
# usually in this check we receive only 1 line - so we will find the correct output in string_table[0]
if len(string_table) > 0:
# we create from the output one single line
line = str(" ".join(string_table[0]))
try:
json_output = json.loads(line)
except Exception as read_json_error:
json_output = {"timestamp": "EXEC-ERROR - please check output of plugin on host - Additional info: %s" % str(read_json_error)}
# endtry
else:
json_output = {"timestamp": "EXEC-ERROR - please check output of plugin on host"}
# endif
return json_output
# enddef
# we register the parse function
register.agent_section(
name="oom_kills",
parse_function=parse_oom_kills
)
# "section" now contains only the one pretty JSON line, which is output by "parse_oom_kills"
def discover_oom_kills(section):
# if we didn't find an error we will pass on an item
if "EXEC-ERROR" not in section["timestamp"]:
yield Service()
# enddef
def check_oom_kills(params, section):
# some default variables
# inital state is always OK
state = State.OK
# empty output summary
summary = ""
# default we don't have an error
error = False
# warn, crit - levels for service
warn, crit = params["kill_levels"]
if "EXEC-ERROR" in section["timestamp"]:
state = State.UNKNOWN
summary = 'unknown state - connection error no parseable output of agent (?)'
error = True
# endif
if not error:
# copy the parsed stuff
count_kills_data = section.copy()
# remove timestamp and lastrun
count_kills_data.pop('timestamp')
count_kills_data.pop('last_run')
count_kills = sum(count_kills_data.values())
# should put a green OK in the summary
summary_state = "(.)"
# we do not need to up to second is ok
approximate_timestamp = str(section["timestamp"]).split('.', 1)
if count_kills == 0:
summary = "No kills since " + approximate_timestamp[0] + " " + summary_state
else:
if (warn is not None) and (crit is not None):
if count_kills >= warn and count_kills >= crit:
state = State.worst(State.CRIT, state)
# should put a red CRIT in the summary
summary_state = "(!!)"
elif warn <= count_kills < crit:
state = State.worst(State.WARN, state)
# should put a yellow WARN in the summary
summary_state = "(!)"
else:
state = State.worst(State.OK, state)
# endif
else:
# just tag the check with the yellow WARN - but it still will be marked as green
# as long as there are no levels set
if count_kills >= 1:
summary_state = "(!)"
# endif
# endif
# if there were multiple processes killed, we create one long summary text
for process in count_kills_data.keys():
if summary == "":
summary = "Process %s killed %d times" % (process, count_kills_data[process])
else:
summary = summary + ", Process %s killed %d times" % (process, count_kills_data[process])
# endif
# endfor
summary = summary + " since last check " + summary_state
# endif
# endif
yield Result(state=state, summary=summary)
# enddef
register.check_plugin(
name="oom_kills",
service_name="OOM Kills",
discovery_function=discover_oom_kills,
check_function=check_oom_kills,
check_default_parameters={"kill_levels": (1, 2)},
check_ruleset_name="oom_kills"
)
Check-Parameters-UI-File
As you may have noticed the "Check-File" has some default-parameters which may not be useful. To be able to change paramters like those, you have to create a "Check-Parameters-UI-File" (that’s what I call it). This file just contains some definitions which are used to generate the corresponding WATO-UI page for the service.
In this case we only need some variable kill_levels
which holds a tuple with a WARN- and CRIT-threshold.
#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-
#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#
# This is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation in version 2. This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY; with-
# out even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more de-
# ails. You should have received a copy of the GNU General Public
# License along with GNU Make; see the file COPYING. If not, write
# to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
# Boston, MA 02110-1301 USA.
# for autotranslating stuff it seems
from cmk.gui.i18n import _
# we need Dictionary, Tuple and Integer as we use those later
from cmk.gui.valuespec import (
Dictionary,
Tuple,
Integer,
)
# we use CheckParameterRulespecWithoutItem because we only have one line in the output of oom_kills
from cmk.gui.plugins.wato import (
CheckParameterRulespecWithoutItem,
rulespec_registry,
RulespecGroupCheckParametersOperatingSystem,
)
# looks similar to old checkmk versions - still can get pretty messy if you have a lot of parameter options
# for a check
def _parameter_valuespec_oom_kills():
return Dictionary(
elements=[
(
"kill_levels",
Tuple(
title="Thresholds",
elements=[
Integer(title="Warning threshold", default_value=1),
Integer(title="Critical threshold", default_value=2)
]
)
)
],
)
# enddef
# need to register the thing so it works - again "WithoutItem" as we only receive one single "item"
rulespec_registry.register(
CheckParameterRulespecWithoutItem(
check_group_name="oom_kills",
group=RulespecGroupCheckParametersOperatingSystem,
match_type="dict",
parameter_valuespec=_parameter_valuespec_oom_kills,
title=lambda: _("Thresholds for detecting OOM kills"),
)
)
Checkman-File
A file which contains the manual and info for the plugin and check.
title: OOM-Kills Check
agents: plugin
license: GPL
distribution: cashpoint
description:
Checks for constantly OOM-killed processes and reports how often
a process was killed inbetween to check intervals. May be useful
for example to detect constantly restarting docker/kubernetes
containers
Returns {CRIT} if the value is out of the given ranges and
{OK} otherwise.
inventory:
Creates one check containing all info about found issues.
Bakery-File
This file just contains info which files the plugin consists of and should be added to the agent built for a host. In this case it only defines that oom_kills.py
should be added.
#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-
#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#
# This is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation in version 2. This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY; with-
# out even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more de-
# ails. You should have received a copy of the GNU General Public
# License along with GNU Make; see the file COPYING. If not, write
# to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
# Boston, MA 02110-1301 USA.
from pathlib import Path
from typing import Any, Dict
from cmk.base.cee.plugins.bakery.bakery_api.v1 import FileGenerator, OS, Plugin, register
def get_oom_kills_files(conf: Dict[str, Any]) -> FileGenerator:
yield Plugin(base_os=OS.LINUX, source=Path("oom_kills.py"))
# enddef
register.bakery_plugin(
name="oom_kills",
files_function=get_oom_kills_files,
)
Bakery-UI-File
Contains the info what the UI for configuring the corresponding agent-rule should look like. In this case we only need a dropdown, where you can select if the plugin should be installed or not.
#!/usr/bin/env python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-
#
# 2021-09-13 c.steinkogler[at]cashpoint.com
#
# This is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation in version 2. This file is distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY; with-
# out even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more de-
# ails. You should have received a copy of the GNU General Public
# License along with GNU Make; see the file COPYING. If not, write
# to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
# Boston, MA 02110-1301 USA.
from cmk.gui.i18n import _
from cmk.gui.plugins.wato import (
HostRulespec,
rulespec_registry,
)
from cmk.gui.valuespec import (
DropdownChoice,
)
from cmk.gui.cee.plugins.wato.agent_bakery.rulespecs.utils import RulespecGroupMonitoringAgentsAgentPlugins
def _valuespec_agent_config_oom_kills():
return DropdownChoice(
title=_("OOM-Kills (Linux)"),
help=_("This will deploy the agent plugin <tt>oom_kills</tt> for checking if there are constant oom-kills "
"(e.g. crashing pods on kubernetes/docker)"),
choices=[
(True, _("Deploy plugin")),
(None, _("Do not deploy plugin")),
]
)
# enddef
rulespec_registry.register(
HostRulespec(
group=RulespecGroupMonitoringAgentsAgentPlugins,
name="agent_config:oom_kills",
valuespec=_valuespec_agent_config_oom_kills,
)
)
Links and Info
- The in my eyes messy Check_MK documentation: https://docs.checkmk.com/latest/en/devel_check_plugins.html
- Heinlein’s git-repo where you already can find some nice new CheckMK plugins: https://github.com/HeinleinSupport/check_mk_extensions
A grep command to look for stuff in your site’s folder to quickly find where maybe useful code is hidden
# execute in site's user home
grep -rI --exclude=werk* --exclude=*.html --exclude=*.js "bakery_api.v1" *
Todo
- improve the check 😉 – e.g. add some config which tells the check-plugin to run with each x-th
check_mk_agent
call - understand the parsing function better, seems there is some better method to format the output to correct JSON
- maybe recode the "Plugin-File" itself, this surely can be done in a much better way :neutral_face:
Preview
stay tuned, more HowTos coming.
- HowTo for SNMP-checks
- Hope I can add a HowTo for a full feature agent-plugin ASAP
Kommentare