I manage a whole number of device and servers, which are monitored by various utilities, including Nagios. I also have clients who do the same, as well as using other tools that produce notifications – build systems etc.

Nagios is the thing that tells me when my web server is unavailable, or the database has fallen over, or, more often, when my internet connection dies. I have similar setups in various client networks that I maintain.

It logs to the system log, sends me emails, and in some urgent cases, sends a ping to my phone. All very handy, but isn’t very handy for other casual users who may just want to see if things are running properly. For those users, who are somewhat non-technical, it’s a bit much to ask them to read logs, and emails often get lost.

For one of my clients we had a need to be able to collect these status updates from different sources together, make it more persistent, and make it visible in a much more accessible way than log messages (which has a very poor signal to noise ratio) or email alerts (which only go to specific people).

“Known” issues

A solution I came up with was to create a Known site for the network which can be used to log these notifications in a user friendly, chronological and searchable form.

I created an account for my Nagios process, and then, using my Known command line tools, I extended the Nagios script to use my Known site as a notification mechanism.

In commands.cfg:

define command {
        command_name host-notify-by-known
        command_line echo "$HOSTNAME$: $HOSTSTATE$" | /etc/nagios/known_nagios_notify.sh
}
define command {
        command_name service-notify-by-known
        command_line echo "$HOSTNAME$ – $SERVICEDESC$ : $SERVICESTATE$. Additional info: '$SERVICEOUTPUT$'" | /etc/nagios/known_nagios_notify.sh
}

Then in conf.d/contacts.cfg I extended my “Root” contact:

define contact{
        contact_name                    root
        alias                           Root
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,r
        service_notification_commands   notify-service-by-email, service-notify-by-known
        host_notification_commands      notify-host-by-email, host-notify-by-known
        email                           root@localhost
        }

Finally, the script itself, which serves as a wrapper around the api tools and sets the appropriate path etc:

#!/bin/bash

PATH=/path/to/BashKnown:"${PATH}"

status.sh https://my.status.server nagios *YOURAPICODE* >/dev/null

exit 0

Consolidating rich logs

Of course, this is only just the beginning of what’s possible.

For my client, I’ve already modified their build system to post on successful builds, or build errors, with a link to the appropriate logs. This particular client was already using Known for internal communication, so this improvement was logical.

The rich content types that Known supports also raises the possibility of richer logging from a number of devices, here’s a few thoughts of some things I’ve got on my list to play with:

  • Post an image to the channel when motion is detected by a webcam pointed at the bird feeders (again, trivial to hook up – the software triggers a script when motion is detected, and all I have to do is take the resultant image and CURL it to the API)
  • Post an audio message when a voicemail is left (although that’d require me to actually set up asterisk, which has been on my list for a while now)
  • Attach debugging info & a core dump to automated test results

I might get to those at some point, but I guess my point is that APIs are cool.

8 thoughts on “Tracking system events with Nagios and @withknown

  1. RT @withknown: Known doesn’t have to be for humans. Here’s how @mapkyca wired his datacenter up to a Known site: marcus-povey.co.uk/2015/04/02/tra…

  2. I’ve previously documented how I’ve previously used Known to track system events generated by various pieces of hardware and various processes within my (and my client) infrastructure.
    Like many, I use a UPS to keep my servers from uncontrolled shutdowns, and potential data loss, during the thankfully rare (but still more common than I’d like) power outages.
    Both my UPS’ are made by APC, and are monitored by a small demon called apcupsd. This demon monitors the UPS and will report on its status, from obvious things like “lost power” and “power returned”, but also potentially more important things like “battery needs replacing”.
    While some events do trigger an email, other messages are written to the console using the wall command, so are less useful on headless systems. Thankfully, modifying the script is fairly straightforward.
    Setting up your script
    First, set your script as you did for nagios. Create an account from within your Known install, and then grab its API key, then put it in a wrapping script.



    #!/bin/bash

    PATH=/path/to/BashKnown:”${PATH}”

    status.sh https://my.status.server apc *YOURAPICODE* >/dev/null

    exit 0


    1234567

    #!/bin/bash PATH=/path/to/BashKnown:“${PATH}” status.sh https://my.status.server apc *YOURAPICODE* >/dev/null exit 0


    I need to Pee
    The next step is to modify /etc/apccontrol to call your wrapper script.
    I wanted to maintain the existing ‘post to wall’ functionality as well as posting to my status page. To do this, you need to replace the definition for WALL at the top of the script, and split the pipe between two executables.
    To do this you need a command called pee, which is the pipe version of tee, and is available in the moreutils package on debian based systems. So, apt-get install moreutils.
    Change:



    WALL=wall


    1

    WALL=wall


    To:



    WALL=”pee ‘wall’ ‘/path/to/wrapperscript.sh'”


    1

    WALL=“pee ‘wall’ ‘/path/to/wrapperscript.sh'”


    Testing
    To test, you can run apccontrol directly, although obviously you should be careful which command you run, as some commands fire other scripts.
    I tested by firing:



    ./apccontrol commfailure


    1

    ./apccontrol commfailure


    Happy hacking!
    Share this:EmailLinkedInTwitterGoogleFacebookReddit

Leave a Reply