Sentinel update

January 3, 2013 / Matthew Smith / 1 Comment

Many moons ago…

A while back I started to mention the idea of Self healing systems a dedicated system that makes use of monitoring and real time system information to make intelligent decisions about what to do, i.e. I write a complicated program to gradually replace my self. It was suggested about using hooks in Nagios to do the tasks but that misses the intelligence side of what I’m trying to get to, restarting based on Nagios checks is simply an if statement that on a certain condition does something, Sentinel will be more that that.

Back in April I started Sentinel as an open source project As expected the uptake has been phenomenal! absolutely no one has even looked at it :) Either way I am not deterred. I have been on and off re-factoring Sentinel into something a bit more logical Here and I have gone from 3 files to some 13! from 1411 words to 2906 and I even have one fully working unit test! I don’t think I’ll be writing more as at the moment it is not really helping me get to where I want to be quickly but I know I’ll need them at some point!

So far all I have done is split out some of the code to give it structure and added the odd bit here and there. The next thing I need to start doing is to make it better, there’s a number of options:

Writing more providers for it so it can start to manage disks, memory etc etc so it’s a bit more useful
Sorting out the structure of the code adding in more error handling / logging and resilience
Integration with Nagios or some tool that already monitors system health and use that to base actions off of
Daemonize Sentinel so it runs like a real program!
Configuration file rather than CLI

What to do

I think for me I’d rather sort out the structure of the code and improve what is already there first, I’m in no rush with this so the least I could do is make what I have less hacky. This also gives me the opportunity to start working out how I’d rather have the whole thing structured.

I did look at writing a plugin framework so it would be possible to just drop in a module or something similar and it would have the correct information about how to manage what ever it was written to do, but I figured that was a bit beyond me at this time and I have better things to do!

After that I think the configuration file and daemonizing the application, the main reason for this will be to identify any issues with it running continually any issue here would be nice to know sooner rather than later.

This then leaves more providers and nagios type integration which i’m sure will be fun.

Give it AI!

Once those items are done this leaves sentinel with one more thing to do, start intelligently working out solutions to problems, obviously I don’t know the right way to tackle this however I do have a few ideas though.

In my head… I think how I would solve an issue and inevitably it starts with gathering information about the system, but how do you know what information is relavent to which problems and how much weighting should it have? well for starters I figure each provider would return a score about how healthy it thinks it is. So for example:

A provider for checking the site is available notices that it’s not available; this produces a score that is very high say 10000. It then makes sure it’s got the latest information from all providers on the server. One of those providers is disk which notices one of the volumes is 67% full but the thresholds have been set to warn at 70 and 95 % so it sets a score of say 250 and is ranked in a list somewhere to come back to if all else fails.

At this point it is unlikely that disk is the culprit, we have to assume that whomever set the thresholds knew something about the system, so more information is needed, it checks the local network and gets back a score of 0 as far as the network provider can tell it’s working fine it can get to localhost, the gateway another gateway on the internet. A good test at this point is to try and work out which layer of the OSI model the issue is, so one of the actions might be to connect to port 80 or 443 or both and see what happens, is there a web response? or not, if there is does it have any words in it or a response code that suggests it’s a known web error like a 500 or does it not get connected.

And so on and so forth, this would mean that where ever this logic exists it has to make associations betten results and the following actions. one of the ways to do this is to “tag” a provider with potential subsystems that could affect it then based on the score of each of the subsystems produce a vector of potential areas to check, combined with the score it’s possible to travel the vector and work out how likely each is to fix the issue, as and when each one produces a result it either dives in a new vector either more detailed or not. It would then, in theory be possible to start making correlations between these subsystems, so say the web one requires disk and networking to be available and both the networking and disk require CPU then it can assume that web one needs that and base don how many of these connections exist it can rank it higher or lower much in the same way a search engine would work.

But all of this is for another day, today is just about saying it’s started and I hope to continue on it this year.

Sentinel Update

May 16, 2012 / Matthew Smith / No Comments

No, really an actual update

So after feeling all bad for not committing any work to sentinel for a while I decided to brush off the dust and crack on with some simple bits.

As of right now Sentinel will:

Perform basic checks of *nix processes

Is there a process running?
Is it in a running or sleep state (or other healthy state)?

Basic check of system health

Check disk usage with DF
If the disk usage is high, Log the offending disk info to the system log
Check memory usage of system

Basic Application health

Perform basic URL grab / scrape for search string
Check the amount of memory the application is using of what it requested & of the system total

Basic actions

If process Zombied / Dead, Kill it

Run as a cron job
Log output to file and to screen in log4J style format
Take options from CLI where appropriate

This is not to bad, there’s a few things still missing from what I’d like it to be able to do before I start testing it out such as:

Identify service associated with process and restart if it kills it
Restart application if URL check fails X number of times
Tidy up disk space based on known “safe” files that can be deleted (like log files over X days)

That’s all I want to achieve. Once I have that I then need to start with what is a rather tiresome activity which will be the testing of that application to make sure it works in a way that is sensible. After the testing that’s it!

I wish, I will after I’ve done some testing then re-structure the code, although it works as it is it is not maintainable and it is becoming harder to write code for it without it being in the wrong place and not quite doing the right thing… so I want to split the code out so it is easier to see what each bit does. I will probably create a a few classes to put the code in and make it better, either way I’ll grab some advice from people that know how to program first so I can get it into a reasonable structure.

Other than code layout, theres the structure of the data I’m trying to store, at the moment I am trying to use a default constructor in Ruby to create a class for my scores, however this is not working out I’m pretty sure I’ll need to write my own constructor and methods to get and set the values. This way i’ll be able to create a hash to store the data in instead of a list of variables.

Then there’s writing a proper fix to my 50 line case statement which I imagine is no small task but hopefully that will be pleasantly challenging.

is it worth it?

I think so, even if no one ever uses this I am learning more about programming structures which I think will be rather useful in the long term as I imagine there may be a time when being able to code to a reasonable standard may be needed.

Hopefully when the re-structure of the code is done and the last of the features is there it will hopefully be a useful application to deploy and hopefully that will save me from being woken up at 3am to deal with a simple support issue.

So that is basically what I’ve been playing with for a majority of this last week and I will try to continue on with it for now to at least get the final proof of concept done and see how it works out, at that point I’ll start trying to design it a bit better so it works as a proper application rather than a dodgy script.

Sentinel Update

May 9, 2012 / Matthew Smith / No Comments

I’ve been a little lazy

I had grand plans, which quickly fizzled away when I realised I also had work and life to contend with, but none the less I did a minor update to Sentinel at the weekend. It is only slightly more useful than it was before but it is a little mile stone for the project, or at least version 0.1. From the initial launch there are some minor changes, most importantly I now have every score I wanted to catch generated, there is still a bit more work to do on getting the scores but I noticed a number of issues…

Next Steps

The Sentinel project has probably gone as far as it will with out me thinking about it, I can probably continue with the rest of the features needed for the 0.1 release and “wing” it to make it work, but quite frankly it’s becoming harder to write the code and remember what I’m menat to get and where it comes from etc etc… In short it all needs re-structuring.

I need to create some classes for the various aspects of the code and give some thought to the data structures I need to use. This is sort of unfortunate as I don’t have a clue where to start, but that is also why I started this projet, a bit of a learning curve.

The downside is I need to do a lot of reading so maybe next weeks update will have some content around this subject… maybe not i make no promises and to be honest this is going to be a short one; mainly as I have another four or five of these things to write in the next 24 hours, so let’s see what happens!