Open files – why limits are set

Everything has a tolerance…

At some point everything has a boundary of its optimal operation, be it memory allocation, someones temper or the speed of a car. As with all of these things the limits are set for a reason, you can drive a car at 150+ mph all the time, that’s fine, the brakes will wear quicker, the engine will wear quicker. The same can be applied to someones temper, you can push and prod them so much and eventually they will snap and bite your head off.

Sometimes it can be useful to understand what these tolerances are and then once it is understood you can exploit them, but if you don’t understand them you should not be changing the details. One of my pet peeves is people that say “Ergh, why is it limited to 1024 files it needs more than that, set it to unlimited!” Sigh why do these people exist in this world.

Some background

So, web apps such as Tomcat require the use of file handles, in fact everything that runs under linux does with out file handles life becomes difficult. To get a better picture of this rather than search for file handles, try searching for file types, you’ll soon see there’s a lot that file handles have to do. As a result whenever any application get’s turned on, in this case Tomcat, it needs to consume a number of these file handles to be able to operate.

When a connection comes into Tomcat, that consumes a file handle, that handle will remain open until it is closed, the application that initiated it is killed or some time out occurs (I think it’s 2 hours but not sure…). While this connection is active, any subsequent work, for example reading a config file would trigger another file handle, a network poll would trigger another and so on. So at any point when tomcat is doing something it is consuming file handles. This is fine it is normal use, now we’ll come back to that later…

So with all this in mind, when the application does stuff… it needs resources… it’s kinda simple in the way it goes upwards, what seems to escape people is that these resources, much like memory needs to be freed again. You ask for a resource, you use it and you give it back, what happens when you don’t give it back? well it waits until a timeout or the application is killed. This is a resource leak which can lead to interesting things from a security point of view, and from an operational point of view it can stop the server from responding, or at least it would if the kernel didn’t have a hard limit in it, either way though, your box could go horribly wrong if this isn’t controlled.

So, By this point it should be making sense why you would put limits on how many file handles you would want each application to use. Which brings us onto the defaults of 1024, why 1024 and not 1000 ? well every file descriptor will take some memory, so by using 1024 rather than 1000 it allows for better utilisation of the memory and it’s easier for the computer to store a number such as 1024 compared to 1023. Moving along, I’d like to say each file descriptor was Xk but I don’t know the answer (my assumption is 4k but that’s an assumption) either way it is a resource and you can think of it as having physical mass.

Let’s increase the limits

Cool, by this point you have understood a bit of background on file descriptors, now someone say’s “Lets increase the file limit to 2048” That’s twice what it was, but not unreasonable. However, you should still have an understanding as to why it was 1024 and now needs to be 2048. If you don’t you are just throwing file descriptors at it because no one knows… this is bad.

Potentially, the application could be leaking file descriptors because someone forgot to close a connection or the application doesn’t handle the close connection in the expected way.

So a sensible thing is to ask how many is needed, in most cases someone can come back with “in our tests we saw it get to X” this is a good point to go to.

But setting it to unlimited is bad, if an argument comes back of it needs lots, or it’s dynamic etc.. Rubbish it’s a computer program it does exactly what it’s told to do, it will only ever produce X as a maximum of 1 triggered event and it can handle X events.

Imagine a file descriptor as a brick, If you asked a builder to build you a house, and you asked how many bricks he would need, i’m going to assume he will come back with a number based on experience and some maths involved to calculate how many bricks were needed.

I certainly wouldn’t expect him to say he needs all the bricks in the world, there’s not enough space on the building site to store all the bricks. Sure you can stack them up to a point, then it will fall over; the same happens with the OS when you set it to unlimited. Luckily the OS takes some precautions on this by hard limiting the number of open files to a number that is less than the total memory available on the computer. In some cases it restricts it even further.

Summary

So in short if someone say’s set it to unlimited they are probably trying to avoid doing work, either in working out what it is using or fixing a leaking file descriptor problem, these people need to wake up. It takes the stability of a system from a measurable amount an unknown, which is not good.

If you find your self in a situation where it’s out of your control, try to get to the point where the file descriptors are monitored, you can use this to work out an average and some tolerances on what is considered normal usage, and then wait for the application to crash…

focus… Focus…. FOCUS!

Geesh, how hard is it to stay focused!

In a number of employments the focus of a project has always seemed a fine thing, and those that are lacking in focus tend to do worse. I don’t know the reason as to why, but I imagine it is something to do with getting all of your resources to run all over the place and do all sorts of other things without clearly defining or completing a single topic.

This is, to say the least a little chaotic and not good for the employees either, I mean who wants to work in an environment where on a daily basis what you were working on changes so you never finish anything?

It’s good to do lots and to be seen doing lots!

Erm, No; at this point I would really like the noise of a buzzer or some other equally annoying sound. It is true that people like to be seen to be delivering a lot of things, but ask your self this, Are you delivering a lot by doing a lot ?

The answer is probably not, there is no point in doing a lot of things if you can not deliver on any of them, trust me, on this route leads isolation and resentment from the rest of the business, so in short, Don’t do it

I worked in a team about 2 years ago that were controlled by a manager who thought it was a good idea to say yes to everything and occasionally say no, the result of this was that we took on a lot of work that we delivered poorly on and were late on every project. The resulting damage of this was that the reputation of the IT team diminished so far that there was no faith left in them and everyone in the business started seeking their own route to get the solution. This is even worse than simply tarnishing the name of the IT department, it leads on to many unique systems all not integrated and yet the business keeps coming back to IT to get them integrated and you end up with what can only be termed as bespoke crap hooking it all together. Needless to say, this manager did not last too long (thankfully).

It is better to deliver than not

As I touched on above, people do like rapid delivery, but they like it even more if you can continue to do that and that the next time they ask they get the same experience. You can only get that sort of consistency through decent processes and by thinking through the solutions you are doing and accurately time boxing them. If you are not delivering the full solution then you are not delivering, if you did not consider the wider implications when scoping the project, you are not delivering.

Don’t get me wrong this is a balancing act at this point, you have to work out the right way of delivering a project that still enables you to… 1, deliver it and 2, future proof the delivery.

If you only churn out 2 or 3 projects a month, but all of them are on time and on budget it’s better than turning out 5 that are all late. Credibility of the department buys a lot of trust and leeway when needed and by not delivering on what you say will harm this (golden no no)

Focus

Finally after a long winded ramble we are back at where we need to be. Focus. By staying focused on the road map and the committed goals it helps the team and the department continually deliver on the things that are needed, and keep the reputation intact, which is vital.

So the next time a random “wouldn’t this be nice” comes up, think long and hard about if it is needed now, is it worth re-directing resources on to it, and if you do are you still able to deliver what you were committed to? You are in most cases better off planning the work in rather than subverting resources to try and deliver extra.

You need to stay focused on what is already committed before adding in extra work, if you absolutely have to add something in you need to have a strategy for coping, are you going to pay over time to get the work done? slip something else and hurt the reputation of the team? negotiate the delivery of a task slightly later so you can do the new thing?

It’s not bad to have this happen, it’s just bad to keep asking for the work to be done with out thinking about how it is actually going to be delivered.

Summary

Stay focused on what is committed, take on extra tasks after considerable thought and stay focused…

Github Applications

I thought I’d have a play…

So with Sentinel I have started to think about what I am going to do about documentation and ticketing. It’s quite important to have some documentation to explain what is going on and it’s very important to have a way of tracking issues and features for the application.

Already it is probably clear I care a little bit more for the ticketing element than the documentation, I had a play with the GitHub issue tracking application and it’s not as bad as I thought it would be.

GitHub issues

I decided to log a few issues within the sentinel project and see what I could do. I was not holding up much hope as there really is not much of an interface there to do much with. I persevered anyway and I was amazed at what I could do with it.

It has a couple of things that enable it to be quite powerful with its minimal interface, Labels and Milestones.

Labels… Unlike most ticketing systems GitHub doesn’t ensure the user selects an issue type or set one as default so every ticket is uncategorised, this is about the extent of the downsides. You can create your own labels and you can apply multiple labels to the same ticket, so for example you could have a ticket labeled with “bug” and “won’t fix” this structure gives you enough to be able to categorise tickets for planning / searching purposes.

Milestones… These are very useful, you set a name for the milestone and can associate tickets with a milestone, each milestone has a date associated with it for completion, which is about all you need really. This gives you the ability to work in an agile way with it, you can create small sprints (Milestones) with the tickets associated with them and plan your work based around that.

So in summary, the issues tracking in GitHub is okay, if you have low volume and do not wish to track any more details such as effort etc then it’s fine, and for me, if I didn’t know better or if I hadn’t installed other applications before I would probably use it, but for my project I will probably go with something else.

GitHub Wiki

This was rather more disappointing, to be clear I only left it in the markdown mode, but still disappointing.

You can basically do the Headings, text formatting, code blocks, lists, images, links, quotes and horizontal rules.
This is disappointing as it is only the very basic commands, HTML would have been better, here’s the lrough list fo what they’re doing, out of a much larger set, h1,h2,h3 a,img,pre,li,ol,ul but needless to say, if you are not tech savvy enough to set up media wiki (Which is a rubbish wiki IMO) but are smart enough to set up git hub then there’s other issues. Granted, you would need somewhere to host the wiki… if only you get get wiki hosting free online and find it easily by searching for it, hang on… Try This

In short, there’s better tools out there for the job also free, don’t be lazy and look them up, failing that do it in plain text and use ascii art to format the pages

summary

The apps they provide around the wiki and issues are good enough to get started with but they’re not really suitable long term I’d think with very little searching online you would find free hosting of better services, with that in mind I’m probably going to set my own up at some point.

Valuable vs Rapid change

Change, Change, Change, All Change!

For those of you in IT where a change seems a rare and wonderful thing, this post is probably not for you, if you are implementing less than a change a week, this post is not for you…

So where I am, we make in some cases several changes a day through our various systems, all in a bid to keep away the Big Bang of changes (also known as the Service disrupter), quite simply, our ethos is many small releases so the change pattern looks like a saw tooth not a set of steps.

Just like this:

Courtesy of Its Tech up north

see those tiny little releases, that’s the aim, the right hand side of the picture is when it all goes wrong.

So, back to the topic, lots of changes all the time, And at this point you may think I’m going to deviate into discussing the merits of quick releases and longer releases, as per the graphic. Well you’d be wrong, you should all be doing the left side of the graphic.

What’s a Valuable change?

It is all too easy to fall into the trap of Agile = Quick therefore make many changes and use all those rapid changes later to fix things up again. This is not right. This is not to say that you shouldn’t make a change rapidly if it fixes a service affecting issue… i.e. reconfiguring apache; however this is to say that you shouldn’t release half arsed.

So a valuable change is a well thought out change that adds benefit, it is in line with the road map and well tested. There is no time limit on a valuable change, you could spend months working on it, you could spend minuets. The key is as above, it’s on the road map, it’s well tested and it adds benefit, if any of these three things are not in alignment you probably have a rapid change; sorry for your service loss.

A Rapid change!

They are cool, they are quick, everyone thinks they are amazing because of how quickly they made it into the product / the service and they are by far the most pointless changes you’ll ever have to make.

Sometimes you will need to make rapid changes, i.e. one of your nodes in the cluster isn’t working so you need to make a change to remove it. This is not in line with the road map, it is not even a sensible change in the bigger picture, but it is absolutely necessary in the short term.

So rapid changes are fine to fix service affecting issues, you have to keep the service running, no excuses.

Rapid changes are ugly, often poorly documented, thought out and designed. So they should not be a method for implementing changes for the long term, i.e. a new way of logging in, a new graphic, a new way of balancing traffic etc etc

Rapid changes always produce Technical Debt where as valuable changes do not. So the sure fire way to identify a rapid change is by the fact it has technical debt.

Slightly off topic, I wonder if it’s worth graphing the tickets that produce technical debt and link those back to the change that produced it, you would then be able to show how much effort was created by a rapid change and categorically prove how bad it was to do and therefore discourage then in the future…

Summary

You will always have some element of rapid change, the aim is to make the number of rapid changes as small as possible, as stated they aways produce technical debt and that is really bad for you, really bad. Where possible even in a crises you should try to come up with solutions that are valuable and do not produce technical debt.

This should always be your aim, for the sake of another couple of mins of thinking about a solution, do it, it may save you a lot of technical debt and reduce the pain of supporting a system that is not as ideal as it should be.

Sentinel Update

No, really an actual update

So after feeling all bad for not committing any work to sentinel for a while I decided to brush off the dust and crack on with some simple bits.

As of right now Sentinel will:

  • Perform basic checks of *nix processes
    • Is there a process running?
    • Is it in a running or sleep state (or other healthy state)?
  • Basic check of system health
    • Check disk usage with DF
    • If the disk usage is high, Log the offending disk info to the system log
    • Check memory usage of system
  • Basic Application health
    • Perform basic URL grab / scrape for search string
    • Check the amount of memory the application is using of what it requested & of the system total
  • Basic actions
    • If process Zombied / Dead, Kill it
  • Run as a cron job
  • Log output to file and to screen in log4J style format
  • Take options from CLI where appropriate

This is not to bad, there’s a few things still missing from what I’d like it to be able to do before I start testing it out such as:

  • Identify service associated with process and restart if it kills it
  • Restart application if URL check fails X number of times
  • Tidy up disk space based on known “safe” files that can be deleted (like log files over X days)

That’s all I want to achieve. Once I have that I then need to start with what is a rather tiresome activity which will be the testing of that application to make sure it works in a way that is sensible. After the testing that’s it!

I wish, I will after I’ve done some testing then re-structure the code, although it works as it is it is not maintainable and it is becoming harder to write code for it without it being in the wrong place and not quite doing the right thing… so I want to split the code out so it is easier to see what each bit does. I will probably create a a few classes to put the code in and make it better, either way I’ll grab some advice from people that know how to program first so I can get it into a reasonable structure.

Other than code layout, theres the structure of the data I’m trying to store, at the moment I am trying to use a default constructor in Ruby to create a class for my scores, however this is not working out I’m pretty sure I’ll need to write my own constructor and methods to get and set the values. This way i’ll be able to create a hash to store the data in instead of a list of variables.

Then there’s writing a proper fix to my 50 line case statement which I imagine is no small task but hopefully that will be pleasantly challenging.

is it worth it?

I think so, even if no one ever uses this I am learning more about programming structures which I think will be rather useful in the long term as I imagine there may be a time when being able to code to a reasonable standard may be needed.

Hopefully when the re-structure of the code is done and the last of the features is there it will hopefully be a useful application to deploy and hopefully that will save me from being woken up at 3am to deal with a simple support issue.

So that is basically what I’ve been playing with for a majority of this last week and I will try to continue on with it for now to at least get the final proof of concept done and see how it works out, at that point I’ll start trying to design it a bit better so it works as a proper application rather than a dodgy script.

Universal trouble shooting

Simples!

Recently I’ve been going through some “interesting” times with my lower back; it all started several months back (October, 2011) and it was only about a month back when someone started to do tests, actual tests that would prove or disprove the situation. To be honest it’s a little frustrating being in a position where trouble shooting is my job. I’m very use to working with and understanding many different technologies, personal experience and gut feel for what is the root cause of an issue might be, but it struck me as a little odd how physiotherapists took roughly the same approach, but before proving what the problem was they would try a cure.

That is kind of like saying, “I can see you can’t connect to the internet so I’m going to call your provider and make sure you bill is up to date” very strange, this approach to trouble shooting is something I’m going to term “Following the light”

Following the light

What do I mean by following the light? If you’ve ever had a cat and a torch you probably have a picture of what I’m thinking, else have a look at this.

You’ve seen a small hint of something that could possibly be the cause of the problem, and in your finite wisdom you have decided to follow the first credible route until it is proven to not be true, at which point the next credible route will be followed. Great, eventually you will solve the problem, by which point the service will be turned off or if you had my Physiotherapists’ I’d be dead.

This is not to say that the first thing you stumble upon can not be the cause of the issue, it can, but you’re now trying to solve a problem to something that may not actually be a problem.

This isn’t a bad way to problem solve, it just takes for ever, so there’s a another way which when everything goes well is by far the quickest problem solving technique, “Scatter gun”

Scatter gun

Okay, another weird phrase, Imagine someone trying to hit a bank note with a gun from 20 paces away, a good shot would hit it right away, every time, a average shot (where most people are with problem solving) misses 9 out of 10 times, so they would be better off with a shot gun, it’s the only way they’ll hit the target every time. Unfortunately the rest of the shot misses.

With this approach you dive into a problem, get the first 2 mins into the problem description and you already have the problem sorted, except you stopped listening 2 mins into a 5 min problem description; take the following example.

The last time I went to my favourite website it asked me to change my password, so I did, I definitely changed it to meet the security requirements, and I continued to browse around, it worked fine for a a week or two and then when it prompted me for the password again it wouldn’t accept the new one. Any idea what the problem is now ? seems pretty clear, Error 18 (the error is 18″ away from the computer), PEBKAC, PICNIK what ever you want to call your generic user error, continue reading…
Just by chance I tried my old password and it worked fine. What I don’t understand is why would they ask me to change my password and then forget to save it? seems odd, but either way I can log in again now. Cool, so by doing the scatter gun approach you just missed a security breech, Nice one ;)

To be honest that could happen to anyone, but the scatter approach of try this, try that, try something else may work every now and then but it’s not an efficient way to get to the solution, if anything in theory it would take longer than “Following the light”

Which leads onto what is the right way? How about a “Sniper” it’s a lame name but it’s late (At time of writing anyway) and it’s the best I can come up with…

Sniper

So Snipers will go and sit quietly for days at a time waiting for a target and then attack at the most opportune moment, relying on instinct and experience to chose that time. By listening to the whole problem and asking some simple questions you can get a fuller picture of what is going on, by this point if you have experienced issues similar to this then your experience will play a part and your gut instinct will fill int he gaps. This doesn’t mean you now go and grab the shotgun to shoot the target. Is this the right problem? does it match the use case? based on your understanding could it possibly cause the described problem? if so, pick up the rifle and fire off a shot. There’s a better than average chance you’ve solved the problem, if you haven’t then you have ruled out the most likely cause based on the current evidence.

What next? you go after the next most likely cause based on the new set of information and rinse / repeat. The most important step is to re-base our decision based on failed likely outcomes. Now what do you do when you have no idea what the problem is, well you need to start somewhere so rule something out. If you have the opportunity to do something that is simple that would rule out multiple possibilities, do it. Then rinse and repeat, always re-basing based on new evidence gathered, it sounds like the scatter gun approach, and it is but with one difference, you have listened to the whole problem, and based on your experiences and gut instinct you have logically chosen the most likely cause, you are not shooting in the dark or following a white rabbit.

Summary

If my physiotherapist had re-based their assumptions based on new data I’d have been treated sooner, instead I had to go see a surgical consultant that basically worked out what was wrong by listening to the problem, asking some simple questions and hitting me with a little hammer on each foot, one leg reflexes the other does not. He suspected a prolapsed disc based on what I had said and my symptoms and then set about doing a number of tests to prove the hypothesis, when one test didn’t show the problem a new test was tried.

Simples.

Sentinel Update

I’ve been a little lazy

I had grand plans, which quickly fizzled away when I realised I also had work and life to contend with, but none the less I did a minor update to Sentinel at the weekend. It is only slightly more useful than it was before but it is a little mile stone for the project, or at least version 0.1. From the initial launch there are some minor changes, most importantly I now have every score I wanted to catch generated, there is still a bit more work to do on getting the scores but I noticed a number of issues…

Next Steps

The Sentinel project has probably gone as far as it will with out me thinking about it, I can probably continue with the rest of the features needed for the 0.1 release and “wing” it to make it work, but quite frankly it’s becoming harder to write the code and remember what I’m menat to get and where it comes from etc etc… In short it all needs re-structuring.

I need to create some classes for the various aspects of the code and give some thought to the data structures I need to use. This is sort of unfortunate as I don’t have a clue where to start, but that is also why I started this projet, a bit of a learning curve.

The downside is I need to do a lot of reading so maybe next weeks update will have some content around this subject… maybe not i make no promises and to be honest this is going to be a short one; mainly as I have another four or five of these things to write in the next 24 hours, so let’s see what happens!

Dealing with Technical Debt

A brief history

Last week I touched on Dealing with rapid change and I kinda got carried away and exceeded my self imposed 800 word limit. So the point of today’s blog is about capturing the technical debt and then how best to deal with it so you don’t end up in a situation where the service starts suffering because of tactical decisions made without suitable foresight.

So to summerise what technical debt is for those that did not read last weeks post (shame on you), Technical Debt is what you accrue when you have to make a change that does not head towards the overal goal but instead causes additional headaches for managing the environment or some how limits you in achieving the overal goal of a perfect system.

Capturing technical debt

My personal preference to tracking technical debt is a ticketing system, just create a way within that to easily identify what is technical debt and what is a normal task. You don’t have to use a ticketing system, you could just write it on paper or in a spreadsheet or what ever, the important thing is that you capture it with a reasonable amount of detail.

Avoid the temptation of filling the ticket with a lot of information, just put enough in it that explains what the problem is and why it needs to be fixed, if you have some suggestions on how it could be fixed add them to the ticket but don’t worry about saying “this is the solution” that can be done when the ticket is actioned.

Try and make sure you capture the following as a rule of thumb:

  • What is the issue?
  • Why is it an issue?
  • Any future tasks that are dependant on it
  • How long the task will take
  • A rough priority

These things will help prioritise the task when it comes to the planning the next set of tasks / projects that need to be actioned, but it doesn’t really help it get prioritised. Why? because it will never be the focus of the business to do the boring work that makes it stable unless there are issues or it has to be done as dependancy of another task.

Actioning technical debt

As I pointed out before, the business will never prioritise the technical debt unless there is a real issue for it to do so, service stability or dependancy on another task. This is a fact of life, and if you’ve found your self in this situation getting frustrated about all the hundreds of days of backlog tasks of technical debt that is accruing, panic no more.

As I pointed out above, the business will never want to do the technical debt unless there is good cause to do so, so the point of capturing the tickets with the right information is that the dependancies are clear, the outcomes of not doing it are clear, this makes it easier to discuss it as a business requirement. That is not enough though, you will get some tasks done but you will not be decreasing the technical debt it will continue to increase.

You need to create an environment in your immediate team and the extended team that understand why the technical debt needs to be actioned. This will be easier to do than convincing the whole business as to why it is important. Once you have the buy-in of the teams everyone will hopefully understand why it is important to keep the technical debt to a minimum. This will help it take a higher priority when it comes to scheduling tasks and help reduce the technical debt in the long run, it would also be a good idea as a group to identify an amount of technical debt that as a group you are happy to live with, this can be calculated on the amount of effort days required to deliver each task. This is a sure fire way of getting technical debt actioned and will help ensure that it remains at a level that is sustainable.

There’s always going to be a risk that you’ve tried all of the above and you’ve still not gotten anywhere, the technical debt keeps rising and the environment continues to get more and more complicated to work within. Simple, do not worry about it, you did what you were meant to, you raised the issues, you pointed it out to your boss, you can kick back and relax, maybe even take deep joy in the moment when it all fails and someone asks why and you just point out the years worth of technical debt that has ground your system to a halt. In short, it’s not your problem, it’s your bosses, you just need to make sure you capture it and raise it.

Summary

Technical debt is bad, if you’re not aware of it you need to be, you need to have a mechanism for dealing with it and if you don’t your systems will grind to a halt and you will probably be one of those companies that re-builds the entire infrastructure every 3-5 years because the technical debt was so large that the environment was unmanageable.

DNS results in AWS aren’t always right

A bit of background

For various reasons that are not too interesting we have a requirement to run our own local DNS servers that simply hold the forward and reverse DNS zones for a number of instances. I should point out that the nature of AWS means that doing this approach is not really ideal, specifically if you are not using EIP’s and there are better ways, however thanks to various technologies it is possible to get this solution to work, but don’t overlook the elephant in the room.

What elephant?

A few months ago while doing some proof of concept work I hit a specific issue relating to RDS security groups, specifically where I had added in the security group that my instance was in to grant it access to the DB. One day after the proof of concept had been running for a number of weeks access to the DB suddenly disappeared with no good reason and it was noticed that by adding in the public IP of the instance to the RDS security group access was restored, odd. The issue happened once and it was not seen again for several months, it then came back, odd again, luckily the original ticket was still there and another ticket with AWS was raised, to no avail.

So a bit of a diversion here; if you are using Multi-AZ RDS instances you can’t afford to cache the DNS record, at some random moment it may flip over to a new instance (I have no evidence to support this, but also can’t find any to disprove) so the safest way to get the correct IP address for the DB instance is to ask Amazon for it every time. So you can’t simply take what ever the last IP returned was and set up a local host file or a private DNS record for it, that’s kinda asking for trouble.

So we had a DNS configuration that worked 99.995% of the time flawlessly, and at some random unpredictable time it would flake out, just a matter of time. As everyone should we run multiple DNS servers, which made tracking down the issue a little harder… however eventually I did. Depending on which one of our name servers the instance went to, and how busy AWS’s name server was when which ever of our name servers queried it depended on the results we got back. Occasionally one of the name servers would return the public IP address for the RDS instance, causing the instance to hit the DB on the wrong interface so the mechanism that does the security group lookup within the RDS’s security group was failing; it was expecting the private IP address.

The fix

It took a few mins of looking at the DNS server configuration, and all looked fine, and if it was running in a corporate network that would be fine, but it is not, it’s effectively running in a private network which has a DNS server already running split views. The very simple mistake that was made was the way the forwarders had been set up in the config.

See the following excerpt from here

forward
This option is only meaningful if the forwarders list is not empty. A value of first, the default, causes the server to query the forwarders first, and if that doesn’t answer the question the server will then look for the answer itself. If only is specified, the server will only query the forwarders.

The Forward option had been set to first, which for a DNS server in an enterprise is fine, it will use its forwarders first, if they don’t respond quick enough it will lookup the record on the root name servers. This is typically fine as when you’re looking up a public IP address it doesn’t matter, however when you’re looking up a private IP address against a name server that uses split views it makes a big difference in terms of routing.

What we were seeing was that when AWS name servers were under load / not able to respond quick enough, our Name Server got a reply from the root name servers which were only able to get the public IP address, therefore, our instance routes out to the internet, hits Amazons internet router, turns around and hits the public interface for the RDS security group on its NAT’d public IP and thus not seen as within the security group, Doh!

Luckily the fix is easy, set it to “forward only” and ta-daa, it may mean that you have to wait a few milliseconds longer now and then, but you will get the right result 100% of the time. I think this is a relatively easy mistake for people to make, but can be annoying to track down if you don’t have the understanding of the wider environment.

Summary

Be careful, if you’re running a DNS server in AWS right now I suggest you double check your config.

Probably also worth learning to use “nslookup <domain> <name server ip>” as well to help debug any potential issues with your name servers, but be aware that because of the nature of the problem you are not likely to see this for a long long time, seriously we went months without noticing any issue and then it just happens and if you’re not monitoring the solution it could go un-noticed for a very long time.