The Bigger Picture Blog

  • FinOps and Cyber Security

    I had a great time at AWS Summit London 2023 earlier in the year. Saw loads of old friends and met new ones.

    You may well ask what my favourite thing was. Actually, it was the talk How to implement AWS cost optimization strategy that works by Steph Gooch from AWS and David Andrews from Just Eat.

    Why? FinOps has a limited relationship to security unless someone is running a crypto miner.

    Because I see that FinOps and Cyber Security (and Data Governance) share a lot of the same ways of working. That being, trying to convince developers to do things that are not always on their journey to delivering a useful product.

    I’m ok with that too. Security is on it’s own journey and, being pragmatic, it needs to be added when at the point it’s needed. But our job is to check in with the teams as close as possible to the time when it’s needed.

    What I enjoyed most about the talk is that the language used is not tainted by years of treating humans as the problem rather than the solution.

    Securing IT systems has been around a long time and has its roots and language in the military and academia which aren’t neceserily reflective of how a modern business operates. For security to be effective we need to reduce the fear of security, and reducing agressive language is one way to improve adoption.

    FinOps does not have any of this baggage being at most a few decades old. So it’s refreshing to see how a relatively clean slate can work and inspire developers to care about the thing that you care about.

  • Phishing Simulations and Culture

    My good friend Joel wrote a widely shared post on why phishing simulations don’t work. This itself follows on from excellent discussions with our good friend Emma, on how phishing cannot be over-come by will-power.

    But the area I think has also been missing from the discussion so far is, what phishing simlations mean for your business culture.

    Business these days will strive for inclusivity and diversity. When done well this increases innovation, and improves your buisness ability to adapt. But improving diversity in a business doesn’t happen in isolation. It’s supported by providing employees with the psychological safety, the ability to identiy exclusionary thinking, and tools to report and respond to it.

    This culture is important to Cyber Security too. We must create a culture where people feel safe to report suspicious activity, even (or especially) in cases of uncertainty.

    I wanted to contrast this with the way phishing simulations are often managed, and give a sense about the effects on employees, focusing on those subjected to them.

    Phishing simulations, first off, create a culture of fear with people accessing their e-mail. This expresses itself as stress. I think this is particuarly accute where your role requires you to have access to more systems. This stress does not help with the feeling of psychological safety within a business.

    Also a phishing simulation is most effective when someone slips up and clicks on a link in an e-mail. Joel and Emma have discussed the reasons for this, but what happens after this is also important.

    Typically the user will be directed to a page advising them of their mistake, and possibly offer some advice. That’s the best case scenario.

    I know in some businesses the follow-on is getting a talking to from a manager, or sent on mandatory training. Perhaps different people or teams within the business, are ranked against each other in the hope that this will cause them to perform better.

    Really this pits one group of people against another creating a sense of ‘otherness’, adding to stress and lack of inclusivity.

    This is additional stress. Stress makes people more liable to make mistakes. The stress of an exam does not make you better able to do well at the exam. Except the exam is happening every time you open your inbox.

    Think I’m exagerating? Let’s take email out of this entirely. This is the equivalent of working in a post room. You get post in. You put it in the right pigeon-holes, and you deliver it to the people who need it. Your line manager decides to put in some dummy post, to check how effective you’re being. They check if the dummy post is delivered to the right place and in a way the manager is happy with.

    I hope no one has experienced this, but it’s disrespectful and distrustful. I would call this micro-managing.

    I know from my own experience in phishing simulations, it’s like working with a chance of an axe jumping out at you at any time.

    We should be looking at ways of reducing that stress, and increasing trust. Allowing people to come forward with concerns about their e-mail. Not having a distrustful relationship with their Cyber Security team, that should be helping them.

    Let’s allow people to make mistakes, and give them the clarity of thought to idenify the mistake and learn from it. Not telling them off for something doing wrong.

  • The Reservoir of Good Will

    The ideal outcome for a Cyber Security team is to make the teams within the business more aware of the risks they carry in their work, and for those teams to react and change their processes to manage the risks better.

    This change in process usually (although not always) requires the team to either add extra steps, or increase cognitive load, and slows their work. Having to do this work, separate from the work itself, damages the relationship a little with Cyber Security teams.

    A busy team, who are struggling with their existing workload, may not appreciate you cheerfully turning up and adding to their existing work.

    A teams ability to take on additional work, I like to call the The Reservoir of Good Will.

    To explain in more detail, a reservoir is a body of water used for storage. The levels of water go up as more water enters it such as rain or snow, and is drained to produce a valuable resource, such as hydro-electric power or drinking water.

    Good will is the concept of doing something with no expectation of getting something in return.

    The Reservoir of Good Will is an analogy to allow Cyber Secruty teams to think about how making requests on a team affects the relationship with that team.

    When a Cyber Security team makes a request of a business, it takes the team away from doing the thing they’re expecting to do, leading to a level of frustration. It drains some good will from their reservoir.

    Some teams have a larger reserviour. They have more capacity to take on the inconvenience of the thing you ask. Some, have a smaller reserviour, and may struggle.

    As people practicing Cyber Security, our responsibility is to ensure we keep the reservoir of each team topped up. We can take sips or gulps from the reservoir, but we must be mindful by how much it is drained.

    However, if the reservoir gets too full, we are not fulfiling our responsibilities to the business of keeping it as secure as it could be. We will always need to dip in to that reservoir a bit.

    We need to keep that reservoir at a good level, and to do that we need to understand where that level is. We must measure it.

    Therefore we must be cautious and poll the levels and make sure our actions are kept at the right amount to sustain each team, and we need to understand the size our activities to see if they are going to be too big of a glug.

  • Keys, Secrets, and Tokens

    I have spent some of my time over the last 12 months working out how to improve our response to, first the Heroku leaks, followed later by the LastPass leaks, closely followed by the CircleCI leaks. Each of which took a significant amount of time to resolve, time that could have spent improving the features and capability of our main product.

    Something I identified, as part of this response, was that as engineers, have several names for things that are very closely related, namely, Keys, Secrets and Tokens, there might even be others. I think any seasoned engineer will recognise that they do refer to different things, after all, we have AWS KMS and AWS Secrets Manager. But an IAM User has both keys and secrets and that keys is a different kind of key to the KMS key.

    Having a common language reduces change fatigue and rework when a problem that everyone who thought they understood a problem discovered that they didn’t. This way we can have better confidence in the outcomes of the problem we’re discussing.

    I needed a way of describing, without significant friction, a way to refer to values that are used by systems to interoperate (i.e.: not a password), are sensitive, and need careful management.

    I decided that the best word to encompass this was Secret. It seemed like it had more meaning, in a more succinct way than any of the other terms available.

    I proposed this to our business via our usual engineering approvals board and it got accepted.

    It’s not to say that the usage of the word Key, or Token is banned, but that Secret emcompassed them all. The word Secret could also mean a key or a token. Something sensitive that needs protecting.

    There are likely differences in each business that may mean that secret isn’t the right word for you, but having at least an awareness how these words mean similar, or different things can be helpful.

  • Social Marketing

    Cyber Security is not something that many engineering teams consider automatically, in the same way that it’s common that engineering teams rarely consider the data governance or legal implications of the product they are looking to build. The team is there to solve user stories, real or imaginary, and the external factors, such as security, may only be considered once on the journey to delivery.

    As security people we recognise that directing changes early has the lowest impact on delivery. But making cyber security integral to the delivery path is to have a dedicated person on each team gets expensive fast. Either as additional effort for each team, or as gatekeepers blocking releases.

    Instead cyber security teams must accept that security will be included in the delivery at some point and make it our responsibility to intervene and educate engineers so that they can effect changes, or ask questions, as early as possible.

    This thinking made me want to explore how to create more intervention opporunities by improving communication in the early stages of a delivery.

    One industry that I realised is good at making people change their thinking was marketing.

    Marketing, to me, has a strong association with unethical practices, but tweaked in the right way I wanted to see if there might be some re-usable techniques to change people’s thinking about cyber security.

    I reached out to my friends in NCSC because they have a responsibility to do exactly this, communicate to the general public to suggest ways to improve their security so that money and people’s data doesn’t get stolen.

    From there I was directed towards some academic research titled Can we sell security like soap? which is a wonderful analogy.

    The thing that I liked though is the concept of Social Marketing. As a side note, I want to be clear this is different from Social Media Marketing, and while looking for books on Amazon it also gets it mixed up.

    Unfortunately I’ve not yet been able to identify a set of guidelines that help me. A lot of what I’ve read so far seems to originate from academia without much grounding it in the practical working world.

    But perhaps that’s the point. As people working in Cyber Security our responsibility is to take what has been writen and turn it in to something consumable that will make others more effective.

  • Security and Inclusivity

    I do a lot of thinking about second and third order effects, and how poor security controls impacts the culture of a business and in particular can reduce inclusivity.

    By way of an example of this, and the common Cyber Security control that most annoys me, is the way accounts expire after not being used for some time. Without digging in to the background too much it comes about because in business it’s often difficult to have a reliable leavers process. This leads to user accounts existing long after its owner has left the business.

    Automatic blocking of accounts will always be inexact and this is the cause of inclusivity problems. You will identify common absences from work such, such as maternity, or stress.

    The true answer in this scenario is to get accounts blocked as part of the leavers process, making it as important as ensuring they’re no-longer being paid. This will have a big impact on reducing the risk to your business.

    There will still be edge cases, and it’s worth including additional steps to detect long-lived accounts. You can also apply an human action of verifying if the account is still needed. You could even automate this by sending an e-mail to a line manager that explains the risks and impact, asking them to click a link if the account can be blocked.

    If you were that person coming back to work after time off from stress. Having to go through a support number, prove their identity, and get their account unlocked. Turning that excitement to disapointment. Does that represent your business? Should that represent your business?

    And this is far from the only control that has this kind of impact.

    When presented with a risk, as IT people we think about the controls we have available to us and the mitigation may seem obvious. It takes time and research to identify the impact of it and further time and work to fix the underlying problems. Although without that work the change is to effect the culture of the business.

    This makes one of the true Cyber Security responsibilities to ensure that security controls are reasonable, equitable, and justified.

  • Safety and Security

    It is dark. You hear a beep.

    It is dark. Another beep.

    You check the time. It’s almost 2.30am.

    Another beep.

    What is that?

    Is it from outside?

    You open a window and wait.

    You hear a beep.

    It’s not outside. That must mean it’s inside. That means it’s your problem.

    Shit.

    There are lots of things it could be. They’re all over the house. From the fridge to the carbon monoxide detectors.

    Bleary eyed you stalk around the house, stopping occasionally to see if you can hear the direction of the beep.

    Going in to one room, doubling back on yourself.

    Eventually. A beep above your head. A smoke alarm. Of course it is.

    Your child is asleep in the next room. He doesn’t sleep well. You have to silence the smoke alarm quickly and quietly.

    What does the beeping mean? The house is not on fire. You happened to have gone around the house already.

    The beeping could mean anything. You don’t know the cause. Your mind races. The smoke alarm is mains powered. The smoke alarm could have lost power. Did you do anything recently that might have caused the smoke alarm to have lost power? Has there been a power cut? Could the battery have lost power? Does it have a battery? Should the battery have lost power if it’s been running off mains? How would the smoke alarm know the battery is dying if it mains powered, it isn’t using it.

    You check online for anything else it could be. It’s 3 am. You want to sleep, not diagnose a beeping smoke alarm that will wake up your family.

    There are loads of websites giving you lots of options depending on the type of smoke alarm you have and the kind of beep it’s making. You don’t know the make and model of my smoke alarm, they came with the house.

    Apparently it’s common for batteries to start failing during the night when the tempurature drops, causing the battery to be less effective.

    All the smoke alarms are on their own circuit. You could turn the whole thing off and worry about it in the morning. It seems unlikely there would happen to be a fire just then. But would that make things worse? Would the smoke alarm keep working off their batteries and drain them faster causing more of them to beep? Would it set all the smoke alarms off as a failsafe?

    Your best bet is to replace the battery. Often people don’t know that a beeping smoke alarm means the battery needs replacing.

    Is there a spare? Yes. Fortunately. Lets not think about what you would do if that wasn’t the case.

    It’s also fortunate this is not one that needs a massive ladder to reach.

    Now how do you replace the battery? How do you open the smoke alarm?

    Written on the smoke alarm is an obscure instruction about pushing a screwdriver in to a hole on the side. Looks like you’ll need to find a screwdriver then. Why do you need to find a screwdriver at 3.30am? Fortunately you know where one is.

    You follow the instructions. The smoke alarm is slightly too close to the wall and the screwdriver almost doesn’t fit. You still cannot work out how to get the old battery out. Your arms are so tired from holding them above my head for so long.

    You’re frustrated, angry. You lose it. You pull on the case. Maybe there is a way in. After easing it a bit, it comes away.

    No access to the battery.

    Fuck.

    At least now you can see how you would get to the draw that holds the battery.

    After a bit of prying a draw pops out and the battery drops out of the draw.

    Replace the battery. Put the draw back in.

    No beeping?

    Go back to bed.

    The adrenaline is surging.

    You’re waiting for the next beep. Was that a beep?

    You stay awake until your son wakes at 6.30am.

    The next morning#

    Your electrician advises that there are very high voltages in the smoke alarm due to the electronics used to check for smoke. Removing the case was dangerous. It is best to replace the unit.

    Given no one knows the age or history of any of the smoke alarms and that you have a young family you make the decision to replace them all.

    The electrician recommends the most popular brand. It’s the exact same model you already have.

    At least you know how to replace the battery now.

    Coda#

    For the UK the regulations used to say that, when adding a loft extension to your house, that all doors were replaced with firedoors that self close.

    People found all the doors in your house being closed was inconvenient, this lead to people propping the doors open. So the regulations were changed to install smoke alarms in all rooms that are linked.

    This is incredibly sensible and pragmatic advice that responds to people’s real-world usage of safety equipment.

    When talking about security I regularly like to frame it to the world of safety. Safety is tangible and people like talking about safety. But I recognise that they’re often not directly comparable.

    Safety equipment is often tied to a regulations that can be complex and expensive to change. But safety equipment can also have the same failings as a lot of security products do. Poor documentation. Poor user experience, especially when things go really wrong. Poor quality alerts.

    If you want to build good, reliable, and accurate equipment you need to think about how they fail, and how your users will over-come the failure.

    Because right now I don’t want smoke alarms in my house. I would rather have a relaxing and peaceful night sleep. I won’t be alone in that motiviation.

    I wish this story was not based on real events.

  • Reminiscing over the OWASP Top 10

    I was having a discussion with some freinds recently where we found that we were in agreement about how much less useful the OWASP Top 10 list is since the updates. I found that no one had written anything that was critical of the 2021 changes.

    Before the 2021 change I would use the OWASP Top 10 as a checklist for when engineers wanted to know how to improve the security of their web services.

    Understandably the delivery of web services has changed significantly since the last time the Top 10 was run in 2017. But those changes don’t apply to everyone.

    It’s good that OWASP are looking to include more of the delivery pipeline. But this means when I direct an engineer to it they’re provided confusing wording and subjects that could be outside of their responsibility, making engineers looking to cover the basics no-longer able to find the information they need.

    Here are the changes I find interesting:

    • A02:2021-Cryptographic Failures: This one used to be about data leaks. This is now somehow, about cryptographic failure, crypto, sounds like, if not about Bitcoin, then about encryption. Would gaining access to a public file in S3 that contains sensitive data count as a cryptographic failure?
    • A04:2021-Insecure Design: This one talks about “move left” with scare quotes like it’s a fad. Insecure Design sounds to me like an upfront task, before any engineer has a look-in.
    • A05:2021-Security Misconfiguration: Security Misconfiguration is pretty much the entire list isn’t it?
    • A09:2021-Security Logging and Monitoring Failures: Is it a failure? Or does it not exist?
    • Also A06:2021-Vulnerable and Outdated Components and A08:2021-Software and Data Integrity Failures are too closely related.

    The changes and the reasoning for them is not well explained on the Top 10 page itself. If I was looking for advice on how to to secure pipeline I would not automatically think of OWASP. Instead I think I will still use the 2017 reference for in most situations.

  • Sophisticated Attack

    A trend I and some friends in the industry noticed over the last few years is that whenever a company gets hacked they almost always call it a “sophisticated attack”. It kind of became a running joke, no-matter whether the attack was a Denial of Service, or data left in a public S3 bucket, that term was always used.

    The latest one to use that term is the Chief Executive of Optus, a mobile service provider in Australia, when they had their data stolen, seemingly by having their data available publicly on the web. This time the government of Australia called them out on this. Serious questions should be asked whether a government should call out a private business in that way, and the impact on their working relationship in the future, but it is refreshing for that term being picked up as being false by the media.

    What’s strange is that whenever these hacks occur they always use the same ‘sophisticated attack’ term. Almost as if there is a single comms playbook being used across all these organisations and countries.

    The reason the term is used is because companies want the public to think that the hack wasn’t a trivial amount of effort and deflecting criticism that the company were negligent in their responsibilities in finding and patching vulnerabilities. Whereas the reality is that these events are inevitable and we need to plan for and minimise them. Does a mobile service provider need to store people’s driving license details?

    Anyway, I wanted to explore why this term ‘sophisticated attack’ is being so widely used. Where did it originate? How old is it? Is there a playbook? And lastly, can we improve how it is used?

    The first step is to check Google Trends:

    A chart of Google Trends

    Well that was disappointing. I was expecting a linear or exponential increase. If anything it’s a decrease. The related queries seem to involve military usage, so perhaps the term originates in the military and that has been transferred to cyber security. It wouldn’t be the first time.

    During this research I discovered that I’m not alone in trying to look in to the origins. I don’t agree with some of the language used, but if a paid journalist can’t find the origins I don’t think I can either.

    Ok, so let’s finish off by looking at how we can use the term ‘sophisticated attack’ better.

    Lets start by identifying what it is not. Some of the types of attack people in the industry tend to exclude are open services leaking data and DDoS. I don’t want to base the definition off a list of types of weaknesses because having an exhaustive list will be difficult to maintain. The other commonality is that these exploits can be performed by anyone pressing a button, such as on a DDoS service or an online scanning tool like Nessus. It seems to me then that the definition can exclude running a single script or action to perform the exploit.

    That means that chaining exploits is probably a better way to define the level of sophistication. If the attack involved 2 or more steps to get access to the data that seems like it took effort and skill. If an attacker found a SSRF to exfiltrate cookies and used that to gain access to a poorly protected admin interface, that is 2 or 3 steps. By these metrics that is sophisticated. Perhaps for you it may mean 4 steps.

    This measure also means as security improves 2 steps can be the norm and 3 steps will be the new sophisticated.

    By communicating about these attacks better we can start to have real conversations about how to protect data better.

  • Experiments in Stateless Terraform

    This article about the history of statefulness in Terraform really resonated with me. Over the last few months I have been playing with setting up Terraform in such a way that the state file was no-longer required.

    This post is about sharing my observations from those experiments because what I found instead was that with the current way AWS works, while it is possible to use Terraform without state, there are some limitations with AWS that made this more difficult than you would expect.

    The key issue you encounter when you no-longer know the state of your infrastructure is identifying which resources are no-longer needed. You either need to enumerate them, or to discard them without being aware that they exist. The first solution I tried was to use the AWS Account as an ephemeral container for everything that is built and destroyed. This guarantees to find everything and is reasonably quick to achieve.

    I do recognise that what I’m attempting to do is a hack, Terraform and AWS are not designed to be used this way. Terraform has not made their decision to use state in isolation, they will have encountered the problems I further on in this article and the solution they went for was adding complexity through state. I don’t see that as unreasonable, they made those decisions based upon the limitations of AWS at the time, and AWS made their own decisions based on assumptions about how their infrastructure was intended to be used.

    Becoming stateless#

    The concept is that everything you need is first built in to one AWS Account and when you want to release a new instance, re-create everything in a second AWS Account and fail over to it. The first AWS Account can then be removed. A separate AWS Account with a load balancer can then perform the switching when the new infrastructure is up and running. This is a form of Green/Blue Deployment.

    MEMBER_ACCOUNT_PAYMENT_INSTRUMENT_REQUIRED#

    There are a some limitations with this, and this is where I think it gets interesting.

    My first thought to clearing out the old resources is to delete the AWS Account. However, the deletion of AWS Accounts from an AWS Organization requires the sub-account to have credit card details associated to the Account. That makes it a manual step. A manual step means no automated Green/Blue deployment.

    There is an alternative. I call it the “black hole”. This is an OU in your AWS Organization that has a policy on it that prevents all roles from being assumed in that Account. That way your resources cannot run and your cost will reach zero.

    This then runs in to another issue. There are soft limits on the number of sub-Accounts that an AWS Organization can have. By default 10. When you’re doing continuous deployments, you do the math on the number of Accounts you will need per day. It is a soft limit, so you can ask Amazon nicely for it to be increased, but I bet in a short amount of time you will have an unhappy Amazon asking you what you are up to and can you please stop.

    Better Solutions#

    The only other way of doing this is to follow the enumeration path. Fortunately tooling such as cloud-nuke exist to empty an AWS Account of resources. You can even be selective about allow-listing certain resources, but it is a very slow process and it may not cover everything.

    However, the interesting part here was finding the limits of what was possible in AWS. Not something I encounter often. I don’t have a lot of experience in other PaaS services so I would love to hear if doing this is possible in Azure or GCP.

    I do wonder if its something Amazon will fix, but I also think it is a fundamental limitation in how AWS assumed Accounts would be used. Otherwise they would have no doubt made it easier.

    A different, but unattainable solution I’ve started to see with services like Fastly and Doppler is that they have configuration versioning built right in to their web UI.

    I am hopeful that this is a tend towards a ClickOps model which has the ability to make Infrastructure as Code, and therefore managing state, redundant. The first PaaS services to do infrastructure versioning this way is going to get a lot of attention to me. However, I don’t see this happening in AWS for a long time.

  • What does the future look like

    I remember back in the late 90s and early 2000s there was a portion of the population that were fascinated by the culture of Japan. One aspect of that, that captured people’s imagination is how we used to find it so odd that at “celebrities” like Sylvester Stallone and Arnold Schwarzenegger were doing adverts for strange products in languages other than their own. This became such a part of our own culture that it was even parodied in the TV series Friends.

    These days we’re not even shocked when “celebrities” like George Clooney endorsing coffee, or Snoop Dogg rapping about take-away apps.

    Being a fan of Sci-Fi I’m fascinated by the idea of what we would make of technology from the future. If a smart phone from 2050 landed in your hand, how would you interpret it? Or use it? Would it even be recognisable? We can’t know what it would look like, but I think we can predict our feelings towards it.

    This past week I listened to a Cautionary Tales episode about the Sinclair C5 in which they dramatised the launch of the C5, a “revolutionary” electric vehicle. Among its many failings, such as being incapable of coping with the weather of the country in which it was launched, it was also simply “weird”.

    The C5 was unlike everything anyone had seen before. Even now, in a world with eScooters and hoverboards, the C5 looks like a strange device – although that could be to do with the 80s styling.

    I think this can tell us something about those who are yet to be exposed to practices such as ZeroTrust and DevOps practices, or those that have been exposed and are resistant to it. There is no doubt those ways of working are here and likely to stay, and may well be superseded. But some people view modern engineering techniques as this odd, alien, concept that upsets everything that has gone before it. Those people are like us, viewing the culture of Japan through a lens of “well this is weird”.

    Through my own experience of learning and practising Agile techniques, I found it’s a total mind-shift from the traditional way of working. It turns up-side-down so much of what we’re taught by companies, society, and universities.

    It also challenges many things outside an engineering team, things that are usually deemed sacred, like finance and procurement. Making it so that even if your engineering team does adopt it, it’s full effect cannot be achieved without a holistic cultural shift in your organisation.

    Remembering those early days is important, to remember how you felt, how you got to a stage where you were comfortable, and to use those “aha!” moments to bring others on the journey with you.

    I hope to write about the best way I find to achieve adoption soon. Implementing Agile methodologies across an organisation certainly is a slow process and potentially generational in nature, and it cannot be done in isolation to work effectively. But I hope that this gives you a sense of empathy towards those who may not have even heard of it.

  • The Maintenance Cost of Security Controls

    I used to work with a security architect who thought that every risk can be mitigated by use of “IP white-lists”.

    There are some immediate concerns with that approach in many scenarios. In most cases the poor user experience, and the ease in which it can be over-come. But one of the other concerns, that I often don’t see discussed, is the support impact of such a control.

    Firstly let’s assume that this is to limit user authentication. The user has to realise the reason that a service has been unavailable for hours, or days, is because their IP has changed. They will then have to find a way to request help through a different path.

    Even with a single user this could be a big over-head, requesting help several times a year. If they were using something like a 4G dongle for their internet connection, which changes IP addresses frequently, they could be requesting help several times a week. That is an extreme example, but multiply that by even a modest number of users and your support teams become overwhelmed. You could create some sort of self-service, but then why have the control at all? All it does is create bureaucracy.

    This post isn’t entirely about bashing IP allow-lists, although it demonstrates the issues effectively.

    Another example is using AWS’s Customer Managed Keys for KMS. By using CMK you may have mitigated whatever concern you thought you had, but you now have to manage your own encryption keys. You have to create an entire ecosystem for managing those keys, ensure they’re protected, rotated, and retired securely. That has the additional effort that you are now not putting in to delivering value to your users, as well as the additional security risk associated with getting every step right.

    Any control proposed to mitigate a risk needs to be adequately weighed up and the impact and implications of that control has its own risks. You have to consider the secondary effects on how to effectively manage that control, and everything that comes with it.

    With IP allow-listing there is an interesting asymmetry between the risk and the control. The control is vastly more impactful than the thing you’re trying to protect against. To mix metaphors it’s like bringing a barn-door to a knife fight. You have to be careful that each control chosen is measured against the risk, rather than taken at face value.

  • The Not So Sorry State Of Open Source

    I’ve seen recent discussions bemoaning the state of Open Source software. This seems likely to me to have been brought on by questions about their security and integrity in light of recent concerns about supply chain exploits.

    I will declare myself now, I am not a fan of Java. In my personal ranking of programming languages I hate, Java is there right below PHP. I will admit with the log4j issues over the weekend I have been quietly smirking to myself a bit. Although it goes without saying, #hugops to those who have had their weekends ruined.

    However, I do recognise that ranking them comes from my personal biases and is not backed up by any data. It does not make the Java (or PHP) ecosystem insecure.

    I’m sure this also applies to those with concerns about the security of Open Source software. Lashing out with whatever their personal biases tell them without any considered reasoning. It makes me think of the SystemD arguments that are still around today.

    Given the broad use and severity of the log4j issues, what does actually concern me is environments that are difficult to update. I do recognise that we are in a time of transition right now. A lot of software is still manually deployed, if the engineers who built it are still working for the company. And even the ones that are automatically deployed may not fully understand the tooling within them. I’m sure much like Heartbleed the fixes will be gradual rather than big bang.

    Open Source software, Java, PHP, whatever needs tooling that allows it to be patched quickly, reliably and effectively. I would hypothesize that those who are letting their personal biases lead them to conclusions about whatever technology they hate are actually victims of ineffective tooling.

    Hopefully this will be another opportunity for businesses to have the discussion “what can we do to prevent this from happening again?” and do some fish-bone analysis rather than listening to the loudest person in the room then being surprised when we run in to the same issues again.

  • Security Communication

    Lately I’ve been listening to the BBC’s How They Made Us Doubt Everything which exposes how the manufacturing of doubt is used to protect first the tobacco industry and their link to lung cancer, then the oil industry and their link to climate change.

    It’s all well worth listening to the whole show if you can get access to it, but one of the episodes that stood out to me was about climate change communication and the section with Susan Hassol describing the confusing and conflicting words that scientists use. A similar assessment of the poor usage of language used in climate science in Susan’s TEDx talk.

    How this ties in to IT security is that I think the community has a similar communication problem. I don’t see evidence that there are adversarial actors involved, intent on casting doubt on the effectiveness of encryption. We can assume it exists, and we should be thinking about these problems before security has the same communication challenges as climate science.

    MFA, Authentication, Authorisation, backups, privacy. I know what these means to me. I have a hunch about what they might mean to the readers of this post. But what do these terms mean to those outside the industry, my parents, or my neighbours?

    I know when I met my neighbours for the first time I inadvertently got in to a conversation about coding and hacking and had to clarify their meaning. As explained by Hassol, having these communication barriers doesn’t help make people understand better, and make information easier to act on.

    Hassol has a published translation of commonly confused terms. Similarly, Gender Decoder, used to check for gender biases in job adverts is also based on published evidence. Where is that analysis for security terms?

    NCSC does put a lot of effort in to this, and has even re-emphasised their tend pragmatic advice when questioned by those in the industry on their three random words policy. But the argument is about as effective as whether oil or coal is the largest contributor to green-house gasses. It creates uncertainty, and ultimately misses the more valuable position of secure passwords.

    I want more IT and cyber security discussed in common forums, such as TV, and Movies. Those in the industry often get excited about seeing nmap or multi-factor authentication used on screen. But it’s rare that security advice is portrayed well, or in seriousness.

    I often think, who is, or will be the person to communicate IT security well. Who will have an accessible, entertaining and informative TV show on how to secure your e-mail? Who will educate the public in looking for better privacy serving products? Who will go on news when a major company gets hit by ransomware or has a data leak?

    Who is, or will be, the David Attenborough or Hannah Fry of Cyber Security?

  • Green and Brown Teams

    A recent Cyber Weekly mentioned Green Teams. This is not a term I had heard before, and from having a quick look around I couldn’t see this term widely used at least in the context of service delivery. I wanted to expand on it.

    The name isn’t so much a description of what a Green Team do, but a reflection of their responsibilities. By which I mean you cannot become a Green Team, you are working on a product that didn’t exist before.

    Thinking on the term a little I also realised that there is a missing group here, and that is the teams working on existing deliveries and in many cases legacy systems. They also need a name. Extending the analogy of Greenfield the term used for projects that have some existing infrastructure is often Brownfield so we end up with Brown teams. Although that name is problematic.

    Whatever the name the categorisation is useful because, as the article points out, we often don’t spend significant time thinking about the issues these teams face, and they require different working practices to get them to a good place.

  • Providing Useful Security Advice

    I know I certainly dread that message:

    You have a massive security issues by the way

    It instantly puts you on the defensive, and in my experience, it’s also not true.

    On top of that the phrasing creates an information asymmetry and puts you in a position you have to defend. You may in reality be in a strong position, but by starting off with a non-collaborative discussion it can be difficult to identify the true problem.

    As I mentioned twice in my previous post security is all about context, and I wanted to expand on that point. You may, in isolation, have something in your environment that isn’t ideal. We all do. The alternative is to have no service at all. But IT security is entirely about risks and trade-offs.

    It may well be that you do have an issue here, and it may also be that you have managed it already through another control such as alerts that warn when that scenario happens, but to an outsider it is often difficult to draw the relationship between a risk and a control.

    Let’s turn it around. Say you discover an issue in someone else’s service and you want to draw their attention to it. What would be the best way to broach that subject? It needs to be collaborative and resolves the issue. This question appears a lot in interviews in this industry, in one form or another, but I’ve yet to see a response I’m entirely happy with. Let’s expand on it then.

    First off, setting the scenario. I’m excluding where you are an outsider as bug bounty programmes are well established and documented.

    In this scenario you are a Security Architect, Engineer or similar in a business. You have access to the source code. You also have access to the person who is responsible for the code, and that you’re using modern delivery methodologies.

    As mentioned, the first issue you have is that perhaps the issue you’ve found has already been risk managed. To account for this the first step is to check the documentation. There is a chance that the issue you are seeing has been documented but difficult to find. Unless you’re somewhere that employs enough people and pays well enough to attract good people it’s unlikely this is documented at all, but let’s try to minimise wasting someone else’s time first. Time they could spend resolving other issues.

    In general terms you’re trying to identify if the issue you are seeing is actually an issue and test your assumptions.

    In the more likely scenario that the documentation doesn’t exist, hard to find, or have the information you need. Next is to approach the person responsible. I think it’s important to coach your wording appropriately. Never make assumptions about the issue you have found. You want to look like you are coming from the position of being wrong or were not able to find the right information. You are now attempting to find the right information. Phrase it something like:

    I was looking at your code and I’m trying to identify if something I’m seeing is actually an issue.

    Followed by explaining the issue.

    There’s no point saying there is a massive security issues because, as mentioned, you put the other person on the defensive, and it may well be that you are wrong. You are reducing your chances that they will pay attention to you in the future, when perhaps there is a genuine issue.

    You may discover at this point that they have identified the issue, mitigated it, and the documentation is good, but not something that can be found quickly. At this point you may want to suggest making it easier to find the documentation and your job is done.

    However, if they have mitigated the issue but they haven’t documented well, or they haven’t mitigated it at all then you have some work to do.

    This is your opportunity to create some security credit and I recommend you take it. This will make working on issues in future much easier.

    The first step is to offer to raise a ticket for the work. This means that it has a greater chance of the issue being worked on. If you’re unfamiliar with the project this is also the point to clarify where they record their issues so you know it’s somewhere they will see it.

    Ideally the next step is to be proactive, although it’s not always practical.

    If it’s a documentation issue then offer to update their documents.

    It also may well be a genuine unmitigated issue, and much the same way with the documentation, if you’re comfortable you should then be proactive in fixing the problem. Raising a Pull Request for the change and adding or updating tests.

    That way the team with the issue are more likely to listen to you and fix the problem next time, which is the ultimate goal: reducing the risk of disruption to your system.

  • Armchair Security Experts

    Providing security advice to someone can be very difficult, particularly when unsolicited. This is because IT security is entirely context based. I use the term Armchair Security Experts to describe people who provide security advice, often wrong, as if they had expertise in the subject area.

    Wikipedia explains the term Armchair Expert (or Armchair Theorizing) well enough, and the security aspect is much the same. Someone with limited practical expertise in the field. Or more generously, someone whose responsibilities doesn’t include owning any risk.

    Security is all about context. If you make broad bold statements about security without leaving your armchair to consider the context you are providing negative value to a delivery.

    The security advice Armchair Security Experts sometimes provide I call these Security Grenades. This would often be in the middle of a big important meeting and someone semi senior would throw in something like “I have heard that S3 is insecure”. This is so bafflingly bad information that it de-rails the purpose of the meeting.

    This is closely related to the dead cat strategy (which is more or less as horrible as it sounds) but instead of drawing attention away from a dangerous line of discussion it diverts the delivery team in to trying to handle this knowledge asymmetry.

    This is where in an Agile delivery that taking all stakeholders on the journey with you is so important.

    You need to keep everyone informed and up to date on your progress, your assumptions, and your solutions. You should be able to identify those who have the power to derail your project and put the extra effort in to keeping them informed so that they don’t scupper your delivery at the final stages.

  • Behavioural Economics and Second Order Thinking

    I really enjoyed the article on Chesterton’s Fence I came across recently. It really made me reflect on Behavioural Economics again – as does so much stuff I read.

    The Chesterton’s Fence analogy is an excellent description of why challenging assumptions in Agile processes is so important, but the reason I like it is because thinking in a Behavioural Economics way you can use it to inform Second Order Thinking. Or in some ways, start to predict the future about what effects a particular action has.

    In Behavioural Economics terms, the fence can have multiple purposes, but what it incentives is keeping something in or out. By removing that fence is no-longer performs that function and whatever protection it afforded no-longer is true.

    I know in reality however I’m sure I would mindlessly be attempting to remove the fence post without thinking about it’s purpose. Fortunately because of the repeatability of DevOps practices we can re-enforce that thinking each time.

    I write this however, not so much attempting to be a sooth sayer, but, much like the fence, as a warning that what Second Order Thinking and Behavioural Economics can tell you about the future, is still assumptions based on assumptions, and that you can really only understand what happens next through small fail fast experimentation.

    So perhaps the answer the true analogy isn’t about the fence being moved, but instead about making assumptions about the effect of moving it.

    But then there is such a thing as taking an analogy too far.

  • Is Public Wi-Fi More Secure Than Personal VPN Services?

    An IT security group I associate with recently wrote a blog post on threats and scenarios for securing mobile phones. It’s well worth a read.

    One line stood out to me as interesting because I wanted to understand the evidence for that. It feels true, but that’s not the same as being true.

    That Greg is…

    more likely to select a malicious VPN provider than he is to run into malicious Wi-Fi

    Speaking to the authors the answer was more anecdotal than quantifiable based upon the lack of reports on malicious public Wi-Fi in recent years and the number of reports of malicious personal VPN providers.

    To validate that there are plenty of examples of personal VPN services leaking logs, but as I have written about before that doesn’t make those services more, or indeed less, secure than they are now.

    Also, while public VPN services are not a new concept, they have become more prominent in the last 5-6 years, which could lead you to think that data leaks by personal VPN services are more likely because those services are note-worthy and therefore more interesting to report on.

    Conversely using public Wi-Fi has been a concern for many people for far longer but with phone 3G and up being more common place this would mean its usage is likely lower. It seems though that evidence for this and ‘hacks’ on public WiFi service having significant effects in the real world are hard to track down. The reasons for this are either because it rarely happens, or rarely gets identified.

    Defining Terms#

    For clarity a public WiFi service is any wireless network with an internet connection provided by a company in a public space. This means coffee shops such as Starbucks but does not include corporate Guest WiFi that you might find in a workplace. A public WiFi provider is the company that provides that WiFi service.

    Whereas personal VPN services are a kind of internet service provider that supplies internet access via a Virtual Private Network and is aimed directly at consumers rather than businesses. A personal VPN provider is a company such as NordVPN or ExpressVPN that run these services.

    Comparisons#

    The core of the question here is:

    Are you more likely to use a public WiFi that has a vulnerability that could do you harm, than a personal VPN service that has a vulnerability that could do you harm?

    Part of the complexity in answering this is that we’re comparing apples and oranges. You can tell through a logic test, that you can run a VPN over public Wi-Fi but not the other way around. This dissimilarity makes comparing risks difficult and the only reasonable way to measure it is through qualitative thought experiments.

    Economic Incentives for Security#

    To start with lets look at the reasons a company might run one of these services as the economic models for public WiFi services and personal VPN services are quite different. Using Behavioural Economics we can identify constraints and motivations for running it.

    Personal VPN services have a very traditional “quid pro quo” subscription model, this means that those who tout their services as improving security have a specific interest in you having secure system, or they would be out of business. It doesn’t mean it won’t happen, but they are incentivised to make it less subseptable to vulnerabilities, in much the same way cloud providers are.

    The payment model for public WiFi services is more complex. Coffee shops and libraries are widely regarded as a place to get free WiFi. The payment model is indirect. In coffee shops you are paying for your WiFi as a portion of your purchases. The WiFi encourages you to choose that venue over other similar venues, and the act of you staying there means you spend more on food and drink. You can infer therefore, that a coffee shop or library is more interested in providing a public WiFi service at the lowest cost they can entice people with. It’s not going to harm their business as much if there’s malware flying around the network – although arguably it should.

    Security Considerations#

    I wanted to address some of the good, although patchy and inconsistent work that personal VPN providers do on security of their services. WiFi providers, as mentioned, don’t have the same incentives to provide this information.

    Some personal VPN services do bring in auditors for their code, which is good as long as you trust the audit, but also comes with the caveat that the audit covers the code for a particular point in time. Some also have vulnerability reporting incentives. While some can provide additional blocking.

    A thing to consider is that personal VPN services typically requires you to download and install potentially untrustworthy applications on to your machine to gain access to their services. While some provide configurations for widely used VPN software.

    Discoverability#

    The final part is discoverability. This is your likelihood of accessing a personal VPN service vs a public WiFi service.

    A personal VPN service would typically be a one-to-one relationship between consumer and business. Where as there are a lot of public WiFi providers. In the United Kingdom public WiFi services are usually from big telecoms providers such as Virgin Media, BT and Sky, but they could also in a smaller shops be an ad-hoc PSK posted on a wall, or printed on a menu. You may use several in a year (or used to). However a personal VPN service can be accessed from anywhere at any time.

    I think it’s also a likely scenario that if you were new to personal VPN services you would probably perform a search for the best providers which is really a list of affiliate links and the one at the top probably gives the author the largest share of the referral fee, but you might not choose the top one because they’re the most expensive, so maybe the 5th one? Then at some point down the road that company isn’t doing so well and gets bought out by someone not so reputable, and sunk cost fallacy says you’re going to stick with it anyway.

    Moot Points#

    There are some security considerations that both public WiFi services and personal VPN services share. It’s worth drawing those out here as some may see them as one providing an advantage of one over another.

    HTTPS#

    The vast majority of web traffic is now HTTPS which means you are less likely to have your traffic changed in some way, but it is still likely that the providers can see meta data about your traffic. Typically you provide a public WiFi provider some information about you on sign-up, although I rarely see that being verified. A personal VPN provider can probably tell who was accessing what because you’ve presumably provided them some payment details, although I’m sure some more privacy minded folks might pay via a crypto currency.

    Further Monetisation#

    I have no doubt that large public WiFi providers monetise your traffic by taking meta data and selling that on. While some VPN providers claim they don’t, if you’re looking to make cost savings and choosing a cheaper provider I wouldn’t count on them changing their minds.

    Data Leaks#

    Both public WiFi services and personal VPN services have leaked personal data in the past. It’s an inevitability of time. It’s the nature of IT, as mentioned though I don’t think it’s necessarily a reflection of the current IT practices of those services.

    Conclusion#

    Like most answers in IT security, it really depends on your circumstances. Tom Scott does (like always) provide a fantastic summary of why you might want to use a VPN and I completely agree, while Troy Hunt has a less than unbiased view on why you should use NordVPN.

    What this demonstrates is that VPNs are not a quick fix, even for technically minded people. There are a lot of ways to get it wrong.

    That means if you’re in a position to use a personal VPN services, given the amount of choices you may choose one that doesn’t meet all your security needs. Whereas public WiFi services are less frequently used and have a larger surface area, and where they are most used they are run by large reputable companies.

    So it seems like public WiFi services are probably less malicious than some or all of the personal VPN services, but a huge aspect of that is exposure.

  • YAML: Out of Order

    I did enjoy Drew DeVault’s wish-list for the future of YAML today. I have to agree with all of it. YAML is terrigood.

    I recall reading an article years ago (that I can’t find now) comparing the YAML spec to other similar languages such as XML, JSON, etc, and whatever you pick YAML’s spec surpasses any of them in terms of length. The implication being that YAML is particularly difficult to implement correctly.

    But YAML is also great, and I think that is down to the syntax being highly intuitive. It’s familiar to a lot of people, and in particular those familiar with Python.

    One thing I think Drew over-looked though, and I’m sure there are others like Drew who would not automatically recognise this, is one really important aspect of YAML, and is why languages like TOML won’t replace every case of YAML, is the inference of sequencing in a YAML file.

    This becomes particularly apparent with Ansible. You cannot have an Ansible playbook without sequencing, they have a top down order to the structure. You will also find this in most CI languages such as GitLab CI and GitHub Actions – although they are less apparent because you can also break the sequence by referring to other steps.

    This inferred sequencing is not something you can do in TOML. Much like in CI pipelines you could reference other steps, but it starts to act much like GOTO 10 as well as increasing the cognitive load on the reader to find all the steps across a file or set of files.

    Terraform is another good comparison, HCL doesn’t enforce a sequence, each block in most cases references another. You then end up creating potentially complex graphs in your parser to identify what needs to be the first to run, and what needs to be run after that.

    Again, I do not disagree with Draw’s assessment, but I do think we also have to recognise that the sequencing is an important part of what makes YAML so accessible, and if we’re looking to the future of YAML this is a feature that needs to be retained.

  • Your Lack of Planning is Not My Emergency

    Recently a colleague shared a rule he uses when dealing with urgent requests:

    Your lack of planning is not my emergency

    My gut instinct was to disagree with it, but I couldn’t articulate it properly at the time. Now I see how it’s a regressive solution, which always makes me think there has to be a better way, and I wanted to explain my reasoning.

    The history of it comes from working in IT where often capabilities need to be updated in a hurry, but the task has a dependency on another team, your team, who does not own that change. Your team now has a bunch of pressure exerted on it because another team encountered something that they hadn’t planned for.

    The rule is a way to quickly diffuse any expectation that because another team made a mistake, that your team is responsible for fixing it. This is because these events are toxic in nature causing extra pressure to your own deliveries.You could also reasonably assume if your team helped out, the team who owned the event won’t learn anything and will know to go direct to you in future. A single urgent request in isolation is fine, but happening regularly can be crippling to a team.

    Although if this kind of task ends up requiring an urgent response from your team, to dismiss it outright you will be making assumptions. The following is a breakdown of those assumptions.

    Lack of Planning#

    How often in IT are rapid changes caused by a lack of planning? I’m betting in most cases the source of the tasks are incidents, or security patches. If you dismiss the request, you will be seen as unwilling to help out at a time of need, and as someone who may be unwilling to participate in similar future requests.

    I would expect this also means you reduce your chances of being consulted in future, or those who are making changes don’t know to inform you of them because you are not considered on the normal path to release.

    Emergency Exit#

    The other assumption to check is, how much of an emergency is it? Most of the project managers I’ve worked with would often say “Can you do this by tomorrow?” and when you trace back to the engineering team they’re happy with it being the next day, or May, or whenever. You may have decided not to participate based on faulty data.

    Often the delivery date has been made up based on guess-work and have no meaning. I always like to check the deadlines. They rarely hold true.

    The User Factor#

    Another consideration is that users are often the cause of these rush jobs. Set aside that telling them ‘no’ is a path to shadow IT, but it might be an indicator that your process isn’t fast enough for the demands for the business. If you have users who need software installed today and your SLA is 5 days, then maybe your SLAs are at fault.

    Happy to Help#

    You should always be willing to improve the operational effectiveness of your business. To do that you have to participate. That is the cost of being an effective team. Perhaps a single team is often the source of the problems, so where are their problems stemming from? Are they performing an incident post-mortem? Or learning any lessons? Is there something you can do with them to prevent their issues re-arising? Perhaps this is the true problem.

    Finally, if you haven’t planned extra capacity in to your work, then have you actually planned your own work properly?

  • Running Effective One-to-Ones

    I’ve run one-to-ones for a couple years now I realised I didn’t have a good plan for how to run them effectively, and weirdly there didn’t seem to be a lot of discussion written up about them that I could find online.

    I spoke with some colleagues who pointed to the Manager Tools Basics podcast which has a set of episodes totalling around 90 minutes on the subject at hand. They are however seemingly recorded in 2005 which makes them over 15 years old. This lines up with the banner at the of the webpage. While this does seem old and despite pandemics and changes in technology it seems to hold up for the most part. The issue though is that for one, they are very verbose and second, the information isn’t replicated online.

    What I wanted to do here is to pull out what I found the most important so that others can run their own effective one-to-ones without having to dedicate 90 minutes to listen to a podcast. This is a condensed view of what to do. If you want the complete explanation of why you do this I would recommend listening to the podcast itself.

    In the podcast they refer to the person who works for you as your “direct”. I don’t really like the term, but I can’t think of a better one, so I’m going to re-use it.

    Where and when#

    • Run your one-to-ones weekly. As the manager put it in your and the directs calendar. You can move it, but don’t miss it.
    • No more than 30 minutes. An hour is too long.
    • Run it in space where the direct feels comfortable

    What to discuss (Agenda)#

    • Your directs views: 10 minutes
    • Your views: 10 minutes
    • The future: 10 minutes

    Don’t rigidly stick to the agenda. If your direct wants to spend more time talking about their views then that’s their prerogative. This is their meeting.

    Your views might be about how interactions in the team are working, or not. The future is about where they are heading. Are they looking to move on? Or stay?

    An important note is that you, as a manager, should have a broad opening question. In the podcast they suggest “How’s it going?”.

    Review#

    I disagree somewhat with this agenda. It seems to lean towards on the manager’s view. The important thing is that your directs have a voice and you understand their wants and needs. If they want to talk for the entire time that’s up to them. It’s a two way street anyway and an opportunity for coaching.

    I personally would avoid the Joey Tribbiani comparison and possibly ask “How are things?” but has roughly the same meaning.

    The other thing to note is that as meetings go, one-to-ones are deceptively straight-forward. They work better when you don’t over think them. Other than scheduling you almost let your directs run them themselves.

    Other than that I was quite happy to get this insight.

  • Engineers Care Less About The OS

    Last Week in AWS is another one of my favourite blogs, but this weeks one titled ‘Nobody Cares About the Operating System Anymore’ definitely got me thinking.

    If you’ve never read Last Week in AWS the thing to note is that scores really high on the snark factor, and like most blogs it tends towards the hyperbolic titles to attract readers and discussion, after all, it’s not as pithy, and not going to attract as much attention if the title is ‘In the cloud, Engineers don’t really care about the Operating System anymore’. Which is probably why my blog is much less snarky, and much more boring.

    However, in this case I think the premise is wrong. It’s not that even engineers no-longer really care about what Operating System you choose but that Cloud based Virtual Machines (as a catch all for EC2s, Azure VMs and whatever GCP calls compute) is no-longer used and for me is actually an indicator of ‘infrastructure smell’. I.e.: that something is old, or unpatched, or not cloud native. For me, if I have to use a VM in hyperscale cloud, something has gone wrong somewhere.

    There are instances where you might use VMs, or more commonly called VPS, and dedicated boxes, which you find on more commodity providers such as DigitalOcean, Scaleway, or RedStation (to broach the tip of the iceberg), but when you have SaaS, FaaS, and Serverless Docker environments such as Fargate and Kubernetes, I really struggle to see the value in using dedicated, always on, compute. If anything its technical debt that you need to monitor, maintain, patch and scale.

    As a side note, yes you sometimes have to choose a base Linux distribution for Docker images, but this is essentially a game of pick your preferred package manager, and then it’s done and you never ever change that FROM value except to update the version number.

    I can see why Corey would interpret that as nobody cares about the Operating System, but the broader picture to me is not that the Operating System doesn’t matter, it’s that we’re at a point in technology in hyperscale cloud that the technologies allow us to abstract away VMs, making the variety of choice of Operating System is less relevant, but it’s the cloud providers abstraction layers that enable that, rather than engineers not caring.

  • Chaos Engineering Your Team

    Mountain Goat Software runs possibly one of my favourite blogs at the moment. This recent post “Should Your Team Adopt No-Meeting Weeks” really resonated with me.

    I read an article while back, but of course I cannot find now, about how you might want to consider applying Chaos Engineering, not only to your technology, but also your people. A random person gets selected maybe once a week, to take the day off. Simulating a sick-leave like situation.

    I’m a huge proponent of the idea that no single person should be critical to your business, and in fact they become a massive risk to your business that is only realised once they leave. It also creates an uncollaborative working environment because you will have a lot of different pressures from different people on one person. This is covered a lot in The Phoenix Project.

    Using the principles behind Chaos Engineering you can build in expectations in to your team that you should be able to survive the loss of a person.

    What I like with the experiment mentioned in the post, taking a random person out of meetings, was that this it is similar to Chaos Engineering your team, but at a smaller scale. In the situation where the person, who is critical to your meeting, gets unceremoniously booted out of said meeting, you have identified a process that doesn’t account for the loss of one person.

    The other issue is that if people learn they’re not necessary for a meeting they stop coming, which I assume is the intention. However now, you start kicking out important people from the meeting, such as the executive authoriser who was there specifically for that meeting.

    In the post these issues was explained as the experiment failing. I would look at it from the other direction, it’s a failing of the processes to have a dependency in a single person.

    As CGP Grey often says, one is none.

    Now don’t get me wrong, I’m not saying you need two (or more) executives in the meeting. I’m taking the broader picture. Does everyone really need to be in the same room at the same time? In my experience I would say 90% of the time the answer is no.

    Instead the major discussions should happen before the meeting, on your preferred collaboration tool. Any reading material can be distributed before hand and consumed at your leisure. If someone has a vested interest, such as an executive authoriser, they can raise any concerns before anyone steps in the room.

    The meeting becomes a formality to say “a decision needs to be made by this date and time”. It’s more like that cliche moment in weddings in movies where the vicar says “Speak now or forever hold your peace”. The meeting should be 5 minutes tops.

    The only other time people might want to meet at the same time is if something was being presented, such as a town hall or a sprint review. In those situations you should ensure that those meetings are recorded and published so that others can view and ask questions later if they were interested, but weren’t able to attend.

    This leaves everyone free to work on what’s important, when it’s important. Not spending their entire day in meetings being presented material they could have accessed at any time. Or using time less efficiently because they’re winding up to a meeting.

    But, as the original post admits, this may only work in a Google-eque business process utopia.

  • Common Sense as Bad Practice

    It came to me in conversation recently as to how toxic “Common Sense” is.

    Common Sense says “well, everyone uses X, so we should use X” or “Everyone should know this.”

    But those are fallacies and assumptions that aren’t tested or rooted in evidence.

    Usually Common Sense comes from asking your mates or colleagues, or assuming they think the same as you. However, by doing that you inherit bias in those assumptions. You make decisions that don’t hold true, and at best you are likely going to make poor decisions, and at worst create an exclusionary environment – locking out people who you will want to include.

    If you believe it is Common Sense, prove it is Common Sense. Test it out. Trial it, and properly. Don’t be part of the problem.

    Also, challenge “Common Sense” when it is used.

    Why is it common sense?

    Who else thinks they know that as the answer?

    You should end up with a better solution that is more open and equitable than your assumptions.

  • Assessing Security Practices of 3rd Party products

    In recent months I’ve been involved in discussions about whether remote working tools are “secure” or not. The answer to any blunt question like that is, as always, “it depends”, but this is as helpful as getting financial advice from YouTube adverts.

    It struck me that a lot of people interested in IT security often judge tools based upon how many vulnerabilities there are in a product. But lets be accurate here, they are judging it on how many security vulnerabilities are reported, or visible.

    Once the pandemic forced us all to work from home, Zoom seemed to be the target for every “white hat” hacker consultancy generously giving their time to declare the 0 days they found to news websites, with nothing more in return than their company name to be placed along side the advert article.

    These articles seemingly leading to several instances of “Enterprise” security teams declaring Zoom as insecure and made attempts to block its usage in their environments.

    But is Zoom unsecure? And does blocking it improve things?

    If you block Zoom, what will people use instead? Google Hangout? Skype? Chime? WebEx? Some random tool they found on the internet? Is blocking Zoom making things more or less secure?

    Zoom got the attention because its user base and visibility increased massively during the pandemic. Admittedly it did have some low hanging fruit, but was it more or less of a threat to a business than your employees using something like Omegle for work?

    If we used the same metrics in which we judge Zoom and apply it to Windows or Linux those security teams would block almost all operating systems out there except maybe OpenBSD and BeOS.

    IT security shouldn’t be reactionary#

    IT security needs to be evidence lead. Would you say now that Zoom, after 6 months of global usage it is not secure? Other than the occasional person posting their Zoom codes, which is the Video Conferencing equivalent of calling your S3 bucket big-bank-data.

    So can how can we be less reactionary in future?

    One thing is to recognise that Zoom fixed a lot, if not all, the issues that the security researchers were making a fuss over. Usually within days or weeks.

    Where-as shouldn’t we trust GitHub less for having a long standing issue it’s been unable or unwilling to fix?

    What I’m proposing is instead of making decisions based upon what we read in the tech news this week, we measure good security practice from 3rd parties on how quickly and responsibly vulnerabilities in their products are announced and fixed, and be pragmatic about what your users need and expect from their IT.