December 12, 2017

Day 12 - Monitoring Postgres Replication Lag

By: Kathryn Exline (@kathryn_ex)
Edited By: Baron Schwartz (@xaprb)

Have you created replicas of your PostgreSQL databases? I am going to assume you are a good database steward and answered that question with a resounding “YES INDEEDLY DO!” Ned Flanders style. If not, I recommend taking the time to be kind to your future self and do so as soon as possible. We won’t talk about how to do that here, but you can find details on how to configure replication in the PostgreSQL documentation, the PostgreSQL wiki, and around the internet.

With your trusty replicas in place, make sure you take the time to properly monitor your clusters. One of the most important metrics of replication health, albeit seductively easy to over value, is “replication lag”. Before I show you a few simple queries to collect this value on your PostgreSQL clusters, let us briefly talk about replication in PostgreSQL. If you are already familiar with the concept of replication and how it is implemented in PostgreSQL, feel free to skip ahead to the “Why Care About Lag?” section.

The Basics: Replication and WAL

Replication is a mechanism where data from one database (a “primary”) is copied to another secondary database (a “replica” or “standby”), keeping it in sync. Most databases have built-in mechanisms to support this feature. After you configure your primary PostgreSQL database for your service, you should create one or more replica databases in case you lose your primary database or decide you want to offload some operations from the primary. Generally, you initialize a replica with a snapshot of the primary and then it stays up to date by fetching and replaying the primary’s transactions.

PostgreSQL implements replication via the Write-Ahead-Log or the “WAL” (pronounced “wall”, like the big icy thing in Game of Thrones). The notion of the WAL is not unique to PostgreSQL, and is similar to that of journaling in file systems. It ensures transactions are logged durably before they are committed, so updates can be recovered and replayed in the case of a crash. Aside from crash recovery, PostgreSQL leverages the WAL for internal performance gains and built-in replication support.

The WAL is a collection of 16MB binary files located in the pg_xlog directory of your data directory. Each time the database gets a transaction that requires changing any data, it appends a record of the transaction to the most recently created WAL segment file and assigns the record with a Log Sequence Number (LSN) to note its position in the WAL. I explicitly say position and not time because as the term Log Sequence Number suggests, the WAL files and their individual records are based on a sequence-based timeline. Why? Because if you are processing a high volume of transactions, timestamps may not be unique or granular enough to validate that your transactions are executed in the correct serial order. Not to mention time is full of nasty tricksies. This will be important later when we look at the queries you can use to find your primary’s and replicas’ position in the WAL.

PostgreSQL uses the WAL to make replicas of the primary in one of two ways. The latest and greatest is via streaming replication, where each WAL log record is sent to the replica as quickly as possible to be replayed. By default, this is done asynchronously so the replica can process the record without delaying the commit on the primary; however, PostgreSQL also supports synchronous replication where a transaction on the primary must wait until the WAL record is committed on both the primary and and the replica before considering the transaction successful.

The second and older option is via log-shipping where it ships one full WAL segment file (16MB) at a time from the primary to a replica. This generally results in higher replication lag since the replica will not receive the WAL records until the file is completely filled. Streaming replication is best for most use cases, but I recommend reading the PostgreSQL documentation around log-shipping standby servers for in-depth explanations of these two options.

Why Care About Lag?

Replication lag is the replica’s distance behind the primary in the sequential timeline. The time it takes to copy data from the primary to a replica, and apply the changes, can vary based on a number of factors including network time, replication configuration, and activity on both the primary and replicas. Unsurprisingly, I have seen replication lag spike on several occasions due to network issues. In another case, I saw replication lag spike on a replica that was not able to find and recover a WAL file from an archiving node, and it quietly fell out of date. The potential causes are widespread and I have found that replication lag is often an indicator that something is subtly failing or behaving unexpectedly.

Ultimately, it is safe to assume that there will be some amount of lag on any replica. But why do you need to know the replication lag in your clusters?

Disaster Recovery

In most scenarios where a primary database is lost, users want to promote the most up to date replica to ensure minimal data loss. You and your tooling can measure the lag to select the optimal replacement for the primary.

Service Strategies and Optimizations

If you connect all of your clients to the primary, you will eventually overload your database. When this happens, a common technique is to direct some read-only queries to replicas; however, If you don’t build your service to be aware of, and tolerate, replication lag, then your users will experience inconsistent behavior from your service. Knowing the typical replication lag of your replicas will help you strategize which services can still function in spite of potential lag.

Debugging and Observation

Just as measuring latency in an HTTP request can indicate an underlying issue, unusually high replication lag can indicate an issue with your databases. Unfortunately replication lag in isolation rarely informs users of the specific underlying problem, but it is a broad indicator of several issues and is another data point in your observability toolbelt.

How To Monitor Lag: Get Your “See Lags”

Now that we have a handle on the importance of monitoring your replication lag, let’s dive into two ways to measure replication lag.

By WAL Location

The most accurate way to determine the lag is to compare the current WAL location on the primary with the last WAL location received by the standby. To find the LSN value of the current WAL location in Postgres versions older than 10.x, run the following on the primary:

=# select pg_current_xlog_location();

In Postgres 10.x, you’ll need to use a newer function:

=# select pg_current_wal_lsn();

You should get an LSN value which looks like the following:

 pg_current_xlog_location 
--------------------------
 9C/1E306050
(1 row)

To find the LSN value of the last WAL location received and synced to disk by the standby, run the following on the replica. Once again, there’s a pre-10 syntax and a newer version for Postgres 10.x:

=# -- in Postgres 9.x
=# select pg_last_xlog_receive_location();

=# -- in Postgres 10.x
=# select pg_last_wal_receive_lsn();

You should get a similar record as the previous function:

 pg_current_xlog_location 
--------------------------
 95/75619450
(1 row)

Note that these functions denote what WAL position the replica has received from the primary, but not what it has applied to bring the replica’s copy into sync with the primary. There could be a difference between these two values. To find out what the replica has replayed, use the following functions on the replica:

=# -- in Postgres 9.x
=# select pg_last_xlog_replay_location();

=# -- in Postgres 10.x
=# select pg_last_wal_replay_lsn();

You can determine whether the replica is at the same point in the WAL as the primary by comparing the values of what’s been committed on the primary and what’s been received or replayed on the replica. The disadvantage of using WAL position is that, despite being an accurate representation of lag, it is difficult for humans to understand what an LSN difference really means. I have seen clever scripts that convert LSN’s to the byte position in the WAL and take the difference of these values, but there is an easier option that leverages another built-in function to approximate time lag.

By Time Difference

I told you earlier that time was tricky and the WAL is based on a sequence, but timestamps are more readable to humans and ingestible by external tools than the WAL location values. PostgreSQL can extract the timestamp of a given WAL location, allowing you to compare the timestamp of the last played transaction in the WAL with the current time using the following query on your replica:

=# -- Same in both Postgres 9.x and 10.x 
=# select now() - pg_last_xact_replay_timestamp();

This value needs to be read with additional context and taken with a grain of salt. It is meant to be an approximation of lag and should be treated as such. I find this query most useful to inject into my time-series observability metrics or to toss in a terminal pane when running operations that might affect lag. If you are selecting a replica to replace a failed primary, you should use the LSN instead of the approximate timestamp of the LSN.

Other Helpful Queries and Tools

You can run the following to determine whether you are interacting with the primary or a replica. A replica will return ‘t’ and the primary will return ‘f’:

=# select pg_is_in_recovery();

You can also translate the LSN value returned by the functions mentioned above to the name of the WAL file name within your pg_xlog directory using:

=# -- In Postgres 9.x
=# select pg_xlogfile_name(pg_last_xlog_receive_location());

=# -- In Postgres 10.x
=# select pg_walfile_name(pg_last_wal_receive_lsn());

If you are curious about what a WAL file actually looks like, PostgreSQL introduced the pg_xlogdump tool in version 9.3 to convert the contents of the binary WAL file into human readable form. Note this tool was renamed to pg_waldump in version 10.0 and is intended for educational purposes only.

Beyond the Queries

If your databases run on a cloud platform, your provider may already provide these metrics for you. For example, AWS Cloudwatch provides the ReplicaLag metric and GCP provides the replication metric. Finally, whether you use external tooling to monitor your replication lag or write your own monitoring plugins, you need to consider how you actually use these metrics.

As we discussed earlier, replication lag is a helpful metric and provides additional data points when making decisions about your services, but think long and carefully before alerting or paging around replication lag. You probably don’t want to. Replication lag is susceptible to a variety of factors, some of which are not actionable or inherently wrong, and it varies enough that you could find yourself bogged down in tuning alerting thresholds or developing complex anomaly detection. If you do choose to page on this value, make sure you embed plenty of headroom in your thresholds, provide context around potential lag causes in your alerting tools, and give your on-call rotation a few extra high-fives.

ADDITIONAL READING

December 11, 2017

Day 11 - Scaling your on-duty team

By: Damien Pacaud (@damienpacaud)

Edited By: (@bmarsteau)

Our tech team at teads.tv is mostly based in France where labour law and legislation provide quite a strict set of rules and boundaries for working out of office hours.

For this reason we’ve had to adapt and give some thought to our on-duty team organization as we grew from a start-up to a scale-up.

Intro

Scaling your on-duty team is crucial for most of the fast-growing startups that operate at a global level. The internet never sleeps, and even with the best design for resilience, one day, your system will go down. At teads, we deliver outstream video advertising for the biggest content publishers in the world. Any downtime has important repercussions on our revenue but also on the publisher’s revenue. We decided to carefully think about scaling our on-duty team in order to minimize the downtime when a system goes down. That story is below.

Our problem

In a few years, we’ve scaled from a growing startup operating with a few pizza teams into a company where more than 100 developers on 3 different locations deliver new features on a daily basis. We’ve been able to do so by implementing our own version of the “Spotify model” and it has given us the ability to stay agile while growing the tech team. Applying the same recipe to the on-duty team was a challenge, to say the least. Initially, the on-duty team was composed of a few developers that had been with teads since the very beginning and that were very knowledgeable on every part of the platform. We relied on their knowledge, availability and on the fact that they helped build most of the system. As we grew, the system became larger and more complex. The handful of developers keeping the revenue safe overnight were now unable to keep up with the needed knowledge to solve a problem.

First step : Growing the on-duty team

We started looking for people to add to the on-duty team and ideally have someone from each of our feature teams be part of the rotation. This was our way of implementing “you built it, you run it” in a country with strict labour laws. It meant growing that team to 12 people and that’s when we hit the first wall. We tried growing the team while having a few visible production incidents (S3 Service Disruption in us-east-1, anyone ?) and of course, no one was voluntarily applying to be on duty.

spounge bob

Besides, being on duty once every twelve weeks seems counterproductive as it is spread on too long a timespan. By the time you are back on duty a lot of systems have changed and it is difficult to remember good practices.

Lost battle: trying to be ready

One of the main reason nobody was applying to the on-duty rotation was the lack of documentation for how to react when an incident arises. We tried to tackle this problem and for a few months we set up meetings, put knowledgeable people in a room and ask them to kindly document the steps to take when incidents happen.

meeting

This was too large of a mission, even for a highly motivated team. Soon, meetings were skipped, and documentation was not improving.

At this point, we started thinking about the problem in a different way.

Enter on-duty pairing

The first decision we took was to have two persons on-duty at the same time for a week-long shift. We tried to wisely choose pairs for mutually exclusive skills set and experience. We will for example pair a back-end developer with a data-oriented developer. This allows to cover most systems on the critical chain.

The benefits that we see with the on-duty pairing are: It’s much easier to bounce ideas off someone when a problem is impacting production and you (or your pair) do not know how to fix it. Sometimes while on-duty, the incident runs so deep that a critical business decision must be taken. It’s much easier to share the responsibility of such a decision in the middle of the night. We accept that this may slow down the decision process as there will be back-and-forth between the pair. In the rare event of someone not waking up to the PagerDuty calls, there is a backup. Interestingly enough, we had never experienced someone not waking up until we started pairing. This brought the question that pairing may lower each individual’s sense of alert because there is a backup but in the end we feel it has more benefits than downsides.

teamwork

We implemented this change in a few weeks and so far we are quite happy with it. The team has scaled to 12 developers, coming from all feature teams, and the rotation goes smoothly.

Escalation ?

The traditional way of dealing with increasing complexity is to have an escalation policy. We chose not to implement this and have PagerDuty automatically wake up both pairing developers at the same time. This automates the decision of waking-up another human being and makes PagerDuty responsible for it. We don’t want to be responsible for this hard decision so we let the robot do it.

robot

Escalation usually also solves the “I need an expert on [insert any well known distributed system here] and I need her right now” problem. Putting them on escalation policies is great if you have a big enough pool of experts on each of the systems that you use. For us this meant that a few persons would be on call every other week. We thought this was not acceptable and decided that we could solve this by : Telling the on-duty team members we know they will do their best to recover the issue Giving them the confidence that, as engineers, they will find a solution Automating as much as we can routine maintenance operations (taking a bad cassandra node out of the ring, decommissioning and replacing a Kafka broker…)

Post-incident & Playbook

Soon after the incident, we gather everyone from the on-duty team in a room for a blameless, fact-oriented, post-mortem. We aim to leave the room after one hour having filled our very simple post-incident template. Summary of the issue How to reply to such an issue (should it rise again) Action plan

This process allows us to document our interventions and ensure, should the same incident happen, we have a solution to mitigate its effect in a timely manner.

Feedback

After a few months, we are quite happy with this new on-duty rotation. It has proven useful many times and we now have more documentation than ever on how to react to our alerts. The post-incident ritual also acts as a team bonding meeting and we are thinking of creating more rituals specifically for the on-duty team (on top of each individual’s feature team rituals).

The biggest complexity that we encountered since launching was organizing the Christmas rotation period with pairs. It’s always a challenge to find one person available during those holidays, so trying to find two is double the fun.

December 10, 2017

Day 10 - From Product Eng to Systems Eng

By: Will Gallego (@wcgallego)

Edited By: Dave Mangot (@davemangot)

Engineering is an open field, not a paved road

In engineering, straight lines are few and far between. There is no certain path, no strict guide, no singular right way for every task. There are lots of wrong ways, for sure (being dismissive of others’ hard work, excluding underrepresented or underprivileged folks, stealing ideas and claiming them as your own for a few examples). More often than not it’s hard to know you’re “doing things right” until you look back. It took me a while to understand how my career led me to joining a Systems Engineering team in particular, because so much of this doubt clouded my career path early on. If you’re thinking of making a similar transition, I’d love to see you take a chance in exploring a new facet in the tech world.

Everyone has their own flavor of falling into engineering. Sometimes it’s sitting next to just the right person to peek over their shoulder as they work. Maybe you have a mom who was really into hardware hacking, passing that same love for electronics over to you as you grew up. A lot of the time it could be “part of the job” - you needed to pay the bills or they trained you to write scripts so you could automate some of the more mundane tasks along the way. All are valid and don’t let anyone tell you otherwise.

I grew up with a hobby centric mindset as my approach into software development. HTML in AOL Pages and writing a blackjack game in BASIC were my gateway drugs, back in the 90’s when it was just starting to become widely accessible to geeks in non-geek families. I had enough knowledge to be dangerous but not enough confidence to push past the fear of failure, even through college. I had a similar reluctance through the first few years of my professional career, a hesitation of parts of the stack that “weren’t for me”. Unless someone asked, I didn’t cross the line.

Taking a bigger step

Often as engineers, we tend to wait for someone to tell us what we can and can’t do. We hesitate to apply for senior roles because we haven’t been called “senior” in a title yet. We hold back on ideas in meetings or questions during architecture reviews worrying that they’re obvious or even stupid (spoiler: they’re not). We don’t question previous design decisions because we assume when it was first built it must have been right and must continue to be right. All of these are intertwined with the fear of being less, doing less, or appearing less in the eyes of those assumed to be smarter or simply more talented than us.

Nothing could be further from the truth.

I don’t believe in an intrinsic trait that breeds systems engineers, or engineers at all for that matter. People are not born and bred for this. Sure, there are folks who are naturally drawn to it and find their mark early in admin work or building distributed systems. That said, no one has database administration in their DNA. There isn’t a gene marker for admins. Noble blood lines delineating who can and can’t be an SRE don’t exist. The field is open to anyone interested, and if you hear someone say differently, they’re only displaying to the world their own insecurities.

Hand in hand with this falsehood is the belief that systems engineering is “harder” than other parts of the stack. Many engineers, myself included, started out building products on the front end typically because the feedback loop is shorter and perhaps arguably more tangible. We convince ourselves that backend work is beyond us because it departs from our comfort zone. Letting go of this self doubt in what you’re capable of opens up a ton of opportunities.

Why join Systems Engineering?

Many of the reasons for which you might have found yourself joining Product Engineering have parallels in Systems Engineering, avenues you might not realize exist. We’re problem solvers and builders, investigators and collaborators. We want to create and to improve, expanding our knowledge of our craft both for ourselves and for others directly and indirectly. If you feel this itch but worry “well, I’m just going to be setting up machines and never coding, right?”, let me allay those fears now. There are a ton of reasons to try your hand here.

Because you care

First and foremost, you care. You’re in this industry because you want to to build products that will be impactful, tools that improve someone’s daily lives or entertain audiences, maybe even save some lives in the process. Systems Engineers deeply believe in that too, just with a small shift in the focus of said audience. People have problems that need solving, ones that you want to put your energy into helping them overcome.

Empathy should be at the core of everything we do in engineering, regardless of role or position in the company. Typically when building a frontend app, your audience consists of folks external to your org. You might meet a few people who say “Oh, you work at X? I love that and use it every day!” which is a great ego boost knowing you’re making people’s lives better. How great is it to get that same feedback from your friends and coworkers? Combining your knowledge of the needs of your frontend with the functional knowledge of what can be done via the backend can make you an important asset to any technical organization. You get do so at a very personal level, one in which you can directly ask your consumers “how can I help?”.

You have a passion for understanding

You’re not content in making assumptions about your stack and you’re voracious in your consumption of material to learn. One of my favorite interview questions is simple in its ask and incredibly deep in its answers: “What happens when you type google.com in your browser and hit enter?”. If you’ve ever walked through that, you’ll realize just how far the branching pathways extend in various directions.

  • Well, something has to interpret a domain name into an IP address, but I’ve always handwaved that (something something DNS). But how does DNS actually work?
  • Is it just one host machine somewhere though? Most likely not. How would a site that takes tens, hundreds, thousands of requests a second scale?
  • Hrm, what if I’m logged in - there has to be datastore requests for information relevant to my account to fetch. How do we reliably read and write to that, and how does it change for read or write heavy applications?
  • How are all of those concurrent requests being spread out? Is there a caching layer in front of them to reduce some of that load?
  • What happens when one machine, multiple machines, all of them, fail in a localized zone?
  • How do I know which static asset versions I should be serving?
  • What’s our deployment strategy for making updates to the site?
  • Is there any kind of security for accessing this site - a cert to confirm I’m visiting the correct site and not being hit by a man in the middle attack? Why should I use TLS over SSL?

And this is all just scratching the barest surface for these and many more questions. If you’re passionate about the learning aspect of engineering, you can see how extending yourself further down into the stack gives you near limitless opportunities to grow as an engineer. When you see companies asking for full stack engineers, it’s folks who are asking these questions and more when presented with a new or unknown architecture.

Sometimes it’s fun just to be the smart engineer who knows a ton of stuff, too. This goes a bit beyond the scope of this post, but there’s a distinct push and pull between doing great work you’re proud of while maintaining the humility that you can’t know everything. You certainly don’t want to be the know-it-all jerk who spouts pedantic trivia (think “well, actually…”), but it’s exciting to be the conduit between teams who can answer a ton of questions. Helping people out can create a strong positive feedback loop for feeling valuable in what you do.

In short, systems engineering can be an opportunity to bust through silos and open up some black boxes. See what assumptions you hold that you can break. There’s nothing magical inside, and yet your colleagues will think you’re a wizard with how much you know!

Pushing back against Defensive Attribution Hypothesis

Defensive Attribution Hypothesis is a cognitive bias that involves the disassociation in the thinking, understanding, and skills of others with those of your own in the face of failure. That blameful feeling you get when something goes awry and you want to point fingers at “those engineers over there”? That comes part and parcel with this. We tend to see failure external to ourselves and mentally push it away further, creating a gulf between perceived success and failure. If failure is way over there, then I must be successful over here. Of course in doing so, you’re also pushing understanding of the situation - and people - away.

The classic example of this is the devs vs. ops mentality. A deploy goes out to production. Shortly after, the site experiences an outage. You can imagine exactly what happens next. The developers cry foul, saying “the code was working in our dev environment, so production must not be set up correctly!”. The ops team says “of course it’s set up correctly, the site was working fine up until that deploy so it must be your change that broke everything. Why didn’t you test it more?”. There’s no insight here, no learning from what happened.

Now, imagine you have experience in both product engineering and systems engineering. You understand what time pressures devs are under to accomplish goals and the purpose of the app. They don’t want to break production, but they’re trying to hit their targets for the latest sprint. Likewise, you know what load the backend can handle and what’s required to scale it further. You know metrics like the requests per second your systems are currently under and how the architecture would need to be scaled up should those numbers change. You can see both sides of the situation.

As an example, let’s say your product team is launching a feature that adds a new query to the database. They build it in dev and it’s reading/writing as expected, but under much reduced load. They followed protocol, putting in the change request to the DBA’s and setting up an index for this query. They’re being proactive! Likewise, your DBA’s gain confidence in the deploy because of this request. The devs are testing it and they wouldn’t have asked for the index addition if they weren’t being careful, so they give the thumbs up. The deploy hits prod and things go south, with both sides believing they had done their due diligence and the other is to blame. A lack of perspective promotes conflict. Now let’s add you to this equation, with your prowess in multiple parts of the stack. You see a code review for this pop into your inbox and think:

  • Perhaps this query could be cached for more availability, since it’s fairly read heavy and can hold up to a bit of staleness without adversely affecting what the app is trying to accomplish
  • Maybe you can reduce the number of columns fetched and use a covering query, because you know what is and isn’t needed for the app
  • To give some confidence to future deploys in general, you could set up a canary cluster for the devs to rely on so that they could see the performance of their code changes before it affects end users

Your knowledge of multiple domains lets you assume best of intentions for all parties involved because you can understand where both parties are coming from and empathize with the requirements they face.

What’s next?

So you’re ready to try your hand at systems engineering be it general ops, database admin, networking, etc., but it’s super intimidating to jump in. You have experience with dev work in some form but the divide looks too far to take in one stride. You need to start fresh in a number of areas, which means in a lot of ways you might feel like you’re starting over. Fortunately, there are lots of ways you can ease into the waters without too much disruption. How do you get from here to there?

Temp rotations with your local ops/backend team

Ask your manager if you can do a short term rotation with your company’s ops team. That tool that everyone wants built but no one has time to get to? This is a great opportunity for both you and your teammates to make it happen. With some coordination, you can be everyone’s favorite engineer putting it together while gaining some real world experience. Likewise, if there are tickets in your Jira, Trello, or other respective task board, see if you can snag one or two that look like low hanging fruit. Your admin friends will be grateful for your help in chopping away at the queue and hopefully can extend some of their domain knowledge to level you up in the process. Everyone wins!

Attend new-to-you meetups, conferences, and meetings

Rotations can be a tall order for some companies, though, and not every org’s roadmap allows for a multi-week trip digression. If your company will sponsor you to go to conferences you might not normally attend, that’s an excellent resource for a longer term investment. Setting up smaller sessions internally within your org to spread domain knowledge as well (demos and “lunch and learns” are two great vehicles for this!) can be immensely useful. This can simultaneously help to break engineers free of being single points of failure for maintaining subsystems and can inform a large swath of folks. Local meetups in your area? Jump on those too! If you’re feeling a bit timid, though, grab a friend to go with, as partnering up can also help the feedback loop for answering questions and stimulating ideas. Sharing information and learning as a team can make those intimidating questions become trivial quickly.

Daily tips and tricks

Impostor syndrome can really hit home when you’re making a large shift like this. Trying to gearshift when you’ve focused so heavily on one vertical in tech is really daunting. DNS? Filesystems? Database integrity? There’s so many paths to choose and each goes so deep. With so many people who have deep concentrations of knowledge it can be intimidating to try to “catch up”.

There’s no need to try to boil the ocean learning all of this in one day. You’re in no rush and you’ve got lots of time in your career no matter where you are looking to pick up skills. You’d probably be surprised at how much you’ve learned in a short time - unless you’re writing out your achievements as they come (and you should! It’s another great way to fight impostor syndrome). There’s a quote attributed to Bill Gates: “Most people overestimate what they can do in one year and underestimate what they can do in ten years.”. We try to do so much in the immediate, but set up a long term plan and you can move mountains. Try working on smaller bite sized projects, courses, and reading that’ll move you along your path.

Here are a few ways to get inspired if you’re looking to attack this problem on multiple fronts:

  • Subscribe to twitter feeds that offer daily tips or regularly contribute to learning. Some of my favorites: Julia Evans (@bork), Command Line Magic (@climagic), and of course SysAdvent in December (@SysAdvent)
  • Some of your favorite sites have great tech blogs - Kickstarter, Yelp, Netflix, and Etsy
  • Subscribe to mailing lists and newsletters with articles that arrive in your mailbox - DevOpsWeekly, SysAdmin Casts, Monitoring Weekly, and SRE Weekly
  • Add some tech podcasts or screencasts to your commute - Arrested DevOps, CodeNewbie Podcast, SysAdmincasts, and Devops Cafe
  • If your company is doing demo days, sit in on some for other teams you don’t typically collaborate with.
  • Likewise, hack weeks hosted by your company can bubble up great ideas and inspire you to venture into new parts of the stack with guidance from their owners.
  • Start a tech book club at your company reading over a chapter every week or two. Learning can be even more effective when you’re sharing ideas with other like minded folks.

Looking just over the horizon

Finally, you can look to parts of the stack adjacent to your comfort zone as a direction into systems engineering. If your strengths lie in mobile app building, ask yourself what might that API architecture it interacts with look like. If you need to set up a datastore call, investigate how you might profile and optimize that query to utilize indexes or set up caching around it. If you’re writing views and controllers in your favorite language (say, php), take a look behind the curtain to see how a dependency management tool, like composer, might be installing packages in various environments.

Picking up new skills doesn’t have to feel like being air dropped into the middle of nowhere. There’s something novel for sure about learning about tools that may be wholly different from where your current strengths lie, but easing into it with tech bordering what you’re comfortable with can smooth that transition. Checking out “over the horizon” tech to see what’s near to what you know but still new can help broaden your skillsets while leaving you with a starting point to build from in your mental model.

Bundling this up

For those of you thinking about a transition to systems engineering work, I can attest from personal experience how rewarding it can be. By opening yourself up, you can be a powerful force for good in your company, one with high adaptability and a wide breadth of scope for promoting positive change. Understanding this can afford you exciting work to be a part of and new challenges to spur your career moving forward. It’s a big step into uncharted territory, but one that can be deeply satisfying.

If you’re restricting yourself to familiar comfort zones or if you have a tendency towards vertical over horizontal learning, make this an opportunity to surprise yourself. There’s no rush - you’re not falling behind by exploring other parts of the stack. The best engineers in the industry have made it because they understand that failure is a necessary risk to achieve personal growth. Yes, you’re going to trip and fall. You did the same getting into engineering in the first place! Trust that you’ll eventually land on your feet and be all the better for it in the end.

December 9, 2017

Day 9 - Using Kubernetes for multi-provider, multi-region batch jobs

By: Eric Sigler (@esigler)
Edited By: Michelle Carroll (@miiiiiche)

Introduction

At some point you may find yourself wanting to run work on multiple infrastructure providers — for reliability against certain kinds of failures, to take advantage of lower costs in capacity between providers during certain times, or for any other reason specific to your infrastructure. This used to be a very frustrating problem, as you’d be restricted to a “lowest common denominator” set of tools, or have to build up your own infrastructure primitives across multiple providers. With Kubernetes, we have a new, more sophisticated set of tools to apply to this problem.

Today we’re going to walk through how to set up multiple Kubernetes clusters on different infrastructure providers (specifically Google Cloud Platform and Amazon Web Services), and then connect them together using federation. Then we’ll go over how you can submit a batch job task to this infrastructure, and have it run wherever there’s available capacity. Finally, we’ll wrap up with how to clean up from this tutorial.

Overview

Unfortunately, there isn’t a one-step “make me a bunch of federated Kubernetes clusters” button. Instead, we’ve got several parts we’ll need to take care of:

  1. Have all of the prerequisites in place.
  2. Create a work cluster in AWS.
  3. Create a work cluster in GCE.
  4. Create a host cluster for the federation control plane in AWS.
  5. Join the work clusters to the federation control plane.
  6. Configure all clusters to correctly process batch jobs.
  7. Submit an example batch job to test everything.

Disclaimers

  1. Kubecon is the first week of December, and Kubernetes 1.9.0 is likely to be released the second week of December, which means this tutorial may go stale quickly. I’ll try to call out what is likely to change, but if you’re reading this and it’s any time after December 2017, caveat emptor.
  2. This is not the only way to set up Kubernetes (and federation). One of the two work clusters could be used for the federation control plane, and having a Kubernetes cluster with only one node is bad for reliability. A final example is that kops is a fantastic tool for managing Kubernetes cluster state, but production infrastructure state management often has additional complexity.
  3. All of the various CLI tools involved (gcloud, aws, kube*, and kops) have really useful environment variables and configuration files that can decrease the verbosity needed to execute commands. I’m going to avoid many of those in favor of being more explicit in this tutorial, and initialize the rest at the beginning of the setup.
  4. This tutorial is based off information from the Kubernetes federation documentation and kops Getting Started documentation for AWS and GCE wherever possible. When in doubt, there’s always the source code on GitHub.
  5. The free tiers of each platform won’t cover all the costs of going through this tutorial, and there are instructions at the end for how to clean up so that you shouldn’t incur unplanned expense — but always double check your accounts to be sure!

Setting up federated Kubernetes clusters on AWS and GCE

Part 1: Take care of the prerequisites

  1. Sign up for accounts on AWS and GCE.
  2. Install the AWS Command Line Interface - brew install awscli.
  3. Install the Google Cloud SDK.
  4. Install the Kubernetes command line tools - brew install kubernetes-cli kubectl kops
  5. Install the kubefed binary from the appropriate tarball for your system.
  6. Make sure you have an SSH key, or generate a new one.
  7. Use credentials that have sufficient access to create resources in both AWS and GCE. You can use something like IAM accounts.
  8. Have appropriate domain names registered, and a DNS zone configured, for each provider you’re using (Route53 for AWS, Cloud DNS for GCP). I will use “example.com” below — note that you’ll need to keep track of the appropriate records.

Finally, you’ll need to pick a few unique names in order to run the below steps. Here are the environment variables that you will need to set beforehand:

export S3_BUCKET_NAME="put-your-unique-bucket-name-here"
export GS_BUCKET_NAME="put-your-unique-bucket-name-here"

Part 2: Set up the work cluster in AWS

To begin, you’ll need to set up the persistent storage that kops will use for the AWS work cluster:

aws s3api create-bucket --bucket $S3_BUCKET_NAME

Then, it’s time to create the configuration for the cluster:

kops create cluster \
 --name="aws.example.com" \
 --dns-zone="aws.example.com" \
 --zones="us-east-1a" \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \
 --cloud=aws

If you want to review the configuration, use kops edit cluster aws.example.com --state="s3://$S3_BUCKET_NAME". When you’re ready to proceed, provision the AWS work cluster by running:

kops update cluster aws.example.com --yes --state="s3://$S3_BUCKET_NAME"

Wait until kubectl get nodes --show-labels --context=aws.example.com shows the NODE role as Ready (it should take 3–5 minutes). Congratulations, you have your first (of three) Kubernetes clusters ready!

Part 3: Set up the work cluster in GCE

OK, now we’re going to do a very similar set of steps for our second work cluster, this one on GCE. First though, we need to have a few extra environment variables set:

export PROJECT=`gcloud config get-value project`
export KOPS_FEATURE_FLAGS=AlphaAllowGCE

As the documentation points out, using kops with GCE is still considered alpha. To keep each cluster using vendor-specific tools, let’s set up state storage for the GCE work cluster using Google Storage:

gsutil mb gs://$GS_BUCKET_NAME/

Now it’s time to generate the configuration for the GCE work cluster:

kops create cluster \
 --name="gcp.example.com" \
 --dns-zone="gcp.example.com" \
 --zones="us-east1-b" \
 --state="gs://$GS_BUCKET_NAME/" \
 --project="$PROJECT" \
 --kubernetes-version="1.8.0" \
 --cloud=gce

As before, use kops edit cluster gcp.example.com --state="gs://$GS_BUCKET_NAME/" to peruse the configuration. When ready, provision the GCE work cluster by running:

kops update cluster gcp.example.com --yes --state="gs://$GS_BUCKET_NAME/"

And once kubectl get nodes --show-labels --context=gcp.example.com shows the NODE role as Ready, your second work cluster is complete!

Part 4: Set up the host cluster

It’s useful to have a separate cluster that hosts the federation control plane. In production, it’s better to have this isolation to be able to reason about failure modes for different components. In the context of this tutorial, it’s easier to reason about which cluster is doing what work.

In this case, we can use the existing S3 bucket we’ve previously created to hold the configuration for our second AWS cluster — no additional S3 bucket needed! Let’s generate the configuration for the host cluster, which will run the federation control plane:

kops create cluster \
 --name="host.example.com" \
 --dns-zone="host.example.com" \
 --zones=us-east-1b \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \
 --cloud=aws

Once you’re ready, run this command to provision the cluster:

kops update cluster host.example.com --yes --state="s3://$S3_BUCKET_NAME"

And one last time, wait until kubectl get nodes --show-labels --context=host.example.com shows the NODE role as Ready.

Part 5: Set up the federation control plane

Now that we have all of the pieces we need to do work across multiple providers, let’s connect them together using federation. First, add aliases for each of the clusters:

kubectl config set-context aws --cluster=aws.example.com --user=aws.example.com
kubectl config set-context gcp --cluster=gcp.example.com --user=gcp.example.com
kubectl config set-context host --cluster=host.example.com --user=host.example.com

Next up, we use the kubefed command to initialize the control plane, and add itself a member:

kubectl config use-context host
kubefed init fed --host-cluster-context=host --dns-provider=aws-route53 --dns-zone-name="example.com"

If the message “Waiting for federation control plane to come up” takes an unreasonably long amount of time to appear, you can check the underlying pods for any issues by running:

kubectl get all --context=host.example.com --namespace=federation-system
kubectl describe po/fed-controller-manager-EXAMPLE-ID --context=host.example.com --namespace=federation-system

Once you see “Federation API server is running,” we can join the work clusters to the federation control plane:

kubectl config use-context fed
kubefed join aws --host-cluster-context=host --cluster-context=aws
kubefed join gcp --host-cluster-context=host --cluster-context=gcp
kubectl --context=fed create namespace default

To confirm everything’s working, you should see the aws and gcp clusters when you run:

kubectl --context=fed get clusters

Part 6: Set up the batch job API

(Note: This is likely to change as Kubernetes evolves — this was tested on 1.8.0.) We’ll need to edit the federation API server in the control plane, and enable the batch job API. First, let’s edit the deployment for the fed-apiserver:

kubectl --context=host --namespace=federation-system edit deploy/fed-apiserver

And within the configuration in the federation-apiserver section, add a –runtime-config=batch/v1 line, like so:

  containers:
  - command:
    - /hyperkube
    - federation-apiserver
    - --admission-control=NamespaceLifecycle
    - --bind-address=0.0.0.0
    - --client-ca-file=/etc/federation/apiserver/ca.crt
    - --etcd-servers=http://localhost:2379
    - --secure-port=8443
    - --tls-cert-file=/etc/federation/apiserver/server.crt
    - --tls-private-key-file=/etc/federation/apiserver/server.key
  ... Add the line:
    - --runtime-config=batch/v1

Then restart the Federation API Server and Cluster Manager pods by rebooting the node running them. Watch kubectl get all --context=host --namespace=federation-system if you want to see the various components change state. You can verify the change applied by running the following Python code:

# Sample code from Kubernetes Python client
from kubernetes import client, config


def main():
    config.load_kube_config()

    print("Supported APIs (* is preferred version):")
    print("%-20s %s" %
          ("core", ",".join(client.CoreApi().get_api_versions().versions)))
    for api in client.ApisApi().get_api_versions().groups:
        versions = []
        for v in api.versions:
            name = ""
            if v.version == api.preferred_version.version and len(
                    api.versions) > 1:
                name += "*"
            name += v.version
            versions.append(name)
        print("%-40s %s" % (api.name, ",".join(versions)))

if __name__ == '__main__':
    main()      

You should see output from that Python script that looks something like:

> python api.py
Supported APIs (* is preferred version):
core                 v1
federation           v1beta1
extensions           v1beta1
batch                v1

Part 7: Submitting an example job

Following along from the Kubernetes batch job documentation, create a file, pi.yaml with the following contents:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
spec:
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

This job spec:

  • Runs a single container to generate the first 2,000 digits of Pi.
  • Uses a generateName, so you can submit it multiple times (each time it will have a different name).
  • Sets restartPolicy: Never, but OnFailure is also allowed for batch jobs.
  • Sets backoffLimit. This generates a parse violation in 1.8, so we have to disable validation.

Now you can submit the job, and follow it across your federated set of Kubernetes clusters. First, at the federated control plane level, submit and see which work cluster it lands on:

kubectl --validate=false --context=fed create -f ./pi.yaml 
kubectl --context=fed get jobs
kubectl --context=fed describe job/<JOB IDENTIFIER>

Then (assuming it’s the AWS cluster — if not, switch the context below), dive in deeper to see how the job finished:

kubectl --context=aws get jobs
kubectl --context=aws describe job/<JOB IDENTIFIER>
kubectl --context=aws get pods
kubectl --context=aws describe pod/<POD IDENTIFIER>
kubectl --context=aws logs <POD IDENTIFIER>

If all went well, you should see the output from the job. Congratulations!

Cleaning up

Once you’re done trying out this demonstration cluster, you can clean up all of the resources you created by running:

kops delete cluster gcp.example.com --yes --state="gs://$GS_BUCKET_NAME/"
kops delete cluster aws.example.com --yes --state="s3://$S3_BUCKET_NAME"
kops delete cluster host.example.com --yes --state="s3://$S3_BUCKET_NAME"

Don’t forget to verify in the AWS and GCE console that everything was removed, to avoid any unexpected expenses.

Conclusion

Kubernetes provides a tremendous amount of infrastructure flexibility to everyone involved in developing and operating software. There are many different applications for federated Kubernetes clusters, including:

Good luck to you in whatever your Kubernetes design patterns may be, and happy SysAdvent!

December 8, 2017

Day 8 - Breaking in a New Company as an SRE

By: Amy Tobey (@AlTobey)
Edited By: Tom Purl (@tompurl)

To put it mildly, 2017 has been an interesting year for everyone. My story has its own bumps and advances, including changing jobs twice recently. Changing jobs is a reality many of us face, hopefully voluntarily. It happens infrequently enough that it isn’t something we spend a lot of time talking about. Life’s trials tend to teach us quickly and I hope to share some of what I’ve learned with you.

Backstory

I started the year as an SRE on the CORE team at Netflix. For various reasons, mostly involving stress, I ended up being fired in late June. I only took a couple days to cool off before kicking off interviews. I knew my financial situation would be stable because of the severance package, but I still wanted to have something lined up as soon as possible. Interviewing can sometimes take many weeks or even months so I got started right away.

My first stop was Apple. I knew lots of engineers from my Cassandra days and a friend there had previously pinged me to see if I was interested. I let them know I was available and that process took off.

I also started talking to two other large tech companies, eventually withdrawing from both. One was too far away and had a lengthy study guide for interviews that I was not willing to commit to. The other was going to make an offer but I found their interview process too heavily tilted towards technical ability with little to no discussion of culture or the day-to-day work.

Finally, I interviewed with two smaller infosec companies. On one call we mutually recognized there was no fit while the other moved on to make me a compelling offer.

I eventually went with the Apple offer as it paid more and I knew more people. The difference I didn’t adequately account for was the commute; the security company is remote and Apple involved a long shuttle commute in bay area traffic.

I was only at Apple for 6 weeks and during that time I was trying to experience it with a beginner’s mind. I read docs, code, and listened a lot. In the end, it wasn’t a culture fit for me and the commute was unbearable at 1-2 hours each way, so I quit.

The day after quitting, I set up a call with Tenable and they graciously renewed their offer to me which I accepted immediately. This time I took a couple weeks off because I really wanted to give them my best start.

Day Zero

Since my job is remote, they mailed me my gear. It arrived the Friday before my first day and I took inventory and set it all aside. Some of my accounts had been activated early, but I resisted the urge to start nesting before I was official.

Week 1

I found some bugs with the Windows 10 installation on my laptop on the first day. I spent far too much time dinking with it and finally emailed my managers and IT and got it sorted out within a few hours. I still ended up installing Linux because I found some things about the standard image too constricting.

Besides messing with my laptop, I had a few meetings and spent a lot of time reading docs and code. I spent the remainder getting to know folks over chat and video calls.

Lessons:

  • Start using collective speech immediately, i.e. “ours” instead of “yours”.
  • Reach out for help with IT problems more quickly, even if the issue at hand appears user-serviceable.
  • Reading docs and code in downtime was a great idea.
  • Spending time keeping up with chat rooms is a good way to get to know people in a remote job.
  • Keep notes about things that are troublesome like my laptop issues and bumps in setting up accounts (more on this later).
  • Generate 4KB ssh keys with a good password to impress your colleagues when they copy/past your pubkey.
  • Install Linux sooner :)

Week 2

With a working laptop and most of my accounts set up, it was time to start digging in. I got set up on the bastions and started looking at some tickets for our databases. Getting into live machines for analysis helped expose gaps in the new-hire onboarding workflow. Where possible I updated docs and other things were added to my notes.

Lessons:

  • Find a ticket to work on right away and give yourself plenty of time. I hit a few snags with access and local conventions that slowed me down.
  • Try to figure things out by searching docs and poking.
  • Start committing small changes to the team’s primary repos. e.g. formatting, spelling, tests, and minor bug fixes.
  • Set a time limit on self-discovery and ask for help.
  • I don’t always take notes, but when I do, it’s usually during my first month on the job.

Week 3

For my third week, the company flew me to headquarters for the official orientation. The first day of company overview, culture introduction, and getting to know the other new hires was enjoyable and interesting. The next couple days were largely sales-focused. My manager and one of my peers flew in and sprung me from those days. We went over our ongoing projects and prioritized work while we got a feel for each other. On the last day of the week I started researching projects we prioritized and started coordinating with other SREs on how to approach things.

Lessons:

  • Get to know one or two people in your hiring “class”. It’s fun and you’ll probably run into these folks again and again regardless of which department each of you are in.
  • Ask about orientation and find out if you really need to attend all of it. At the SaaS companies I’ve experienced there is a lot of content for sales folks (often much higher turnover than other departments) that might not be very relevant to an SRE.
  • Remember to install your password manager on your travel device (unless international).

Week 4

In my reviews of the Cassandra setup I found a few things that needed addressing. Rather than handing off the work that needed to be done, I decided to take it on a little before I felt ready. This gave me the opportunity to start committing to our Ansible repos. I wrote a bit of shell code and updated a bunch of playbooks that led me through learning the basics I needed to work with Ansible daily. I got one project done that was blocking progress for dev teams which led to making friends in those teams as I worked with them.

By this time I was starting to get questions from managers and peers about how I was liking the company and people. I was happy to reply in the positive and kept mental notes about specific things people asked.

Lessons:

  • Find a project in your wheelhouse to start with. For example, I knew Cassandra or Linux tuning would be a good place for me. Find some familiar ground and use it to launch yourself forward.
  • Clearing blocker tickets is an excellent way to make friends in dev teams.
  • Remember to give yourself extra time to adapt to the new environment.

Beginner’s Mind

A few years ago when I was stressing out about a talk, Aaron Morton recommended that I read Zen Mind, Beginner’s Mind. It’s an excellent book if you’re interested in Zen of course, but one thing I took away from it is the value of seeing things with a beginner’s mind. Once we’ve acclaimated to a new environment - be it a company, home, city, or vehicle - we are less likely to notice things that are obvious to beginners. I took my notes and data spanning git logs, email, Jira, and even my shell history to create a document with all the things I noticed that were good and those that needed improvement. I chose to keep things as constructive and positive as I could.

Observations:

  • Pointed out some technical debt, noting that it was not bad and totally fixable.
  • Wrote a quick review of the state of our Ansible repos and procedures.
  • Noted some good patterns I noticed in our AWS setup, particularly that we were leveraging IAM and VPC quite well for being relatively new to cloud ops.
  • Included a detailed review of the Cassandra clusters since that’s an area of expertise for me.
  • Pointed out gaps in operational tooling that we should focus on.
  • Gushed about how I saw the company’s stated culture show up in my interactions with people as I onboarded.
  • Wrote a bit about my experience with our product and threw out my BHAG for it.
  • Recommended improvements for gear & provisioning specific to technical staff.
  • Called out the need to simplify our edge architecture (under way!).
  • Talked briefly about my orientation experience and recommended some changes for engineering hires.
  • Linked some research about IT procedures that needed to updated.
  • Celebrated that folks were using my name and pronouns consistently.
  • Offered to advise IT folks on some improvements to laptop deploys.

I wrote the doc with my manager, their manager, and my peers in mind. The last I heard it was being passed around the C-levels with positive remarks. Even though I resolved some blockers and deployed a significant upgrade, this document was by far my largest contribution to date. The people I joined were used to their environment and by way of keeping some notes and adding some narrative, I was able to reach a large chunk of the management team and have an impact that benefitted the whole organization.

Wrapup

Starting a new job is stressful regardless of your role. For those of us working in operations and site reliability, the tech work is usually the easy part. Getting acclaimated into the social fabric and pacing yourself for the long haul is something I think we all struggle with. By being methodical and mindful about your approach to your first few weeks, it’s possible to make an impact without the usual stress of being new and vulnerable.

As with any tips of this sort, your experience may vary. Some things that worked for me may also work for you. Above all, find what works for you, make a plan, and communicate, communicate, communicate.

December 7, 2017

Day 7 - Running InSpec as a Push Job…or…The Nightmare Before Christmas

By: Annie Hedgepath (@anniehedgie) Edited By: Jan Ivar Beddari (@beddari)

sys-advent-title

Bored with his Halloween routine, Jack Skellington longs to spread Christmas joy, but his antics put Santa and the holiday in jeopardy! - Disney

jack

I feel a kindred spirit with Jack Skellington. I, too, wanted to spread some holiday-InSpec joy with my client, but the antics of their air-gapped environment almost put InSpec and my holiday joy in jeopardy. All my client wanted for Christmas was to be able to run my InSpec profile in the Jenkins pipeline to validate configuration of their nodes, and I was eager to give that to them.

Sit back and let me tell the holiday tale of how I had no other choice but to use Chef push jobs to run InSpec in an air-gapped environment and why it almost ruined Christmas.

Nothing would have brought me more holiday cheer than to be able to run run the tests as a winrm or ssh command from the Jenkins server directly from a profile in a git repository, not checked out. However, my soul sank as I uncovered reason after reason for the lack of joy for the season:

Scroogey Problems:

  1. Network Connectivity: The nodes are in an air-gapped environment, and we needed InSpec to run every time a node was added.
  2. Jumpbox Not an Option: I could have PowerShell remoted into the jumpbox and run my InSpec command remotely, but this was, again, not an option for me. You see, my InSpec profile required an attributes file. An attribute is a specific detail about a node, so I had to create an attributes file for my InSpec profile by using a template resource in the Chef cookbook. This is because I needed node attributes and data-bag information for my tests that were specific to that particular Chef environment.
  3. SSL Verification: There is an SSL error when trying to access the git repo in order to run the InSpec profile remotely. Chef is working on a feature to disable SSL verification. When that is ready, we can access InSpec via a git link but not now.

Because we were already using push jobs for other tasks, I finally succumbed to the idea that I would need to run my InSpec profiles as ::sigh:: push jobs.

Let me tell you real quickly what a push job is. Basically, you run a cookbook on your node that allows you to push a job from the Chef server onto your node. When you run the push jobs cookbook, you define that job with a simple name like “inspec” and what it does, for example: inspec exec .. Then you run that job with a knife command, like knife job start inspec mynodename.

Easy, right? Don’t get so cheery yet.

This was the high level of what would have to happen, or what you might call the top-of-the-Christmas-tree view:

  1. The InSpec profile is updated.
  2. The InSpec profile is zipped up into a .tar.gz file using inspec archive [path] and placed in the files/default folder of a wrapper cookbook to the cookbook that we were testing. The good thing about using the archive command is that it versions your profile in the file name.
  3. The wrapper cookbook with the new version of the zipped up InSpec profile is uploaded to the Chef server.
  4. Jenkins runs the wrapper cookbook as a push job when a new node is added, and the zipped up InSpec profile is added to the node using Chef’s file resource.
  5. During the cookbook run, an attributes file is created from a template file for the InSpec profile to consume. The push jobs cookbook has a whitelist attribute to which you add your push job. You’re just telling Chef that it’s okay to run this job. Because my InSpec command was different each time due to the version of the InSpec profile, I had to create basically make the command into a variable, so that meant I had to nest my attributes, which looks like this:
    node['push_jobs']['whitelist'] = {
     'chef-client' => 'chef-client',
     'inspec' => node['mycookbook']['inspec_command']
    }

    The inspec_command attribute was defined like like this (more nesting):

    "C:/opscode/chef/embedded/bin/inspec exec #{Chef::Config[:file_cache_path]}/cookbooks/mycookbook/files/default/mycookbook-inspec-#{default['mycookbook']['inspec_profile_version']}.tar.gz --attrs #{default['mycookbook']['inspec_attributes_path']}"
  6. Another Jenkins stage is added that runs the “inspec” push job.

And all of that needs to be automated so that it actually stays updated. Yay…

I will not get into the details of automating this process, but here is the basic idea. It is necessary to leverage a build that is kicked off in Jenkins by a pull request made in git. That build, which is a Jenksinsfile in my InSpec profile, does this: - archives the profile after it merges into master - checks out the wrapper cookbook and creates a branch - adds to new version of the profile to the files/default directory - updates the InSpec profile version number in the attributes file - makes a pull request to the wrapper cookbook’s master branch that also has a pull request build which ensures that Test Kitchen passes before it is merged

easy

So…this works, but it’s not fun at all. It’s definitely the Nightmare Before Christmas and the Grinch Who Stole Christmas wrapped up into one. It takes a few plugins in both Jenkins and BitBucket, which can be difficult to pull off if you don’t have admin rights. I used this blog post as a reference.

I battled internally with a simpler way to do this. A couple of nice alternatives could have been Saltstack and Chef Automate, but neither of those were an option for me. I’m not familiar with Saltstack, but I’m told that its remote execution feature would be able to run InSpec in an air-gapped environment. Likewise, Chef Automate has the Chef Compliance feature which runs all of your InSpec profiles from the Compliance server that you can put in your network. I’m still on the fence about whether those would have been easier to implement, though, because of the heavy dependence I had on the node attributes and data-bags that are stored on the Chef server.

As ugly as this process is, every time I see those all successful test results displayed in the Jenkins output, I can’t help but put a big ol' jolly smile on my face. Sure, it super sucks to jump through all these hoops to get InSpec to work in this environment, but it when the automation works, it just works and no one knows what I had to go through to get it there. It’s like a Christmas miracle.

tada

Do I recommend doing it this way if you don’t have to? No. Is this a great workaround if you have no other way to validate your configuration? Absolutely.

And if you need further convincing of the Christmas magic of InSpec, be sure to read my post last year about how InSpec builds empathy across organizations.

I hope you enjoyed my post! Many special thanks to Jan Ivar Beddari for editing this post and to Chris Webber for organizing this very merry blog for all of us! You can follow me on Twitter @anniehedgie. If you’d like to read more about InSpec, I wrote a whole tutorial series for you to follow here. And if you’d like me and my team at 10th Magnitude to help you out with all things Azure, give us a shout!

December 6, 2017

Day 6 - sysadmins - the evolution of a role amidst revolutionary hype.

By: Robert Treat (@robtreat2)

Edited By: Daniel “phrawzty” Maher (@phrawzty)

Like so many things in our industry, our job titles have become victims of the never-ending hype cycle. While the ideas behind terms like “DevOps” or “Site Reliability Engineering” are certainly valid, over time the ideas get lost and the terms become little more than buzzwords, amplified by a recruiting industry more concerned about their own short-term paychecks than our long-term career journeys. As jobs and workers are rebranded as DevOps and SREs, many Ssysadmins are left wondering if they are being left behind. Add in a heavy dose of the cloud, and a sysadmin has to wonder whether they will have a job in a few years. The good news is that despite all the noise, the need for sysadmins has never been stronger, you just need to see the connections between the technology you grew up on, and the technology that is going to move us forward in the next several years.

It used to be that when you started a new project you first had to determine what hardware to use, both from a physical standpoint but also from a pricing standpoint. In the cloud, these concerns are still there but have shifted. While most people in the cloud no longer worry about sizing infrastructure correctly at the start of a project, the tradeoff of being able to re-size VMs with relative ease is that it is also easy to oversize your instances, and in the cloud oversize means overspend, all day every day; at some point you need to work out the math to determine what your growth curve looks like in terms of resource needs vs dollars spent. Ever made that joke about having to bust out the slide rule to run cost comparisons between Riverbed, NetApp, and LSI? As much as they try, the cloud hasn’t made IOPS go away. Helping set estimates on how many IOPS an application will consume still requires a bit of maths, only now you also need to know your way around EBS and SSDs vs Provisioned IOPS in order to determine a reasonable IOPS per dollar ratio. But hey, you’ve done that before.

And that’s the thing; there are many skills which transfer like that. Scared about the new world of microservices? Don’t be - you were dealing with microservice like architectures long before the rest of us had even heard of the term. At it’s core, microservices are just a collection of loosely coupled services designed to meet some set of business goals. While we have not traditionally built and run applications that way, the mental leap for a sysadmin familiar with managing machines running Apache, Sendmail, OpenLDAP, and Squid is much less than for a developer who has only ever dealt with building complex monolithic applications. As sysadmins, we don’t think twice about individual services running on different ports, speaking different protocols, and providing completely different methods for observing their behavior; that’s just the way it is. Compare that to a development community that has wasted a generation trying to build ORMs to abstract away the concept of data storage rather than just learning to talk to another service in its own language.

This isn’t to say you can rest on your laurels. The field of Web Operations and the software that powers it is constantly changing, so you need to develop the ability to take what you already know and apply it to new software and systems. It is worth pointing out that this won’t always be easy or clean; new technology often misrepresents itself. For example, the rise of tools like Chef and Docker left many sysadmins wondering which direction to turn, but if you study these tools for a bit, you see that they draw similar patterns to old techniques. It can certainly be difficult for folks who have spent years coding on the command line to grok the syntax of a configuration management tools DSL, but you can certainly understand why companies want to automate systems; the idea of replacing manual labor with something automated is something we print on t-shirts just for fun. And sure, I understand how the yarn ball of recipes, resources, and roles might look like overkill to a lot of people, but I’ve also seen some crazy complex mixes of bash and Perl used as startup scripts during a PXE boot, so it’s all relative.

When Docker first came on the scene, it also promised to revolutionize everything we know about managing systems. All the hype around containers seemed to focus on resource management and security, but the reality was mostly just a new way to package and distribute software, whereby new I mostly just mean different. I’ve often lamented that something is lost when an operator never learns how to compile code or bypasses the experience of packaging other people’s software. The promise of Docker is that you can put together systems like using a set of legos using pre-existing pieces, but stray from the well trodden path even a little and you’ll find all of those magic compile time errors and strange library dependencies that you are familiar with from systems in the past. Like any system built on abstractions, there are assumptions (and therefore dependencies) baked in three levels deep. If you ever debated whether to rewrite an rpm spec file from scratch after a half day hacking on the distro’s spec file trying to add in the one module you need the maintainers didn't… replace rpm spec file with dockerfile and you have someone to share root beers with. Sure the container magic is magic when it works, but the devil is in the dependencies.

Of course no conversation about the role of the sysadmin would be complete without touching on the topics of networks and security. While sometimes made the purview of dedicated personnel, at some level these two areas always seem to fall back to the operations team. Understanding the different networks within your organization, the boundaries between those networks, and the who or how to traverse them has always been a part of life as a sysadmin. Unfortunately in what should be one of the most directly applicable skillsets (networks are still networks), the current situation in cloud land has actually gotten worse; the stakes are fundamentally higher in a world where the public internet is always just a mis-configuration away. Incorrect permissions on a network file share might expose sensitive material to everyone in the company, but incorrect permissions in S3 expose those files to the world. Networking is also more complicated in the cloud world. I’ve always been pretty impressed by how far one could get with VPNs and SSH when building your own, but with cloud providers focused on attracting enterprise clients in regulated industry, you’ll have to learn new tooling built to meet those needs, for better or worse. It can still be done, just be aware it is going to work a little differently.

So the good news is that the role of the sysadmin isn’t going away. While the specifics may have changed, resource management, automation, packaging, network management, security, rapid response, and all the other parts of the sysadmin ethos remain critical skills that companies will need going forward. Unfortunately that doesn’t solve the problem of companies thinking they need to hire more “DevOps Developers” (though I never see jobs for “DevOps Operators” - go figure!) and other such crazy things. As I see it, it is easy to make fun of companies who want to hire DevOps engineers because they don’t understand DevOps, but you can also look at it like hiring a network engineer or security engineer - believing you need someone who specializes in automation (and likely CM or CI/CD specifically) is not necessarily a bad thing. So the next time you’re feeling lost or wondering where your next journey may take you, remember that if you focus on the principles of running systems reliably and keep your learning focused on fundamental computing skills, even though the tools may change the problems are fundamentally the same.