tl,dr;

Batteries included features: baked in Google workspace syncing, more rules based policies, k8s polish, slack JIT access bot

Catching up

Since the last update, a lot of features have come through that are showing the Tailscale platform maturing with a focus on making it easy to manage and deploy it at scale. This is a big deal because these features aren’t showy or something that demos well, but pays dividends when it comes to scaling a deployment past the the initial POC.

Details

  • Google Sync: Replaces an entier chunk of Terraform code I wrote at my previous job using the Google Workspace Provider to scrape the group memberships on a periodic basis and populate the policy file with users/groups. Making this a feature of the platform instead of the policy file is really nice and something that just makes it simple to scale up.

  • Via: Useful for a lot of things, this allows for meta rules, very useful when paired with Device Postures. Allows for policies such as “SE team’s iOS and macOS devices can use Salesforce App Connector, but only their registered work Macbook can access the demo cluster”. It also enables regional routing, so for a distributed company they could ensure users connect to subnet router closest to them. It would be great if/when the data collected for device posture includes the local subnet of the device - allowing for different rules to apply when a user is on a corporate network vs remote. The last one is something I’ve already run into where my home lab is on a different subnet than my main network - I want my devices to have access to the lab via the subnet router that runs there only when I’m remote, but when I’m on the main network I don’t need an additional hop when my L3 switch can handle the traffic for me.

  • IP Sets: While it looks like a large enterprise feature, it is also partially addressing the same issue I had where my home network is more than one subnets (or a block of sequential subnets that could be listed simply with a larger CIDR mask). Being able to replace “autogroup-internet” with a [“autogroup-internet except all the internal office IP ranges”(https://tailscale.com/kb/1387/ipsets#customize-autogroupinternet)] is essentially a more fine granged version of “enable local network access” one can do with the Tailscale client. A big benefit of this is one can also include things like trusted CDN endpoints, to lessen the burden on the exit-nodes themselves. There are other scenarios this can be used for, in general I see this as a big feature for people rolling out Tailscale for hundreds if not thousands of endpoints, and wanting meta controls on how traffic flows across the tailnet. This I believe is also the first time one can write a negative rule in policy, finally allowing an easier way to alias “included everything in this subnet except this one IP”.

  • Kubernetes Operator Updates: too many to include in a quick review, but a lot of this looks to be for making it easier to just run and provide Tailscale features/services (exit nodes, app connectors, funnels) on a kubernetes platform. I’d not suggest deploying an EKS cluster just to provide Tailscale services, but since Kubernetes/EKS has replaced VSphere for a lot of IT shops, it makes sense make those teams lives easier - just deploy a manifest and get a subnet node on hardware/services already deployed with the right network configurations etc. (If just deploying a subnet router or an exit node, I found an arm based EC2 instance to be much more afforable). Will have to play with this more in my lab.

  • Slack Accessbot: Leveraging device postures again, this Slackbot lets you ask in a slack workflow for timne limited access to a specific Tailscale Device or Tag. Under the hood what it actually is doing is adding a custom posture attribute to the device - instead of rewriting a policy on the fly, this assumes you’ve already added a policy stating “Janice, Chris, and anyone with a device with ProdAccess=True can access these tags.” All the slackbot is doing is using the API to update the individuals device posture (and assuming it also removes it when the timelimit expires). So while one could use the accessbot, it is also a really good example one can use to customize their own Slack Workflow that does something similar. A plugin for incident.io was on my wishlist when deploying a similar solution in the past, that would allow for prod access to be granted during an incident but more importantly, any access to prod would require declaring an incident (even if it is a minor change - it acts as an incentive that rewards robust engineering that doesn’t require prod access to debug / troubleshoot / monitor).

Thoughts

I’d love to have these features while deploying Tailscale in the past. For any Security or HelpDesk team tasked with managing Tailscale, these are great quality of life improvements. The magic of Tailscale is that is really painless and a pretty seamless experience for most endusers, but as the number of users one is managing on the platform, the more the management side of the platform matters than the enduser features. Having an incremently faster client isn’t as important as ensuring every user’s device is being upgraded properly to take advantage of that feature (or security fix). It’s refreshing to see a PLG company address the administrative and “boring” featureset side of things - I would hope this continues.

On the less positive note about all these features: the refactoring of and managing of all these settings and configurations in a single policy file may start to become unweildy. It would be good to see some best practices on how to structure policy files, and possibly some more UI driven experiences (even if it’s still modifying the policy file on the backend). Troubleshooting why a node can or can’t reach another one can become burdensome and may leave to some accidentally more permissive than required changes as a result (I know I’ve resorted to this in the past). Tests exist, but it all requires one to become an expert on Tailscale to really appreciate how to write them - adding guiderails and examples with tests could got a long way where a user is more focused on providing the Business logic (these teams need to access these resources) instead of having to understand IPsets vs aliases and how Via works, etc.