Deploying Tailscale for a remote only company

One of my first jobs included managing on premise infrastructure for businesses and configuring VPN systems were a constant hassle. When I moved to Puppet I was glad that VPN services were someone elses problem and that my role didn’t need access to services in the datacenter. The purpose of the VPN was to extend the old school hard candy shell of physical network security to remote employees and offices. There was a luxury to know that to access specific systems you needed physical presence in a location that could have security controls involving badged access and 802.11x radius authentication on network ports.

Fast forward to 2022 and I’m deploying Tailscale for a company with no physical office and all their resources are hosted on someone else’s computers (aka “the cloud”). How would very smart and well managed wireguard tunnels be useful for them? They’re developing a cloud based SaaS, source code is hosted on Github, meetings done in Zoom, chat done in Slack, the CRM is Salesforce, the very standard startup stack of tools.

The first use of Tailscale by the company, was for using the exit-node service. They replaced individuals Surfshark and similar personal VPN solutions with a handful small compute instances acting as VPN endpoints in different regions. Employees could select a local exit node while at the coffee shop and know they had a secure tunnel in a suspect network.

When I was tasked with implementing more central security controls, I decided to use the fact that most people already had Tailscale for the VPN service as the basis for connecting and managing the rest of the services. Different teams had put together their own controls and did follow best practices for managing access to sensitive services, but that left too much to chance and added too much risk in terms of human error. A lot of the security was maintained by very dilligent people, but from experience that is an area that doesn’t scale well and can suffer most with attrition and burnout. I wanted to get to the point were developers had access to see what resources they could access from their laptops on their laptops and Tailscales access features really played a crucial part in that process. Tailscale provided the underlaying infrastructure and essentialy the “company lan” that I used to run the rest of the access management tooling ontop of.

Some patterns I found useful:

The terraform provider lets you break up the Tailscale config into different components, so I used the google workspace provider to scrape the needed emails to populate the needed groups definitions. This also allowed for the ACL section to be a different set of files entirely, so one doesn’t have to keep track of a single hujson file to make changes to (the terraform run would assemble all these together into one artifact for us). This made commit diffs easier to track in the control repo that handled Tailscale and other services around the company.
Subnet collisions are always a problem. Since everything was developed separately and access to a resources was originally on a by purpose basis, it was common for the same IP ranges to be reused. Doing the CIDR math for others and creating lists of IP ranges folks could use by region helped a lot, since most people hate doing that work themselves. This made the subnet router feature in Tailscale much easier to implement, being able to drop in an EC2 instance for a region and expose the services by their IP range.
CGNAT is a thing in more use than anyone thinks. It’s what AWS allows users to add as a secondary IP range to every VPC (intentionally for EKS, allowing for AWS security groups to apply to pod networking), so I’m really glad to see one can define CGNAT the ranges used on the tailnet now. Also if you do have to run Tailscale on a node in CGNAT, tweak the CLI to ensure it doesn’t blanket drop all CGNAT traffic by accident (--netfilter-mode=off). Tailscale does per device routes anyway, but it can add a blanket “DENY 100.64.0.0/10” that will kill the other non Tailscale connections.
Dig into the Autoscaler guide for your cloud of choice. For AWS, I used autoscale groups with maximum lifetimes to ensure that exit nodes and subnet router weren’t running for months without patches - instead they were designed to be disposable, killed off after a week. Using a cloud-init script to handle all the patching and maintenance at boot, it meant the instances were as container like as they could be for their purpose, without all the overhead required to run a container service (and it’s best to keep these services outside other kubernetes or container systems). I didn’t get to implement a lifecycle hook to generate an one time use auth token on instance launch and stage it in a secret for, but I did have separate process running to rotate the auth tokens frequently, ensuring there weren’t long lived credentials sitting around.
You can use a custom private DNS entries if you want, but you can also get away with publishing route53 public entries for services that point to private IPs that are behind a subnet router. Anyone can see the DNS entry (and private DNS is leaked anyway thanks to certificate transparency lists), but only users on your Tailscale network can reach those IPs. If you do use private DNS instead, for scenarios like Route53 private zones, once they are attached to a VPC, the second from the bottom address in the range is the DNS resolver you can add to your Tailscale DNS settings, letting users resolve those private entries easily enough.
Tailscale DNS can cause heartburn if folks don’t know it is applied to their machines - the fact that DNS (and split DNS) can be implemented by the Tailscale admins can throw endusers for a loop. A sanity check to let them know if it’s applying to them is they see 100.100.100.100 as their resolver in a dig lookup, it’s going through MagicDNS. It is great if your org wants to keep DNS information private, but in some cases just using the public DNS records may be sufficient, since you can remove “did the user disable tailscale DNS on their machine” from the troubleshooting list.

Since I finished this project, Tailscale has introduced a ton of new features I wish I had at my disposal while there and they’re definitely maturing into a serious enterprise ready tool. These that show they’re realizing the buyers now aren’t individuals or teams who use the tools they’re paying for, but a director or CISO who is trying to mitigate security risks for hundreds or thousands of users. Some of the more exciting ones for me have been:

Tailscale client customization: being able to force DNS settings (on or off) and other behavior are the kinds of things that simplify rolling this tool out to teams. Right up there with a real MacOS package for deployment means they’re building the levers possible for a larger company to adapt to Tailscale and Tailscale to adapt to their policies and needs. I’d not go wild on it for where I was before, but these level of tweaks are essential when your deploying a tool to more than a handful of people and want them to have a positive experience with it.
SSH session recording and Kubernetes operator would have allowed for the replacement of an entirely second set of tools I had to deploy to audit the last mile interactions with services. This would have been great and also allows for fully internal services to just be served up over Tailscale directly (the operator acting as an Ingress to a service is amazing) and enforce access not just at the application level but the network level with the Tailnet ACL. The other “zero trust” access tool I used required a lot of overhead and had users change their habits when it came connecting to servers - with Tailscale ssh mrz@server-name is exactly the same as access was done before and so has much less friction for adoption. The fact that it’s happening over a Tailscale tunnel with a level of logging and auditing is transparent to the enduser.
App Connectors are a brilliant product that’s repacking some things that Tailscale is already doing. A Tailnet could already have a subnet router that advertises the handful of /32 “subnets” that are github.com IPs, ensuring that all traffic on the tailnet going to a github.com servers originate from a specific known IP. But they’ve made implementing that trivial with the feature and this alone is a huge improvement for security. Not only can one enforce traffic going to sensitive services have to go over a Tailscale network - eliminating the need for users to always have an exit node enabled, it makes it practical to use IP allowlisting for many SaaS tools. It’s one of the best safe guards against leaked credentials because now the attacker also needs to be on your Tailnet not just have a users github account credentials.
Regional Routing this is a great addition to the HA ability of the Subnet Routers. I had run into the problem of deploying transit backbones across regions to allow VPC connectivity in EU and US. Prior to this feature, the option was to either route everyone to one region to access the VPCs or split the VPC subnets into batches and then share only a regional subnet via subnet routers in both regions. The first option meant half the company had a needless roundtrip across the globe and the second option doubled the number of subnet routers and having ensure the shared subnets were kept up to date. Now one could deploy a subnet router in each region advertising identical routes and the users in the same region use the preferred node. So much easier.
Mullvad exit-nodes would allow for throwing out a ton of redundant infrastructure and overhead managing exit nodes. I can see using these as the catch all exit nodes, with the exception of heavily regulated spaces that want internet destined traffic to go through specific proxies or firewalls in addition to going through a VPN. This also frees up admin cycles to focus on providing App Connectors and Subnet Routers to those services and not having to worry about providing VPN access to the internet at large.
Device Posture lets you use the state and assessment of a node to determine if it should have access to your Tailscale network and what resources there in. This is amazing and something I would have loved to implement alongside Puppet/Facter’s trusted data. Being able to restrict access to machines based not just on user but attributes helps minimize the risk associated with a remote office (or one following a bring your own device model). Some super basic restrictions could be limiting iPhone device access to some app connectors and exit nodes, while the same users laptop could have access to github.com and other sensitive resources.

What I find really impressive about all of the above changes is that it’s showing how a tool like Tailscale really enables the a fully remote company to operate securely. This isn’t just replacing the physical doors on a central data center with a virtual abstraction, but allowing for a fine grained flow of traffic between systems and services in infrastructure. At Puppet we wouldn’t have even considered putting a VPN client on a sales persons laptop, but if something like Tailscale existed then, it would be trivial to deploy it and support it, knowing that it could be managed in a way that adds a real level of security to their work without being a huge hinderance to them doing their job. There’s an added benefit for development teams because not only does it provide the security infrastructure to do their job, adding a new internal service or system to the platform is possible. ACLs can be configured to allow a developer to access to all their own machines / services they’ve deployed with Tailscale first, great for testing and demoing, before then getting signoff on the service being shared wider on the tailnet. Making it painless to developer a secure service by default is huge cultural shift for an org, it is an actual compelling “shift security left” story.

So yeah, Tailscale is a pretty great set of tools and I’m excited to see what they work on next.