Assume the Opposite

Sunday, October 30, 2016

Strongswan to Amazon Virtual Private Gateway

Recently I needed to connect a Rackspace subnet to an AWS subnet. Our VPN software of choice in Rackspace was Strongswan, and for Amazon we wanted to use Amazon's Virtual Private Gateway (VPG).

I figured this would be easy. Set up the Virtual Private Gateway, Customer Gateway, and VPN Connection in AWS, set up Strongswan, and done. Should really only take a few minutes.

It took many hours, spread over a couple of weeks, before I actually had it working. Google searches yielded many sample configurations for Strongswan against Amazon Virtual Private Gateways, but none of them worked for me. I'm providing the configuration that worked for us in the hopes that others can be spared the same pain of figuring it out. The end configuration is straightforward: the part that was troublesome was determining the correct set of options for /etc/ipsec.conf. The only other Strongswan file I modified was /etc/ipsec.secrets; everything else was left untouched (as provided by the Debian Strongswan installation).

The pieces:

Rackspace subnet 192.168.17.0/24.
Amazon subnet 172.18.0.0/16.

The steps:

In Rackspace, spin up a Debian 8.6 (Jessie) server for Strongswan with three network interfaces:
1. PublicNet — this is the public interface over which Strongswan will communicate to the Amazon Virtual Private Gateway. Let's say Rackspace gave us 1.2.3.4 as the IP address of this interface.
2. ServiceNet — the Rackspace network for communication to Rackspace services, not used by Strongswan.
3. 192.168.17.0/24 — our private subnet on the Rackspace side. The Strongswan server will forward traffic between this network and the Amazon 172.18.0.0/16 network.
On the Strongswan server, set up iptables appropriately. I'm not going to cover iptables configuration here. There was nothing specific to Strongswan other than ensuring that both subnets were allowed and forwarding was allowed between them.
On the Strongswan server, enable forwarding. In /etc/sysctl.conf, set net.ipv4.ip_forward=1 and either trigger a reload or just reboot.
In AWS, create the Virtual Private Gateway: nothing at all to configure here.
Create the Customer Gateway. Set Routing to Static and provide the external IP of the Strongswan server (1.2.3.4 from the Strongswan box above).
In AWS, create the VPN Connection. Set Routing Options to Static and provide the Rackspace subnet (192.168.17.0/24) in Static IP Prefixes.
In AWS, in the Route Propagation tab of the appropriate Route Table, set Propagate to true for the Virtual Private Gateway. This was the only routing configuration that was needed; I did not need to configure any routes in Amazon or on the Strongswan server for the Strongswan gateway to route appropriately.
In AWS, on the Tunnel Details tab of the VPN Connection, take note of the IP address of Tunnel 1 (we won't be using the second tunnel). Let's say this value was 5.6.7.8.
In AWS, select Download Configuration on the VPN Connection and choose Vendor: Generic, Platform: Generic, and Software: Vendor Agnostic. From this file, locate the Pre-Shared Key for IPSec Tunnel #1. Let's say this value was htFtWOVqKkss2EamZ36rFoPefECU18XJ.
On the Strongswan server from step 1, install Strongswan (in this case Strongswan 5.2.1):
```
 apt-get install strongswan
```
On the Strongswan server, update /etc/ipsec.secrets to contain the following (and only the following) to map the Pre-Shared Key to the Amazon Tunnel 1 endpoint:
```
 5.6.7.8 : PSK "htFtWOVqKkss2EamZ36rFoPefECU18XJ"
```

On the Strongswan server, update /etc/ipsec.conf to contain the following (and only the following):

conn %default
  mobike=no
  compress=no
  authby=psk
  keyexchange=ikev1
  ike=aes128-sha1-modp1024!
  ikelifetime=28800s
  esp=aes128-sha1-modp1024!
  lifetime=3600s
  rekeymargin=3m
  keyingtries=3
  installpolicy=yes
  dpdaction=restart
  type=tunnel

conn dc-aws1
  leftsubnet=192.168.17.0/24
  right=5.6.7.8
  rightsubnet=172.18.0.0/16
  auto=start

Note how the only unique values here are the subnets (leftsubnet=192.168.17.0/24 and rightsubnet=172.18.0.0/16) and the Amazon Tunnel 1 endpoint (right=5.6.7.8). Everything else is boilerplate.

Restart Strongswan, and done!

After the last step, Tunnel 1 of the VPN connection should have a Status of UP. Running ipsec status on the Strongswan server should show something like

root@theserver:~# ipsec status
Security Associations (1 up, 0 connecting):
     dc-aws1[1]: ESTABLISHED 5 seconds ago, 1.2.3.4[1.2.3.4]...5.6.7.8[5.6.7.8]
     dc-aws1{1}:  INSTALLED, TUNNEL, ESP in UDP SPIs: cf07b51b_i 1965694d_o
     dc-aws1{1}:   192.168.17.0/24 === 172.18.0.0/16 
root@theserver:~#

Note that while I managed to get it working on a Debian Jessie server, I tried but failed to get the same configuration working on CentOS 7. Different defaults for Strongswan, different default software or settings on CentOS 7, or a combination of both? It remains a mystery.

Thursday, April 19, 2012

Creating key pairs for Amazon EC2

A few days ago I needed to generate key pairs for an Amazon account again. I thought I'd write down the process.

There are two key pairs that you need: one pair for making API calls (and using the command line tools which make API calls under the covers), and another pair to log into your EC2 machines with SSH. The following works on Linux and Mac clients “out of the box”, Windows users will need to download the appropriate software.

Now, Amazon provides facilities for generating key pairs, why not use those? The first rule of public-key cryptography is that nobody but you ever sees your private key. In fact, that's not just a rule, that's the whole point: the best way to keep a secret is never to share it. If you use Amazon's facilities to generate your private keys, you're violating this rule. Yes, malicious Amazon employees could force the use of key pairs that they have generated themselves, but that should at least be traceable. In the end, when you use Amazon's infrastructure, you are putting a certain level of trust in Amazon, but a basic tenet of security is that having security at multiple levels is A Good Thing.

Ok, with the reasons to do it yourself covered, this is how you do it:

Generating AWS Signing Certificates

 openssl req -x509 -newkey rsa:2048 -passout pass:a -keyout kx -out cert
 openssl rsa -passin pass:a -in kx -out key

The first command produces the key pair and a self-signed certificate; just hit return to accept the defaults at all the certificate request prompts (real information is not required or useful). The second command removes the password “a” from the private key file, which is generally required for automation purposes (make sure that the file and your machines are appropriately secured). The kx file can be deleted.

To use the key pair, upload the cert file as a signing certificate to Amazon and specify the location of the cert and key files in the appropriate environment variables (EC2_CERT and EC2_PRIVATE_KEY) or directly on the command line.

Generating EC2 key pairs

 ssh-keygen -b 2048 -t rsa -f aws-key

This will generate two files, aws-key and aws-key.pub containing the private and public keys respectively. Import aws-key.pub as a “key pair” (it's only the public key, not really a pair) into AWS. When you launch a Linux instance with this key, this public key is made available to the instance, where it will typically appear in an authorized_keys file for remote access via ssh. If you don't set the key as your default ssh key on your client, you can use the -i option of ssh to specify the location.

Saturday, March 31, 2012

Raising the bar

A few days ago I arrived home a little late, missing supper. As I approached the door, I could hear Victor and Anna bouncing in the front hall, raising a call of "Daddy! Daddy! Daddy!": what a delightful welcome home. So I came in, dropped my laptop bag (containing laptop and power brick, keyboard, mouse, tablet), and gave first Victor (seniority has its privileges) and then Anna big hugs. Then I took off my boots and turned around to see Victor balancing on my laptop bag.

Needless to say in the actual event I didn't stop to take this photo but rather cried, “Victor, what are you doing!?”. I had Victor perform this reenactment (bag empty of course) earlier today, although I now realize that the positioning is correct but the orientation of the bag is wrong. The bag had fallen on its front, which means that the keyboard, mouse, and brick were on the bottom with tablet and then laptop above.

Once I established with Victor to never, ever stand on bags again, I brought the bag to my office to assess the damage.

There was none.

No damage to the MacBook Pro, not to the Apple Wireless Keyboard, not to the iPad, and not even to the Magic Mouse. Yes, it is actually a laptop bag, so it has some padding, but not a significant amount.

I knew the MacBook Pro was durable. A rather younger Victor had once crawled across it with no ill effects, and I'd taken advantage of that durability when doing things like wedging it into hotel room safes. I might ding the aluminum but I wasn't worried about that translating into any real damage. However, with the accessories potentially forming a fulcrum underneath them I'd worried that there might be enough flex to at least crack the screen. Apparently not, and apparently that applies to the iPad (first generation) as well.

I'm having a hard time imagining a laptop from any other company surviving that without damage. Certainly none of the non-Apple laptops I've owned in the past: Victor's weight would have cracked the screens or back even without a fulcrum effect.

I see other companies attempting to match the style, but not the substance. Clearly the Dell here is intended to match the MacBook Pro. But the case of the Dell uses sheet metal, to a completely different effect. They don't get it. It's actually worse than a plastic case, because the sheet metal flexes much more than plastic. The owner of the Dell informs me that you can reboot it by applying pressure to the sheet metal on the bottom in a particular place. I'm actually afraid to hold that Dell anywhere but the edges.

Prior to owning unibody MacBook Pros, I didn't see any issue with standard laptop construction. Laptops were of course delicate pieces of machinery, and if they broke after being subjected to such abuse, well, what did you expect?

Now I expect more. Apple has raised the bar.

Saturday, February 04, 2012

Mac OS X Lion: stepping backwards

I finally took the plunge and upgraded my MacBook Pro to Lion. Technically I went to 10.7.2. The immediate motivation was to upgrade to the current version of Xcode. I'd delayed for two reasons: I'd heard about issues with a number of major applications, and I knew there were a couple of major changes that would, at best, take some getting used to.

My assumption is that by this point companies have fixed their Lion-related issues, so I just made sure I had all the relevant updates and so far so good. The known changes were another thing, though.

Yes, scrolling is “backwards” now. I understand (and agree with) the motivation for changing it, but I can see it is going to take me a long time to get used to it, particularly since I regularly use Linux and Windows boxes too.

The other major change is that "full-screen mode" uses separate desktops now, so applications on the second monitor disappear. I hate this, as apparently do many other people. I'd say about 50% of the time I'd have one application in full screen mode on one monitor and one or more activities on the other that I was peripherally monitoring... no more of that now that I'm on Lion. No more making Chrome full-screen for writing a blog entry while watching some long-running process in a Terminal on another monitor... like I'm doing right now, except not full-screen. I think I understand why Apple did this too: the previous ways of doing full-screen didn't work well for a lot of scenarios and each application did it in a different way. But it's rather annoying to me since the pre-Lion methods mostly worked really well for my scenarios.

The separate-desktop for full-screen mode idea works great when you have just a single screen. But since I have multiple screens, I want the main activity full-screen (no distractions!) on the main monitor and all the side activities on the second monitor. I hope Apple comes up with a good paradigm for that, but I suspect they don't see it as an issue that needs to be resolved. There are hacks to work around this problem but there are significant issues to them. People say “just don't use full-screen” but there is no “no-chrome” mode on most applications to achieve the effect that I'd get with most full-screen implementations on Snow Leopard. That's what I really want, a way to get rid of the excess application chrome. Perhaps Apple should provide a standard mechanism for that.

The final Lion issue (for this post anyway) is Preview. In Snow Leopard, small images would scale badly (no anti-aliasing etc) when you enlarged them in Preview. On the other hand, other software like Google Chrome scaled images nicely. But that's ok, I'd go full-screen in Preview and then the images would look good. I was hoping Lion would fix this. Lion did, indeed, make the behaviour consistent... full-screen images now look like garbage too. See the image at the top for the difference between what Chrome does (left) and what Preview does (right).

One more issue: the Finder crashed while I was writing this. I don't remember the last time the Finder crashed on Snow Leopard.

So with all the negativity out of the way, I will say that I do like Launchpad and Preview does seem a lot faster than it was.

Thursday, February 02, 2012

Amazon S3 reliability

According to the Amazon Web Services Blog, they currently have 762 billion objects in Amazon S3. That's impressive. The popularity of Amazon S3 isn't hard to understand: it's easy to use, only $0.14 per gigabyte-month in most regions, and is “designed to provide 99.999999999% durability and 99.99% availability of objects over a given year” (quoted from Amazon Simple Storage Service (Amazon S3)).

That's an impressive statement: 99.999999999% durability. That means that for my 7,038,080 objects, I could expect to lose one in 14,208 years, or to put it another way I have a 0.007% chance of losing an object in a particular year. That seems like a pretty minuscule risk.

But then you look at the scale of Amazon Web Services. There are 762 billion objects in Amazon S3. That means by their design criteria (ignoring the reduced redundancy storage option) they expect to lose at least seven of those objects this year. Have you checked your objects today?

Now I doubt that the “99.999999999%” probability is a “normal operating conditions” number. I suspect that's a guess at the probability of three Amazon data centres being taken out at once, or something like that. In normal operations I suspect that you might as well just call it 100% reliable in terms of preserving your object. But I find it amazing that they're at the scale where such minuscule probabilities become certainties (if you naïvely apply them).

Monday, January 30, 2012

RAID redo

I spent some time this weekend rebuilding the family file server, which is a Linux box. The last time I rebuilt it was several years ago, and at that time I figured I should set it up with (software) RAID 5 to avoid the hassle of having the recover from backup if a disk failed. This worked great. A few years ago a disk did fail, I bought a new one, plugged it in, it rebuilt and everything was good.

Similarly, a few years back at work I configured our new VMware ESXi box with an eight disk RAID 5 array (hardware RAID). Last year a disk failed, and on that machine I didn't even have to power it down. I yanked out the old disk, hot-plugged the new, and the machine didn't miss a beat.

So, RAID 5 is wonderful, right? Well, the time between the disk failure and disk replacement was somewhat stressful. In both cases, the disk couldn't be replaced immediately. The disk in my home server failed the night before I left on a trip, so I couldn't replace it for two weeks. And the new disk for the work machine had to be ordered and took some time to arrive. There was this gap where there was no redundancy. In both cases there were backups, but restoring from backup takes a lot more time than just plugging in a disk, and I realized that I really, really didn't want to waste my time setting up machines when simply providing a little more redundancy would have removed the need. “You can ask me for anything you like, except time.”

So the home server needed a bit of maintenance (for example, the root volume was low on space) so I figured while I was doing that I would reorganize the server and take some extra time to fix the redundancy problem, moving to RAID 6 on the four disks. RAID 6 would allow two disks to fail without loss of data. I'd lose some space but the extra redundancy would be worth it. Why RAID 6 over RAID 10? Well, RAID 6 provides better error checking at the expense of some speed.

This is what I did to prepare:

Took an LVM snapshot of the root partition and copied that snapshot as an image to an external drive. Why an image? Sometimes the permissions and ownership of files are important, I like to preserve that metadata for the root partition.
Copied the truly critical data on the root partition to another machine for extra redundancy. The existing backup process copies the data offsite, which is good for safety but not so good for quick recovery, so I wanted to make sure I didn't have to use the offsite backup.
Copied the contents of the other partitions to the external drive. The other partitions don't contain anything particularly critical so I didn't feel the need for redundancy there.
Zero'd out all the drives with dd if=/dev/zero of=/dev/sdX. Some sites suggested this was important, that the Linux software RAID drivers expected the disks to be zeroed. It seems unlikely but it didn't cost me anything to do it. There was an interesting result here, though: the first two drives ran at 9.1Mb/s, while the second two ran at 7.7Mb/s. If I recall correctly there are three identical drives and the one I replaced which is a different brand, so it isn't a drive issue but rather a controller issue: the secondary controller is slower.

Now that the machine was a blank slate, I set it up from scratch:

Start the Debian 6.0.3 installer from a USB key.
In the installer, partition each of the four disks with two partitions: one small 500M partition and one big partition with the rest of the space (~500G).
Set up RAID 1 across the small partitions (four-way mirroring).
Set up RAID 6 across the large partitions.
Format the RAID 1 volume as ext3, mounted as /boot.
Create an LVM volume group called “main” and added the RAID 6 volume to it.
Create a 5G logical volume for /.
Create a 5G logical volume for /home.
Create a 10G logical volume for swap.
Create a 20G logical volume for /tmp.
Create a 200G logical volume for /important.
Create a 200G logical volume for /ephemeral.
Tell the installer that this machine should be a DNS, file, and ssh server and let the installer run to completion.
Copy the important files to /important and the ephemeral files to /ephemeral.
Configure Samba and NFS.

So why this particular structure? Well, Linux can't boot from a software RAID 6 partition, so I needed to put /boot on something that Linux could boot from, therefore the RAID 1 partition. The separate logical volumes are primarily about different backup policies. The 5G size for / and /home is to limit growth (these volumes will be backed up as filesystem images) and 5G fits on a DVD for backup, in case I want to do that at some point. Swap of course needs to be inside the RAID array if you don't want the machine to crash when a disk fails: yes Linux knows how to efficiently stripe swap across multiple disks but a disk failure will cause corruption or a crash. The 20G volume for /tmp is so that there's lots of temp space and it's on a separate volume so backup processes can ignore it. The /important volume contains user files that are the important data and can be backed up on a file-by-file basis (as opposed to / which is backed up as an filesystem image). The /ephemeral volume contains files that don't need to be backed up. All filesystems have the noatime mount flag set, and they're all ext4 except for /boot which is ext3.

If you're counting you'll note that there is still a lot of empty space in that LVM volume group. There are several reasons for this:

Some empty space is required if I want to make an LVM snapshot, so I never want to use up all the space.
I frequently make additional temporary volumes for a variety of purposes.
If I need to expand any particular logical volume, there is room to do so.

Monday, January 16, 2012

Safe Facebooking with Chrome

I don't like the idea that every page on the web with a Like button will tell Facebook that I've browsed to that page. But at least that information is anonymous... so long as I'm not logged into Facebook.

I used to just not stay logged in, only logging into Facebook for “Facebook sessions” in incognito mode, use a different browser and clear the history, etc or similar such mechanisms. But I found this post about Chrome's certificate pinning where Chris describes how to “Twitter Like A Boss”. This inspired me to run a new Chrome process with a separate profile for Facebook, using this command line (via an alias):

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome

  --user-data-dir=$HOME/.mb/chrome-safe-browsing/facebook

  --disable-plugins

  --proxy-server=localhost:1

  --proxy-bypass-list='https://facebook.com,https://*.facebook.com,https://*.fbcdn.net,https://*.akamaihd.net'

  https://facebook.com/

With no line breaks, of course.

This isolates Facebook's cookies into a separate profile, which prevents my general web browing in another Chrome instance from being tracked under my Facebook login. It also disables browsing any other sites (following a link will fail, you need to copy/paste into another browser), forces all Facebook connections to use SSL, and disables plugins.

Note that this doesn't use incognito mode (I want to stay logged in now that my everyday browsing isn't affected) and it doesn't use certificate pinning. The main point was to stop leaking information to Facebook. I may get around to figuring out the certificates to pin at some point, but really I'm hoping that a better solution will arise to the problem that certificate pinning is addressing (which is not to say that certificate pinning hasn't already proved effective in critical scenarios).

It's harder to do the same thing with Google because I use so many Google services. The core principle is to use a separate browser instance for each login for corporations like Facebook and Google that have code on so many third-party web pages; this is one way of doing that which happens to have additional security benefits (forcing SSL etc).