YouTube Outage, Layman Explanation

YouTube went down on Sunday the 24th of February. A good summary of the events (at least for geeks) can be found at:

http://www.renesys.com/blog/2008/02/pakistan_hijacks_youtube.shtml

There has been LOTS of comments on NANOG all weekend about it. NANOG is the North American Network Operators Group, generally a bunch of folks in the Americas that participate in some way in the operations of networks and the Internet. You can see the archives at: http://www.merit.edu/mail.archives/nanog/ and see some of the mails that flew back and forth regarding the outage.

I thought I'd provide a summation for the one or two folks who read my blog but aren't geeks, or network geeks at least and maybe teach a little about networking in the process.

Basically on Sunday the Pakistan Government told Pakistan Telecom (along with other ISPs in Pakistan) to block YouTube. Pakistan Telecom decided the best way to do this was to "black hole" some YouTube routes. Black holing traffic on the Internet is basically forcing traffic to a different location and then throwing that traffic away. One of the most drastic ways you can accomplish this is by using the first decision in deciding where next to send a packet. That decision can be described as "Longest Match Wins" in routing.

Think about Longest Match this way. Say you have an address of 221 Main Street, Fairfax, Virginia. Now say you had four paths in front of you, the first path said "Virginia", the second path said "Fairfax, Virginia" and the third path said "Main Street, Fairfax, Virginia" and the fourth path said "221 Main Street, Fairfax, Virginia". You would chose the fourth path because it takes you directly to where you need to go.

So, Pakistan Telecom decided to cheat a bit and say, instead of just going to "YouTube", follow these paths to "West Coast You Tube" and "East Coast You Tube". I've greatly simplified how You Tube breaks up their IP addresses, but the concept holds for this example.

Now what SHOULD have happened is that Pakistan Telecom (PT) SHOULD NOT have advertised those more specific directions (address prefixes) to their upstream transit provider. Those more specific address prefixes should have only been used inside the PT network. However, those prefixes got "leaked". Basically someone put the road-sign up for the public telling everyone on the Internet that PT had the most specific path to get to YouTube.

YouTube responded amazingly quick (30 minutes) and basically started advertising the more specific blocks themselves thus the Longest Match rule no longer applied and instead you had two "221 Main Street, Fairfax, Virginia" road-signs posted; one just said 10 miles, and the other said 1000 miles... people are going to take the shortest path then. Determining the shortest path is another part of routing. Perhaps another day I'll take some time to explain that one.

Longest Match specifically refers to taking your address and comparing it to a routing advertisement (the prefixes) and looking to see how many bits are identical in the two. If you've worked with computers you probably know about the Subnet Mask that you have to assign along with your IP address. When dealing with Subnet Masks, this allows a machine to decide if they need to go to a router or if they can talk to another machine directly. For example: 10.1.1.1 with a subnet mask of 255.255.255.0 (aka 10.1.1.1/24) can talk to any other machine whose IP address begins with 10.1.1 ... WITHOUT going through a router.

In a prefix advertisement the prefix includes something very similar to a Subnet Mask. In this case the mask basically tells other routers in the network how specific of a route any specific prefix represents.

For example: I could say that my home address is "Virginia, Fairfax, Main Street, 221" and I could say that when I advertise my address I'll advertise down to the street name. In networking there is the concept of CIDR notation to describe blocks and sizes of IP addresses. For our teaching example, we'll pretend that to advertise a direction just to street level we would add /streetname to the address. My routing advertisement for path #3 from the above example would look like "Virginia, Fairfax, Main Street/streetname" Then if like #4 I advertised, "Virginia, Fairfax, Main Street, 221/streetnumber" you realized that THAT would be the longest match if you were looking for my specific address.

What if you were looking for 223 Main Street though? In that case, the longest possible match would be path #3 for you, "Virginia, Fairfax, Main Street/streetname" and you'd take that path which would get you to my street, but not directly into my driveway. Once you get to the street you'll get further directions on how to get to #223.

So, now that you hopefully have an idea of how longest match works, what could have been done to prevent this? The simple solution and the one that NORMALLY keeps stuff like this from happening is Route Filters. In this case, PT's transit provider should NOT have accepted any route advertisements from PT for address space that PT doesn't own. Currently the best way to ask people you are providing transit for what their addresses are, then look at the various assigned numbers authorities and/or routing registries to verify the blocks of addresses really belong to them and then create a filter that only allows those addresses to be sent. It is a pretty manual process though, and of course mistakes (or mischief) can happen.

There are discussions ongoing about other ways this could be done. Routing registries could provide certificates or you could sign your routes in a public manner that are in the registries and then when one router talks to another router they could verify through the signed messages that the number/routing authority has identified you as the proper owner (by your possession of the private key/cert) and accept any of those routes. Whew! That is a pretty straightforward way to accomplish this, and hopefully this incident will remind folks that it is important to move forward with it.

Though straightforward, it isn't easy. Lots of folks have to all agree to do it the same way. Other folks have to build infrastructure to support it. Vendors have to update their routers with software that understand how to process it. And of course, then the operators of the networks have to actually understand and use it. We can dream though. :)

If nothing else though, hopefully transit providers (like UU.NET/Verizon, AT&T, Level 3, PCCW, ATDN, etc.) will pay more attention and filter any routes that don't belong to their customers and prevent this from happening at the edge. Some already do, good for them! Some don't. Bad for them!

Oh well, hopefully you found all this interesting. Didn't mean to be so wordy, just mostly wanted to pass along what happened in non-network geek terms.