Transcript
APNIC 23
Routing SIG
Wednesday 28 February 2007
16:00 - 17:30
Alright. Good afternoon, everybody. I'll stand up. Good afternoon, everybody. Welcome to the Routing Special Interest Group at APNIC 23, APRICOT2007.
Just going through some administrative things before we actually make a start. Routing SIG is chaired by myself, Philip Smith, and my co-chair, Randy Bush. You can reach us at those addresses, sig-routing-chair@apnic.net. If you're not on the mailing list, you can join the mailing list. There's not much traffic on the mailing list, so don't worry about a deluge. If you'd like to generate traffic on the mailing list, you're welcome to do so. To subscribe, it's mailman.apnic.net/mailman/listinfo/sig-routing
So, general info before the main agenda - there are going to be three presentations. If you'd like to ask questions at any time, please use the microphones. There are roving microphones somewhere, George at the back and another even further at the back. You've seen where the teas and coffees have been, so I don't need to tell you where that is.
I should remind you of the APNIC social event this evening, at the Jungle Pub at the Melia Hotel and when I was here with Miwa to do the site survey six months ago, it was a very nice pub. I recommend you come along. Remember you need a ticket and you can get your ticket at the APNIC desk outside - I believe so, anyway.
The reception starts at 7 pm and it's sponsored by Afilias.
MyAPNIC and the Policy flash demo is running all day, as you've probably seen outside. Visit the lounge and you can find out how to win an iPod Shuffle, which seems to be the giveaway of the moment. Complete the meeting survey to win a big prize, another iPod Shuffle, maybe the same one.
NURANI NIMPUNO: A different one!
PHILIP SMITH: Two iPod Shuffles! Great. Also if you go to the APNIC meeting website, you'll find the onsite notice board there.
Before I carry on, again, a friendly reminder to speakers and anybody asking questions - speak slowly and clearly because we have stenographers here who are basically typing up on the screen everything that you say and I do mean everything, as you've noticed.
SPEAKER FROM THE FLOOR: How do you get to the thing tonight? Transport?
PHILIP SMITH: Transport to the Melia tonight - feet. Melia is basically two hotels along from here. I don't know if there is transport but you can walk it in about three minutes. Walk down to the beach to the paved footpath. Turn right. Walk past the Laguna and the hotel after that is the Melia. You should find it easily enough.
As I was saying, please speak slowly and clearly to help the stenographers. It also helps those whose first language is not English, because then they have a better chance of understanding you. Also, if you can try and avoid the use of an acronym because an acronym in one part of the world may be meaningless another part of the world. Watch out for that please.
So, the agenda today, the introduction I am just doing. The review of action items - well, we had no action items from the last SIG meeting, so there's nothing to review. So we'll skip on really. We're going to have a look at the presentations first because, as you can see, I'm standing here by myself. Randy has a conflict so he's currently presenting at another session and he'll be joining us later on. After the three presentations, we're going to have a look at the special interest group guidelines draft and then we're going to hold the special interest group co-chair election. But we'll come to that later on.
So really what I want to do now is we do the three presentations. The first one is from Martin Winter, second one will be from Andrei Robachevsky, Andrei also is not here and has another conflict so hopefully he can make it by the time Martin has finished speaking and finally I'll start talking again.
Anyway, let's start with Martin.
MARTIN WINTER: Good afternoon, everyone. My name is Martin Winter. I'm jumping in for Clarence who, unfortunately, could not make it to the meeting, but the credit to this presentation goes all to him.
I am talking here about a new technology which allows you to do BGP convergence in much less than a second, so different to today when you see multiple seconds to multiple minutes convergence time.
So let's start with a few definitions first. When I talk about 'down convergence', I'm talking about this scenario. We have an existing traffic flow, which passes along the green arrow and, at some time, which have a link failure. It's inside the IGP and the BGP does not change. The next hop destination is the same. We only have a change inside our local network on the IGP routing. So at this time when we have a link failure, we have an outage of the traffic flow.
The outage goes on while the next hop calculates a new IGP path destination and at some time we have a new path selected, the orange one, and the traffic flows again. So the outage and loss of connectivity is T2 minus T1 and we have a loss of connectivity at that time.
Then, as a second thing, we have 'up convergence'. So we have now the traffic flowing on the orange path, not the optimal path. We have now at time T1, the link gets repaired, comes back up. The IGP does a re-convergence and, at time T2, the best path is selected back to the green arrow. So the up convergence is T2 minus T1, important to know for all the common implementation today, this does not cause any traffic interruptions. We are changing over from a valid path to another valid path so with up convergence, we don't have any traffic outage in general.
So looking at the convergence, we have the down convergence, which causes an impact, a loss between T2, and T1 and we have the up convergence, which does not cause an impact, as all the modern routers can handle as they change from one valid path to another valid path.
This paper will focus on a BGP convergence on a down transition. Most are not focusing on the issue of a boot-up convergence. We're not talking about a BGP convergence when a new peer comes up.
So I'm going to introduce a new technology, which allows us to reduce the down time in the down convergence case.
So, for looking at this, we're looking at two different scenarios. The first one is the Core scenario and the second one will be the Edge scenario. We're calling this BGP PIC - which stands for Prefix-Independent Convergence. So it basically means we have a convergence that is independent of the number of prefixes.
So, let's look at the core scenario.
So, the core scenario is defined like this - we have - we are an ISP on the right side. We have a path in our network from PE1, which is a router, to PE2 as the BGP next hop. The router PE1 has installed all the BGP routes that all point to PE2 that all point to PE2 as the next hop. The routing table, we have something that says PE2 is reachable over the link PoS1.
So at some time in time we have now an outage in the middle of our network. That could be a link failure or a router failure and at this time we have now the outage. So the traffic stops flowing.
In our case for our tests we assume PE2 announced 350,000 BGP routes. So, after some time, we have an IGP convergence and we're learning a new path to PE2 going out a different interface, going out interface PoS2 on the router PE1. So we see it on the left side, the IGP is updated there in the yellow part. It's now PE2 is now reachable over PoS2. And now the big question is how long does it take it to update all the 350,000 BGP prefixes? And that's basically where the new technology comes in.
So the key requirement we have, we want to leverage this new path as soon as the IGP has converged.
So looking at the real-life test of this new technology. So here we have, like, a testbed, we have a tier-1 ISP topology simulated. In this ISP topology, which have user CRS1 running IOX 3.5, we have 5,000 ISIS routes and we have 350,000 BGP routes and these 350,000, that's from the previous page, they all have the same next hop. So we have 350,000 BGP routes that we have to converge in the case of the down scenario.
So looking at this graph, on the Y-axis, we have the time in milliseconds. On the X-axis, we have the prefixes. We can see the IGP convergence of the 5,000 ISIS routes. At the start, we have the first ISIS routes converged at around 0 milliseconds and the last one at about 350 milliseconds. And for the BGP now we have that on the right side where these four dots are. The first dot is basically the BGP next hop for these 350,000. What we used here, we used the feature to mark certain prefixes as high priority so they converge first so we can make sure that specific IGP next hop is converged at the very beginning, at the 80 milliseconds together with all the other IGP routes.
The second dot is now the first BGP routes, which is exactly the same time, at the 80 milliseconds. The third dot is at the half point, the 175,000 BGP routes and the last point is 350,000, like all. So you see that all the BGP routes converge and use the new path immediately when the new IGP path becomes available and they're all at the same time.
So how is this done? The key technology here is we're using hierarchical FIB. So when we have in our FIB, we have all the 350,000 networks on the white side. They all have the same next hop so we point them to a BGP LDI, which stands for load info, a structure which describes all the next hops which are all common to this BGP. So all the networks have the point of one common structure which describes all the next hops. On that BGP next hops we have another pointer to an IGP LDI structure which describes all the IGP next hops and from there we have another point going to the output interface with adjacency. So now, when we had before the case where the IGP converged, what we actually did, we just updated one data structure, the IGP LDI. As soon as we updated that, all the BGP routes were immediately using the new path. That's how we achieved the 80 milliseconds for all the 350,000 routes.
So looking at the classical way and difference, this is how it's done traditionally. All this information in there is flattened out. So we have the BGP network entry and it has in the structure directly the output interface in there. So that's how a traditional FIB is organised. So all the information, the BGP next hop to the IGP next hop, that's all flattened out and compressed. And if you look at this case now where you have to go and update the output interface, if you look at the realistic case of 10 microseconds for one output interface to update. If you have 350,000 routes, that means that you easily spend 3.5 seconds time just for updated this case. And you see the convergence will most likely go on a graph from the first to the last, not all at the same time.
Also, the other thing to note is this was in the past done that was making an issue of performance. So classic old routers, the main limitation was the packet look-up for the forwarding and this is much easier to have the packet look-up to process them faster while the new way, with the multiple structure, takes more performance there. But today's new router with the dedicated network processors can easily handle this.
So looking at advantages, we have, like, all the BGP dependence, they converge at the IGP convergence there at the same time. So the whole network, it's like we actually use less memory for the FIB space too so we have a much better scaling there and that's thanks to the sharing of some data structures. We also have, like, a much smaller CPU requirement for updating the FIB table so instead of, as I mentioned in the previous example, 3.5 seconds to update it, we may need a few microseconds or a few milliseconds. So it depends on how fast your IGP converges. And after all that, it's commercially available today so you can verify it yourself and it's proven.
So now let's look at the Edge case. So before we looked at the Core case, which basically the next hop stayed the same, the BGP next hop stayed the same and we only had a change of path inside the network.
So with the Edge case, we assume we are like provider A here on the left side in the grey box. We are peering with another provider, the yellow box here, and we assume we have at least two peering links. And one of the key limitations at this time if you want to try it out today in your lab, is we actually require that you've got a multipath scenario. So you have, like, both routers need to be load-shared. So the router, which does the fast convergence, initially you will see two destinations, two BGP next hops, router R1 and router R2. So when the router R1 fails, in our demonstration here, we basically have to change it from a load share between R1 and R2 to only use R2.
Okay, that's the result of our test. Again, this is using a GSR running IUX 3.3 for this test. And this is how the result looks like. We actually used 17 test points so we have 17 flows every 25,000 network entries we defined a flow and we measured it. Don't be surprised if some are zero for the loss of connectivity because we had a load share scenario so some of them didn't have the failure so we didn't see any loss there and the worst case is up to here, at 180 milliseconds where the convergence happened. These were the flows which were using the router R1. The multiple colours are basically multiple test runs which we used.
So how does that look like if you don't have the new technology. The classic case if you don't have that, we have again router R1 up there, which is failing. So now we have a linear dependency based on the number of prefixes. So we did this test to see exactly the same set-up but in this case we used the GSR running RUS which doesn't support the BGP PIC technology. And we did the measurements. We ended up that for 100,000 routes, with 12,000 running 32S, it took about 30 seconds for the convergence. We did the same test with other vendors. It's around 30 seconds too. They're all around the same ballpark.
The main limitation is here that we have to run through the whole test so when R1 fails we get an IGP, like, next hop down notification on router R3.
At that time, router R3 has to do a new best-path calculation for all the impacted 350,000 routes. After the best-path calculation is done, it has to update the table with the new entry saying, "Now only use R2 instead of using R1 and R2." So after that, when we have recalculated, we have to forward that into the FIB table until we can use it. And at this time, actually, before it will work again, the BGP updates may take some more time at that time, R3, that's the common case, it will still do the BGP updates for the BGP updates for notifying all the neighbours will still go on and take in any case, multiple tens of seconds, but the traffic flow is basically done after the full FIB convergence.
So, again, how is it done in this case? Same data structures. We have our 350,000 BGP network entries. They all point to the same BGP LDI structure, so the load infrastructure. The BGP LDI structure in this case initially has two next hops in it because we are using two load-shared next hops. We have one pointing to the router R1 on the top and we have one pointing to the router R2 on the bottom and then the IGP next hop, the IGP LDI, the top one, points to the correct interface, and the bottom one to the interface. Now in the case where we had outage of R1 we will notice in IGP protocol that this IGP at the top is no longer valid. So we will delete that, and at that time we have a back point to know which is impacted. We can access the LDI and remove that load-shared next hop. So after that, basically, everything goes out on router 2.
So a few assumptions in here. There are basically two main assumptions. The first thing is, like, the failure must result in an IGP transition. For this technology to work, we basically need to have a change in IGP routing protocol. That could be two cases. If we have the case of a PE node failure, it would be this case which we simulated here. It's the IGP neighbours will detect that there is a loss of that link and will immediately start the recalculation of it. So we have at that time the IGP trickle. Or the second case will be, if we want to protect the link failure, that will only work if we use, like, the link as a passive interface, so don't use the next hop in that case, we have to carry this peering link in our IGP routing table. So if you want to protect the peering link, we have to carry that one as a passive interface or a distributor connected in our routing table.
And the second assumption which is there today is there have to be multipaths, so today it only works if R3 already has both next hops in the load-shared environment and one goes away. That will probably be quite a severe limitation for you today but it's like, today, if you want to test it you can try it out. We are working and we think we soon have that limitation removed so it will work in other cases without the multipath but if you want it go and try it out today, you basically have to follow that limitation.
So looking at the conclusion, we have a better FIB memory scaling with this technology. We have a much lower CPU consumption for the FIB updates. So imagine that for an IGP reconvergence doing a backbone link or router failure, we don't have to reflatten all the output interfaces in the FIB table. So we don't have to go and update all the router interfaces for each of the BGP networks individually. We only have to go and update one data structure in it. So that's where it's nice and because of that, because of the lower CPU utilisation, we also end up with a much higher robustness. It may be also interesting for your network that now the BGP, all the networks converge at the same time, so you don't have, like, the inconsistency that you don't know which networks converge in which order.
And then the key thing, the PIC technology. We have a much faster convergence. So we have sub-second convergence for the core case and sub-second convergence for the Edge case a key thing there in the PIC as it says, prefix independent convergence. It's independent of how many prefixes you have to converge. It can be 10 or 1 million. This will converge at the same time and at the speed of your IGP convergence. So it's quite a new technology and it should actually quite impact the design of your backbone and also your router design.
So, for the ones of you who want to try it out, availability today, it's available on the 12,000 of IOX 3.3 and on the CRI IOX 3.5. That's for the Core and then we also have this for the Edge. IOX 3.5.
OK. Just about in time. I'm not sure if we have time for a few questions.
JASON BAILEY: It's only a quick one. You mention how you've got that on the CRS series of routers. Is there any plans to move then back to 7500 and 6500 series?
MARTIN WINTER: There are no plans at this time because it depends a bit on the hardware capabilities. So the CRS basically was designed with this in mind for the beginning so it has the capacity to do the look-up. On the GSR, actually initially we were not sure if we can do it. We finally found a way to do it, do the look-ups there. I'm not familiar enough if the 7600 hardware topology - I'm sure they will look at it - but it's more the question of RUS doesn't really support it. So it's only the IOX on.
NURANI NIMPUNO: Can I remind people to state their name before they ask a question for the stenographers and for the webcast. Thank you.
PHILIP SMITH: OK. Alright. Thank you very much, Martin.
Now, Andrei hasn't made it yet.
GEORGE MICHAELSON: Do you want me to go and look?
PHILIP SMITH: It's Okay. I can talk but if you can remind him he's got 15 minutes to get here, that would be really useful, George. Thank you.
MARTIN WINTER: Oh, by the way, the presentation on the web here on APRICOT, at least this early afternoon, it was still an old version on it. I think Phil was just uploaded a new one but, if you grabbed an old copy before, you may want to grab the new copy, which looks like the one I showed.
PHILIP SMITH: Yeah. We'll get all the presentations - they certainly will be on the APNIC website and I've forwarded Martin's updated presentation there so I hope that the APNIC hostmaster team can get it on tonight, if not tomorrow morning.
So Andrei is not here so what I will do is I will do my presentation first and then hopefully Andrei can be here in time to do his one on the 32-bit AS number.
So, I don't know how popular I'm going to be with this but anyway, let's see what happens.
I changed this a little bit from what I proposed initially for the APNIC meeting. Really because I was at the SANOG conference last month in Colombo and I tried - I was advised to do a slightly tougher presentation than I was originally proposing to do. So the slightly tougher presentation is here.
The history of this is really some work that we did in the RIPE Routing Working Group to try and put together, I suppose, recommendations for aggregation for ISPs. And it all grew out of the LINX's efforts, LINX, the London Internet Exchange, their efforts to put together an aggregation policy for their membership. It was actually quite interesting because the members of the LINX wanted to do this and so they came up with the policy that I think it was as far as I remember ISPs participating in LINX should announce no more than 50% more prefixes than there would be if they were aggregated. So if they announced ten prefixes, then the maximum they could announce was 15. It was something like that.
So they passed all this but the whole thing failed because it then came around to members who didn't particularly want to deal with all this and it was a previous - I'm trying to remember which now, whether it was the IX SIG or Routing SIG when Nigel Titley actually presented on that failed policy. Out of that it was picked up as a RIPE Routing Working Group thing. The working group was myself, Mike Hughes from LINX and Rob Evans from UKERNA, the UK Academic Network, wrote this, using Mike's initial draft as a guideline.
It's now been published. It was published early last month at RIPE-399. The URL is up there on the screen. And it discusses the history of aggregation, the causes of de-aggregation, the impacts on the global routing system, some of the available solutions and recommendations for ISPs.
Going through the document quickly, it goes through the history, so the classful to classless migration, the clean-up efforts that were made in the 193/8 address space. It covers a little bit of what the CIDR Report used to do when it was started by Tony Bates way back when, to encourage the adoption of CIDR and aggregation. It was basically a top-30 list of providers who still had not aggregated their prefixes post the migration to classless routing. It was mostly ignored through the late '90s. In fact, at some stage, ISPs were using it as a positive marketing tool, which I found quite amusing. I took it over from the work that Tony was doing as Tony had less and less time to do this kind of work. And then Geoff Huston started doing a lot of work on generic BGP table analysis and agreed to take this on as part of the job and Geoff has actually done an amazing amount of work putting a web interface in front of it and integrating it with all of his tools. The document then covers the introduction of the Regional Internet Registry system and provider aggregate address space, the good things that that all did.
Then it covers some of the claimed causes of de-aggregation and most of these are feedback from the community because around, what, '98, '99, 2000, there were a group of folks who would send e-mails to ISPs and say, "Hey, you know you're announcing 500 prefixes when you could be announcing 3? Do you need help configuring your routers?" and so forth, and we got various interesting answers back like, "Announcing /24s means that no-one else can DOS the network," "Announcing only address space in use as rest attracts noise". I got several "mind your own business" including, "If you pester me again, I'll tell the procurement people not to buy Ciscoes." I don't see how that was relevant but I got a few of those, as did various colleagues from other organisations.
There's a lot of leakage of iBGP outside the local Autonomous System. External BGP is not the same as internal BGP. Now those of us who have been in the industry far too long know this but how many of the newer ISPs actually know this? A common reason was, "Yeah, well we're doing traffic engineering." OK, so spraying out /24s hoping something will work is the way to do traffic engineering? Well, I'm not sure that it is but clearly many people do. Also, people tried to blame the legacy assignments, "It's all these people who got assignment before the registry system was introduced." If you look at the numbers, that's not borne up at all.
The impacts - these are some of them. Router memory - it shortens router lifetime whilst vendors generally underestimate growth requirements. Some people can argue either way about that. In speaking with a vendor point of view, I actually don't mind. I'm happy to sell you more router memory. I'm happy to build your routers with huge amounts of memory. But the thing is people tend to underspecify what they purchase. Depreciation lifetime gets shortened and there's a general increased cost for ISPs and their customers. The router processing power, well, processors could be underpowered as well and if you look at the C PU powered growth, Geoff was talking in the plenary this morning about his laptop taking 15 minutes to handle some particular thing for 32-bit AS numbers.
But I'm pretty sure that Geoff's laptop was a lot more powerful than many of the routers out there today.
GEOFF HUSTON: Want to bet? It's an old laptop.
PHILIP SMITH: It's not running Motorola processors I'm sure.
RANDY BUSH: We have routers running -
PHILIP SMITH: Is that a Commodore you've got?
RANDY BUSH: It's faster than just about any RP in the industry.
PHILIP SMITH: An off-microphone discussion about laptop power. I should carry on.
The life cycle was shortened and that increased costs again. The larger routing table means slowed convergence. That can be improved by faster control plane processors, which means lots more upgrades, bigger hardware, which is probably fine for the guys in the centre of the Internet because they have huge budgets, but the guys more to the edge of the Internet don't have such big budgets. I'm not sure the guys at the centre of the Internet would agree with that. Network performance and stability - slowed convergence gives you slowed recovery from failure. If you're trying to do BGP multihome being a multiBGP table today, trying to do a convergence on multiples of 200,000 prefixes takes a whole. Slowed recovery means longer downtime and longer downtime means unhappy customers.
MARTIN WINTER: That's why you have BGP PIC.
PHILIP SMITH: Not everybody could afford a CPS 1 as much as we might like people to buy one. CIDR Report has been running since 1994. These days it has a nice little box at the bottom where you put your AS number in and it gives you advice as to how to aggregate. People don't seem to use that. I'm forever surprised that people go, "Oh, I never noticed that." Routing table report. I've been sending lists since 1999 to various operational mailing lists. Some people pay tension, others filter It. There are training, tutorials, the Project Cymru guys have done great work. Then we had the CIDR Police. We've given up because we got too much negative reaction.
The BGP features - there's NO_EXPORT Community, which quite a few ISPs are using and other people have never heard of. There's a NOPEER Community, which nobody seems to be using that I know of. But RFC 3765 describes all about it and I think it is a pretty neat thing. AS path limit attribute is still working through the IETF IDR Working Group. That could be a useful tool if it's widely adopted. There are provider-specific communities. I see a lot of North American and European ISPs offering those. I don't see many others offering those.
So the recommendations basically are these:
The announcement of initial allocation as a single entity. I mean that's what we've been doing since day one. Before I was at Cisco, I was one part of UUNET and we took for granted that if we got an address block, we announced it. Subsequent allocations should be aggregated if they're contiguous and business-wise aligned. If they're aggregated and side by side, do so. Spraying out /24s is not traffic engineering. Use some of the BGP enhancements already discussed. I know this will apply to IPv6 too because as far as we're concerned at the operational level, v6 is v4 with bigger numbers.
I have the T shirt, I just didn't wear it.
Looking at de-aggregation, the CIDR Report encourages aggregation. There's a little box at the bottom where you pop your AS number in and it recommends what to do. My routing report does the BGP table status on a per Regional Internet Registry basis so it's the original CIDR Report and a whole lot more stuff because I was kind of curious what was happening on a per region basis, as was APNIC at the time and still are indeed. So - I didn't come up with this. This is thanks to Simon Lyman who suggested a de-aggregation factor. I don't know why I didn't do this of this before, it's so obvious. It's basically taking the routing table as it is and dividing it by what the routing table would be after I do the aggregation in my routing report. Work that out on a global basis and I work it out on a per Regional Internet Registry basis.
VINCE FULLER: I had a quick question, since I referenced this yesterday. What are the criteria you use for aggregation? Common path? Common attributes? No holes?
PHILIP SMITH: Yes.
VINCE FULLER: You don't aggregate over holes?
PHILIP SMITH: No. It's what I see in the table and I aggregate.
VINCE FULLER: Okay.
PHILIP SMITH: This was Simon's suggestion - just include this ratio in the report. I mean I'd been looking at the way the routing table was increasing and the max aggregation number was increasing and it was, yeah, it's increasing, but I assumed the ratio was pretty much constant. That was a mistake because this is what it was yesterday. 230,000 prefixes in my view, which is basically APNIC's router in Japan. Geoff gives a slightly different number. The global table we see 213,000 prefixes. From North America, so basically the ARIN region, North America and a few other bits. 104,000 prefixes. Europe and the Middle East, the RIPE NCC region, 44,000 prefixing. The global average is about 1.86, North America 1.7, Europe 1.53, by and large the largest piece of the Internet.
Let's look at the newer part of the Internet, the rapidly growing by. 210,000 prefixes, 48,000 from the Asia Pacific region. 2.39. Remember the RIPE region is 1.5. Africa 3,000 prefixes, 2.67. Latin America and the Caribbean, 14,000 prefixes and 3.47. You say, "Okay, these are nice numbers." But look at the graph. The graph is quite exciting. The yellow line the global general trend. It's increasing like every other graph for the Internet seems to be doing. The RIPE region is the bottom one here. It's growing very, very steadily. That mostly seems to be coming from the de-aggregation of 193 and 194 space doing a casual look at the numbers. The blue line is the Asia Pacific region, this was around the time of the rapid growth Internet bubble and then the big burst and obviously it had a marked impact in just this part of the world because you don't see it in any other bits.
Of course, LACNIC started functioning around about then. That's mid-2002. So they're the top one here and they're shooting off into space, more or less. The green one, AfriNIC, came along about a year-and-a-half later and they've been shooting off into space as well.
So this is, well, a little bit alarming. What is causing it? Following Vince talking about this yesterday, there's been a wee bit of discussion about possible causes. Is it de-aggregation or is there a deeper economic social kind of thing or a political thing that's doing it.
So look at some numbers. Looking just at the Africa region, you've got, TEDATA announcing 244 prefixes. They could knock 238 out of this and just leave us with six. Is there a political region this is happening? Is it because they don't know how to de-aggregate? That's Africa. Not so interesting for this part of the world. Look at this part of the world.
The interesting top 20 de-aggregated. Our friends at VSNL are announcing 1123 and could throw away 1046. China net could through away 998, Bharti could throw away 904, Sify could throw away 708. I don't know who Hathway are. TPG could throw away 512. Reliance could throw away 505 and so it continues down the list. If you look all the way down the list, you see this. Why is it happening? I don't know.
North America, we have our usual friends at Covad. They're doing it for business reasons and that's about as far as I can go there. They could knock 983 off 992. That's quite a significant number.
We have a question but the microphone doesn't work.
PETER SCHOENMAKER: You're saying that these are all the more specifics they can save with an aggregate and my best guess from my experience is that these are from not setting no export on the /24s while announcing the aggregate out the route on traffic engineering.
PHILIP SMITH: Well, we did explore this and they basically said to me, in Covad's case, "We're doing this for business reasons."
VINCE FULLER: What about the previous ones?
PETER SCHOENMAKER: Yeah, the Asia Pacific ones are what we see from many of the people listed here. If you look at where these people are located, a lot of the cables only support STM 1 or STM 4 and the networks are many, many, many times larger than that so they have many STM 1s or STM 4s that they're doing TE across.
PHILIP SMITH: I'll be interested to see how TE is working but there are loads of /24s when you look at the figures and given the massive size of address space they have it seems unusual.
PETER SCHOENMAKER: But they're connecting the one network, not just us, but other networks with many, many STM 4s, so they announce them across so the path looks the same and everything else looks the same but internal to our network, the next hop is different for each of the /24s.
RANDY BUSH: It's the old one, Randy Bush IIJ. Why should I pay for your traffic engineering? Why does the global Internet have to pay because you can't figure out how to do a different kind of engineering? Right? It's the grazing of the Commons.
GEOFF HUSTON: Getting into Peter's comment a little bit harder, what we haven't actually done over time, yet, is looked at those more specifics and looked at whether they ever actually, inside any updates, move into alternate paths. I suspect that, certainly in the case of a lot of these places which are bandwidth constrained, they don't. So this is all about, if you will, a lack of proxy aggregation on the other side. But what's going on is they are, indeed, announcing more specifics on eBGP to get load balancing but the other end should be going, "Well I just need to announce the aggregate." So it's a lack of communication as well as intelligence. On the North American and European stuff, I actually suspect that there's an issue of diversity and, if you will, resiliency and I suspect that if you looked at the update patterns for these more specifics, you find they're doing a bit of path shifting.
Now I'm not excusing this practice. I actually think that most of this is a result of inadequate understanding of what you can do to do the right thing in routing. It really is an intelligence problem going out in how we spread the understanding of what BGP can do. We can fix training. Vince says that's a good thing and I quite agree. Looking at this from a dynamic point of view of what the updates tell us as well as the tables, might be a good thing.
PHILIP SMITH: Yeah, I think it would be a good idea to look at that.
PETER SCHOENMAKER: I agree. I think there is a percentage of the announcements that are /24s doing TE across providers. There is a substantial percentage of that to one provider.
GEOFF HUSTON: And proxy aggregation is missing.
PETER SCHOENMAKER: We have all the tools to do this today. They don't know though how to set the 'no' export or the internal communities that we have today. They can use - a lot of big tier-1s have communities to stop announcing the routes to certain peers or certain places. If you're trying to do TE between the two, you can announce the more specific and not announce to the other. We have many tools to actually solve these problems in a really intelligent way that helps the overall growth of the table. As Geoff said, it's an education problem I think and most of the people I deal with are more concerned in making sure their network runs rather than worrying about how everybody else's network runs. There's no incentive for people to spend the time to take the maintenance wind open and adjust their announcements and watch out for mistakes, and some of them don't know how to do it too.
RANDY BUSH: I think we're all pretty much in agreement. I have one serious disagreement with Geoff, which is proxy aggregation. You don't need it. You don't do it. The only people who do it are the American military and I don't think we want to emulate them.
The reason you don't need it is NO EXPORT and all those people you're exporting too in the first place have nice communities you can attach. The problem is education. The problem is communication. But also part of the problem is a lack of visibility of known normal ways of doing this. So everybody is sitting back there and being smart instead of being intelligent.
PHILIP SMITH: OK. Thank you, Randy. Given I've got virtually no time left, I'll skip through this. I've done it for each one. I noticed there was a big difference between the de-aggregation in Europe and North America versus the other bits of the Internet. Given I have access to quite a few ISPs and I can see what some of the ones there are sending to customers or peers, I am left wondering. I've helped a few ISPs do aggregation and they go, "Oh, we didn't know you could do that." My belief is it's more the educational bit. That's what I say right at the bottom there. The training is there. I mean there are some of us who are doing it. We need a lot more who can spare the time to actually try and do it. I don't want to sound like the global warming people and people in the nice parts are saying, "Oh, well, it doesn't matter for us."
Because it is going to be an issue. It is going to bite us all over again.
The trouble is I was around in '92 and '93 and I remember all the pain then and I don't want us to have to go through the same pain all over. Most of the people running today's networks were not around then and are not aware of the big issues we had to go through.
RIPE 399 is only a recommendation. It's not policy, mandatory or anything like that. But I hope that the registries can include pointers to the document in every address allocation they make just to try and encourage awareness that, you know, aggregation, where possible, is actually a feasible thing to do. That's why I say, you know, just make it the BGP good practice document or at least one of them. And that was pretty much all I had to say.
Are there any other questions?
If not, thank you.
Next up we have Andrei Robachevsky, who has finally arrived. Andrei will be talking about 32-bit AS numbers.
ANDREI ROBACHEVSKY: Hi, my name is Andrei Robachevsky. I work for the RIPE NCC. I'm presenting the presentation and Henk Uijterwall from the RIPE NCC did the presentation. He presented the presentation at NANOG in February and his team did all the work. So this presentation is about our experience in the RIPE NCC in supporting the new policy that allows assignment of 32-bit AS numbers. I was planning to go through some background, but Geoff's presentation at the plenary saved me this job. Thank you, Geoff. I will just put the slides.
We all know we are running out of IS numbers and the pessimistic focus is 2010 so we have to be prepared. There is draft, which is soon to be become an RFC, which allows us to start using the 32-bit numbers. It assumes a transition period and assumes coexistence of 16-bit numbers and 32-bit numbers or rather, old BGP speakers. There are some details how it works out, but again, it was in Geoff's presentation.
What we need to deploy, we need - someone needs to get ASN 32. The first aspect - we needed a policy. LIRs have to request them and RIRs have to be able to handle the requests. Use your ASN 32, you need to have adequate hardware and tools and do some tests and probably update your operations.
There is a policy adopted in all five regions basically stating while in the timeframe between 1 January 2007 and 31 December 2008, LIRs can request 32-bit number with applying for new ISL. RIRs will give ASN 16 which is low, low numbers by default. But any request, they will be able to give you 32-bit number.
Starting from 2009 and for the whole of 2009, LIR can ask for 16-bit number or 32-bit number and RIRs will give 32 numbers by default and 16-bit by special request. After the cut-off date on 1 January 2010, RIRs will always give 32-bit ASN. We have to be prepared. This policy doesn't change other aspects of the internal system number allocation.
So, as I said, similar proposals were made in all five regions and consensus reached everywhere late in 2006. The policy was introduced as of 1 January, 2007. All RIRs, well I'll give you some statistics. As of 5 February, the RIRs have to start handling requests, as of 1 January 2007.
So we had to be ready by 1 January, LIRs have to be ready to start using the 32-bit numbers by 1 January, 2009.
But while the speculation - if I have an ASN, why should I care? Probably because you will get new customers and you will need a new number or your customers will need one.
What do the requests look like? If I want to request a three number, I use exactly the same form. The only additional thing I put a note, "I would like an ASN 32 number, please", and RIPE NCC will give you the number. Other RIRs have similar procedures.
That is the task we face. We have to process those requests and our registration system were designed to handle 16-bit ASN numbers. Doesn't sound like rocket science, but it is a lot of work.
We found ASNs that were represented in different ways in many different places and those are just some of them. So we had to do some work. We composed a team. We started the work.
That is the team, not a small one. It is not everyone worked full-time, but just gives you an indication about the spread of those numbers in the RIPE NCC. Henk was the manager.
The first problem - notation. How do we represent the AS numbers when we give them to LIRs? There was no standard. There were several suggestions.
One of them, I think, was to use column, which may be confused with other communities, in other words just to use numbers and the third one to use dot. A recent proposal, proposed by George Michaelson, suggested 32-bit AS numbers.
That is the slide about notation. It is different from all other BGP attributes accepted by, at least as far as I know, one vendor. There is still an open question, is it a valid notation? Yes, it is.
It is still for the IDR working group as far as I know. We had to implement the policy as of 1 January, we couldn't wait and picked up the dotted notation, assuming it is the right thing.
We at the RIPE NCC, what we did, we took George Michaelson's draft and looked how it will look in RPSL. We did another draft in RPSL that actually outlines attributes that have to be fixed and updated and amended and we submitted the draft to the IATF.
Another thing, the new format that required some software upgrades, passing ASNs on input for output and realising there is danger in using dot or other notation because some tools will take the notation as a floating point without warning. Thorough testing is to be done.
ASN 16 bits forever. Code using will break immediately but what about registered INTs? Only will break when you reach a certain number.
Routers - was the state of the routers with regards to 16/bit ASN? As far as we know Juniper and Redback have announced an implementation and Cisco have one that is not finished.
PHILIP SMITH: It is official.
ANDREI ROBACHEVSKY: Great news, thank you. There is a chicken-and-egg problem, but we will certainly overcome it.
Software routers, well again, there was a presentation about testing open BGPD and we are using Quagga. Both have patches and are capable of handling thro 32-bit ASN numbers, which is good news as well.
Supporting systems have to be upgraded, like Nagios and other tools. We run routing information system that collects BGP information all over the world. It would be very useful if we can detect the 32 numbers and therefore we had to upgrade risk as well.
Other stuff? Whois training material, documentation, various scripts, as I said, all over the place.
That is how we did some planning. We put an internal deadline as of 1 December 2006 so we should have been able to make trial requests for 32-bit numbers, meaning all internal systems are upgraded to be able to be ready for 1 January, 2007, to be able to handle LIR quest requests. Other systems, late in 2007 and it strongly depends on the vendors. So the question, did it work out? Yes. We upgraded all the software, or all software that was necessary to handle the requests, on 2 January 2007 and 2 January 2007 we received first external request.
We processed this and allocated the company in Germany.
Well, that was a good test for our internal systems as well.
We also looked at our risks. Risks is operating in 16/bit number ASN space, it can still detect 33 numbers. We can guess what is behind the code names, right? It is reserved IS number 2, 32, 4, 5, 6. Which represents the 32-bit number in the net but we see them in RIS.
We see at least one in RIS. We can't tell which of the three ASNs it is really, but we can guess.
Are people asking for ASN 32? Yes, it is the status as of 5 February and more recent information, you may correct me on that. We allocate ARIN 3.
We upgraded everything apart from RIS and everything based on this. That is pending on our test with Quagga, because this is our BGP speaker that collects all BGP information, but I think we are resolving this and hope to deploy it very, very soon.
Lessons to be learned? As I said, this is not really rocket science. It is changing 16-bit numbers to 32-bit numbers but it appears to be a lot of work. At NCC that amounted up to two man years and involved seven departments. Supporting systems only for a medium-sized network, our estimate would be half or three-quarters of a man year.
What should you do? I think we all should start about 32-bit ASN numbers in our companies and organisations now. Also ask your vendor for support or be prepared for a nasty surprise in 2009. That is the advice, don't wait until you get assigned this number in 2009 and don't know what to do with it.
I think that concludes my presentation now. If you have any question?
RANDY BUSH: Randy Bush, since this a routing registry and I'm one of the people paying the bills for these things I would like to ask APNIC how many person years they used? Maybe they used women and were able to do it in months.
GEORGE MICHAELSON: We're able to leverage the work RIPE had done.
RANDY BUSH: I found the list of staff and the amount of time horrifying.
PHILIP SMITH: Were you in charge?
GEORGE MICHAELSON: I'm George Michaelson from APNIC, I forgot my own rules.
PHILIP SMITH: Any other questions.
GEORGE MICHAELSON: I would like to make a comment in the notation, if that is acceptable. The notation doesn't matter. It caused a lot of upset for some people, while I am the author of the draft and have huge emotional investment, it doesn't matter. In particular, Quagga will just work fine. The inner core of Quagga will work fine. The patches have two parts, the 32 and some descent in the Quagga community, with the interface, if they can get over it, Quagga is going to be released 32-bit. Whilst there are realities, for routing people, it doesn't matter.
PHILIP SMITH: Other questions at all? George? Discussion? Okay. Thank you very much.
ANDREI ROBACHEVSKY: Thank you very much.
PHILIP SMITH: Okay. So this is now the business part of the routing SIG. I delayed it, as I said in the start, I delayed it to the end simply because my co-chair was double booked and he couldn't be here as part of the meeting at the start.
So the administration part is basically explaining to you a little bit about the SIG guidelines and then we have a small election for the co-chair position.
I will explain that in a second. One of the things that we have been working on as the chairs of the special interest group is producing documentation that really describes how the APNIC special interest groups are created and how they are run. I have put the URL on the screen there. It was posted to APNIC talk mailing list a few weeks ago. I would encourage you all to grab the document, have a read of it, provide us with feedback. We reckon it describes how the SIGs are operated, but, you know, there could be holes, there could be things missing, things that you would like to know more about, and so forth. Please have a read of the document and give us your feedback.
The highlights are really how to form and dissolve a special interest group. The routing SIG started off at the beginning when the SIGs actually started - that was an unusual system. But the description of how to form a SIG actually was the experience that I had when we launched the IX SIG.
A new thing we have introduced covers SIG chair elections and the length of service of SIG chairs. We definitely - I think we came to a consensus that we definitely don't want to community to think we have this job for life. I don't think any of the SIG chairs would even want the job for life, either! But it wasn't actually documented in the early documentation about how you quit being a SIG chair and how we get a new one into the system.
So we have basically agreed that SIG chairs serves for two years, as well as the co-chair does. That is effectively starting from this meeting here.
As part of that, nominations went out a couple of months ago and that is why we will have an election after I finish this explanation here. The document also covers what the co-chair's responsibilities are. They are a little more than putting it on one of your wider Internet responsibilities. You actually have to get presentations for the meeting, sit here and lead the discussion as chair or co-chair or both and so forth. You are basically also working with APNIC Secretariat to prepare the agenda for the six-monthly meeting and you are expected to turn up. It is fairly important if you are going to chair a session.
We also cover the presenter guidelines. Each presenter should have received the guidelines, talking slowly, clearly, using straight-forward language, especially to help the stenographer team here and to assist those whose first language. It describes how the working group is set up and the Birds of Feather. We have had a go at documenting what consensus means. That, I suppose, the guidelines borrowed from other organisations who have strived to achieve consensus.
It is not a yes, no vote - it is a general feeling of the room. Trying to capture it in suitable words was more challenging for us. I recommend you look at that. If you suggest improvements or refinements, that would be quite welcome. That covers the guidelines document.
As I said at the start, I encourage you all to read it, give your feedback to the routing chairs, to APNIC-talk or whatever.
It brings us to the final item - the SIG chair election. As per the guidelines, one of the two chairs steps down at this meeting. So Randy has volunteered to step down. We also agreed that we should add a separate co-chair to this. The routing SIG is quite a big special interest group. It is, in my opinion, very successful. We seem to attract a large number of people and it is a quite a bit more work in doing this than for some other special interest groups. We agreed having two co-chairs would make more sense.
When the call went out, we asked for nominations for the co-chair position. We had two nominations. I nominated Randy to serve again as co-chair. Tomoya Yoshida also volunteered to serve as co-chair. We had two nominations, and they are elected, unopposed. Questions?
RANDY BUSH: At another SIG where Philip was chair, the co-chair said it was great being co-chair because Philip did all the work. A similar situation exists here but I am not proud or happy. So, I feel guilty that I have done an insufficient amount of the work. If there are others who have the energy, etc, even though I'm in the official process, we can't do it right now, please come talk to us and assistance and more hard-working people than I would be appreciated.
PHILIP SMITH: Thank you, Randy. I was going to come to the thank you, Randy, for assisting me, because he has actually assisted me and I appreciate that. I don't know whether I'm being selfish or something grabbing the chair at the head of the table, I feel like I'm being like that, but the other half is always keen to get things done and move forward and generally improve things for everybody. Maybe that is a part of that too. I would like to thank Randy, for probably the last two years of assistance and, of course, welcome him back as co-chair and also welcome to Tomoya Yoshida who will joining as co-chair of the routing SIG. That is all the business we have. Thank you all very much for coming.
Are there any other questions? Comments, business?
JASON BAILEY: Is the second vice chair here?
PHILIP SMITH: Please stand up. That is actually one of the requirements, if somebody who is nominated actually has to be here at the meeting before they can assume the position. We can't have absentee co-chairs. You can't chair it when you are absent.
If there is no other business we have finished five minutes early, which I am delighted to see. Thank you all for coming. Just a reminder of the APNIC social in the Melia Hotel. If you missed the announcement at the start, through the Westin, down the beach, walk past the Laguna, turn right, look for the Jungle Cafe.
MARTIN WINTER: Basically, it is supported in some platforms, the 7600 we are looking at it. We are looking for it.
PHILIP SMITH: Okay. While on the subject of the social, you do need a ticket to get in. If you have no ticket you will not get in. I have no ticket yet so I probably will not get in. The tickets are available outside. So please, if you want to go, grab one. Otherwise, no other business, thank you all for coming, enjoy the rest of the conference.