Open source Working Group session
.
29 October 2020
.
3:00pm CET
.
ONDREJ FILIP: Good afternoon everyone, it's 15:43 seconds, so we are 13 seconds late, but I hope that will not be a problem. My name is Ondrej Filip, co‑chair of Open Source Working Group with Martin Winter, grey guy, probably now in Switzerland, right, and I welcome to the best Working Group during all the RIPE meetings, you know, we will definitely not bother you with any policy politics, organisational, nothing like, it will be purely engineering discussion about open source project with the developers. Martin, can you please run slides.
.
A few housekeeping. You know we have a new tool called Meetecho but I believe you were able to accommodate it during the week to that. So there is an audio queue, if you want to speak you will be allowed to. There is will go a Q&A button so you can just type your questions and it will be read afterwards. There is a chat, and we have also stenographers.
.
So ‑‑ and, of course, if you want to speak, do not forget to state your name and affiliation.
.
So now you can see on the screen first part of our agenda.
.
The welcome part is roughly done.
.
We have basically two presentations and one discussion. So, as usual, I would like to ask if there is any addition to the agenda, if someone wants to add something after this question, there is usually a long silence, so I hope I will not, you know, be too rushing to anybody if I say okay, I don't see any additions, so the agenda is approved.
.
Also, the minutes from the previous Working Group meetings were published on time, so, we haven't received any kind of additions, questions or remarks, so I also believe that those minutes can be approved. Again, I will wait ten seconds if there is any objection.
.
That's a long time, ten seconds, isn't it?
.
Okay, so I don't hear anything, and then we are in the first part of the ‑‑ the results of the Working Group Chairs re‑election and I will passion the microphone to Martin.
MARTIN WINTER: So, we had a message sent out obviously every year for the whole Working Group session, we have the re‑election. I didn't receive any candidates, so that was like no person who wanted to become it from that point on, and both Ondrej and myself to continue, it basically goes on as it was before. So they'll have the next election again before the next fall meetings, that's next year, hopefully that will be a physical meeting again by that time.
.
And now from for the presentations, the first presentation, we have a presentation from Nathalie Trenaman, I hope I say her last name correct, about the lifecycle of the NCC RPKI validator. She will talk a little bit about the current state, the future state of the validator and also some time lines.
.
Then we go on, the Birdist cookbook, they are using Bird and BGP for the routing and also specially for Anycast network in their network, and he wants to talk a little bit about experience and what observations he has and a few other interesting things he has.
At the very end, we basically have just a quick ‑‑ at least a start of a discussions about it, how the Open Source Working Group works during the are virtual meetings and the discussion is there a way to improve it or should we change some of the format a little bit to work better for virtual meetings.
.
That's basically it from the agenda. Let's go to the first speaker.
NATHALIE TRANAMAN: I am sharing my screen and I think you should be able to see my screen now. So, good afternoon, and thank you for giving me the opportunity to give you an update on the RIPE NCC RPKI validator. As I said, my name is Nathalie Trenaman, you did pretty well pronouncing it, and I am the Routing Security Programme Manager for RIPE NCC.
.
So, let me start by telling you a little bit about what we have been working on recently. For time constraints, I am not going to explain what a validator is in the sense of RPKI; I assume you already know that.
.
What is also important to know is that the software that we call, that we use for the validation, is officially called relying party software. So people that are not too familiar with RPKI stuff, that is important to know because I didn't explain it on my slides.
.
The biggest issue that we heard from users that were using the RPKI validator from the RIPE NCC was the high CPU and memory consumption and we have been working quite hard on that to reduce both. To reduce the CPU load we performed a lot of micro‑optimisations in the validator code ‑ for example, the better caching and only store the validation checks that failed. And to improve the memory consumption, we tuned the exodus caching, the database caching, so it has smaller pages now.
.
We also made some tweaks to the CA 3 validation algorithm. So that also gave some better performance.
.
We also had to work on the AS zero policy implementation from APNIC. They created, basically, one very large ROA to have all their unallocated space originating from AS zero basically, so not being announced in the global routing table so we had to accommodate for that.
.
And we had a lot of feature requests to remove the repositories from the cache if they are not referred by any certificates for more than seven days.
.
So, then, what else did we work on? After one of our outages earlier this year, we discovered that different relying party software behaves quite different with regards to manifest validation rules. Manifest are one of the objects in the CA.
.
If something was wrong with the manifest, we would give a warning, but not exclude the certificate authority as a whole. Other validator software did exclude those CAs, so, the RFC was not really specific on what the behaviour should be in such cases, but the SIDROPS Working Group in the IETF agreed that we had to work on that and that RFC needed to be updated, and is currently working on a BIS document for RFC6486.
At the moment, there are still some loose ends in the draft so it's still under discussion, but there are some large chunks that already reach conscious. In the meantime relying party software is moving forward.
.
RPKI client already had strict validation, and Routinator released 0.8 last week, and drum roll: we released RPKI validator 3.2 yesterday to production, so this is hot news. Here are the places where you can find it on GitHub and on our FTP site.
Like I said, the 6486 BIS draft is not completely finalised, so we have some minor differences. For example, we do not fall back to a previous version of the CA that would still be valid. So if we spot an issue within the CA, we completely drop it and we don't fall back.
.
This is one of the requirements in the BIS drafts. All objects listed on the manifest reside at the same publication point. We do not perform checks at this time yet.
.
And the last one that we are not compliant with yet is the CRO manifest next update times are not checked for consistency, typically there is maximum one or two second delay for all the objects to be synchronised, so if we now do already check for that consistency, a lot of the objects will fail. So we're not there yet. We need to do some tweaking on the other side, so on the publication server side to fix that.
.
So we do recommend to update to version 3.2 as of today, because it is production and we will also tell that to other users.
.
What else? Maybe you saw my RIPE Lab article, and, if not, I'm going to give you a little bit of an update here.
.
We have been looking at the validator landscape and our own strategy for the next years in terms of RPKI, and we came to the conclusion that the current validator landscape is mostly mature, and well‑maintained, also very important, and that we, as RIPE NCC, would like to shift more our focus towards having that really secure, stable and resilient trust anchor and certificate authority.
.
So that meant that we are stepping slowly away from efforts in our validator.
.
We are going to phase out the RIPE NCC validator. I saw on the members' discussion mailing list that some people said they would like to see us stop immediately with the maintenance, but we believe that a phased approach might be a little bit better and gives a bit of a more smoother transition.
.
I would like to share our timeline with you.
.
So this is the timeline, I will talk a little bit about the three different phases and we will archive the validator on the 1st July of next year, 2021.
.
For Phase 1, which is quite short, you might not realise it, but Christmas is only eight weeks ago. Work continues as normal so we will work on features, RFC implementations, but reports also from the version 3.2 of course. If other RIRs come up with new policies and they manage to finish that any time soon, we will also incorporate those changes. And in the meantime, we will inform the community so users will what we're doing, that's why I'm doing this presentation for example.
.
Then Phase 2 starts on the 1st January. Then we will stop working on new features, but the rest of the work we will still continue. RFC implementations, policy implementations, bugs and security fixes of course.
.
In January we will also try to reach out to as many users as we possibly can find e‑mail addresses etc., to inform them and to point them to alternatives.
.
Also, at the moment, our training material consists of two labs, our RPKI training material consists of two labs: one is Routinator and the other one is our own validator and our training team is currently looking into alternatives for our validator to be put in those labs.
.
Then, Phase 3. Basically in Phase 3, the only work that we will do is security and bug fixes. We still have this phase because it is security software and if you have to maintain security software it has to be secure until the last day. So that is why we continue to work on that. And then on the 1st July we will archive the RIPE NCC RPKI validator.
.
Luckily, like I said earlier, a lot of mature alternatives out there. Here is the list. Well maintained, relying party software, Routinator, Ford, OctoRPKI client, RPKI client, [Prover and RPSIT] 2. All the links.
.
And with that, I am happy to take any questions, if there are any.
ONDREJ FILIP: Thank you very much, Nathalie. I am really kind of impressed by this presentation because you know, for me, it's always easier to start a project than to end it. So I'm happy that you took the approach that you also realised that there is sometimes time to do some project. Thank you for that. Are there any questions? As I said, you have more probabilities, you can either put them into the Q&A, type them or you can request an audio and we can let you speak.
.
I do not see ‑‑ oh, we have Rudiger and we have Blake Willis.
RUDIGER VOLK: Not really a question, more a comment.
.
Ondrej, I am surprised that you, as a source of commonly used production software, are not seeing that people who are responsibly using operational stuff actually have to take care that they know when and how long support for the stuff they are using will be available. So, kind of, yes, I am very happy that Nathalie and the NCC is telling and giving the road map for closing down this software support.
.
On the other hand, from ‑‑ if I still was responsible for operating stuff like this, I would be actually pretty disappointed by having fairly short notice in the large shop, it probably is really a challenge to change to a different production software, do a selection, introduce and test, and so on, within just half a year.
ONDREJ FILIP: Thank you. I just say that I have problems to sunsetting projects, I didn't say anything else. But anyway, Nathalie, I think that remark went to you.
NATHALIE TRANAMAN: Thank you, Rudiger. Very fair point, I have to say. It is, as you said, it's difficult to find a balance on when you sunset a product. Like I said on Members Discussed, there were people saying it should be stopped immediately, and not one, there were multiple people saying that. Now, I also believe that that is not realistic, so that's why I went for a longer period.
.
Some might say half a year is not long enough, and I hear you there.
.
Now, at the moment we have around 450 active users of this software. I think we can reach from what I have quickly glanced at, around 300 of them that I can identify. Now, the others, and this is another concern that I personally have, that I have to look into, is how would I make sure that I reach those people? And this is also another challenge. So, then, indeed, the six months is quite short. I would rather call it ambitious. But, yeah, I see that as a responsibility also on our side to do our utmost to inform people in a very timely manner.
.
MARTIN WINTER: We have one more question from Blake.
BLAKE WILLIS: Thank you for this, Nathalie. Again, this is not a question, it's a comment, so I'd like ‑‑ I guess I'm in the back of the queue.
.
I know there are a lot of opinions on this about some, for example, saying the RIPE NCC's funds shouldn't be used for developing things like this. Other people say, you know, that it should continue forever, and so on and so forth. I personally feel like this kind of project is a good way to prime the pump, and that bringing something ‑‑ you know, the RIPE NCC being the first validator out there. Not necessarily the best, the fastest and so forth, but it's a proof of concept that primes the pump, it gets people started using RPKI and then other solutions develop and then we sunset the thing. And this seems like the right way to go about getting things like RPKI deployed that need a good kick in the pants to get started and then other people can pick it up and move it down the road. So thank you.
MARTIN WINTER: I see one more question from Dmitry in the Q&A, but it doesn't look like it's related to this presentation. So, we'll skip that one and we go on with Alexander Zubkov, talking about your Bird experience.
ALEXANDER ZUBKOV: Hello, my name is Alexander Zubkov, I work at Qrator Labs. We do some BGP Internet connectivity there.
.
For that, we maintain Anycast networks around the world, and one of the big problems we have is reading route information around the network and, for that, you should have guessed already that we use Bird to BGP protocol.
.
So, a little review of what we have. We run Bird only on Linux environment, of course some servers and also on Arista and Mellanox, and here is an example of Bird tables and its interactions on Linux.
.
We use only Version 2 now. We get rid finally of Bird Version 1 this year. We prefer the Version 2 because it's actively developed and we also have features like V A X support, BGP BIND to daemon, specific IPs that they have configured, and also simultaneous support of demeanor simultaneously ,we have IP version 6 and 4.
.
And the four protocols we use on the BGP and coupled with ‑‑ BFD and large communities and FlowSpec. And of course we extensively use Bird tables and filtering procedures, it provides, and we love it very much, it provide the best abilities for us.
.
So, today I want to talk about some ‑‑ our experience in some obstacles we get into when we are using Bird, and first is, Bird works in a single thread and, in that thread, it takes ‑‑ processes events one by one and ‑‑ so, if you, like, have some high role set‑up, it won't scale to multiple cores, and also, if you have some hard events like processing of BGP for you and also if one Bird writes to ‑‑ or makes some working operations, it can end for some time and if during the ‑‑ if it takes a long time, some timers and protocols can expire like BGP and then it will just drop that session. You will see a message saying in your log when that loop takes longer than 5 seconds for default.
.
So, when Bird writes either in Fire or in Syslog, those are both working operations for intent, so if you have some problems with eyeballs, you can hold for some time and you can have problems with it.
.
We did a patch for ourselves. It uses UDP to send logs and you can find it by the link.
.
Next, Kernel protocol, which exports and imports routes from Kernel tables. It has some limitations. You can use only one Kernel protocol per one Kernel or one Bird table. So, if you want to export your ROAs to several Kernel tables, or vice versa, you need to do it, you will need to create additional tables just to those tables and then to Kernel tables.
.
And also, if you have software working with your roles, you have to know that inverted... in Kernel you will see that there is a proto attribute that when you export inroads from Bird, cannot exchange that, you only have Bird. Sometimes you may want to change it, but you cannot.
.
And then when you look at the configuration of Bird daemon, all protocols update their state, to state that the written in your configuration, so, if you are temporarily disabled some promise and want them to keep that state and do not go up or down when you could your config, you need to update the configuration in your statement and that's not very convenient, for us at least, and we also made a patch for that which keeps the state of administrative state of protocols when you make a configuration of the Bird daemon.
.
Also, if you use strict BIND, you can face such issues if you run your Bird daemon and it tries to set up some BGP session which uses the IP address which is not configured here on the interface, in that case Bird would just disable that session and you will not try to run it again.
.
Cannot ‑‑ you can like ‑‑ you do not see that and do not know that your session is down. We also made patch for that. We added it option 44 for the circuit so that Bird can BIND on nonexistent address and of course it would still get errors but the session remains up and you try to reconnect when your address or PI session will still be there.
.
This also a small issue. Cannot have several neighbours with the same remote IP address even though you have different local IP addresses. Bird will use only one one session from those who will be active. I don't know if you see the example. There are two sessions within the same remote address, and one session became down because of timeout and the other session took the right to be active and the first session remains in idle state because Bird can't connect it.
.
We use ‑‑ we hid by that when exporting routes to our BGP collector, it's not critical for us, so we just did it.
.
Then you may know that in most BGP daemons, by default, verify that the neighbour's AS number is the first number in the path, but Bird does not do it now, and you can do it manually in the filter but there is also an option in the head, like branch, the master branch in the repository of Bird which gives that option and you can use it to make Bird do the check for you. But it is not released yet. It will be in the next version of that release.
.
And if you make some changes to your IS paths and AS paths and you migrate in form Bird 1 to Bird 2, you need to know that that were changes when Bird 1 made to ISP clients before they export filter and Bird 2 do it vice versa, and so you cannot, for example, cut your (something) from the path, and if you do it you need to use, for example, IS client option Version 2.
.
Then, next, we can configure our neighbour direct multi‑hop and BGP neighbours by default and eBGP and direct and iBGP multi‑hop by default. With the channels, you also can configure how incoming routes are processed. You can configure gateways be solution as being direct recursive and for direct session you have gateways direct for default, and for multi‑hop you have GW recursive by default. It's also important cannot have multi‑hop neighbour with direct gateway. And why would ‑‑ for example, if you have some direct session and if you pass some routes with a non‑local BGP next hop in that session Bird will just drop it.
.
And if you change the session to multihop, then such routes become unreachable. But you know there are ‑‑ there is best past selection order, and it's commonly chooses routes for different attributes with a different attributes, for example, it chooses the best local pref, when chooses the shortest AS path. We will get ‑‑ no matter how good local pref it has. And to work around that, you can set some meaningful gateway attributes in port filter. So, that route will be reachable at least in the Bird table and you can count it with others.
.
And let's look at the case of recursive routes. This, of course, is a useful feature. They appear in BGP protocol and in static protocol you also can set recursive gateway for your route, but you can also ‑‑ you can only have one level of recursion, and so, those routes will try to resolve through other recursive route will be unreachable.
.
So, it can affect routes that you receive with multihop sessions.
.
And also, cannot, if you have some multipath road, cannot make a recursive road through that, you will not receive multipath route.
.
An example here, we have static recursive route and in the section lower, you can see the route ‑‑ we tried to send the route to ‑‑ to solve the routes, the other route which we received via direct session, and you can see it is okay, and in the case, in the case, we received the route, the recursive session, you see that the ROA route is uneachable now because it cannot be solved.
.
Bird, unfortunately, do not have aggregation, but in some cases we can imitate it. We can make a static recursive route and search a gateway some through addressing origin prefix, and so we will receive for, like, the aggregated route we need, or we can also have some other route, deaggregated or some related route. But you also, if you do that route, you will need to filter out to it in the export filter, if it is unreachable. When origin route is missing, your static route will be unreachable and you need to filter it.
.
And, of course, it will not affect your attributes from the your original node.
.
Next, Bird treats every route as separate, but in some cases, you can configure it so that it makes it multipath routes. For example, in Kernel protocol you have option to merge path. It will take the best selected route and its equivalent and export it to Kernel as one multipath route.
.
In BGP you also have option to add path, but it will not take on the queries, but it the export all routes which pass the filter to the neighbour, and the neighbour should also support this extension because it exports all the routes so the neighbour should select the best routes and merge it to multipath by itself.
.
As I said, cannot have a recursive multipath route, but ‑‑ because if you set it, it only choose only the best route as its destination. But if you have ‑‑ if you are able to split, to filter your routes to different tables with different destinations, then you can add static recursive route to each of the tables and then pipe those routes back to the destination table back to the source table, and so you will receive several of those static routes which will form multipath route you need.
.
And then the last part I want to tell about how we run Bird on network equipment.
.
First is Linux switches. We use it on the SwitchDev and, because of the that, the Bird box like on common ‑‑ and it's nothing special here.
.
In SwitchDev we have a subsystem, they have in data plane amounted to a VRF subsystem in Linux which is supported by Bird 2, so it works just normally here. There were some problems in older versions, so we had the protocol but it is fixed now and so using of Bird is straightforward, so there are no problems with it.
.
And on Arista, it's a little bit ‑‑ it's a complex story. You have Linux there also, but it is proprietary and they have also multipath data configuration to Kernel configuration. So, you can run Bird there and people seem to interface the IP addresses and interact with other devices. But, at least with older versions we have some problems with routes, cannot, or could not expert multipath routes from Bird through the Kernel table. And because of that, we need to use other ways for that. We took Arista SDK and made a patch with additional protocol for Bird 1.6, which used SDK to export routes there.
.
And it worked well, and when we migrated to Bird Version 2, nobody wanted to adopt that so we sorted it by leaving the older Bird nearby and to ‑‑ and to new Bird exporting its routes by BGP to the old Bird and to that old Bird export the SDK to the hardware, when they thought that Arista had its own BGP daemon, and so maybe we can send it directly there and we tried that, that configuration like that. It works. It's our base configuration now for Arista.
.
Also, there are variables so ‑‑ but in Arista it's network mapped in cases of Linux and the Bird, do not worry about them, so you should run several Bird instances in each VRF.
And in some cases, we need to exchange routes between VRFs and there are no direct network connection between them, so we ‑‑ but there are no Internet connection, but there are Unix sockets and we ‑‑ unfortunately, Bird, they do not support running BGP over Unix circuits and we able approve of concept, we used approximates immediately and here is an example of configuration, it worked but we do not use it in production yet. We do not need it still.
.
So, I show the number of problems here, but it's like every software, have its sharp edges, and you should use it, it's a great software. And if you use it, I recommend you to subscribe to the Bird mailing list, there are a lot of interesting discussions. So, if you have any questions, I gladly will answer them now or you can ask ‑‑ write me or e‑mail me.
MARTIN WINTER: We're very short on questions, so I appreciate, Alexander, if you maybe hang out on Spatial afterwards a little bit and people can contact you there too and ask for questions.
.
Is there any urgent question, anyone who wants something before we go on?
ONDREJ FILIP: I would like to say that I really liked the end of the presentation, that it's a great software. Thank you very much for that. I see a question that is not related. Please contact me I am happy to cooperate on that. That's a question for ‑‑ it's not related to Alex. And with that, I don't see anything. So I think we can move on. Martin.
MARTIN WINTER: Okay. I just wanted to bring up a quick discussion I had a few times, it's the whole discussion from the meeting format as we have now, the second virtual‑only meeting, and I'm not that optimistic the next one will be a physical meeting again.
.
And what mainly I want to bring up in our charter, we always talk about how we force discussion among developers and service providers, and the main thing is the discussion part, not just presentation, and sometimes I am a bit curious how well that works in this virtual meetings, and more important, is there anything we may want to change in our format from our like Open Source Working Group, anything we can make better to actually get a better discussion in. I also know there is quite a few people in the open source community who are not that high will he interested to present especially because it's mostly just presentation and many have the view, well, my stuff is already enough, YouTube is out there, why do I need to show another video? And most of them are more interested in discussion and I am curious if anyone has some good ideas and thoughts there.
.
We are obviously very short on time, so most of these things I will start the thread on the mailing list, so I would appreciate if people have ideas or thoughts about it, or you think everything is fine and we go on, but maybe there is something we can make better for the next RIPE meeting.
.
ONDREJ FILIP: I see some suggestions in the chat. It's more specifying a section of the Working Group meeting for bring and share topics that do not require a full presentation. Thank you for that.
.
And I see Dmitry, who would like to speak, so, please, Dmitry, be brief. I grant you audio.
DMITRY SCHERBAKOV: Thank you very much again. I will talk only about the virtual meetings, not about other problems.
.
So what I can say is that it's a very, very, very good thing because we are ‑‑ now really can come and ‑‑ all together, not going to some things, that Corona is giving us a new possibility to making real works.
.
So, what about the problems? I wrote it, I will send, but I can read it now, because ‑‑ one minute, please ‑‑ so, first, we meet it's ‑‑ for bugs and interface issues. Because what I can find from this meeting, it's strange but another address can see my profile in question and answer. So, my question was needed time to understand what's the problem.
.
It will be good to see all results of polling, not only last polling. It needs to having some more chat rooms, because some discussion may be ‑‑
MARTIN WINTER: Dmitry.
ONDREJ FILIP: We need to revoke you, Dmitry. This is definitely not related to Open Source Working Group. I am sorry, I need to cut you off. I am sure that someone from NCC will take your remarks very seriously, but, sorry, we don't have time for this now. I apologise. Somebody else who wants to speak?
MARTIN WINTER: Then we have, like, one last announcement before everyone goes, from Vesna, about an upcoming hackathon.
VESNA MANOJLIVIC: Hi, everyone. I am from RIPE NCC, I am a community builder, and I wanted to announce the deploython for the RIPE Atlas software probes. So this is also kind of open‑source‑related because it is an open source project, and we would like to have more probes added. Anybody can actually install them in their own time, but we thought, let's do it all together on the 25th November. Throughout the day, we will have people helping each other, the volunteers from the community and the RIPE Atlas team, and maybe that's going to make it more interesting for the participants to actually join.
.
So, this is the announcement. And the details are on the RIPE Labs and actually published on a lot of mailing lists. Thank you.
ONDREJ FILIP: Thank you very much. I see in question and answers one remark from Simon Leinen from SWITCH: For fostering this discussion, couldn't we have a room in SpatialChat where software folk can put up a table for their project and discuss either doing the whole meeting or during specific RIRs.
.
Thank you very much, Simon. I think we will consider this. Thank you.
MARTIN WINTER: Okay. I think we are already over time. So, thank you everyone for attending. Have a quick break. It continues soon.
ONDREJ FILIP: Thank you very much. And have a great day. Bye‑bye.
.
(Coffee break)