networkd: Alternate address configuration methods for cloud providers #16547

joshtriplett · 2020-07-23T05:15:10Z

I'd like to bring up the network as fast as possible; every millisecond counts. On some cloud providers, there are faster ways of obtaining an address, rather than sending out a DHCP request. I understand and agree with networkd's general policy of not supporting hooks, so I'd like to request first-class support for these as address assignment methods within networkd. (I wouldn't expect any of these to be in any default configuration, just available options to use in a .network file.)

On some cloud providers (such as Google Cloud), an IPv4 address is available by using the low four bytes of the interface's hardware MAC address. I'd like to have a option along the lines of IPv4FromMAC that implements this. This is the simple case, and I'm hoping it'd be trivial to support.

On other cloud providers (such as AWS), it's possible to get the IPv4 and IPv6 addresses for all interfaces from instance metadata, which would involve bringing up link-local addressing (169.254.x.x) and then fetching a specified URL (http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/local-ipv4s and http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/ipv6s). For these, the configuration option would specify the URL (e.g. IPv4FromURL and IPv6FromURL). I recognize that this would be a larger ask; if this doesn't seem reasonable to do within networkd, I'd be happy to hear suggestions for other ways to implement this, other than running an entirely custom network bring-up daemon.

The text was updated successfully, but these errors were encountered:

poettering · 2020-07-23T07:48:49Z

The AWS thing involves IPv4LL and HTTP and is supposed to be quicker than DHCP? That would be sad?

So we already have systemd-network-generator.service which generates networkd config files from kernel cmdline args. I figure at least the google cloud thing could be implemented that way too, i.e. have a google-cloud-generator.service that synthesizes a .network file in /run/systemd/network/.

I'd also be fine if we add IPv4FromMAC=, but it needs to be somewhat generic, i.e. maybe take a MAC address bit mask, and a base IP address, so that it is not google cloud specific but can be used somewhat generically:

IPv4FromMAC=00:00:00:00:FF:FF 192.168.0.0

Or so? The first parameter would specify the bitmask to apply to the MAC address, the second parameter the base address to OR it into. And we should refuse operation if any bits are set in the suffix of that IP address. I guess for the MAC address we need to support masking any bits, i.e. also from the middle, though for the IP address we only have to insert the determined bits to the end of the IP address.

Is the AWS logic reasonably standardized? (i.e. does it have a spec, and is it used beyond AWS?) If the latter then we could just add native support to networkd I guess, similar to the existing IPv4LL/DHCP/IPv6RA support. If it's strictly AWS specific and underdocumented I doubt this would be the right place though

joshtriplett · 2020-07-23T19:49:22Z

@poettering wrote:

The AWS thing involves IPv4LL and HTTP and is supposed to be quicker than DHCP? That would be sad?

That is indeed sad, and yet I've confirmed it locally. DHCP is more than 100x slower than fetching the IP address from the instance metadata (10ms vs less than 100us). In theory, more aggressive DHCP might be able to close some of that gap, but we're still talking "communicate over virtual network" versus "communicate with locally attached hardware supplying the instance metadata". And it's probably possible to fetch the address from instance metadata faster than I did, too.

I'd also be fine if we add IPv4FromMAC=, but it needs to be somewhat generic, i.e. maybe take a MAC address bit mask, and a base IP address, so that it is not google cloud specific but can be used somewhat generically:
IPv4FromMAC=00:00:00:00:FF:FF 192.168.0.0
Or so? The first parameter would specify the bitmask to apply to the MAC address, the second parameter the base address to OR it into. And we should refuse operation if any bits are set in the suffix of that IP address. I guess for the MAC address we need to support masking any bits, i.e. also from the middle, though for the IP address we only have to insert the determined bits to the end of the IP address.

That seems reasonable to me. (No need to support non-consecutive bitmasks, at least until something actually needs that.) Perhaps the base IP can default to 0.0.0.0 if not specified (and if the mask contains exactly 32 bits)? For GCP, it'd be IPv4FromMAC=00:00:ff:ff:ff:ff 0.0.0.0, or just IPv4FromMAC=00:00:ff:ff:ff:ff with the default.

Is the AWS logic reasonably standardized? (i.e. does it have a spec, and is it used beyond AWS?) If the latter then we could just add native support to networkd I guess, similar to the existing IPv4LL/DHCP/IPv6RA support. If it's strictly AWS specific and underdocumented I doubt this would be the right place though

It's documented, and Azure and GCP both have similar instance metadata services, with Azure also supplying IP addresses in instance metadata. To handle this in the simplest fashion that would work, it would suffice to have IPv4FromURL and IPv6FromURL, http URLs using IP addresses only (no hostnames), with a substitution allowed in the URL for the permanent MAC, and the response must be the IP address in text form. (There are more complex ways to use the instance metadata service, but this would suffice for both AWS and Azure, and IPv4FromMAC would suffice for GCP.)

GCP's metadata service requires an additional HTTP header, but GCP doesn't supply the IP in instance metadata, so that doesn't matter. AWS has an "instance metadata v2" protocol that's more complex, but I think implementing v1 would suffice here.

poettering · 2020-07-28T15:07:17Z

It's documented, and Azure and GCP both have similar instance metadata services, with Azure also supplying IP addresses in instance metadata. To handle this in the simplest fashion that would work, it would suffice to have IPv4FromURL and IPv6FromURL, http URLs using IP addresses only (no hostnames), with a substitution allowed in the URL for the permanent MAC, and the response must be the IP address in text form. (There are more complex ways to use the instance metadata service, but this would suffice for both AWS and Azure, and IPv4FromMAC would suffice for GCP.)

Sounds OK for me to have. We link some stuff to libcurl anyway already (importd and some journal remoting stuff), adding some super-basic http get code based around it should be OK. Should be done out-of-process though, i.e. forked off so that we can set up some sandboxing for it, after all it might be used to parse complex stuff like TLS and certificates... And most likely we'll add DoH support to resolved eventually too, thus the dependency on libcurl isn't terribly new...

GCP's metadata service requires an additional HTTP header, but GCP doesn't supply the IP in instance metadata, so that doesn't matter. AWS has an "instance metadata v2" protocol that's more complex, but I think implementing v1 would suffice here.

We have a pretty neat JSON parser in our codebase, so if the new stuff is a bit of JSON that'd be fine too.

joshtriplett · 2020-07-28T22:33:10Z

On Tue, Jul 28, 2020 at 03:07:33PM +0000, Lennart Poettering wrote: > It's documented, and Azure and GCP both have similar instance metadata services, with Azure also supplying IP addresses in instance metadata. To handle this in the simplest fashion that would work, it would suffice to have `IPv4FromURL` and `IPv6FromURL`, http URLs using IP addresses only (no hostnames), with a substitution allowed in the URL for the permanent MAC, and the response must be the IP address in text form. (There are more complex ways to use the instance metadata service, but this would suffice for both AWS and Azure, and `IPv4FromMAC` would suffice for GCP.) Sounds OK for me to have. We link some stuff to libcurl anyway already (importd and some journal remoting stuff), adding some super-basic http get code based around it should be OK. Should be done out-of-process though, i.e. forked off so that we can set up some sandboxing for it, after all it might be used to parse complex stuff like TLS and certificates... And most likely we'll add DoH support to resolved eventually too, thus the dependency on libcurl isn't terribly new...

This definitely doesn't use TLS or certificates; it's an extremely basic HTTP connection to a link-local IP address. But yeah, sandboxing is a good idea. (Would be nice to start the sandbox as early as possible, so that it doesn't delay bringing up the address by the time for an extra fork/exec.)

> GCP's metadata service requires an additional HTTP header, but GCP doesn't supply the IP in instance metadata, so that doesn't matter. AWS has an "instance metadata v2" protocol that's more complex, but I think implementing v1 would suffice here. We have a pretty neat JSON parser in our codebase, so if the new stuff is a bit of JSON that'd be fine too.

It isn't about the format, it'd still be plain text. (Azure can do JSON, which might be useful to get IPv4 and IPv6 at the same time, but it isn't required.) If you want to support the instance metadata v2 protocol, you need to send a separate PUT to get a time-limited token, then pass the token in subsequent requests. If you're using full libcurl, that'd be straightforward enough. But that variant of the protocol is *entirely* AWS-specific, so it'd need to be separate from the baseline `IPv4FromURL` support.

yuwata · 2020-07-31T23:15:15Z

Could you provide any references about that?

joshtriplett · 2020-08-01T20:41:40Z

@yuwata
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/instance-metadata-service

yuwata · 2020-08-01T21:43:24Z

Thanks.

angdraug · 2020-10-15T19:11:45Z

I measured nspawn container startup with various IP configuration options. Test setup: Debian bullseye/sid, systemd 246, /var/lib/machines is a directory, physical network is Ethernet to Google WiFi 1st gen, container image is minbase debootstrap built with packer-builder-nspawn-debootstrap.

Test sequence: start Wireshark capture on br0, send pings to container's IP address every millisecond with sudo ping -i0.001, start the container with sudo machinectl start, match log entries in journalctl -oshort-precise to the timestamp of the first ping reply in the packet capture.

Total time from start to host0 carrier is consistently around 0.5s. This is a lot, and I didn't dig deeper into what systemd is doing with all that time. Mounting /var/lib/machines to tmpfs made no difference. I couldn't get veth to work with global IPv6 addresses, but with IPv4 using veth instead of bridge also made no difference. Typical time breakdown from one of the runs:

systemd starting container: 18ms
vb-dev link up:             30ms  (+12ms)
systemd started container:  240ms (+210ms)
container systemd started:  244ms (+4ms)
systemd-networkd started:   482ms (+238ms)
host0 link up:              484ms (+2ms)
br0 port forwarding state:  487ms (+3ms)
host0 gained carrier:       507ms (+20ms)

Total time from carrier to first ping reply varied a lot:

carrier to first ping reply:
dhcpv6:         3832ms 3352ms 5309ms
static ipv6:    1557ms 1168ms 1228ms
dhcpv4:           17ms   23ms   28ms
static ipv4:       9ms    3ms    4ms

Even with static IPv6 configuration (Address=, Gateway=, DNS=), it still takes more than 1s before container begins to respond, bringing the total start to reply time to almost 2s. DHCPv6 adds more round-trips with similarly excessive timings and takes another 2-4s.

Enabling optimistic_dad sysctl (RFC 4429) on host and in the container made no difference.

ARO (Address Registration Option) from the NDP optimizations RFC 6775 could in theory speed this up, but it isn't implemented in Linux yet (according to Stefan Schmidt's report at LPC 2019 IoT Microconference), and even when it is it might only apply to IPv6 over IEEE 802.15.4.

Compared to that, the extra 10-20ms that DHCPv4 adds to container startup looks quaint. There might be lower hanging fruit in systemd-nspawn that could reduce that 500ms start to carrier time to the point where faster IPv4 configuration would begin to make a difference.

The ridiculously long time it takes IPv6 stack to initialize makes me sad and wondering if there's anything wrong with my setup. As it stands, it's unsuitable for on-demand containers that get started to serve requests from interactive applications, and wasteful with containers that only need to run for a few seconds at a time as part of a low-frequency compute pipeline.

joshtriplett · 2020-10-18T03:01:56Z

On Thu, Oct 15, 2020 at 12:12:01PM -0700, Dmitry Borodaenko wrote: I measured nspawn container startup with various IP configuration options. Test setup: Debian bullseye/sid, systemd 246, /var/lib/machines is a directory, physical network is Ethernet to Google WiFi 1st gen, container image is minbase debootstrap built with [packer-builder-nspawn-debootstrap](https://git.sr.ht/~angdraug/packer-builder-nspawn-debootstrap/).

...

Compared to that, the extra 10-20ms that DHCPv4 adds to container startup looks quaint. There might be lower hanging fruit in systemd-nspawn that could reduce that 500ms start to carier time to the point where faster IPv4 configuration would begin to make a difference. The ridiculously long time it takes IPv6 stack to initialize makes me sad and wondering if there's anything wrong with my setup. As it stands, it's unsuitable for on-demand containers that get started to serve requests from interactive applications, and wasteful with containers that only need to run for a few seconds at a time as part of a low-frequency compute pipeline.

I currently have full VMs (including the kernel) starting up in less time than that; a container really should be substantially faster. But as for IPv6, I got similar carrier-to-IP-configuration delays on cloud VMs, and ended up having to disable IPv6 in networkd to improve startup performance. It sounds like obtaining IPv6 addresses via instance metadata would be an even bigger win for IPv6.

Only containers running edge services (e.g. Envoy or Nginx) should have global IPv6 addresses. Seed host has privileged access to all containers running on it. Access to Seed hosts is a sensitive security surface that should not be unnecessarily exposed to additional attack vectors. A globally routable IPv6 address is not necessary when Seeds are managed from local network. IPv6 also adds up to 5s to network initialization: systemd/systemd#16547 (comment)

arianvp · 2023-08-27T14:28:07Z

I have another use case for early access to the IMDS. Namely I want to populate the ssh.authorized_keys.root systemd credential from Cloud Metadata. This is complicated though as systemd-tmpfiles I think runs before systemd-networkd.

It would be neat if we could set up a route to the link local address of the metadata server in early boot (Maybe udev's net_setup_link can do this?).

Then we can have a aws-network-generator.c to generate networkd units based on the Metadata. And we can fetch credentials from the metadata server for e.g. setting up authorized_keys.

yuwata · 2023-09-12T14:08:57Z

Based on the today's discussion and https://gist.github.com/arianvp/22e1c5182eb6c17bbd8c1bbe823b516b, how about the following?

systemd-netns
  SYNTAX:
    systemd-netns [create] --interface=eth0 --virtual-interface=ipvlan99 --namespace-name=netns99 --protocol=ipv4ll
    systemd-netns delete --namespace-name=netns99

systemd-netns create

enumerate interfaces,
wait for that the physical interface is initialized.
2a. monitor RTM_NEWLINK, and wait for the specified interface is detected,
2b. monitor uevents, and wait for that udevd initializes the physical interface,
bring up the interface if not,
create a network namespace,
3-1. lock /run/systemd/netns/,
3-2. check the netns file in the directory does not exist,
3-3. fork the process, and create a new netns by the parent process, then the child can still access the main netns,
3-4. bind mount netns file under /run/systemd/netns/netns99,
the child process creates a ipvlan or something on the interface in the namespace, maybe with IFLA_NET_NS_PID,
the parent process waits for that the virtual interface is created,
bring up the virtual network interface if necessary.
start sd-ipv4ll on the virtual interface, we may be able to skip the probing,
unlock /run/systemd/netns,

systemd-netns delete

lock /run/systemd/netns,
umount netns file,
remove netns file,
unlock the directory,

This may be useful to run commands (e.g. curl or wget) with NetworkNamespacePath=/run/systemd/netns/netns99, e.g.
systemd-run -p NetworkNamespacePath=/run/systemd/netns/netns99 curl URL -o /run/credentials/@system/foo

We can share many code from networkd, so I guess it is not hard to implement such.

yuwata added network RFE 🎁 Request for Enhancement, i.e. a feature request labels Jul 23, 2020

yuwata mentioned this issue Dec 8, 2020

networkctl: Show public IP and other IP informations in azure env #17896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

networkd: Alternate address configuration methods for cloud providers #16547

networkd: Alternate address configuration methods for cloud providers #16547

joshtriplett commented Jul 23, 2020 •

edited

Loading

poettering commented Jul 23, 2020

Uh oh!

joshtriplett commented Jul 23, 2020 •

edited

Loading

Uh oh!

poettering commented Jul 28, 2020

Uh oh!

joshtriplett commented Jul 28, 2020 via email

Uh oh!

yuwata commented Jul 31, 2020

Uh oh!

joshtriplett commented Aug 1, 2020

Uh oh!

yuwata commented Aug 1, 2020

Uh oh!

angdraug commented Oct 15, 2020 •

edited

Loading

Uh oh!

joshtriplett commented Oct 18, 2020 via email

Uh oh!

arianvp commented Aug 27, 2023

Uh oh!

yuwata commented Sep 12, 2023

Uh oh!

Uh oh!

networkd: Alternate address configuration methods for cloud providers #16547

networkd: Alternate address configuration methods for cloud providers #16547

Comments

joshtriplett commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

poettering commented Jul 23, 2020

Uh oh!

joshtriplett commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poettering commented Jul 28, 2020

Uh oh!

joshtriplett commented Jul 28, 2020 via email

Uh oh!

yuwata commented Jul 31, 2020

Uh oh!

joshtriplett commented Aug 1, 2020

Uh oh!

yuwata commented Aug 1, 2020

Uh oh!

angdraug commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshtriplett commented Oct 18, 2020 via email

Uh oh!

arianvp commented Aug 27, 2023

Uh oh!

yuwata commented Sep 12, 2023

Uh oh!

joshtriplett commented Jul 23, 2020 •

edited

Loading

joshtriplett commented Jul 23, 2020 •

edited

Loading

angdraug commented Oct 15, 2020 •

edited

Loading