Skip to content

networkd: Alternate address configuration methods for cloud providers #16547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joshtriplett opened this issue Jul 23, 2020 · 11 comments
Open
Labels
network RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@joshtriplett
Copy link
Contributor

joshtriplett commented Jul 23, 2020

I'd like to bring up the network as fast as possible; every millisecond counts. On some cloud providers, there are faster ways of obtaining an address, rather than sending out a DHCP request. I understand and agree with networkd's general policy of not supporting hooks, so I'd like to request first-class support for these as address assignment methods within networkd. (I wouldn't expect any of these to be in any default configuration, just available options to use in a .network file.)

On some cloud providers (such as Google Cloud), an IPv4 address is available by using the low four bytes of the interface's hardware MAC address. I'd like to have a option along the lines of IPv4FromMAC that implements this. This is the simple case, and I'm hoping it'd be trivial to support.

On other cloud providers (such as AWS), it's possible to get the IPv4 and IPv6 addresses for all interfaces from instance metadata, which would involve bringing up link-local addressing (169.254.x.x) and then fetching a specified URL (http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/local-ipv4s and http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/ipv6s). For these, the configuration option would specify the URL (e.g. IPv4FromURL and IPv6FromURL). I recognize that this would be a larger ask; if this doesn't seem reasonable to do within networkd, I'd be happy to hear suggestions for other ways to implement this, other than running an entirely custom network bring-up daemon.

@yuwata yuwata added network RFE 🎁 Request for Enhancement, i.e. a feature request labels Jul 23, 2020
@poettering
Copy link
Member

The AWS thing involves IPv4LL and HTTP and is supposed to be quicker than DHCP? That would be sad?

So we already have systemd-network-generator.service which generates networkd config files from kernel cmdline args. I figure at least the google cloud thing could be implemented that way too, i.e. have a google-cloud-generator.service that synthesizes a .network file in /run/systemd/network/.

I'd also be fine if we add IPv4FromMAC=, but it needs to be somewhat generic, i.e. maybe take a MAC address bit mask, and a base IP address, so that it is not google cloud specific but can be used somewhat generically:

IPv4FromMAC=00:00:00:00:FF:FF 192.168.0.0

Or so? The first parameter would specify the bitmask to apply to the MAC address, the second parameter the base address to OR it into. And we should refuse operation if any bits are set in the suffix of that IP address. I guess for the MAC address we need to support masking any bits, i.e. also from the middle, though for the IP address we only have to insert the determined bits to the end of the IP address.

Is the AWS logic reasonably standardized? (i.e. does it have a spec, and is it used beyond AWS?) If the latter then we could just add native support to networkd I guess, similar to the existing IPv4LL/DHCP/IPv6RA support. If it's strictly AWS specific and underdocumented I doubt this would be the right place though

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Jul 23, 2020

@poettering wrote:

The AWS thing involves IPv4LL and HTTP and is supposed to be quicker than DHCP? That would be sad?

That is indeed sad, and yet I've confirmed it locally. DHCP is more than 100x slower than fetching the IP address from the instance metadata (10ms vs less than 100us). In theory, more aggressive DHCP might be able to close some of that gap, but we're still talking "communicate over virtual network" versus "communicate with locally attached hardware supplying the instance metadata". And it's probably possible to fetch the address from instance metadata faster than I did, too.

I'd also be fine if we add IPv4FromMAC=, but it needs to be somewhat generic, i.e. maybe take a MAC address bit mask, and a base IP address, so that it is not google cloud specific but can be used somewhat generically:

IPv4FromMAC=00:00:00:00:FF:FF 192.168.0.0

Or so? The first parameter would specify the bitmask to apply to the MAC address, the second parameter the base address to OR it into. And we should refuse operation if any bits are set in the suffix of that IP address. I guess for the MAC address we need to support masking any bits, i.e. also from the middle, though for the IP address we only have to insert the determined bits to the end of the IP address.

That seems reasonable to me. (No need to support non-consecutive bitmasks, at least until something actually needs that.) Perhaps the base IP can default to 0.0.0.0 if not specified (and if the mask contains exactly 32 bits)? For GCP, it'd be IPv4FromMAC=00:00:ff:ff:ff:ff 0.0.0.0, or just IPv4FromMAC=00:00:ff:ff:ff:ff with the default.

Is the AWS logic reasonably standardized? (i.e. does it have a spec, and is it used beyond AWS?) If the latter then we could just add native support to networkd I guess, similar to the existing IPv4LL/DHCP/IPv6RA support. If it's strictly AWS specific and underdocumented I doubt this would be the right place though

It's documented, and Azure and GCP both have similar instance metadata services, with Azure also supplying IP addresses in instance metadata. To handle this in the simplest fashion that would work, it would suffice to have IPv4FromURL and IPv6FromURL, http URLs using IP addresses only (no hostnames), with a substitution allowed in the URL for the permanent MAC, and the response must be the IP address in text form. (There are more complex ways to use the instance metadata service, but this would suffice for both AWS and Azure, and IPv4FromMAC would suffice for GCP.)

GCP's metadata service requires an additional HTTP header, but GCP doesn't supply the IP in instance metadata, so that doesn't matter. AWS has an "instance metadata v2" protocol that's more complex, but I think implementing v1 would suffice here.

@poettering
Copy link
Member

It's documented, and Azure and GCP both have similar instance metadata services, with Azure also supplying IP addresses in instance metadata. To handle this in the simplest fashion that would work, it would suffice to have IPv4FromURL and IPv6FromURL, http URLs using IP addresses only (no hostnames), with a substitution allowed in the URL for the permanent MAC, and the response must be the IP address in text form. (There are more complex ways to use the instance metadata service, but this would suffice for both AWS and Azure, and IPv4FromMAC would suffice for GCP.)

Sounds OK for me to have. We link some stuff to libcurl anyway already (importd and some journal remoting stuff), adding some super-basic http get code based around it should be OK. Should be done out-of-process though, i.e. forked off so that we can set up some sandboxing for it, after all it might be used to parse complex stuff like TLS and certificates... And most likely we'll add DoH support to resolved eventually too, thus the dependency on libcurl isn't terribly new...

GCP's metadata service requires an additional HTTP header, but GCP doesn't supply the IP in instance metadata, so that doesn't matter. AWS has an "instance metadata v2" protocol that's more complex, but I think implementing v1 would suffice here.

We have a pretty neat JSON parser in our codebase, so if the new stuff is a bit of JSON that'd be fine too.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Jul 28, 2020 via email

@yuwata
Copy link
Member

yuwata commented Jul 31, 2020

Could you provide any references about that?

@yuwata
Copy link
Member

yuwata commented Aug 1, 2020

Thanks.

@angdraug
Copy link
Contributor

angdraug commented Oct 15, 2020

I measured nspawn container startup with various IP configuration options. Test setup: Debian bullseye/sid, systemd 246, /var/lib/machines is a directory, physical network is Ethernet to Google WiFi 1st gen, container image is minbase debootstrap built with packer-builder-nspawn-debootstrap.

Test sequence: start Wireshark capture on br0, send pings to container's IP address every millisecond with sudo ping -i0.001, start the container with sudo machinectl start, match log entries in journalctl -oshort-precise to the timestamp of the first ping reply in the packet capture.

Total time from start to host0 carrier is consistently around 0.5s. This is a lot, and I didn't dig deeper into what systemd is doing with all that time. Mounting /var/lib/machines to tmpfs made no difference. I couldn't get veth to work with global IPv6 addresses, but with IPv4 using veth instead of bridge also made no difference. Typical time breakdown from one of the runs:

systemd starting container: 18ms
vb-dev link up:             30ms  (+12ms)
systemd started container:  240ms (+210ms)
container systemd started:  244ms (+4ms)
systemd-networkd started:   482ms (+238ms)
host0 link up:              484ms (+2ms)
br0 port forwarding state:  487ms (+3ms)
host0 gained carrier:       507ms (+20ms)

Total time from carrier to first ping reply varied a lot:

carrier to first ping reply:
dhcpv6:         3832ms 3352ms 5309ms
static ipv6:    1557ms 1168ms 1228ms
dhcpv4:           17ms   23ms   28ms
static ipv4:       9ms    3ms    4ms

Even with static IPv6 configuration (Address=, Gateway=, DNS=), it still takes more than 1s before container begins to respond, bringing the total start to reply time to almost 2s. DHCPv6 adds more round-trips with similarly excessive timings and takes another 2-4s.

Enabling optimistic_dad sysctl (RFC 4429) on host and in the container made no difference.

ARO (Address Registration Option) from the NDP optimizations RFC 6775 could in theory speed this up, but it isn't implemented in Linux yet (according to Stefan Schmidt's report at LPC 2019 IoT Microconference), and even when it is it might only apply to IPv6 over IEEE 802.15.4.

Compared to that, the extra 10-20ms that DHCPv4 adds to container startup looks quaint. There might be lower hanging fruit in systemd-nspawn that could reduce that 500ms start to carrier time to the point where faster IPv4 configuration would begin to make a difference.

The ridiculously long time it takes IPv6 stack to initialize makes me sad and wondering if there's anything wrong with my setup. As it stands, it's unsuitable for on-demand containers that get started to serve requests from interactive applications, and wasteful with containers that only need to run for a few seconds at a time as part of a low-frequency compute pipeline.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Oct 18, 2020 via email

angdraug added a commit to angdraug/barley that referenced this issue Dec 29, 2020
Only containers running edge services (e.g. Envoy or Nginx) should have
global IPv6 addresses.

Seed host has privileged access to all containers running on it. Access
to Seed hosts is a sensitive security surface that should not be
unnecessarily exposed to additional attack vectors. A globally routable
IPv6 address is not necessary when Seeds are managed from local network.

IPv6 also adds up to 5s to network initialization:
systemd/systemd#16547 (comment)
@arianvp
Copy link
Contributor

arianvp commented Aug 27, 2023

I have another use case for early access to the IMDS. Namely I want to populate the ssh.authorized_keys.root systemd credential from Cloud Metadata. This is complicated though as systemd-tmpfiles I think runs before systemd-networkd.

It would be neat if we could set up a route to the link local address of the metadata server in early boot (Maybe udev's net_setup_link can do this?).

Then we can have a aws-network-generator.c to generate networkd units based on the Metadata. And we can fetch credentials from the metadata server for e.g. setting up authorized_keys.

@yuwata
Copy link
Member

yuwata commented Sep 12, 2023

Based on the today's discussion and https://gist.github.com/arianvp/22e1c5182eb6c17bbd8c1bbe823b516b, how about the following?

systemd-netns
  SYNTAX:
    systemd-netns [create] --interface=eth0 --virtual-interface=ipvlan99 --namespace-name=netns99 --protocol=ipv4ll
    systemd-netns delete --namespace-name=netns99

systemd-netns create

  1. enumerate interfaces,
  2. wait for that the physical interface is initialized.
    2a. monitor RTM_NEWLINK, and wait for the specified interface is detected,
    2b. monitor uevents, and wait for that udevd initializes the physical interface,
  3. bring up the interface if not,
  4. create a network namespace,
    3-1. lock /run/systemd/netns/,
    3-2. check the netns file in the directory does not exist,
    3-3. fork the process, and create a new netns by the parent process, then the child can still access the main netns,
    3-4. bind mount netns file under /run/systemd/netns/netns99,
  5. the child process creates a ipvlan or something on the interface in the namespace, maybe with IFLA_NET_NS_PID,
  6. the parent process waits for that the virtual interface is created,
  7. bring up the virtual network interface if necessary.
  8. start sd-ipv4ll on the virtual interface, we may be able to skip the probing,
  9. unlock /run/systemd/netns,

systemd-netns delete

  1. lock /run/systemd/netns,
  2. umount netns file,
  3. remove netns file,
  4. unlock the directory,

This may be useful to run commands (e.g. curl or wget) with NetworkNamespacePath=/run/systemd/netns/netns99, e.g.
systemd-run -p NetworkNamespacePath=/run/systemd/netns/netns99 curl URL -o /run/credentials/@system/foo

We can share many code from networkd, so I guess it is not hard to implement such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
network RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests

5 participants