Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14 Backports 2023-09-26 #28282

Merged
merged 40 commits into from
Sep 29, 2023
Merged

v1.14 Backports 2023-09-26 #28282

merged 40 commits into from
Sep 29, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Sep 26, 2023

Once this PR is merged, you can update the PR labels via:

for pr in 27567 27996 28065 27953 27870 28131 28122 28133 27841 27959 28018 28161 28145 28173 28172 28163 27873 27691 28221 28249 28219 28225; do contrib/backporting/set-labels.py $pr done 1.14; done

or with

make add-labels BRANCH=v1.14 ISSUES=27567,27996,28065,27953,27870,28131,28122,28133,27841,27959,28018,28161,28145,28173,28172,28163,27873,27691,28221,28249,28219,28225

jschwinger233 and others added 30 commits September 26, 2023 10:35
[ upstream commit 48023b9 ]

This commit makes conn-disrupt-test a github action, so upgrade test
and IPsec key rotation test don't have to copy and paste everywhere.

The idea is to allow caller workflow to specify the commands to execute,
then this action will follow the steps:
1. Run "cilium-cli connectivity test --conn-disrupt-test-setup";
2. Run whatever caller workflow passes: could be upgrade operation or
   IPsec key rotation;
3. Run "cilium-cli connectivity test --include-conn-disrupt-test";

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 85481c4 ]

When executing `cilium encrypt status`, cilium-agent lists xfrm states
and counts number of different AEAD keys. However, cilium-agent panics
if there is any xfrm state using non-AEAD algorithm. These unexpected
xfrm states could be installed by other applications.

To reproduce the panic, we can manually install one by running command:

```
ip x s a src 1.1.1.1 dst 1.1.1.2 proto esp spi 0x3 reqid 1 mode tunnel enc aes 0xf0e1d2c3b4a5f60708090a0b0c0d0e0f
```

Then `cilium encrypt status` crashes.

This patch fixes it.

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 415e48f ]

One dump is enough.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 3be5d2c ]

Looks like these maps were removed with
87c8746 ("Remove sockops-enable and friends").

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit b95a256 ]

Missed in c5348d2 ("ipmasq: Implement ip-masq-agent support for IPv6").

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit fb1c074 ]

Collect the tail-call map for bpf_xdp.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 708e9c1 ]

For the sake of ease of development, we test in CI that the Ginkgo suite
found in /test can compile on multiple platforms, including darwin.

This test suite includes a test of the Hive framework, which initializes
an instance of the daemon Cell. This Cell requires the MetricsRegistry,
and so these tests depend on the MetricsRegistry. Some metrics may soon
depend upon Linux feature probes, and so make the Linux probes package
an indirect dependency of the Ginkgo tests.

In order to support this build dependency, we allow the probes package
to compile on darwin by adding stub values for netlink constants used by
the package, and a linux-specific declaration of the constants using the
netlink package.

Signed-off-by: Andrew Sauber <2046750+asauber@users.noreply.github.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 53b9ccf ]

To avoid a circular import, move testutils/ipam.go into its own package.

The circular import was observed when attempting to use datapath kernel
feature probes in the metrics package. It was affecting only the probe
package's tests.

import cycle:
github.com/cilium/cilium/pkg/datapath/linux/probes
    [github.com/cilium/cilium/pkg/datapath/linux/probes.test]
github.com/cilium/cilium/pkg/testutils
    [github.com/cilium/cilium/pkg/datapath/linux/probes.test]
github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2
    [github.com/cilium/cilium/pkg/datapath/linux/probes.test]
github.com/cilium/cilium/pkg/policy/api
    [github.com/cilium/cilium/pkg/datapath/linux/probes.test]
github.com/cilium/cilium/pkg/metrics
    [github.com/cilium/cilium/pkg/datapath/linux/probes.test]

Moving this package avoids the import cycle

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 3d964de ]

[ backporter's notes: fixed trivial conflict in spelling_wordlist.txt ]

Add a metric for the current maximum observed interface index

Adds a metric `endpoint_max_ifindex`, which is the current maximum
interface index for all endpoints. The metric is updated during the
periodic invocation of `syncHostIPs`.

This metric can be used to determine if the interface index for the next
Pod will exceed Cilium's limit of max(uint16) on older kernels. On
kernels which do not provide ifindex via the FIB, Cilium must store the
ifindex in the CT map, and 16 bits are reserved per entry for this
purpose. This presents a limit of 65535 lifetime interfaces, which can
be approached quickly with sufficient pod churn.

Users who are subject to this limitation (typically a kernel version
less than 5.10), are advised to reboot or roll the host in this case.

Its default-enabled status is dynamic. On kernels which provide ifindex
via the FIB, the metric is disabled (since Cilium's ifindex limits are
not subject to max(uint16) in this case).

Addresses: #17259
Addresses: #16260

Signed-off-by: Andrew Sauber <2046750+asauber@users.noreply.github.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit d9c9337 ]

[ backporter's notes: re-applied hunk to the v1.14 version of
  the upgrade notes. ]

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 3ea5e8c ]

[ backporter's notes:
  * dropped changes in install/kubernetes/Makefile as the given line
    was not present in v1.14;
  * fixed trivial conflict in spelling_wordlist.txt;
  * re-generated the cilium/values.yaml and cmdref/cilium-agent.md
    files from scratch.
]

When ToFQDN policies are in use (CIDR) identities are created for each IP
in a DNS response that matches a ToFQDN policy. These identities are garbage
collected when

  1) all endpoints with ToFQDN policies are removed
  2) the maximum number of IPs per host is reached (tofqdn-max-ips-per-host)
  3) when the identity has been unused and not refreshed by conntrack map GC

Problems arise when he conntrack GC interval becomes very long. If it's backed
by a LRU BPF map, the maximum is set to 12 hours (defaults.ConntrackGCMaxLRUInterval),
meaning it will take 12 hours before unused identities are marked dead and collected
(unless tofqdn-max-ips-per-host limit is reached for the FQDN entry).

To allow user to have more control over this add the conntrackGCMaxInterval option
that will allow limiting the maximum interval to something less than the 12 hours.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit ce51167 ]

Previously, the CiliumExecContext() function would retry the cilium command up to 5 times (limitTimes) without verifying the success of previous executions. This pull request introduces a logic enhancement that validates the return code of each executed command and exits upon success.

Additionally, this change optimizes test execution time by reducing unnecessary retries. With a 200ms delay between retries, the overall test execution time is expected to improve by at least 200ms multiplied by the number of cilium commands and the number of cilium pods involved.

Local Ginkgo tests, specifically those focused on 'K8sDatapathBGPTests' 'K8sDatapathCustomCalls' and 'K8sDatapathLRPTests' have shown significant improvements:

Before:
`Ran 2 of 106 Specs in 233.348 seconds`

After:
`Ran 2 of 106 Specs in 168.563 seconds`

Signed-off-by: Boris Petrovic <carnerito.b@gmail.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 071c2af ]

In the past, we have merged automated updates to the coccicheck image
version (for example, [0]) without realising they may break the
workflow, because the workflow wouldn't run on the PR, which contained
no BPF-related changes.

Let's make the workflow run when its definition file is updated, to make
sure we can catch similar issues in the future.

[0] #27947

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 02c17d5 ]

CB_SRC_LABEL is cleared just a few lines above. So to get the source's
actual security identity, we need to use our local variable.

Fixes: e2829a0 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit aff16b2 ]

This commit sets the routing-mode and tunnel-protocol based on
.Values.tunnel or .Values.routingMode and .Values.tunnelProtocol,
defaulting to the former behavior but honoring new values if set.

Signed-off-by: Marco Aurelio Caldas Miranda <17923899+macmiranda@users.noreply.github.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit fdeb6f3 ]

Signed-off-by: Marco Aurelio Caldas Miranda <17923899+macmiranda@users.noreply.github.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 76c86df ]

Must have port for Service reference

Fixes: #27956

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 567bc43 ]

According to Kubernetes guidlines empty prefix should be allowed.

Fixes: #27994

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit d987c37 ]

The Makefile for the documentation makes sure that the docs-builder
image is always present and up-to-date before building the docs. To do
so, it simply rebuilds the image, every time.

This is efficient to make sure the image is up-to-date, but not that
much in terms of build duration. Let's make it possible to skip that
step if the user is confident their image is present and up-to-date.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit e58d673 ]

Allow users to set the INCREMENTAL environment variable to avoid passing
the "-E" option to Sphinx when skipping the linters. This way, we can
have Sphinx do incremental builds instead of re-reading and re-writing
all files. This can be used for quick iterations on a local setup.

There is no need to make "-E" conditional for the spell checker run,
given that the spell checker seems to always re-read all files anyway.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 38d5669 ]

The documentation build targets, including html, have a dependency on
update-helm-values. There is no clear reason for that: this seems to be
for historical reasons.

- Initially, the dependency was on $(HELM_VALUES), because the file with
  Helm values had not been committed to the Git repo at the time, so we
  needed to ensure that it was generated first.

- When reorganising checks for Helm values in 6167f8a ("docs: check
  updates for the Helm reference"), we replaced it with
  "update-helm-values" which looked cleaner; but we don't need to run
  the _check_ at this stage, only to make sure the _file_ itself is
  available.

Now that the file is part of the repository, there's no need to check
for its existence, and we can simply remove the target dependency (to
align it with other files that receive automatic updates, such as
codeowners.rst or crdlist.rst).

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit cf3dc21 ]

Let's make sure all Makefile targets are properly documented, and nicely
displayed when running "make help".

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 4d699d8 ]

Move around targets inside of Documentation/Makefile to have them
displayed in a more sensible order on "make help".

    $ make help

    Usage:
      make <target>

    Targets (default: "html")

    Development Images
      base-image                    Build the docs-base image for updating the requirements.txt file.
      builder-image                 Build the docs-builder image for rendering and checking the documentation.

    Auto-generated Contents Updates and Validation
      api-flaggen                   Update the table of API flags restrictions.
      update-cmdref                 Update the command reference documents (agent, bugtool, operators, etc.).
      update-codeowners             Update the description of the code owner teams.
      update-crdlist                Update the list of CRDs.
      update-helm-values            Update the Helm reference documentation.
      check                         Validate command and Helm references, policy examples, and others.

    Build
      html                          Check documentation and render it under the specified format.
      epub                          Check documentation and render it under the specified format.
      latex                         Check documentation and render it under the specified format.
      live-preview                  Build and serve the documentation locally.

    Development
      update-requirements           Regenerate the requirements.txt file from requirements-min/requirements.txt.
      clean                         Clean up all artefacts from documentation.
      help                          Display help for the Makefile.

Also add target update-cmdref to .PHONY.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 5b77a54 ]

Since commit 944dddf ("docs: Symlink (instead of copying) the API
reference"), we symlink the _api/ directory (from ../api) instead of
copying it. So there's nothing to clean on that side on "make clean",
otherwise we lose the link.

Fixes: 944dddf ("docs: Symlink (instead of copying) the API reference")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 2b4f6df ]

Rather than having this information hidden in a parenthesis in the
troubleshooting section, let's have a full paragraph in the key rotation
section that explains what needs to be done when using Cluster Mesh.

Signed-off-by: Marga Manterola <marga@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 4cb0940 ]

As k8s UID are unique, we can index the policies by their UId. This will
decrease Cilium's boot-up time for clusters with a large number of
network policies.

For a cluster with a lot of network policies that was taking ~4m20s to
boot Cilium, with this change it will take ~20 seconds.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit c36548c ]

Function is_html5_writer_available() is deprecated, its invocation is
unnecessary (the writer _is_ available in the versions of Sphinx we're
using), and it has been removed in more recent versions of Sphinx. Let's
get rid of it.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 94da061 ]

[ backporter's notes: picked Documentation/requirements.txt verbatim
  from the main branch to keep it consistent across versions, as we
  nonetheless use the same image in CI. ]

Now that we have rebased our theme onto the upstream version, we can use
newer versions of docutils, which allows us to update Sphinx and a bunch
of dependencies. We're still limited to docutils 0.18 because of
sphinx-tabs, though.

One immediate improvement is that we get working links for the value
types in the gRPC reference.

There are newer versions of Sphinx, websupport, or some second-level
dependencies available (currently 7.2.6 and 1.2.6, respectively), but
Netlify doesn't know about them yet and fails to build the previews.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 30c4df9 ]

In the past, we disabled Sphinx's output in check-build.sh. This was
because of the way we handle warnings: in order to have them printed at
the end of the logs, we print them to a file, we filter them, and print
the outstanding ones at the end of the output. Keeping the standard
output meant that warnings or errors wouldn't appear in the middle of
this output, and warnings that we filtered out wouldn't appear at all,
which could be confusing.

With the recent Sphinx update, we are able to filter the warnings from
MyST more efficiently, directly from conf.py, and to get rid of the
filters.

Let's take advantage of this to re-enable the output from Sphinx. While
it's not necessary to build the docs, it's occasionally helpful to debug
the setup when things don't go as expected.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 2f7fbfe ]

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
mhofstetter and others added 2 commits September 26, 2023 15:00
[ upstream commit f73e1c5 ]

[ backporter's notes: switched cilium/dns import to miekg/dns, as v1.14
  was relying on a replace directive instead of pointing to the fork. ]

PR #25309 introduced a data race by sharing the sessionUDPFactory between
the DNS server instances for the different IP families (IPv4 & IPv6).
This has been detected by #27979.

This commit fixes the issue for the TCP servers too. It not set explicitly,
these are initialized with the default udp session factory to prevent nil
pointer exceptions. Therefore, an explicit noop udp session factory is set.

Fixes: #28156

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit c6fbeb1 ]

This target should completely ignore the existing contents of
`cilium/values.yaml` since we always generate from the template files.

Co-authored-by: Donia Chaiehloudj <donia.cld@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 marked this pull request as ready for review September 26, 2023 13:02
@giorio94 giorio94 requested review from a team as code owners September 26, 2023 13:02
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good. Thank you!

@giorio94
Copy link
Member Author

/test-backport-1.14

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backports for my doc PRs look good, thank you!

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes LGTM.

Copy link
Member

@margamanterola margamanterola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good

@giorio94
Copy link
Member Author

The build commits check failed due to timeout. Manually run git rebase --exec "make -C test build -j $(nproc) && make -C test build-darwin" upstream/v1.14 locally, which correctly succeeded (this was the only step not run by the CI).

Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config changes LGTM.

Thanks for the backports

Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine are good, thanks.

@pippolo84 pippolo84 self-requested a review September 28, 2023 14:10
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 28163 and 28219 LGTM 👍

@pippolo84 pippolo84 removed the request for review from mhofstetter September 28, 2023 14:11
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for HTTPRoute commit ✅

@aanm aanm merged commit 8c56587 into v1.14 Sep 29, 2023
197 of 198 checks passed
@aanm aanm deleted the pr/v1.14-backport-2023-09-26 branch September 29, 2023 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet