Skip to content

Commit

Permalink
fix: Major refactoring of Polling, Retry and Timeout logic (#462)
Browse files Browse the repository at this point in the history
* fix: Major refactoring and fix for Polling, Retry and Timeout logic

This is in response to https://freeman.vc/notes/aws-vs-gcp-reliability-is-wildly-different, which triggered an investigation of the whole Polling/Retry/Timeout behavior in Python GAPIC clients and revealed many fundamental flaws in its implementaiton.

To properly describe the refactoring this PR does we need to stick to a rigorous terminology, as vague definitions of retries, timeouts, polling and related concepts seems to be the main source of the present bugs and overal confusion among both groups: users of the library and creators of the library. Please check the documentation of the `google.api_core.retry.Retry` class the `google.api_core.future.polling.Polling.result()` method for the proper definitions and context.

Note, the overall semantics around Polling, Retry and Timeout remains quite confusing even after refactoring (although it is now more or less rigorously defined), but it was clean as I could make it while still maintaining backward compatibility of the whole library.

The quick summary of the changes in this PR:

1) Properly define and fix the application of Deadline and Timeout concepts. Please check the updated documentation for the `google.api_core.retry.Retry` class for the actual definitions. Originally the `deadline` has been used to represent timeouts conflating the two concepts. As result this PR replaces `deadline` arguments with `timeout` ones in as backward-compatible manner as possible (i.e. backward compatible in all practical applications).

2) Properly define RPC Timeout, Retry Timeout and Pollint Timeout and how a generic Timeout concept (aka Logical Timeout) is mapped to one of those depending on the context. Please check `google.api_core.retry.Retry` class documentation for details.

3) Properly define and fix the application of Retry and Polling concepts. Please check the updated documentation for `google.api_core.future.polling.PollingFuture.result()` for details.

4) Separate `retry` and `polling` configurations for Polling future, as these are two different concepts (although both operating on `Retry` class). Originally both retry and polling configurations were controlled by a single `retry` parameter, merging configuration regarding how "rpc error responses" and how "operation not completed" responses are supposed to be handled.

5) For the following config properties - `Retry (including `Retry Timeout`), `Polling` (including `Polling Timeout`) and `RPC Timeout` - fix and properly define how each of the above properties gets configured and which config gets precedence in case of a conflict (check `PollingFuture.result()` method documentation for details). Each of those properties can be specified as follows: directly provided by the user for each call, specified during gapic generation time from config values in `grpc_service_config.json` file (for Retry and RPC Timeout) and `gapic.yaml` file (for Polling), or be provided as a hard-coded basic default values in python-api-core library itself.

6) Fix the per-call polling config propagation logic (the polling/retry configs supplied to `PollingFuture.result()` used to be ignored for actual call).

7) Deprecate the usage of `deadline` terminology in the whole library and backward-compatibly replace it with timeout. This is essential as what has been called "deadline" in this library was actually "timeout" as it is defined in `google.api_core.retry.Retry` class documentation.

8) Deprecate `ExponentialTimeout`, `ConstantTimeout` and related logic as those are outdated concepts and are not consistent with the other GAPIC Languages. Replace it with `TimeToDeadlineTimeout` to be consistent with how the rest of the languages do it.

9) Deprecate `google.api_core.operations_v1.config` as it is an outdated concept and self-inconsistent (as all gapic clients provide configuraiton in code). The configs are directly provided in code instead.

10) Switch randomized delay calculation from `delay` being treated as expected value for randomized_delay to `delay` being treated as maximum value for `randomized_delay` (i.e. the new expected valud for `randomized_delay` is `delay / 2`). See the `exponential_sleep_generator()` method implementation for details. This is needed to make Python implementation of retries and polling exponential backoff consistent with the rest of GAPIC languages. Also fix the uncontrollable growth of `delay` value (since it is a subject of exponential growth, the `delay` value was quickly reaching "infinity" value, and the whole thing was not failing simply due to python being a very forgiving language which forgives multiplying "infinity" by a number (`inf * number = inf`) binstead of simply overflowing to a (most likely) negative number).

11) Fix url construction in `OperationsRestTransport`. Without this fix the polling logic for REST transport was completely broken (is not affecting Compute client, as that one has custom LRO).

12) Las but not least: change the default values for Polling logic to be the following: `initial=1.0` (same as before), `maximum=20.0` (was `60`), `multiplier=1.5` (was `2.0`), `timeout=900` (was `120`, but due to timeout resolution logic was actually None (i.e. infinity)). This, in conjunction with changed calculation of randomized delay (i.e. its expected value now being  `delay / 2`) overall makes polling logic much less aggressive in terms of increasing delays between each polling iteration, making LRO return much earlier for users on average, but still keeping a healthy balance between strain put on both client and server by polling and responsiveness of LROs for user.

*The design doc summarising all the changes and reasons for them is in progress.

* fix ci failures (mainly sphinx errors)

* remove unused code

* fix typo

* Pin pytest version to <7.2.0

* reformat code

* address pr feedback

* address PR feedback

* address pr feedback

* Update google/api_core/future/polling.py

Co-authored-by: Victor Chudnovsky <vchudnov@google.com>

* Apply documentation suggestions from code review

Co-authored-by: Victor Chudnovsky <vchudnov@google.com>

* Address PR feedback

Co-authored-by: Victor Chudnovsky <vchudnov@google.com>
  • Loading branch information
vam-google and vchudnov-g committed Nov 10, 2022
1 parent ddcc249 commit 434253d
Show file tree
Hide file tree
Showing 34 changed files with 789 additions and 416 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Expand Up @@ -12,7 +12,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.7"
python-version: "3.10"
- name: Install nox
run: |
python -m pip install --upgrade setuptools pip wheel
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/mypy.yml
Expand Up @@ -12,7 +12,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.7"
python-version: "3.10"
- name: Install nox
run: |
python -m pip install --upgrade setuptools pip wheel
Expand Down
28 changes: 19 additions & 9 deletions google/api_core/extended_operation.py
Expand Up @@ -50,10 +50,13 @@ class ExtendedOperation(polling.PollingFuture):
refresh (Callable[[], type(extended_operation)]): A callable that returns
the latest state of the operation.
cancel (Callable[[], None]): A callable that tries to cancel the operation.
retry: Optional(google.api_core.retry.Retry): The retry configuration used
when polling. This can be used to control how often :meth:`done`
is polled. Regardless of the retry's ``deadline``, it will be
overridden by the ``timeout`` argument to :meth:`result`.
polling Optional(google.api_core.retry.Retry): The configuration used
for polling. This can be used to control how often :meth:`done`
is polled. If the ``timeout`` argument to :meth:`result` is
specified it will override the ``polling.timeout`` property.
retry Optional(google.api_core.retry.Retry): DEPRECATED use ``polling``
instead. If specified it will override ``polling`` parameter to
maintain backward compatibility.
Note: Most long-running API methods use google.api_core.operation.Operation
This class is a wrapper for a subset of methods that use alternative
Expand All @@ -68,9 +71,14 @@ class ExtendedOperation(polling.PollingFuture):
"""

def __init__(
self, extended_operation, refresh, cancel, retry=polling.DEFAULT_RETRY
self,
extended_operation,
refresh,
cancel,
polling=polling.DEFAULT_POLLING,
**kwargs,
):
super().__init__(retry=retry)
super().__init__(polling=polling, **kwargs)
self._extended_operation = extended_operation
self._refresh = refresh
self._cancel = cancel
Expand Down Expand Up @@ -114,7 +122,7 @@ def error_message(self):
def __getattr__(self, name):
return getattr(self._extended_operation, name)

def done(self, retry=polling.DEFAULT_RETRY):
def done(self, retry=None):
self._refresh_and_update(retry)
return self._extended_operation.done

Expand All @@ -137,9 +145,11 @@ def cancelled(self):
self._refresh_and_update()
return self._extended_operation.done

def _refresh_and_update(self, retry=polling.DEFAULT_RETRY):
def _refresh_and_update(self, retry=None):
if not self._extended_operation.done:
self._extended_operation = self._refresh(retry=retry)
self._extended_operation = (
self._refresh(retry=retry) if retry else self._refresh()
)
self._handle_refreshed_operation()

def _handle_refreshed_operation(self):
Expand Down
2 changes: 1 addition & 1 deletion google/api_core/future/async_future.py
Expand Up @@ -95,7 +95,7 @@ async def _blocking_poll(self, timeout=None):
if self._future.done():
return

retry_ = self._retry.with_deadline(timeout)
retry_ = self._retry.with_timeout(timeout)

try:
await retry_(self._done_or_raise)()
Expand Down
200 changes: 165 additions & 35 deletions google/api_core/future/polling.py
Expand Up @@ -18,7 +18,7 @@
import concurrent.futures

from google.api_core import exceptions
from google.api_core import retry
from google.api_core import retry as retries
from google.api_core.future import _helpers
from google.api_core.future import base

Expand All @@ -29,14 +29,37 @@ class _OperationNotComplete(Exception):
pass


RETRY_PREDICATE = retry.if_exception_type(
# DEPRECATED as it conflates RPC retry and polling concepts into one.
# Use POLLING_PREDICATE instead to configure polling.
RETRY_PREDICATE = retries.if_exception_type(
_OperationNotComplete,
exceptions.TooManyRequests,
exceptions.InternalServerError,
exceptions.BadGateway,
exceptions.ServiceUnavailable,
)
DEFAULT_RETRY = retry.Retry(predicate=RETRY_PREDICATE)

# DEPRECATED: use DEFAULT_POLLING to configure LRO polling logic. Construct
# Retry object using its default values as a baseline for any custom retry logic
# (not to be confused with polling logic).
DEFAULT_RETRY = retries.Retry(predicate=RETRY_PREDICATE)

# POLLING_PREDICATE is supposed to poll only on _OperationNotComplete.
# Any RPC-specific errors (like ServiceUnavailable) will be handled
# by retry logic (not to be confused with polling logic) which is triggered for
# every polling RPC independently of polling logic but within its context.
POLLING_PREDICATE = retries.if_exception_type(
_OperationNotComplete,
)

# Default polling configuration
DEFAULT_POLLING = retries.Retry(
predicate=POLLING_PREDICATE,
initial=1.0, # seconds
maximum=20.0, # seconds
multiplier=1.5,
timeout=900, # seconds
)


class PollingFuture(base.Future):
Expand All @@ -45,21 +68,29 @@ class PollingFuture(base.Future):
The :meth:`done` method should be implemented by subclasses. The polling
behavior will repeatedly call ``done`` until it returns True.
The actuall polling logic is encapsulated in :meth:`result` method. See
documentation for that method for details on how polling works.
.. note::
Privacy here is intended to prevent the final class from
overexposing, not to prevent subclasses from accessing methods.
Args:
retry (google.api_core.retry.Retry): The retry configuration used
when polling. This can be used to control how often :meth:`done`
is polled. Regardless of the retry's ``deadline``, it will be
overridden by the ``timeout`` argument to :meth:`result`.
polling (google.api_core.retry.Retry): The configuration used for polling.
This parameter controls how often :meth:`done` is polled. If the
``timeout`` argument is specified in :meth:`result` method it will
override the ``polling.timeout`` property.
retry (google.api_core.retry.Retry): DEPRECATED use ``polling`` instead.
If set, it will override ``polling`` paremeter for backward
compatibility.
"""

def __init__(self, retry=DEFAULT_RETRY):
_DEFAULT_VALUE = object()

def __init__(self, polling=DEFAULT_POLLING, **kwargs):
super(PollingFuture, self).__init__()
self._retry = retry
self._polling = kwargs.get("retry", polling)
self._result = None
self._exception = None
self._result_set = False
Expand All @@ -69,57 +100,150 @@ def __init__(self, retry=DEFAULT_RETRY):
self._done_callbacks = []

@abc.abstractmethod
def done(self, retry=DEFAULT_RETRY):
def done(self, retry=None):
"""Checks to see if the operation is complete.
Args:
retry (google.api_core.retry.Retry): (Optional) How to retry the RPC.
retry (google.api_core.retry.Retry): (Optional) How to retry the
polling RPC (to not be confused with polling configuration. See
the documentation for :meth:`result` for details).
Returns:
bool: True if the operation is complete, False otherwise.
"""
# pylint: disable=redundant-returns-doc, missing-raises-doc
raise NotImplementedError()

def _done_or_raise(self, retry=DEFAULT_RETRY):
def _done_or_raise(self, retry=None):
"""Check if the future is done and raise if it's not."""
kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}

if not self.done(**kwargs):
if not self.done(retry=retry):
raise _OperationNotComplete()

def running(self):
"""True if the operation is currently running."""
return not self.done()

def _blocking_poll(self, timeout=None, retry=DEFAULT_RETRY):
"""Poll and wait for the Future to be resolved.
def _blocking_poll(self, timeout=_DEFAULT_VALUE, retry=None, polling=None):
"""Poll and wait for the Future to be resolved."""

Args:
timeout (int):
How long (in seconds) to wait for the operation to complete.
If None, wait indefinitely.
"""
if self._result_set:
return

retry_ = self._retry.with_deadline(timeout)
polling = polling or self._polling
if timeout is not PollingFuture._DEFAULT_VALUE:
polling = polling.with_timeout(timeout)

try:
kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
retry_(self._done_or_raise)(**kwargs)
polling(self._done_or_raise)(retry=retry)
except exceptions.RetryError:
raise concurrent.futures.TimeoutError(
"Operation did not complete within the designated " "timeout."
f"Operation did not complete within the designated timeout of "
f"{polling.timeout} seconds."
)

def result(self, timeout=None, retry=DEFAULT_RETRY):
"""Get the result of the operation, blocking if necessary.
def result(self, timeout=_DEFAULT_VALUE, retry=None, polling=None):
"""Get the result of the operation.
This method will poll for operation status periodically, blocking if
necessary. If you just want to make sure that this method does not block
for more than X seconds and you do not care about the nitty-gritty of
how this method operates, just call it with ``result(timeout=X)``. The
other parameters are for advanced use only.
Every call to this method is controlled by the following three
parameters, each of which has a specific, distinct role, even though all three
may look very similar: ``timeout``, ``retry`` and ``polling``. In most
cases users do not need to specify any custom values for any of these
parameters and may simply rely on default ones instead.
If you choose to specify custom parameters, please make sure you've
read the documentation below carefully.
First, please check :class:`google.api_core.retry.Retry`
class documentation for the proper definition of timeout and deadline
terms and for the definition the three different types of timeouts.
This class operates in terms of Retry Timeout and Polling Timeout. It
does not let customizing RPC timeout and the user is expected to rely on
default behavior for it.
The roles of each argument of this method are as follows:
``timeout`` (int): (Optional) The Polling Timeout as defined in
:class:`google.api_core.retry.Retry`. If the operation does not complete
within this timeout an exception will be thrown. This parameter affects
neither Retry Timeout nor RPC Timeout.
``retry`` (google.api_core.retry.Retry): (Optional) How to retry the
polling RPC. The ``retry.timeout`` property of this parameter is the
Retry Timeout as defined in :class:`google.api_core.retry.Retry`.
This parameter defines ONLY how the polling RPC call is retried
(i.e. what to do if the RPC we used for polling returned an error). It
does NOT define how the polling is done (i.e. how frequently and for
how long to call the polling RPC); use the ``polling`` parameter for that.
If a polling RPC throws and error and retrying it fails, the whole
future fails with the corresponding exception. If you want to tune which
server response error codes are not fatal for operation polling, use this
parameter to control that (``retry.predicate`` in particular).
``polling`` (google.api_core.retry.Retry): (Optional) How often and
for how long to call the polling RPC periodically (i.e. what to do if
a polling rpc returned successfully but its returned result indicates
that the long running operation is not completed yet, so we need to
check it again at some point in future). This parameter does NOT define
how to retry each individual polling RPC in case of an error; use the
``retry`` parameter for that. The ``polling.timeout`` of this parameter
is Polling Timeout as defined in as defined in
:class:`google.api_core.retry.Retry`.
For each of the arguments, there are also default values in place, which
will be used if a user does not specify their own. The default values
for the three parameters are not to be confused with the default values
for the corresponding arguments in this method (those serve as "not set"
markers for the resolution logic).
If ``timeout`` is provided (i.e.``timeout is not _DEFAULT VALUE``; note
the ``None`` value means "infinite timeout"), it will be used to control
the actual Polling Timeout. Otherwise, the ``polling.timeout`` value
will be used instead (see below for how the ``polling`` config itself
gets resolved). In other words, this parameter effectively overrides
the ``polling.timeout`` value if specified. This is so to preserve
backward compatibility.
If ``retry`` is provided (i.e. ``retry is not None``) it will be used to
control retry behavior for the polling RPC and the ``retry.timeout``
will determine the Retry Timeout. If not provided, the
polling RPC will be called with whichever default retry config was
specified for the polling RPC at the moment of the construction of the
polling RPC's client. For example, if the polling RPC is
``operations_client.get_operation()``, the ``retry`` parameter will be
controlling its retry behavior (not polling behavior) and, if not
specified, that specific method (``operations_client.get_operation()``)
will be retried according to the default retry config provided during
creation of ``operations_client`` client instead. This argument exists
mainly for backward compatibility; users are very unlikely to ever need
to set this parameter explicitly.
If ``polling`` is provided (i.e. ``polling is not None``), it will be used
to controll the overall polling behavior and ``polling.timeout`` will
controll Polling Timeout unless it is overridden by ``timeout`` parameter
as described above. If not provided, the``polling`` parameter specified
during construction of this future (the ``polling`` argument in the
constructor) will be used instead. Note: since the ``timeout`` argument may
override ``polling.timeout`` value, this parameter should be viewed as
coupled with the ``timeout`` parameter as described above.
Args:
timeout (int):
How long (in seconds) to wait for the operation to complete.
If None, wait indefinitely.
timeout (int): (Optional) How long (in seconds) to wait for the
operation to complete. If None, wait indefinitely.
retry (google.api_core.retry.Retry): (Optional) How to retry the
polling RPC. This defines ONLY how the polling RPC call is
retried (i.e. what to do if the RPC we used for polling returned
an error). It does NOT define how the polling is done (i.e. how
frequently and for how long to call the polling RPC).
polling (google.api_core.retry.Retry): (Optional) How often and
for how long to call polling RPC periodically. This parameter
does NOT define how to retry each individual polling RPC call
(use the ``retry`` parameter for that).
Returns:
google.protobuf.Message: The Operation's result.
Expand All @@ -128,8 +252,8 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
google.api_core.GoogleAPICallError: If the operation errors or if
the timeout is reached before the operation completes.
"""
kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
self._blocking_poll(timeout=timeout, **kwargs)

self._blocking_poll(timeout=timeout, retry=retry, polling=polling)

if self._exception is not None:
# pylint: disable=raising-bad-type
Expand All @@ -138,12 +262,18 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):

return self._result

def exception(self, timeout=None):
def exception(self, timeout=_DEFAULT_VALUE):
"""Get the exception from the operation, blocking if necessary.
See the documentation for the :meth:`result` method for details on how
this method operates, as both ``result`` and this method rely on the
exact same polling logic. The only difference is that this method does
not accept ``retry`` and ``polling`` arguments but relies on the default ones
instead.
Args:
timeout (int): How long to wait for the operation to complete.
If None, wait indefinitely.
If None, wait indefinitely.
Returns:
Optional[google.api_core.GoogleAPICallError]: The operation's
Expand Down

0 comments on commit 434253d

Please sign in to comment.