Skip to content

Releases: uber/cadence

v1.2.9

01 May 17:46
ba39678
Compare
Choose a tag to compare

What's Changed

Read more

v1.2.8

26 Mar 18:46
3f64176
Compare
Choose a tag to compare

What's Changed

Added

Changed

Fixed

  • Set proper max reset points by @neil-xie in #5623
  • Put a timeout for timer task deletion loop during shutdown by @taylanisikdemir in #5626
  • Catch unit test failures in make test by @Groxx in #5635
  • fix: get messages between query over message_id typo by @zedongh in #5607
  • Fix context leak in tests by @munahaf in #5377
  • Make sure task processing rate limiter is only done in the active side by @sankari165 in #5654
  • Fix Pinot query validator bug when user pass in not equal query with value missing by @neil-xie in #5662
  • Update Pinto query validator failed log, minor refactor pinot visibility store to remove panics by @neil-xie in #5664
  • Fix context leak in pinot integration test by @neil-xie in #5682
  • Fix SignalWithStartWorkflow API by @Shaddoll in #5671
  • Fix wrong migration paths in example by @kotcrab in #5668
  • Fix comment in workflow id cache config by @sankari165 in #5661
  • Fix the local integration test docker-compose file by @jakobht in #5695
  • Do not get workflow execution from database when shard is closed by @Shaddoll in #5697

Removed

  • Removed useless metrics tag from the workflowIDcache by @jakobht in #5651
  • Removed the shadower service for cadence-server by @agautam478 in #5660

New Contributors

Full Changelog: v1.2.7...v1.2.8

v1.2.7

09 Feb 19:00
08d5994
Compare
Choose a tag to compare

What's Changed

Added

Fixed

Changed

Read more

v1.2.6

14 Dec 22:11
558780b
Compare
Choose a tag to compare

What's Changed

Added

Fixed

Changed

  • Cassandra version is changed from 3.11 to 4.1.3 by @taylanisikdemir (#5461)
    • If your machine already has ubercadence/server:master-auto-setup image then you need to repull so it works with latest docker-compose*.yml files
  • Move dynamic ratelimiter to its own file by @jakobht (#5451)
  • Create and use a limiter struct instead of just passing a function by @jakobht (#5454)
  • Dynamic ratelimiter factories by @jakobht (#5455)
  • Update github action for image publishing to released by @3vilhamster (#5460)
  • Update matching to emit metric for tasklist backlog size by @Shaddoll (#5448)
  • Change variable name from SecondsSinceEpoch into EventTimeMs by @bowenxia (#5463)

Removed

New Contributors

Full Changelog: v1.2.5...v1.2.6

v1.2.5

02 Nov 19:07
eb8eea9
Compare
Choose a tag to compare

What's Changed

Added

  • Scanner / Fixer changes by @Groxx in #5361
    • Stale-workflow detection and cleanup added to shardscanner, disabled by default.
    • New dynamic config to better control scanner and fixer, particularly for concrete executions.
    • Documentation about how scanner/fixer work and how to control them, see the scanner readme.md
    • This also includes example config to enable the new fixer.
  • MigrationChecker interface to expose migration CLI by @abhishekj720 in #5424
  • Added Pinot as new visibility store option by @neil-xie in #5201
    • Added pinot visibility triple manager to provide options to write to both ES and Pinot.
    • Added pinotVisibilityStore and pinotClient to support CRUD operations for Pinot.
    • Added pinot integration test to set up Pinot test cluster and test Pinot functionality.

Fixed

Full Changelog: v1.2.4...v1.2.5-prerelease3

v1.2.4

27 Sep 19:03
c93d6af
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.2.3...v1.2.4

v1.2.3 (Retracted, please use v1.2.4)

15 Sep 22:10
4a16136
Compare
Choose a tag to compare
Pre-release

Added

Expose workflow history size and count to client by @timl3136 (#5392)

Fixed

[cadence-cli] fix typo in input flag for parallelism by @sankari165 (#5397)

Changed

Update config store client to support SQL database by @Shaddoll (#5395)
Scaffold config store for sql plugins by @Shaddoll (#5396)
Improve poller detection for isolation by @Shaddoll (#5399)

v1.2.2

19 Sep 16:34
e5f605c
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.2.1...v1.2.2

v1.2.1

19 Sep 03:56
0e17485
Compare
Choose a tag to compare

Project release: Zonal isolation

This version introduces a few resiliency concepts into customers' worker task processing such that they can detect deployment or configuration failures earlier. These features are opt-in.

The high-level concept is to provide a means to subdivide work (called 'isolation-groups') for workers along whatever partitioning mechanism that is required for your service.

By default the partitioning mechanism provided will attempt to keep workflows running in the location the are started, such that customers may identify broken changes earlier, rather than waiting for the deployment of an entire region. However, if there are no pollers available available in that subdivision, it'll route the work elsewhere.

Nomenclature

Partitioning: A means to subdivide the tasks given to workflows, of which there are many possible schemes and one default one provided. When a workflow is started, a group of partition keys are provided by request headers. The partition keys are used to determine which isolation group of workers should process these workflows.
Workflow pinning: A partitioning scheme which emphasizes keeping workflows running in the location they were started
Isolation-groups: A division of work within a customer region in which they can subdivide their workers and pin the workflows. This originally was intended as a synonym for 'zone' in the site reliability, as a subdivision of a region. However the important point is that this is a failure domain for customer workflows, so this may be an arbitrary subdivision of your cluster's traffic.
Isolation-group drain: A means of excluding work from an isolation-group. If an isolation group is drained, workers from that isolation group won't be able to get any task. And customers cannot start workflows from that isolation group.

Default concepts and approaches

The partitioning and isolation concepts are intended to be provided as general purpose orchestration concepts and flexible, with some basic defaults provided. By default the following behaviour is given:

  • Partition data is persisted with workflow execution records by the provided middleware if the provided header is passed when workflows are created.
  • The cadence client and worker Go libraries will pass these as headers if provided in client options

Pinning behaviour

The workflow original zone is captured on workflow start and will be used on workflow processing.

The default partitioner provides the following behaviour: It will attempt to dispatch work in a zone where the workflow was started. However, workers may not be available in that zone, or no longer available for some reason. So the partitioner takes information from a lookback of poller information and uses this lookback data to ensure that the workflow can be processed. If the the start isolation-group is not available it'll another healthy random one.

'Health', here, is determined as the presence of pollers and the absence of drains.

The 'unpinning' is import for two main reasons: firstly, it's quite possible to start a workflow from an unrelated isolation-group in which the pollers are created and to suddenly blackhole that work would likely be not the desired behaviour. But secondly, and probably more importantly, this prevents a head-of-line blocking problem internally for Cadence. At the database level (in this release anyway) tasks need to be dispatched in-order and so if an isolation-group were to be not processed it would block task processing.

Drains

This release also introduces a simplistic notion of drains, which allow for isolation-groups to be excluded from traffic processing, should that be required. Drains are issuable via the Admin API or via cli:

eg:

cadence admin isolation-groups update-global --set-drains zone-1
cadence admin isolation-groups get-global

This information is stored in the config-store and is not part of dynamic configuration.

Configuration

In order to use this feature, the requisite configuration is required:

system.allIsolationGroups: This is a list of all the possible isolation-groups
system.enableTasklistIsolation: This is the bool flag to enable it for a domain

Implementation

The changes for this feature are largely in Matching and can be (reductively) described as: Sync and Async-match in Cadence as being made aware of a new dimension; their associated isolation-group. The tasks piped through the Matching service are matching the appropriate isolation-group channel.

What's Changed

Read more

v1.0.0

26 Apr 19:59
Compare
Choose a tag to compare

We are v1.0! (with a schema upgrade)

What does this mean?!

Not much. Primarily that we are declaring "it's stable and in use" more visibly, because we continually get questions about this :) A larger public announcement / state-of-the-project is in the works.

Importantly, v1.0 does not imply any change to backwards compatibility (the minimum supported client version has not changed), RPC compatibility (ditto, all changes are backwards compatible), or Go API compatibility (this is not truly a library, Go compatibility is not a goal).

Going by previous version patterns, this would have been labeled v0.26.0 as it is a relatively incremental change (plus schema changes) from v0.25.0. As such, some strings still reference "0.26", because this older SHA is the one we have been using the most internally.
These strings will be updated and validated soon, and will likely be released as v1.0.1. This should have no behavioral impact at all, but will be visible in metrics, logs, and display strings.

What do I need to do to upgrade?

Schema upgrades needed

There have been schema changes to both normal and visibility datastores, primarily to provide better data for cleanup and hot-shard detection:

These were intentionally kept out of v0.25.0 to keep that upgrade simple, as they were not fully utilized yet.

Replication cache recommendation

We have internally disabled the replication cache (history.replicatorCacheCapacity dynamic config set to 0), due to unexpectedly large memory use under abnormal load, and you may wish to do so as well.

We did not encounter any misbehavior, and it did reduce database load as intended, but we intend to make some changes to it to estimate and constrain memory use before re-enabling.

What has changed?

At a very high level, we've been focused on:

  • Internal scaling challenges, both improving bottlenecks and improving our ability to accurately identify bottlenecks
    • Many metrics, logs, and refactors are at least somewhat related to this
    • Our multi-cluster support is improved in particular, as we have been connecting clusters and moving many domains to spread load more evenly
  • Database corruptions, as our Cassandra clusters have had some problems that cause issues for months
    • Many logs, scanner, and stale-task changes are related to this, e.g. to detect and remove invalid data
  • Scaling up the team
    • More changes to come!

Some loosely categorized PRs that were included follows:

Critical bugfixes (resolving issues in v0.25.0)

Parent-close-policies apply to child workflows even after they reset/continue-as-new/etc

  • Update parent close policy to terminate/cancel child workflows even after continue as new by @Shaddoll in #5032
    • This requires new stored data, so it does not apply to child workflows started before this version.

Better config introspection

Schemas are now available via the go module, as go:embed files

Enhancing existing metrics and logging (and more included in other PRs)

Misc

Read more