01 May 17:46

jakobht

ba39678

v1.2.9 Latest

Latest

What's Changed

Addition of tests for ArchivalConfigStateMachine in common/domain by @abhishekj720 in #5698
Introduce new dynamic config for enabling wfID based ratelimiting by @jakobht in #5703
Add unit tests for sql plugin registration by @Shaddoll in #5705
Add unit tests for sql helper functions by @Shaddoll in #5706
Add unit test for helper function of sql execution store by @Shaddoll in #5707
Generate a metadata file artifact in unit test buildkite job by @taylanisikdemir in #5708
Write tests for cdb.UpdateWorkflowExecutionWithTasks by @taylanisikdemir in #5709
Add unit tests for helper functions in sql execution store util by @Shaddoll in #5710
Add unit tests for CreateWorkflowExecution by @Shaddoll in #5715
Test: Addition of tests for replicationQueue publish and publish to dlq by @abhishekj720 in #5700
Implemented ratelimiting for external calls pr wfid (guarded by feature flag) by @jakobht in #5704
remove old metrics wrappers and use new generated metered wrappers by @3vilhamster in #5717
Proper shutdown of kafka consumer impl and fix test by @taylanisikdemir in #5712
Add additional unit tests for functions in constants.go by @timl3136 in #5713
Initial codecov integration by @taylanisikdemir in #5711
Add tests for UpdateWorkflowExecution by @Shaddoll in #5718
Tests for UpdateWorkflowEecution in nosql store-Part1 by @agautam478 in #5719
Add unit tests for ConflictResolveWorkflowExecution by @Shaddoll in #5721
Add tests for elasticsearch v6 client by @neil-xie in #5716
Add unit tests for persistence task types in DataManagerInterfaces by @timl3136 in #5720
Add unit tests for CreateFailoverMarkerTasks by @Shaddoll in #5724
Change noisy frontend poll timeout log to debug level by @taylanisikdemir in #5725
Added unit tests for nosql_execution_Store_util.go - Part1 by @agautam478 in #5723
Straightforwardly fixes a few minor copy bugs and adds a small fuzz util by @davidporter-id-au in #5572
Add test for ES v6 client Search method by @neil-xie in #5727
Tests for Common/Domain: Adding tests for replication queue message handling and ack update by @abhishekj720 in #5730
Add more unit tests for persistence task types in DataManagerInterfaces by @timl3136 in #5726
Added two more test cases for the updateworkflowexecution by @agautam478 in #5722
[history] refactor history client with timeout wrapper by @shijiesheng in #5728
Add unit tests for PinotVisibilityStore by @bowenxia in #5714
Removed errors file from test coverage by @abhishekj720 in #5735
Test for Common/domain/replication_queue: GetMessagesfromDLQ & AckLevel by @abhishekj720 in #5734
Added unit tests for Delete current and workflow execution, list all … by @agautam478 in #5733
Added unit tests for PrepareResetWorkflowExecutionRequestWithMapsAndE… by @agautam478 in #5731
Adding more unit tests for ES v6 client by @neil-xie in #5739
Tests for GetDLQAckLevel and UpdateDLQAckLevel by @abhishekj720 in #5740
Add unit tests for TaskInfo types and utility functions by @timl3136 in #5732
Tests for common/domain: tests TestGetDLQSize, TestRangeDeleteMessagesFromDLQ and TestDeleteMessageFromDLQ by @abhishekj720 in #5741
Add error case tests for pinot_visibility_store by @bowenxia in #5746
Add unit test for util methods in es v6 client bulk processor by @neil-xie in #5748
Add unit tests for GetWorkflowExecution by @Shaddoll in #5736
Adds test for execution/mutable_state_builder.go by @davidporter-id-au in #5744
Add unit tests for the util functions in data_manager_interface by @timl3136 in #5742
Very minor nil-or-empty cleanup by @Groxx in #5745
Added more tests for nosql_execution_store.go by @agautam478 in #5738
Write more tests for cassandra/workflows.go by @taylanisikdemir in #5750
Added more tests for nosql_execution_stor_util.go by @agautam478 in #5752
Enforce leading space on comments by @Groxx in #5747
Add unit tests for common/persistence/sql/factory.go by @Shaddoll in #5751
[history] fix generated timeout wrapper by @shijiesheng in #5737
Add unit tests for functions in gocql/batch.go by @timl3136 in #5759
Add test for es v6 bulk processor by @neil-xie in #5758
Added test for replicationTaskExecutor: execute by @abhishekj720 in #5754
Add unit test for ES v7 client by @neil-xie in #5760
Added test cases for more util methods by @agautam478 in #5755
More unit tests for nosql_execution_store_test.go by @agautam478 in #5753
Add unit test for pinot folder with coverage to 93.4% by @bowenxia in #5761
[code-coverage] update admin and frontend client to use generated code by @ketsiambaku in #5702
Tests for PurgeAckedMessages and replicationMessage in common/domain/replication_queue by @abhishekj720 in #5749
Code cleanup for sql package by @Shaddoll in #5756
Add unit test for es v7 bulk processor by @neil-xie in #5764
Added test for pinot_visibility_metric_clients.go by @bowenxia in #5767
adding mutable state builder tests - adding continue-as-new events by @davidporter-id-au in #5768
Refactor/adding mutable state builder tests iv by @davidporter-id-au in #5769
Add unit test for open search client part 1 by @neil-xie in #5774
minor mutable-state log fix by @davidporter-id-au in #5776
refactor common/persistence/pinot tests by @bowenxia in #5777
Addition of tests for archivalConfigStateMachine in common/domain by @abhishekj720 in #5778
Re-enable sql unit test by @Shaddoll in #5779
Test: Validate domain config test for attrValidator by @abhishekj720 in #5699
refactor pinot_visibility_store_test by @bowenxia in #5780
[code-coverage] Generate code for matching client timeout wrapper by @ketsiambaku in #5771
Fix data race in matching test suite by @taylanisikdemir in #5781
hot fix for unit test cases that might cause a failure by @bowenxia in #5787
Adding unit tests for TestPrepareTransferTasksForWorkflowTxn by @agautam478 in #5763
Ignore requests send from pinot response comparator by @bowenxia in #5788
Coverage for dataStoreInterfaces by @Groxx in #5743
Retryable error for workflow rate limits in task processing by @sankari165 in #5782
Re-enable kafka consumer test by @taylanisikdemir in #5791
Global ratelimiter, part 1: core algorithm for computing weights by @Groxx in #5689
Write tests for cassandra SelectWorkflowExecution by @taylanisikdemir in #5792
Fix workflow deletion by @Shaddoll in #5793
Fix checksum validation for SQL implementation by @Shaddoll in #5790
added unit test for function in mapper-thrift-configstore file by @d-vignesh in #5789
Error mapper tests by @jakobht in #5795
Add a benchmark test for crc checksum by @Shaddoll in #5798
Add metric and retry backoff for checksum failure by @Shaddoll in #5797
Added new er...

Contributors

Groxx, jakobht, and 15 other contributors

Assets 3

26 Mar 18:46

neil-xie

v1.2.8

3f64176

v1.2.8

What's Changed

Added

Adding unit-test for matching:newTaskListID by @dkrotx in #5513
Get/Update DomainAsyncWorkflowConfiguration methods in admin API and CLI by @taylanisikdemir in #5616
Workflow ID cache size metric by @jakobht in #5619
Add a helper script to run cassandra and execute tests by @taylanisikdemir in #5620
Scaffold StartWorkflowExecutionAsync API by @Shaddoll in #5621
Scaffold async workflow queue provider component by @Shaddoll in #5627
Update run_cass_and_test.sh script to setup cassandra schemas by @taylanisikdemir in #5628
Add debug logs in PinotTripleVisibilityManager for response comparator testing by @bowenxia in #5631
Adding a sample call to TaskValidator in update workflow cycle by @agautam478 in #5634
Add a middleware for comparator to use by @bowenxia in #5637
Generate rate limit frontend api handler by @Shaddoll in #5636
Add generic OAuth support by @mantas-sidlauskas in #5638
Added metrics for when we rate limit by @jakobht in #5640
Implement StartWorkflowExecutionAsync API by @Shaddoll in #5642
Added 2 more tags in log for comparator to use. by @bowenxia in #5646
Async workflow request consumer manager in worker by @taylanisikdemir in #5655
Add async workflow request consumer for Start/SignalWithStart support by @taylanisikdemir in #5658
Set rate limit on Async APIs by @Shaddoll in #5659
Implement SignalWithStartWorkflowExecutionAsync API by @Shaddoll in #5657
Docker compose setup for async workflows with kafka queue by @taylanisikdemir in #5663
Add a make pr target for an easy "do automated checks for PR" command by @Groxx in #5670
Added debug information for decision timeout handling by @3vilhamster in #5674
Async workflows integration test with kafka by @taylanisikdemir in #5678
Add missing IsolationGroups field in domain cache entry by @taylanisikdemir in #5679
Add close status parse method in pinot query validator by @neil-xie in #5680
Add async workflow integration test step to CI by @taylanisikdemir in #5681
Add metrics for external calls for the workflow ID specific rate limits by @jakobht in #5684
Write tests for cdb (Cassandra DB wrapper) basic functions by @taylanisikdemir in #5686
Added a unit test for nosql execution store - createworkflowexecution by @agautam478 in #5687
Write tests for cdb.InsertWorkflowExecutionWithTasks by @taylanisikdemir in #5688
Added more scenarios to createworkflowexecution test- Part1 by @agautam478 in #5690
Added a test for the GetworkflowExecution in the nosql_execution_store.go file. by @agautam478 in #5692
Write tests for cdb.SelectCurrentWorkflow by @taylanisikdemir in #5693
Support AsyncWorkflowConfiguration decoding in admin CLI by @taylanisikdemir in #5694

Changed

Replace JWT validation library by @mantas-sidlauskas in #5592
feat: pprof support config host by @zedongh in #5601
Refactor persistence serializer tests and add more cases by @taylanisikdemir in #5625
Upgrade domain_config type in cassandra schema to add async wf config by @taylanisikdemir in #5630
Refactor frontend API handler and use generated code to emit metrics by @Shaddoll in #5639
Enable the workflow ID cache in shadow mode for start workflow by @jakobht in #5641
Filtering the prefix in custom query log for pinot response comparator by @bowenxia in #5643
The ratelimiter needs to be created with the domain name not the ID by @jakobht in #5644
Update async workflow queue idl change by @Shaddoll in #5645
Rewrite async workflow queue provider component by @Shaddoll in #5648
Store mutable state checksum in SQL storage by @Shaddoll in #5649
Splitting wfCacheEnabled config for internal and external requests by @sankari165 in #5647
Convert pinot query to use unix milliseconds instead of nano by @neil-xie in #5650
Emit metrics when transfer tasks could be ratelimited by @sankari165 in #5652
Update change log for v1.2.7 release by @neil-xie in #5653
Update pinot query validator to handle raw time string by @neil-xie in #5656
Emit metrics when transfer tasks for decisions could be ratelimited by @sankari165 in #5665
Upgrade pinot client version by @neil-xie in #5666
Update the build-changed message failure by @Groxx in #5667
Improve error message for membership resolver by @Shaddoll in #5669
Emits a counter value for every unique view of the hashring by @davidporter-id-au in #5672
Refactor history packages by @jakobht in #5673
Improve test coverage for sql_execution_store_util by @Shaddoll in #5676
Improve test coverage for sql_execution_store by @Shaddoll in #5677
Improve test coverage for constants.go by @timl3136 in #5685
Enable retry on mutable state checksum verification failure by @Shaddoll in #5691

Fixed

Set proper max reset points by @neil-xie in #5623
Put a timeout for timer task deletion loop during shutdown by @taylanisikdemir in #5626
Catch unit test failures in make test by @Groxx in #5635
fix: get messages between query over message_id typo by @zedongh in #5607
Fix context leak in tests by @munahaf in #5377
Make sure task processing rate limiter is only done in the active side by @sankari165 in #5654
Fix Pinot query validator bug when user pass in not equal query with value missing by @neil-xie in #5662
Update Pinto query validator failed log, minor refactor pinot visibility store to remove panics by @neil-xie in #5664
Fix context leak in pinot integration test by @neil-xie in #5682
Fix SignalWithStartWorkflow API by @Shaddoll in #5671
Fix wrong migration paths in example by @kotcrab in #5668
Fix comment in workflow id cache config by @sankari165 in #5661
Fix the local integration test docker-compose file by @jakobht in #5695
Do not get workflow execution from database when shard is closed by @Shaddoll in #5697

Removed

Removed useless metrics tag from the workflowIDcache by @jakobht in #5651
Removed the shadower service for cadence-server by @agautam478 in #5660

New Contributors

@zedongh made their first contribution in #5607
@munahaf made their first contribution in #5377
@kotcrab made their first contribution in #5668

Full Changelog: v1.2.7...v1.2.8

Contributors

Groxx, jakobht, and 14 other contributors

Assets 3

09 Feb 19:00

neil-xie

v1.2.7

08d5994

v1.2.7

What's Changed

Added

Add metrics to monitor task validation. by @agautam478 in #5466
Add an "all results" query to scanner/fixer workflows by @Groxx in #5470
Add retries into Scanner BlobWriter by @agautam478 in #5471
Added a unit test for the BlobStoreWriter. by @agautam478 in #5472
Add Debugf and some minor updates to timer queue processor base by @taylanisikdemir in #5475
Add unit tests for cassandra workflow utils part-1 by @taylanisikdemir in #5476
Add workflow query-types command to CLI by @arzonus in #5456
Add unit test for cassandra workflow utils part-2 by @taylanisikdemir in #5480
Unit tests for admin cli decode_thrift command by @taylanisikdemir in #5485
Add unit test for sqlConfigStore by @Shaddoll in #5491
Add unit test for mysql configstore by @Shaddoll in #5502
Add persistence serialization unit tests by @3vilhamster in #5507
Adding unit tests to workflowHandler_test.go by @sankari165 in #5500
Add unit tests for AwaitWaitGroup by @arzonus in #5512
Add unit test for sql domain store by @Shaddoll in #5508
Add unit test for cassandra workflow utils part-3 by @taylanisikdemir in #5506
Adding unit tests for RecordActivityTaskHeartbeat by @sankari165 in #5511
add unit tests for ValidIDLength by @arzonus in #5520
Test for rate limited wrappers around persistence clients by @3vilhamster in #5518
Test for error injection clients by @3vilhamster in #5515
Add unit test for sql history store by @Shaddoll in #5524
Adding unit tests to RespondActivityTaskCompleted and RecordActivityT… by @sankari165 in #5521
Add unit tests for IsEntityNotExistsError by @arzonus in #5528
Add unit tests for CreateXXXRetryPolicy by @arzonus in #5527
Add unit tests for ValidateRetryPolicy by @arzonus in #5529
Add unit tests for ConvertGetTaskFailedCauseToErr by @arzonus in #5531
Add unit tests for WorkflowIDToHistoryShard and DomainIDToHistoryShard by @arzonus in #5533
Added a unit test for the timer.go file in reconciliation folder. by @agautam478 in #5505
Adding logging to scanner.go by @agautam478 in #5535
Adding a metric for hosts not being found in resolver by @davidporter-id-au in #5414
Added logs to concrete_execution.go by @agautam478 in #5536
Add unit tests for sql queue store by @Shaddoll in #5541
Unit tests for timer/transfer queue processor pump loops by @taylanisikdemir in #5540
Add unit tests for sql shard store by @Shaddoll in #5543
Add unit test for kafka partition ack manager by @neil-xie in #5545
Add unit tests for GenerateRandomString by @arzonus in #5532
Add unit tests for IsValidContext by @arzonus in #5546
Add unit tests for CreateChildContext by @arzonus in #5547
Add unit tests for DeserializeSearchAttributeValue by @arzonus in #5548
Add unit tests for GetSizeOfHistoryEvent by @arzonus in #5550
Add unit tests for thrift mappers by @taylanisikdemir in #5542
Add unit tests for sql task store by @Shaddoll in #5558
Added logs into the current execution.go and a unit test by @agautam478 in #5555
Add unit test for kafka producer impl by @neil-xie in #5559
Add shard id to queue processor related metrics by @taylanisikdemir in #5557
Add unit tests for sql execution store by @Shaddoll in #5565
Add unit test for new Kafka client by @neil-xie in #5570
Add unit tests for helper functions in sql execution store util by @Shaddoll in #5571
Added tests for visibility sampling wrapper by @3vilhamster in #5564
Add unit test for consumer impl by @neil-xie in #5573
Add unit tests for workflow state non maps by @Shaddoll in #5578
Add logs to debug timer tasks by @Shaddoll in #5581
Added deprecated domain check to the taskvalidator by @agautam478 in #5580
Add unit tests for IsServiceTransientError by @arzonus in #5551
Add unit tests for for IsAdvancedVisibilityWritingEnabled by @arzonus in #5552
Add unit tests for ValidateLongPollXXX by @arzonus in #5553
Add grafana dashboard to visualize persistence metrics for default docker-compose setup by @taylanisikdemir in #5582
Add missing exclude-query support to list-workflows on the CLI by @Groxx in #5583
Add unit tests for DurationToXXX and XXXToDuration by @arzonus in #5530
Add more debug logs for user timer task execution by @taylanisikdemir in #5595
Add cache for workflow specific in memory data by @jakobht in #5594
Added three dynamic config properties by @jakobht in #5602
add ContextKey Struct by @bowenxia in #5606
Adding a stale workflow check to the taskvalidator and code cleanup. by @agautam478 in #5604
Added more error handling in workflow cache by @jakobht in #5611

Fixed

Improves metric and error handling for history by @davidporter-id-au in #5469
Address map access data race in matching engine by @taylanisikdemir in #5477
fix docker compose tests by @3vilhamster in #5479
Fix copying suite.Suite in integration tests by @3vilhamster in #5481
fix scavenger test suite by @3vilhamster in #5490
fix scavenger suite by @3vilhamster in #5498
Fixing matching:TestCheckIdleTaskList test flackiness by @dkrotx in #5494
fix leaky goroutines in matching by @3vilhamster in #5499
Unit test for the fetcher/current.go. by @agautam478 in #5504
More fixes for golint.sh by @Groxx in #5519
Fix race between startup and shutdown in task reader by @Groxx in #5522
Ensure scanner scavenger stops in tests by @3vilhamster in #5510
Bugfix/debugging stuck tasklist by @davidporter-id-au in #5436
Fix multiple lock acquire on membership update by @3vilhamster in #5576
Properly catch errors in ldflag-gathering and fail the build by @Groxx in #5539
Addressed sync issue in workflow cache by @jakobht in #5605
fix a comment by @bowenxia in #5610
Fixed lint errors introduced in previous PR by @jakobht in #5613

Changed

Update kafka config to have isSecure option by @neil-xie in #5473
Minor change to include domainTag and pass domainName. by @agautam478 in #5468
Wrap isSecure config in config map for kafka topic by @neil-xie in #5474
Update changelog for v1.2.6 release by @neil-xie in #5478
Unify cassandra setup in docker-compose by @3vilhamster in #5482
Unify logging in tests by @3vilhamster in #5487
Updated the unit test for BlobstoreIterator into a table format by @agautam478 in #5488
update cassandra dev setup by @3vilhamster in #5501
Converted the existing test for concrete.go execution into a table test by @agautam478 in #5503
Improve logs/metrics of HandleDecisionTaskCompleted by @taylanisikdemir in #5497
Revert gofuzz us...

Contributors

Groxx, jakobht, and 11 other contributors

Assets 3

14 Dec 22:11

neil-xie

v1.2.6

558780b

v1.2.6

What's Changed

Added

Added range query support for Pinot json index by @bowenxia (#5426)
Implemented GetTaskListSize method at persistence layer by @Shaddoll (#5442, #5447)
Added a framework for the Task validator service by @agautam478 (#5446)
Added nit comments describing the Update workflow cycle @agautam478 (#5432)
Added log user query param by @bowenxia (#5437)
Added CODEOWNERS file by @taylanisikdemir (#5453)
Added a function to evict all elements older than the cache TTL by @jakobht (#5464)

Fixed

Fixed workflow replication for reset workflow by @Shaddoll (#5412)
Fixed visibility mode for admin when use Pinot visibility by @neil-xie (#5441)
Fixed workflow started metric by @ketsiambaku (#5443)
Fixed timer-fixer, unfortunately broken in 1.2.5 by @Groxx (#5433)
Fixed confusing comment in matching handler by @jakobht (#5450)

Changed

Cassandra version is changed from 3.11 to 4.1.3 by @taylanisikdemir (#5461)
- If your machine already has ubercadence/server:master-auto-setup image then you need to repull so it works with latest docker-compose*.yml files
Move dynamic ratelimiter to its own file by @jakobht (#5451)
Create and use a limiter struct instead of just passing a function by @jakobht (#5454)
Dynamic ratelimiter factories by @jakobht (#5455)
Update github action for image publishing to released by @3vilhamster (#5460)
Update matching to emit metric for tasklist backlog size by @Shaddoll (#5448)
Change variable name from SecondsSinceEpoch into EventTimeMs by @bowenxia (#5463)

Removed

Get rid of noisy task adding failure log in matching service by @taylanisikdemir (#5445)

New Contributors

@jakobht made their first contribution in #5450

Full Changelog: v1.2.5...v1.2.6

Contributors

Groxx, jakobht, and 7 other contributors

Assets 3

02 Nov 19:07

sankari165

v1.2.5

eb8eea9

v1.2.5

What's Changed

Added

Scanner / Fixer changes by @Groxx in #5361
- Stale-workflow detection and cleanup added to shardscanner, disabled by default.
- New dynamic config to better control scanner and fixer, particularly for concrete executions.
- Documentation about how scanner/fixer work and how to control them, see the scanner readme.md
- This also includes example config to enable the new fixer.
MigrationChecker interface to expose migration CLI by @abhishekj720 in #5424
Added Pinot as new visibility store option by @neil-xie in #5201
- Added pinot visibility triple manager to provide options to write to both ES and Pinot.
- Added pinotVisibilityStore and pinotClient to support CRUD operations for Pinot.
- Added pinot integration test to set up Pinot test cluster and test Pinot functionality.

Fixed

Fix CreateWorkflowModeContinueAsNew for SQL by @Shaddoll in #5413
Fix CLI count&list workflows error message by @ketsiambaku in #5417
Hotfix for async matching for isolation-group redirection by @davidporter-id-au in #5423
Fix closeStatus for --format flag by @ketsiambaku in #5422

Full Changelog: v1.2.4...v1.2.5-prerelease3

Contributors

Groxx, davidporter-id-au, and 4 other contributors

Assets 3

27 Sep 19:03

neil-xie

v1.2.4

c93d6af

v1.2.4

What's Changed

Remove database check for config store tests by @Shaddoll in #5401
Fix persistence tests setup by @Shaddoll in #5402
Implement config store for MySQL by @Shaddoll in #5403
Retract v1.2.3 by @sankari165 in #5406
Implement config store for PostgresSQL by @Shaddoll in #5405
Release v1.2.4 by @Shaddoll in #5407

Full Changelog: v1.2.3...v1.2.4

Contributors

Shaddoll and sankari165

Assets 3

15 Sep 22:10

Shaddoll

v1.2.3

4a16136

v1.2.3 (Retracted, please use v1.2.4) Pre-release

Pre-release

Added

Expose workflow history size and count to client by @timl3136 (#5392)

Fixed

[cadence-cli] fix typo in input flag for parallelism by @sankari165 (#5397)

Changed

Update config store client to support SQL database by @Shaddoll (#5395)
Scaffold config store for sql plugins by @Shaddoll (#5396)
Improve poller detection for isolation by @Shaddoll (#5399)

Contributors

Shaddoll, timl3136, and sankari165

Assets 2

19 Sep 16:34

sankari165

v1.2.2

e5f605c

v1.2.2

What's Changed

add a update workflow execution count metric for RI by @allenchen2244 in #5386
Pass partition config and isolation group to history/matching even if isolation is disabled by @Shaddoll in #5385
[CLI] fix nil pointer issue in domain migration command rendering by @shijiesheng in #5378
Release v1.2.2 by @shijiesheng in #5388

Full Changelog: v1.2.1...v1.2.2

Contributors

shijiesheng, Shaddoll, and allenchen2244

Assets 2

19 Sep 03:56

davidporter-id-au

v1.2.1

0e17485

v1.2.1

Project release: Zonal isolation

This version introduces a few resiliency concepts into customers' worker task processing such that they can detect deployment or configuration failures earlier. These features are opt-in.

The high-level concept is to provide a means to subdivide work (called 'isolation-groups') for workers along whatever partitioning mechanism that is required for your service.

By default the partitioning mechanism provided will attempt to keep workflows running in the location the are started, such that customers may identify broken changes earlier, rather than waiting for the deployment of an entire region. However, if there are no pollers available available in that subdivision, it'll route the work elsewhere.

Nomenclature

Partitioning: A means to subdivide the tasks given to workflows, of which there are many possible schemes and one default one provided. When a workflow is started, a group of partition keys are provided by request headers. The partition keys are used to determine which isolation group of workers should process these workflows.
Workflow pinning: A partitioning scheme which emphasizes keeping workflows running in the location they were started
Isolation-groups: A division of work within a customer region in which they can subdivide their workers and pin the workflows. This originally was intended as a synonym for 'zone' in the site reliability, as a subdivision of a region. However the important point is that this is a failure domain for customer workflows, so this may be an arbitrary subdivision of your cluster's traffic.
Isolation-group drain: A means of excluding work from an isolation-group. If an isolation group is drained, workers from that isolation group won't be able to get any task. And customers cannot start workflows from that isolation group.

Default concepts and approaches

The partitioning and isolation concepts are intended to be provided as general purpose orchestration concepts and flexible, with some basic defaults provided. By default the following behaviour is given:

Partition data is persisted with workflow execution records by the provided middleware if the provided header is passed when workflows are created.
The cadence client and worker Go libraries will pass these as headers if provided in client options

Pinning behaviour

The workflow original zone is captured on workflow start and will be used on workflow processing.

The default partitioner provides the following behaviour: It will attempt to dispatch work in a zone where the workflow was started. However, workers may not be available in that zone, or no longer available for some reason. So the partitioner takes information from a lookback of poller information and uses this lookback data to ensure that the workflow can be processed. If the the start isolation-group is not available it'll another healthy random one.

'Health', here, is determined as the presence of pollers and the absence of drains.

The 'unpinning' is import for two main reasons: firstly, it's quite possible to start a workflow from an unrelated isolation-group in which the pollers are created and to suddenly blackhole that work would likely be not the desired behaviour. But secondly, and probably more importantly, this prevents a head-of-line blocking problem internally for Cadence. At the database level (in this release anyway) tasks need to be dispatched in-order and so if an isolation-group were to be not processed it would block task processing.

Drains

This release also introduces a simplistic notion of drains, which allow for isolation-groups to be excluded from traffic processing, should that be required. Drains are issuable via the Admin API or via cli:

eg:

cadence admin isolation-groups update-global --set-drains zone-1
cadence admin isolation-groups get-global

This information is stored in the config-store and is not part of dynamic configuration.

Configuration

In order to use this feature, the requisite configuration is required:

system.allIsolationGroups: This is a list of all the possible isolation-groups
system.enableTasklistIsolation: This is the bool flag to enable it for a domain

Implementation

The changes for this feature are largely in Matching and can be (reductively) described as: Sync and Async-match in Cadence as being made aware of a new dimension; their associated isolation-group. The tasks piped through the Matching service are matching the appropriate isolation-group channel.

What's Changed

Set config for shardscanner fixer by @mantas-sidlauskas in #3844
Fix get raw history for transient decision by @yycptt in #3847
Fix error handling when processing parent close policy by @yycptt in #3845
Add logging/metrics for decision attempts by @yycptt in #3849
Switch to gocql interface by @yycptt in #3837
Fix NPE in DescribeMutableState by @yycptt in #3850
Switch the remaining history component to internal types by @vytautas-karpavicius in #3843
Switch Health status endpoints to internal types by @vytautas-karpavicius in #3842
reset workflow with no decision task complete by @yux0 in #3687
error check before return the ActivityLocalDispatchInfo by @mkolodezny in #3853
Delete unused dynamic configs that have no referrence anymore by @longquanzheng in #3859
Merge sql updates: Blob size increase by @yux0 in #3858
Handle matching task list conditional error by @yux0 in #3867
Fix go-generate by @yycptt in #3864
Support visibility query with close status represented in string by @yycptt in #3865
Add timers shardscanner by @mantas-sidlauskas in #3846
replace string based logging with tagged logs by @mantas-sidlauskas in #3871
Downgrade golang tools version by @yycptt in #3876
Add instructions to setup local MySQL and Postgres by @yux0 in #3868
Make max activity schedule to start timeout for retry configurable by domain by @yycptt in #3878
Task processing debug logs by @yycptt in #3877
Transfer queue validator by @yycptt in #3875
Pick sql index changes by @yux0 in #3866
Remove strict sanity check to allow reset by @yux0 in #3879
Improve shard context timeout handling by @yycptt in #3881
Add domain name tag in failover metrics by @yux0 in #3882
break out when response is nil by @mantas-sidlauskas in #3886
Allow using Kafka TLS without cert ca and key by @longquanzheng in #3862
Fix dynamic config collection logValue function by @yycptt in #3880
Update read DLQ messages API to return raw task info by @yux0 in #3869
break if adminClient returns error by @mantas-sidlauskas in #3887
Latest idl by @yux0 in #3888
Fix activity lost metrics by @yycptt in #3889
Add replication error logging and metrics by @yux0 in #3891
Simplify templateGetLastMessageIDQuery sql query by @andrewjdawson2016 in #3890
Add task processing workflow busy metric by @yycptt in #3892
CLI 0.18.0 release by @yycptt in #3896
Handle data corruption error in replication by @yux0 in #3895
Add a "help" target to the makefile by @Groxx in #3898
Initial protobuf types and API by @vytautas-karpavicius in #3863
Fix workflow reset command by @yycptt in #3904
CLI 0.18.1 patch release by @yycptt in #3908
Use GetDomainName instead of GetDomainByID for retrieving domain names by @yycptt in #3899
Start enabled shardscanner fixers by @mantas-sidlauskas in #3906
Switch to protoc-gen-go by @vytautas-karpavicius in #3905
Fix scan unsupported workflow in SQl DB by @yux0 in #3909
Makefile cleanup / thrift revamp / gobin removed by @Groxx in #3903
Version goveralls, remove unused go bins from docker setup by @Groxx in #3913
Remove duplicate doc...

Contributors

Groxx, adambabik, and 53 other contributors

Assets 2

26 Apr 19:59

Groxx

v1.0.0

8e81044

v1.0.0

We are v1.0! (with a schema upgrade)

What does this mean?!

Not much. Primarily that we are declaring "it's stable and in use" more visibly, because we continually get questions about this :) A larger public announcement / state-of-the-project is in the works.

Importantly, v1.0 does not imply any change to backwards compatibility (the minimum supported client version has not changed), RPC compatibility (ditto, all changes are backwards compatible), or Go API compatibility (this is not truly a library, Go compatibility is not a goal).

Going by previous version patterns, this would have been labeled v0.26.0 as it is a relatively incremental change (plus schema changes) from v0.25.0. As such, some strings still reference "0.26", because this older SHA is the one we have been using the most internally.
These strings will be updated and validated soon, and will likely be released as v1.0.1. This should have no behavioral impact at all, but will be visible in metrics, logs, and display strings.

What do I need to do to upgrade?

Schema upgrades needed

There have been schema changes to both normal and visibility datastores, primarily to provide better data for cleanup and hot-shard detection:

Update-time additions by @neil-xie in #4962 and #4971
Add FirstExecutionRunID to mutable state by @Shaddoll in #5031
Shard ID visibility additions by @allenchen2244 in #5099 and #5123

These were intentionally kept out of v0.25.0 to keep that upgrade simple, as they were not fully utilized yet.

Replication cache recommendation

We have internally disabled the replication cache (history.replicatorCacheCapacity dynamic config set to 0), due to unexpectedly large memory use under abnormal load, and you may wish to do so as well.

We did not encounter any misbehavior, and it did reduce database load as intended, but we intend to make some changes to it to estimate and constrain memory use before re-enabling.

What has changed?

At a very high level, we've been focused on:

Internal scaling challenges, both improving bottlenecks and improving our ability to accurately identify bottlenecks
- Many metrics, logs, and refactors are at least somewhat related to this
- Our multi-cluster support is improved in particular, as we have been connecting clusters and moving many domains to spread load more evenly
Database corruptions, as our Cassandra clusters have had some problems that cause issues for months
- Many logs, scanner, and stale-task changes are related to this, e.g. to detect and remove invalid data
Scaling up the team
- More changes to come!

Some loosely categorized PRs that were included follows:

Critical bugfixes (resolving issues in v0.25.0)

Fix ndc flush buffered events by @Shaddoll in #5009
Hotfix a replication panic causing crashes by @davidporter-id-au in #5074
Resolve an infinite loop around impossible cron schedules by @Groxx in #5097

Parent-close-policies apply to child workflows even after they reset/continue-as-new/etc

Update parent close policy to terminate/cancel child workflows even after continue as new by @Shaddoll in #5032
- This requires new stored data, so it does not apply to child workflows started before this version.

Better config introspection

Config store CLI: make value required when updating by @mantas-sidlauskas in #5089
CLI: print all available dynamic config keys by @mantas-sidlauskas in #5090

Schemas are now available via the go module, as go:embed files

Embed schema files by @Shaddoll in #5040
Embed elasticsearch index templates by @Shaddoll in #5043
Fix ES embedding by @Shaddoll in #5056

Enhancing existing metrics and logging (and more included in other PRs)

Reduce metrics cardinality replication.TaskStore by @vytautas-karpavicius in #4981
Add Metric Emitter, which right now emits a metric once a minute for true replication lag in nanoseconds. by @ZackLK in #4979
Added logs for domainName empty situation by @abhishekj720 in #4987
Improve logs for task executor by @Shaddoll in #4989
Add domain_type and cluster_groups tags by @vytautas-karpavicius in #4990
Introduce per domain metrics by @Shaddoll in #5012
Improve logs for transfer task validator by @Shaddoll in #5044
Make replication log error message better by @davidporter-id-au in #5052
Wf version metrics by @allenchen2244 in #5041
Add domain tag to unregistered field error by @neil-xie in #5070
UpdateWorkflow ShardId based metrics by @allenchen2244 in #5080
Emit workflow counts per workflow type metrics by @neil-xie in #5082
Use zap logger when initialising dynamic config by @mantas-sidlauskas in #5081
add 3 tags to support adding logs for every manual access by @bowenxia in #5112
Add sample log and dynamic config for updateworkflowexecution hot shard detection by @allenchen2244 in #5120
Add attempt-count to task processing logs, and update unit test so that it will cover deadlock by @bowenxia in #5122

Misc

Allow docker compose to work with docker-compose-mysql.yml on M1 by @ZackLK in #4983
Return early when there are no replication tasks by @vytautas-karpavicius in #4982
Update Cassandra deletes to use ALL consistency level by @Shaddoll in #4984
Make test should pass locally by @ZackLK in #4915
Immediate replication task hydration after successful transaction by @vytautas-karpavicius in #4980
Convert client peer resolving errors to service transient errors by @Shaddoll in #4993
Update idls by @Shaddoll in #4997
Fix history corruption check for workflow signaling by @Shaddoll in #4998
Introduce a dynamic config for cassandra all consistency level delete by @Shaddoll in #5000
Adds fix for domain ack level issue by @davidporter-id-au in #5001
Drop dynamic config for gRPC message size by @vytautas-karpavicius in #5002
Fix Cadence CLI by @Shaddoll in #5005
Re-enable workflow test by @Shaddoll in #5007
Add new unit test by @Shaddoll in #5008
Reformatting most things for go 1.19, rebuilding go.mod tools after clean, warning about different go versions by @Groxx in #5019
Enhance workflowDeletionTaskJitterRange to handle deletes piling up when many workflows have finished at the same time. by @ZackLK in #5020
Feature/min initial failover version by @davidporter-id-au in #5015
Fix Makefile OpenSearch rule name in CONTRIBUTING.md install guide, Fix OpenSearch version in dev Docker config by @charlese-instaclustr in #5004
Decouple StateBuilder from TaskGenerator by @vytautas-karpavicius in #4991
Removing unused code by @vytautas-karpavicius in #5024
Use internal IndexedValueType by @Shaddoll in #5016
Fix workflow cancellation by @Shaddoll in #5025
Add UpdateTime to uninitialized workflow execution record and update logic to set the update time by @neil-xie in #5014
Update DSL query to allow filtering by missing start time by @neil-xie in #5017
test: use T.TempDir to create temporary test directory by @Juneezee in #5013
Enable workflow corruption check for Describe and Query API by @Shaddoll in #5028
Remove unused watchdog signal by @demirkayaender in #5029
Add TLS ServerName as CLI option for Cadence Cassandra Tool by @sonpham96 in #5011
Add cli tls support by @charlese-instaclustr in #5027
Improve Cassandra errors for schema check by @mantas-sidlauskas in #5038
Fix SignalWithStartWorkflow by @Shaddoll in #5036
Fix error message by @ZackLK in #5045
Making a schema tooling concrete -> interface by @davidporter-id-au in #5046
Exposing the ability to pull CQL changesets by @davidporter-id-au in https://github.com/uber/ca...

Contributors

Groxx, davidporter-id-au, and 15 other contributors

Assets 3

Releases: uber/cadence

v1.2.9

What's Changed

Contributors

v1.2.8

What's Changed

Added

Changed

Fixed

Removed

New Contributors

Contributors

v1.2.7

What's Changed

Added

Fixed

Changed

Contributors

v1.2.6

What's Changed

Added

Fixed

Changed

Removed

New Contributors

Contributors

v1.2.5

What's Changed

Added

Fixed

Contributors

v1.2.4

What's Changed

Contributors

v1.2.3 (Retracted, please use v1.2.4)

Added

Fixed

Changed

Contributors

v1.2.2

What's Changed

Contributors

v1.2.1

Project release: Zonal isolation

Nomenclature

Default concepts and approaches

Pinning behaviour

Drains

Configuration

Implementation

What's Changed

Contributors

v1.0.0

We are v1.0! (with a schema upgrade)

What does this mean?!

What do I need to do to upgrade?

Schema upgrades needed

Replication cache recommendation

What has changed?

Critical bugfixes (resolving issues in v0.25.0)

Parent-close-policies apply to child workflows even after they reset/continue-as-new/etc

Better config introspection

Schemas are now available via the go module, as go:embed files

Enhancing existing metrics and logging (and more included in other PRs)

Misc

Contributors