Releases: uber/cadence
v1.2.9
What's Changed
- Addition of tests for ArchivalConfigStateMachine in common/domain by @abhishekj720 in #5698
- Introduce new dynamic config for enabling wfID based ratelimiting by @jakobht in #5703
- Add unit tests for sql plugin registration by @Shaddoll in #5705
- Add unit tests for sql helper functions by @Shaddoll in #5706
- Add unit test for helper function of sql execution store by @Shaddoll in #5707
- Generate a metadata file artifact in unit test buildkite job by @taylanisikdemir in #5708
- Write tests for cdb.UpdateWorkflowExecutionWithTasks by @taylanisikdemir in #5709
- Add unit tests for helper functions in sql execution store util by @Shaddoll in #5710
- Add unit tests for CreateWorkflowExecution by @Shaddoll in #5715
- Test: Addition of tests for replicationQueue publish and publish to dlq by @abhishekj720 in #5700
- Implemented ratelimiting for external calls pr wfid (guarded by feature flag) by @jakobht in #5704
- remove old metrics wrappers and use new generated metered wrappers by @3vilhamster in #5717
- Proper shutdown of kafka consumer impl and fix test by @taylanisikdemir in #5712
- Add additional unit tests for functions in constants.go by @timl3136 in #5713
- Initial codecov integration by @taylanisikdemir in #5711
- Add tests for UpdateWorkflowExecution by @Shaddoll in #5718
- Tests for UpdateWorkflowEecution in nosql store-Part1 by @agautam478 in #5719
- Add unit tests for ConflictResolveWorkflowExecution by @Shaddoll in #5721
- Add tests for elasticsearch v6 client by @neil-xie in #5716
- Add unit tests for persistence task types in DataManagerInterfaces by @timl3136 in #5720
- Add unit tests for CreateFailoverMarkerTasks by @Shaddoll in #5724
- Change noisy frontend poll timeout log to debug level by @taylanisikdemir in #5725
- Added unit tests for nosql_execution_Store_util.go - Part1 by @agautam478 in #5723
- Straightforwardly fixes a few minor copy bugs and adds a small fuzz util by @davidporter-id-au in #5572
- Add test for ES v6 client Search method by @neil-xie in #5727
- Tests for Common/Domain: Adding tests for replication queue message handling and ack update by @abhishekj720 in #5730
- Add more unit tests for persistence task types in DataManagerInterfaces by @timl3136 in #5726
- Added two more test cases for the updateworkflowexecution by @agautam478 in #5722
- [history] refactor history client with timeout wrapper by @shijiesheng in #5728
- Add unit tests for PinotVisibilityStore by @bowenxia in #5714
- Removed errors file from test coverage by @abhishekj720 in #5735
- Test for Common/domain/replication_queue: GetMessagesfromDLQ & AckLevel by @abhishekj720 in #5734
- Added unit tests for Delete current and workflow execution, list all … by @agautam478 in #5733
- Added unit tests for PrepareResetWorkflowExecutionRequestWithMapsAndE… by @agautam478 in #5731
- Adding more unit tests for ES v6 client by @neil-xie in #5739
- Tests for GetDLQAckLevel and UpdateDLQAckLevel by @abhishekj720 in #5740
- Add unit tests for TaskInfo types and utility functions by @timl3136 in #5732
- Tests for common/domain: tests TestGetDLQSize, TestRangeDeleteMessagesFromDLQ and TestDeleteMessageFromDLQ by @abhishekj720 in #5741
- Add error case tests for pinot_visibility_store by @bowenxia in #5746
- Add unit test for util methods in es v6 client bulk processor by @neil-xie in #5748
- Add unit tests for GetWorkflowExecution by @Shaddoll in #5736
- Adds test for execution/mutable_state_builder.go by @davidporter-id-au in #5744
- Add unit tests for the util functions in data_manager_interface by @timl3136 in #5742
- Very minor nil-or-empty cleanup by @Groxx in #5745
- Added more tests for nosql_execution_store.go by @agautam478 in #5738
- Write more tests for cassandra/workflows.go by @taylanisikdemir in #5750
- Added more tests for nosql_execution_stor_util.go by @agautam478 in #5752
- Enforce leading space on comments by @Groxx in #5747
- Add unit tests for common/persistence/sql/factory.go by @Shaddoll in #5751
- [history] fix generated timeout wrapper by @shijiesheng in #5737
- Add unit tests for functions in gocql/batch.go by @timl3136 in #5759
- Add test for es v6 bulk processor by @neil-xie in #5758
- Added test for replicationTaskExecutor: execute by @abhishekj720 in #5754
- Add unit test for ES v7 client by @neil-xie in #5760
- Added test cases for more util methods by @agautam478 in #5755
- More unit tests for nosql_execution_store_test.go by @agautam478 in #5753
- Add unit test for pinot folder with coverage to 93.4% by @bowenxia in #5761
- [code-coverage] update admin and frontend client to use generated code by @ketsiambaku in #5702
- Tests for PurgeAckedMessages and replicationMessage in common/domain/replication_queue by @abhishekj720 in #5749
- Code cleanup for sql package by @Shaddoll in #5756
- Add unit test for es v7 bulk processor by @neil-xie in #5764
- Added test for pinot_visibility_metric_clients.go by @bowenxia in #5767
- adding mutable state builder tests - adding continue-as-new events by @davidporter-id-au in #5768
- Refactor/adding mutable state builder tests iv by @davidporter-id-au in #5769
- Add unit test for open search client part 1 by @neil-xie in #5774
- minor mutable-state log fix by @davidporter-id-au in #5776
- refactor common/persistence/pinot tests by @bowenxia in #5777
- Addition of tests for archivalConfigStateMachine in common/domain by @abhishekj720 in #5778
- Re-enable sql unit test by @Shaddoll in #5779
- Test: Validate domain config test for attrValidator by @abhishekj720 in #5699
- refactor pinot_visibility_store_test by @bowenxia in #5780
- [code-coverage] Generate code for matching client timeout wrapper by @ketsiambaku in #5771
- Fix data race in matching test suite by @taylanisikdemir in #5781
- hot fix for unit test cases that might cause a failure by @bowenxia in #5787
- Adding unit tests for TestPrepareTransferTasksForWorkflowTxn by @agautam478 in #5763
- Ignore requests send from pinot response comparator by @bowenxia in #5788
- Coverage for dataStoreInterfaces by @Groxx in #5743
- Retryable error for workflow rate limits in task processing by @sankari165 in #5782
- Re-enable kafka consumer test by @taylanisikdemir in #5791
- Global ratelimiter, part 1: core algorithm for computing weights by @Groxx in #5689
- Write tests for cassandra SelectWorkflowExecution by @taylanisikdemir in #5792
- Fix workflow deletion by @Shaddoll in #5793
- Fix checksum validation for SQL implementation by @Shaddoll in #5790
- added unit test for function in mapper-thrift-configstore file by @d-vignesh in #5789
- Error mapper tests by @jakobht in #5795
- Add a benchmark test for crc checksum by @Shaddoll in #5798
- Add metric and retry backoff for checksum failure by @Shaddoll in #5797
- Added new er...
v1.2.8
What's Changed
Added
- Adding unit-test for matching:newTaskListID by @dkrotx in #5513
- Get/Update DomainAsyncWorkflowConfiguration methods in admin API and CLI by @taylanisikdemir in #5616
- Workflow ID cache size metric by @jakobht in #5619
- Add a helper script to run cassandra and execute tests by @taylanisikdemir in #5620
- Scaffold StartWorkflowExecutionAsync API by @Shaddoll in #5621
- Scaffold async workflow queue provider component by @Shaddoll in #5627
- Update run_cass_and_test.sh script to setup cassandra schemas by @taylanisikdemir in #5628
- Add debug logs in PinotTripleVisibilityManager for response comparator testing by @bowenxia in #5631
- Adding a sample call to TaskValidator in update workflow cycle by @agautam478 in #5634
- Add a middleware for comparator to use by @bowenxia in #5637
- Generate rate limit frontend api handler by @Shaddoll in #5636
- Add generic OAuth support by @mantas-sidlauskas in #5638
- Added metrics for when we rate limit by @jakobht in #5640
- Implement StartWorkflowExecutionAsync API by @Shaddoll in #5642
- Added 2 more tags in log for comparator to use. by @bowenxia in #5646
- Async workflow request consumer manager in worker by @taylanisikdemir in #5655
- Add async workflow request consumer for Start/SignalWithStart support by @taylanisikdemir in #5658
- Set rate limit on Async APIs by @Shaddoll in #5659
- Implement SignalWithStartWorkflowExecutionAsync API by @Shaddoll in #5657
- Docker compose setup for async workflows with kafka queue by @taylanisikdemir in #5663
- Add a
make pr
target for an easy "do automated checks for PR" command by @Groxx in #5670 - Added debug information for decision timeout handling by @3vilhamster in #5674
- Async workflows integration test with kafka by @taylanisikdemir in #5678
- Add missing IsolationGroups field in domain cache entry by @taylanisikdemir in #5679
- Add close status parse method in pinot query validator by @neil-xie in #5680
- Add async workflow integration test step to CI by @taylanisikdemir in #5681
- Add metrics for external calls for the workflow ID specific rate limits by @jakobht in #5684
- Write tests for cdb (Cassandra DB wrapper) basic functions by @taylanisikdemir in #5686
- Added a unit test for nosql execution store - createworkflowexecution by @agautam478 in #5687
- Write tests for cdb.InsertWorkflowExecutionWithTasks by @taylanisikdemir in #5688
- Added more scenarios to createworkflowexecution test- Part1 by @agautam478 in #5690
- Added a test for the GetworkflowExecution in the nosql_execution_store.go file. by @agautam478 in #5692
- Write tests for cdb.SelectCurrentWorkflow by @taylanisikdemir in #5693
- Support AsyncWorkflowConfiguration decoding in admin CLI by @taylanisikdemir in #5694
Changed
- Replace JWT validation library by @mantas-sidlauskas in #5592
- feat: pprof support config host by @zedongh in #5601
- Refactor persistence serializer tests and add more cases by @taylanisikdemir in #5625
- Upgrade domain_config type in cassandra schema to add async wf config by @taylanisikdemir in #5630
- Refactor frontend API handler and use generated code to emit metrics by @Shaddoll in #5639
- Enable the workflow ID cache in shadow mode for start workflow by @jakobht in #5641
- Filtering the prefix in custom query log for pinot response comparator by @bowenxia in #5643
- The ratelimiter needs to be created with the domain name not the ID by @jakobht in #5644
- Update async workflow queue idl change by @Shaddoll in #5645
- Rewrite async workflow queue provider component by @Shaddoll in #5648
- Store mutable state checksum in SQL storage by @Shaddoll in #5649
- Splitting wfCacheEnabled config for internal and external requests by @sankari165 in #5647
- Convert pinot query to use unix milliseconds instead of nano by @neil-xie in #5650
- Emit metrics when transfer tasks could be ratelimited by @sankari165 in #5652
- Update change log for v1.2.7 release by @neil-xie in #5653
- Update pinot query validator to handle raw time string by @neil-xie in #5656
- Emit metrics when transfer tasks for decisions could be ratelimited by @sankari165 in #5665
- Upgrade pinot client version by @neil-xie in #5666
- Update the build-changed message failure by @Groxx in #5667
- Improve error message for membership resolver by @Shaddoll in #5669
- Emits a counter value for every unique view of the hashring by @davidporter-id-au in #5672
- Refactor history packages by @jakobht in #5673
- Improve test coverage for sql_execution_store_util by @Shaddoll in #5676
- Improve test coverage for sql_execution_store by @Shaddoll in #5677
- Improve test coverage for constants.go by @timl3136 in #5685
- Enable retry on mutable state checksum verification failure by @Shaddoll in #5691
Fixed
- Set proper max reset points by @neil-xie in #5623
- Put a timeout for timer task deletion loop during shutdown by @taylanisikdemir in #5626
- Catch unit test failures in make test by @Groxx in #5635
- fix: get messages between query over message_id typo by @zedongh in #5607
- Fix context leak in tests by @munahaf in #5377
- Make sure task processing rate limiter is only done in the active side by @sankari165 in #5654
- Fix Pinot query validator bug when user pass in not equal query with value missing by @neil-xie in #5662
- Update Pinto query validator failed log, minor refactor pinot visibility store to remove panics by @neil-xie in #5664
- Fix context leak in pinot integration test by @neil-xie in #5682
- Fix SignalWithStartWorkflow API by @Shaddoll in #5671
- Fix wrong migration paths in example by @kotcrab in #5668
- Fix comment in workflow id cache config by @sankari165 in #5661
- Fix the local integration test docker-compose file by @jakobht in #5695
- Do not get workflow execution from database when shard is closed by @Shaddoll in #5697
Removed
- Removed useless metrics tag from the workflowIDcache by @jakobht in #5651
- Removed the shadower service for cadence-server by @agautam478 in #5660
New Contributors
- @zedongh made their first contribution in #5607
- @munahaf made their first contribution in #5377
- @kotcrab made their first contribution in #5668
Full Changelog: v1.2.7...v1.2.8
v1.2.7
What's Changed
Added
- Add metrics to monitor task validation. by @agautam478 in #5466
- Add an "all results" query to scanner/fixer workflows by @Groxx in #5470
- Add retries into Scanner BlobWriter by @agautam478 in #5471
- Added a unit test for the BlobStoreWriter. by @agautam478 in #5472
- Add Debugf and some minor updates to timer queue processor base by @taylanisikdemir in #5475
- Add unit tests for cassandra workflow utils part-1 by @taylanisikdemir in #5476
- Add
workflow query-types
command to CLI by @arzonus in #5456 - Add unit test for cassandra workflow utils part-2 by @taylanisikdemir in #5480
- Unit tests for admin cli decode_thrift command by @taylanisikdemir in #5485
- Add unit test for sqlConfigStore by @Shaddoll in #5491
- Add unit test for mysql configstore by @Shaddoll in #5502
- Add persistence serialization unit tests by @3vilhamster in #5507
- Adding unit tests to workflowHandler_test.go by @sankari165 in #5500
- Add unit tests for AwaitWaitGroup by @arzonus in #5512
- Add unit test for sql domain store by @Shaddoll in #5508
- Add unit test for cassandra workflow utils part-3 by @taylanisikdemir in #5506
- Adding unit tests for RecordActivityTaskHeartbeat by @sankari165 in #5511
- add unit tests for ValidIDLength by @arzonus in #5520
- Test for rate limited wrappers around persistence clients by @3vilhamster in #5518
- Test for error injection clients by @3vilhamster in #5515
- Add unit test for sql history store by @Shaddoll in #5524
- Adding unit tests to RespondActivityTaskCompleted and RecordActivityT… by @sankari165 in #5521
- Add unit tests for IsEntityNotExistsError by @arzonus in #5528
- Add unit tests for CreateXXXRetryPolicy by @arzonus in #5527
- Add unit tests for ValidateRetryPolicy by @arzonus in #5529
- Add unit tests for ConvertGetTaskFailedCauseToErr by @arzonus in #5531
- Add unit tests for WorkflowIDToHistoryShard and DomainIDToHistoryShard by @arzonus in #5533
- Added a unit test for the timer.go file in reconciliation folder. by @agautam478 in #5505
- Adding logging to scanner.go by @agautam478 in #5535
- Adding a metric for hosts not being found in resolver by @davidporter-id-au in #5414
- Added logs to concrete_execution.go by @agautam478 in #5536
- Add unit tests for sql queue store by @Shaddoll in #5541
- Unit tests for timer/transfer queue processor pump loops by @taylanisikdemir in #5540
- Add unit tests for sql shard store by @Shaddoll in #5543
- Add unit test for kafka partition ack manager by @neil-xie in #5545
- Add unit tests for GenerateRandomString by @arzonus in #5532
- Add unit tests for IsValidContext by @arzonus in #5546
- Add unit tests for CreateChildContext by @arzonus in #5547
- Add unit tests for DeserializeSearchAttributeValue by @arzonus in #5548
- Add unit tests for GetSizeOfHistoryEvent by @arzonus in #5550
- Add unit tests for thrift mappers by @taylanisikdemir in #5542
- Add unit tests for sql task store by @Shaddoll in #5558
- Added logs into the current execution.go and a unit test by @agautam478 in #5555
- Add unit test for kafka producer impl by @neil-xie in #5559
- Add shard id to queue processor related metrics by @taylanisikdemir in #5557
- Add unit tests for sql execution store by @Shaddoll in #5565
- Add unit test for new Kafka client by @neil-xie in #5570
- Add unit tests for helper functions in sql execution store util by @Shaddoll in #5571
- Added tests for visibility sampling wrapper by @3vilhamster in #5564
- Add unit test for consumer impl by @neil-xie in #5573
- Add unit tests for workflow state non maps by @Shaddoll in #5578
- Add logs to debug timer tasks by @Shaddoll in #5581
- Added deprecated domain check to the taskvalidator by @agautam478 in #5580
- Add unit tests for IsServiceTransientError by @arzonus in #5551
- Add unit tests for for IsAdvancedVisibilityWritingEnabled by @arzonus in #5552
- Add unit tests for ValidateLongPollXXX by @arzonus in #5553
- Add grafana dashboard to visualize persistence metrics for default docker-compose setup by @taylanisikdemir in #5582
- Add missing exclude-query support to list-workflows on the CLI by @Groxx in #5583
- Add unit tests for DurationToXXX and XXXToDuration by @arzonus in #5530
- Add more debug logs for user timer task execution by @taylanisikdemir in #5595
- Add cache for workflow specific in memory data by @jakobht in #5594
- Added three dynamic config properties by @jakobht in #5602
- add ContextKey Struct by @bowenxia in #5606
- Adding a stale workflow check to the taskvalidator and code cleanup. by @agautam478 in #5604
- Added more error handling in workflow cache by @jakobht in #5611
Fixed
- Improves metric and error handling for history by @davidporter-id-au in #5469
- Address map access data race in matching engine by @taylanisikdemir in #5477
- fix docker compose tests by @3vilhamster in #5479
- Fix copying suite.Suite in integration tests by @3vilhamster in #5481
- fix scavenger test suite by @3vilhamster in #5490
- fix scavenger suite by @3vilhamster in #5498
- Fixing matching:TestCheckIdleTaskList test flackiness by @dkrotx in #5494
- fix leaky goroutines in matching by @3vilhamster in #5499
- Unit test for the fetcher/current.go. by @agautam478 in #5504
- More fixes for golint.sh by @Groxx in #5519
- Fix race between startup and shutdown in task reader by @Groxx in #5522
- Ensure scanner scavenger stops in tests by @3vilhamster in #5510
- Bugfix/debugging stuck tasklist by @davidporter-id-au in #5436
- Fix multiple lock acquire on membership update by @3vilhamster in #5576
- Properly catch errors in ldflag-gathering and fail the build by @Groxx in #5539
- Addressed sync issue in workflow cache by @jakobht in #5605
- fix a comment by @bowenxia in #5610
- Fixed lint errors introduced in previous PR by @jakobht in #5613
Changed
- Update kafka config to have isSecure option by @neil-xie in #5473
- Minor change to include domainTag and pass domainName. by @agautam478 in #5468
- Wrap isSecure config in config map for kafka topic by @neil-xie in #5474
- Update changelog for v1.2.6 release by @neil-xie in #5478
- Unify cassandra setup in docker-compose by @3vilhamster in #5482
- Unify logging in tests by @3vilhamster in #5487
- Updated the unit test for BlobstoreIterator into a table format by @agautam478 in #5488
- update cassandra dev setup by @3vilhamster in #5501
- Converted the existing test for concrete.go execution into a table test by @agautam478 in #5503
- Improve logs/metrics of HandleDecisionTaskCompleted by @taylanisikdemir in #5497
- Revert gofuzz us...
v1.2.6
What's Changed
Added
- Added range query support for Pinot json index by @bowenxia (#5426)
- Implemented GetTaskListSize method at persistence layer by @Shaddoll (#5442, #5447)
- Added a framework for the Task validator service by @agautam478 (#5446)
- Added nit comments describing the Update workflow cycle @agautam478 (#5432)
- Added log user query param by @bowenxia (#5437)
- Added CODEOWNERS file by @taylanisikdemir (#5453)
- Added a function to evict all elements older than the cache TTL by @jakobht (#5464)
Fixed
- Fixed workflow replication for reset workflow by @Shaddoll (#5412)
- Fixed visibility mode for admin when use Pinot visibility by @neil-xie (#5441)
- Fixed workflow started metric by @ketsiambaku (#5443)
- Fixed timer-fixer, unfortunately broken in 1.2.5 by @Groxx (#5433)
- Fixed confusing comment in matching handler by @jakobht (#5450)
Changed
- Cassandra version is changed from 3.11 to 4.1.3 by @taylanisikdemir (#5461)
- If your machine already has ubercadence/server:master-auto-setup image then you need to repull so it works with latest docker-compose*.yml files
- Move dynamic ratelimiter to its own file by @jakobht (#5451)
- Create and use a limiter struct instead of just passing a function by @jakobht (#5454)
- Dynamic ratelimiter factories by @jakobht (#5455)
- Update github action for image publishing to released by @3vilhamster (#5460)
- Update matching to emit metric for tasklist backlog size by @Shaddoll (#5448)
- Change variable name from SecondsSinceEpoch into EventTimeMs by @bowenxia (#5463)
Removed
- Get rid of noisy task adding failure log in matching service by @taylanisikdemir (#5445)
New Contributors
Full Changelog: v1.2.5...v1.2.6
v1.2.5
What's Changed
Added
- Scanner / Fixer changes by @Groxx in #5361
- Stale-workflow detection and cleanup added to shardscanner, disabled by default.
- New dynamic config to better control scanner and fixer, particularly for concrete executions.
- Documentation about how scanner/fixer work and how to control them, see the scanner readme.md
- This also includes example config to enable the new fixer.
- MigrationChecker interface to expose migration CLI by @abhishekj720 in #5424
- Added Pinot as new visibility store option by @neil-xie in #5201
- Added pinot visibility triple manager to provide options to write to both ES and Pinot.
- Added pinotVisibilityStore and pinotClient to support CRUD operations for Pinot.
- Added pinot integration test to set up Pinot test cluster and test Pinot functionality.
Fixed
- Fix CreateWorkflowModeContinueAsNew for SQL by @Shaddoll in #5413
- Fix CLI count&list workflows error message by @ketsiambaku in #5417
- Hotfix for async matching for isolation-group redirection by @davidporter-id-au in #5423
- Fix closeStatus for --format flag by @ketsiambaku in #5422
Full Changelog: v1.2.4...v1.2.5-prerelease3
v1.2.4
What's Changed
- Remove database check for config store tests by @Shaddoll in #5401
- Fix persistence tests setup by @Shaddoll in #5402
- Implement config store for MySQL by @Shaddoll in #5403
- Retract v1.2.3 by @sankari165 in #5406
- Implement config store for PostgresSQL by @Shaddoll in #5405
- Release v1.2.4 by @Shaddoll in #5407
Full Changelog: v1.2.3...v1.2.4
v1.2.3 (Retracted, please use v1.2.4)
Added
Expose workflow history size and count to client by @timl3136 (#5392)
Fixed
[cadence-cli] fix typo in input flag for parallelism by @sankari165 (#5397)
Changed
Update config store client to support SQL database by @Shaddoll (#5395)
Scaffold config store for sql plugins by @Shaddoll (#5396)
Improve poller detection for isolation by @Shaddoll (#5399)
v1.2.2
What's Changed
- add a update workflow execution count metric for RI by @allenchen2244 in #5386
- Pass partition config and isolation group to history/matching even if isolation is disabled by @Shaddoll in #5385
- [CLI] fix nil pointer issue in domain migration command rendering by @shijiesheng in #5378
- Release v1.2.2 by @shijiesheng in #5388
Full Changelog: v1.2.1...v1.2.2
v1.2.1
Project release: Zonal isolation
This version introduces a few resiliency concepts into customers' worker task processing such that they can detect deployment or configuration failures earlier. These features are opt-in.
The high-level concept is to provide a means to subdivide work (called 'isolation-groups') for workers along whatever partitioning mechanism that is required for your service.
By default the partitioning mechanism provided will attempt to keep workflows running in the location the are started, such that customers may identify broken changes earlier, rather than waiting for the deployment of an entire region. However, if there are no pollers available available in that subdivision, it'll route the work elsewhere.
Nomenclature
Partitioning: A means to subdivide the tasks given to workflows, of which there are many possible schemes and one default one provided. When a workflow is started, a group of partition keys are provided by request headers. The partition keys are used to determine which isolation group of workers should process these workflows.
Workflow pinning: A partitioning scheme which emphasizes keeping workflows running in the location they were started
Isolation-groups: A division of work within a customer region in which they can subdivide their workers and pin the workflows. This originally was intended as a synonym for 'zone' in the site reliability, as a subdivision of a region. However the important point is that this is a failure domain for customer workflows, so this may be an arbitrary subdivision of your cluster's traffic.
Isolation-group drain: A means of excluding work from an isolation-group. If an isolation group is drained, workers from that isolation group won't be able to get any task. And customers cannot start workflows from that isolation group.
Default concepts and approaches
The partitioning and isolation concepts are intended to be provided as general purpose orchestration concepts and flexible, with some basic defaults provided. By default the following behaviour is given:
- Partition data is persisted with workflow execution records by the provided middleware if the provided header is passed when workflows are created.
- The cadence client and worker Go libraries will pass these as headers if provided in client options
Pinning behaviour
The workflow original zone is captured on workflow start and will be used on workflow processing.
The default partitioner provides the following behaviour: It will attempt to dispatch work in a zone where the workflow was started. However, workers may not be available in that zone, or no longer available for some reason. So the partitioner takes information from a lookback of poller information and uses this lookback data to ensure that the workflow can be processed. If the the start isolation-group is not available it'll another healthy random one.
'Health', here, is determined as the presence of pollers and the absence of drains.
The 'unpinning' is import for two main reasons: firstly, it's quite possible to start a workflow from an unrelated isolation-group in which the pollers are created and to suddenly blackhole that work would likely be not the desired behaviour. But secondly, and probably more importantly, this prevents a head-of-line blocking problem internally for Cadence. At the database level (in this release anyway) tasks need to be dispatched in-order and so if an isolation-group were to be not processed it would block task processing.
Drains
This release also introduces a simplistic notion of drains, which allow for isolation-groups to be excluded from traffic processing, should that be required. Drains are issuable via the Admin API or via cli:
eg:
cadence admin isolation-groups update-global --set-drains zone-1
cadence admin isolation-groups get-global
This information is stored in the config-store and is not part of dynamic configuration.
Configuration
In order to use this feature, the requisite configuration is required:
system.allIsolationGroups
: This is a list of all the possible isolation-groups
system.enableTasklistIsolation
: This is the bool flag to enable it for a domain
Implementation
The changes for this feature are largely in Matching and can be (reductively) described as: Sync and Async-match in Cadence as being made aware of a new dimension; their associated isolation-group. The tasks piped through the Matching service are matching the appropriate isolation-group channel.
What's Changed
- Set config for shardscanner fixer by @mantas-sidlauskas in #3844
- Fix get raw history for transient decision by @yycptt in #3847
- Fix error handling when processing parent close policy by @yycptt in #3845
- Add logging/metrics for decision attempts by @yycptt in #3849
- Switch to gocql interface by @yycptt in #3837
- Fix NPE in DescribeMutableState by @yycptt in #3850
- Switch the remaining history component to internal types by @vytautas-karpavicius in #3843
- Switch Health status endpoints to internal types by @vytautas-karpavicius in #3842
- reset workflow with no decision task complete by @yux0 in #3687
- error check before return the ActivityLocalDispatchInfo by @mkolodezny in #3853
- Delete unused dynamic configs that have no referrence anymore by @longquanzheng in #3859
- Merge sql updates: Blob size increase by @yux0 in #3858
- Handle matching task list conditional error by @yux0 in #3867
- Fix go-generate by @yycptt in #3864
- Support visibility query with close status represented in string by @yycptt in #3865
- Add timers shardscanner by @mantas-sidlauskas in #3846
- replace string based logging with tagged logs by @mantas-sidlauskas in #3871
- Downgrade golang tools version by @yycptt in #3876
- Add instructions to setup local MySQL and Postgres by @yux0 in #3868
- Make max activity schedule to start timeout for retry configurable by domain by @yycptt in #3878
- Task processing debug logs by @yycptt in #3877
- Transfer queue validator by @yycptt in #3875
- Pick sql index changes by @yux0 in #3866
- Remove strict sanity check to allow reset by @yux0 in #3879
- Improve shard context timeout handling by @yycptt in #3881
- Add domain name tag in failover metrics by @yux0 in #3882
- break out when response is nil by @mantas-sidlauskas in #3886
- Allow using Kafka TLS without cert ca and key by @longquanzheng in #3862
- Fix dynamic config collection logValue function by @yycptt in #3880
- Update read DLQ messages API to return raw task info by @yux0 in #3869
- break if adminClient returns error by @mantas-sidlauskas in #3887
- Latest idl by @yux0 in #3888
- Fix activity lost metrics by @yycptt in #3889
- Add replication error logging and metrics by @yux0 in #3891
- Simplify templateGetLastMessageIDQuery sql query by @andrewjdawson2016 in #3890
- Add task processing workflow busy metric by @yycptt in #3892
- CLI 0.18.0 release by @yycptt in #3896
- Handle data corruption error in replication by @yux0 in #3895
- Add a "help" target to the makefile by @Groxx in #3898
- Initial protobuf types and API by @vytautas-karpavicius in #3863
- Fix workflow reset command by @yycptt in #3904
- CLI 0.18.1 patch release by @yycptt in #3908
- Use GetDomainName instead of GetDomainByID for retrieving domain names by @yycptt in #3899
- Start enabled shardscanner fixers by @mantas-sidlauskas in #3906
- Switch to protoc-gen-go by @vytautas-karpavicius in #3905
- Fix scan unsupported workflow in SQl DB by @yux0 in #3909
- Makefile cleanup / thrift revamp / gobin removed by @Groxx in #3903
- Version goveralls, remove unused go bins from docker setup by @Groxx in #3913
- Remove duplicate doc...
v1.0.0
We are v1.0! (with a schema upgrade)
What does this mean?!
Not much. Primarily that we are declaring "it's stable and in use" more visibly, because we continually get questions about this :) A larger public announcement / state-of-the-project is in the works.
Importantly, v1.0 does not imply any change to backwards compatibility (the minimum supported client version has not changed), RPC compatibility (ditto, all changes are backwards compatible), or Go API compatibility (this is not truly a library, Go compatibility is not a goal).
Going by previous version patterns, this would have been labeled v0.26.0 as it is a relatively incremental change (plus schema changes) from v0.25.0. As such, some strings still reference "0.26", because this older SHA is the one we have been using the most internally.
These strings will be updated and validated soon, and will likely be released as v1.0.1. This should have no behavioral impact at all, but will be visible in metrics, logs, and display strings.
What do I need to do to upgrade?
Schema upgrades needed
There have been schema changes to both normal and visibility datastores, primarily to provide better data for cleanup and hot-shard detection:
- Update-time additions by @neil-xie in #4962 and #4971
- Add FirstExecutionRunID to mutable state by @Shaddoll in #5031
- Shard ID visibility additions by @allenchen2244 in #5099 and #5123
These were intentionally kept out of v0.25.0 to keep that upgrade simple, as they were not fully utilized yet.
Replication cache recommendation
We have internally disabled the replication cache (history.replicatorCacheCapacity
dynamic config set to 0
), due to unexpectedly large memory use under abnormal load, and you may wish to do so as well.
We did not encounter any misbehavior, and it did reduce database load as intended, but we intend to make some changes to it to estimate and constrain memory use before re-enabling.
What has changed?
At a very high level, we've been focused on:
- Internal scaling challenges, both improving bottlenecks and improving our ability to accurately identify bottlenecks
- Many metrics, logs, and refactors are at least somewhat related to this
- Our multi-cluster support is improved in particular, as we have been connecting clusters and moving many domains to spread load more evenly
- Database corruptions, as our Cassandra clusters have had some problems that cause issues for months
- Many logs, scanner, and stale-task changes are related to this, e.g. to detect and remove invalid data
- Scaling up the team
- More changes to come!
Some loosely categorized PRs that were included follows:
Critical bugfixes (resolving issues in v0.25.0)
- Fix ndc flush buffered events by @Shaddoll in #5009
- Hotfix a replication panic causing crashes by @davidporter-id-au in #5074
- Resolve an infinite loop around impossible cron schedules by @Groxx in #5097
Parent-close-policies apply to child workflows even after they reset/continue-as-new/etc
- Update parent close policy to terminate/cancel child workflows even after continue as new by @Shaddoll in #5032
- This requires new stored data, so it does not apply to child workflows started before this version.
Better config introspection
- Config store CLI: make value required when updating by @mantas-sidlauskas in #5089
- CLI: print all available dynamic config keys by @mantas-sidlauskas in #5090
Schemas are now available via the go module, as go:embed files
- Embed schema files by @Shaddoll in #5040
- Embed elasticsearch index templates by @Shaddoll in #5043
- Fix ES embedding by @Shaddoll in #5056
Enhancing existing metrics and logging (and more included in other PRs)
- Reduce metrics cardinality replication.TaskStore by @vytautas-karpavicius in #4981
- Add Metric Emitter, which right now emits a metric once a minute for true replication lag in nanoseconds. by @ZackLK in #4979
- Added logs for domainName empty situation by @abhishekj720 in #4987
- Improve logs for task executor by @Shaddoll in #4989
- Add domain_type and cluster_groups tags by @vytautas-karpavicius in #4990
- Introduce per domain metrics by @Shaddoll in #5012
- Improve logs for transfer task validator by @Shaddoll in #5044
- Make replication log error message better by @davidporter-id-au in #5052
- Wf version metrics by @allenchen2244 in #5041
- Add domain tag to unregistered field error by @neil-xie in #5070
- UpdateWorkflow ShardId based metrics by @allenchen2244 in #5080
- Emit workflow counts per workflow type metrics by @neil-xie in #5082
- Use zap logger when initialising dynamic config by @mantas-sidlauskas in #5081
- add 3 tags to support adding logs for every manual access by @bowenxia in #5112
- Add sample log and dynamic config for updateworkflowexecution hot shard detection by @allenchen2244 in #5120
- Add attempt-count to task processing logs, and update unit test so that it will cover deadlock by @bowenxia in #5122
Misc
- Allow docker compose to work with docker-compose-mysql.yml on M1 by @ZackLK in #4983
- Return early when there are no replication tasks by @vytautas-karpavicius in #4982
- Update Cassandra deletes to use ALL consistency level by @Shaddoll in #4984
- Make test should pass locally by @ZackLK in #4915
- Immediate replication task hydration after successful transaction by @vytautas-karpavicius in #4980
- Convert client peer resolving errors to service transient errors by @Shaddoll in #4993
- Update idls by @Shaddoll in #4997
- Fix history corruption check for workflow signaling by @Shaddoll in #4998
- Introduce a dynamic config for cassandra all consistency level delete by @Shaddoll in #5000
- Adds fix for domain ack level issue by @davidporter-id-au in #5001
- Drop dynamic config for gRPC message size by @vytautas-karpavicius in #5002
- Fix Cadence CLI by @Shaddoll in #5005
- Re-enable workflow test by @Shaddoll in #5007
- Add new unit test by @Shaddoll in #5008
- Reformatting most things for go 1.19, rebuilding go.mod tools after clean, warning about different go versions by @Groxx in #5019
- Enhance workflowDeletionTaskJitterRange to handle deletes piling up when many workflows have finished at the same time. by @ZackLK in #5020
- Feature/min initial failover version by @davidporter-id-au in #5015
- Fix Makefile OpenSearch rule name in CONTRIBUTING.md install guide, Fix OpenSearch version in dev Docker config by @charlese-instaclustr in #5004
- Decouple StateBuilder from TaskGenerator by @vytautas-karpavicius in #4991
- Removing unused code by @vytautas-karpavicius in #5024
- Use internal IndexedValueType by @Shaddoll in #5016
- Fix workflow cancellation by @Shaddoll in #5025
- Add UpdateTime to uninitialized workflow execution record and update logic to set the update time by @neil-xie in #5014
- Update DSL query to allow filtering by missing start time by @neil-xie in #5017
- test: use
T.TempDir
to create temporary test directory by @Juneezee in #5013 - Enable workflow corruption check for Describe and Query API by @Shaddoll in #5028
- Remove unused watchdog signal by @demirkayaender in #5029
- Add TLS ServerName as CLI option for Cadence Cassandra Tool by @sonpham96 in #5011
- Add cli tls support by @charlese-instaclustr in #5027
- Improve Cassandra errors for schema check by @mantas-sidlauskas in #5038
- Fix SignalWithStartWorkflow by @Shaddoll in #5036
- Fix error message by @ZackLK in #5045
- Making a schema tooling concrete -> interface by @davidporter-id-au in #5046
- Exposing the ability to pull CQL changesets by @davidporter-id-au in https://github.com/uber/ca...