Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jepsen: Sets test flaky #17491

Closed
bdarnell opened this issue Aug 7, 2017 · 3 comments · Fixed by #17701
Closed

jepsen: Sets test flaky #17491

bdarnell opened this issue Aug 7, 2017 · 3 comments · Fixed by #17701
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Aug 7, 2017

The Jepsen sets test is failing about half the time (under the start-kill-2 and majority-ring+start-kill-2 nemeses). The final output looks like this, which appears to be a real inconsistency:

INFO [2017-08-06 08:09:56,963] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :details
 {:duplicates [2568],
  :valid? false,
  :revived "#{}",
  :lost "#{}",
  :recovered
  "#{995 1519 1530 1532 1542 1557 1571 1721 1744 1758 1763 1766 1792 1909 2084 2086..2087 2094..2097 2100 2105 2107 2109 2112 2196 2200 2202 2205 2209..2210 2213}",
  :ok
  "#{0..327 331 333 335..340 342 344 346 349..353 355..356 359..363 365..368 371..372 374..375 379..380 382 384..389 392..400 402 404..406 408 410..412 414..417 419..420 422..426 428 430 432..443 449..456 458 461..462 464..467 469..472 475..561 564..570 572..573 575 580 582..585 588..593 595 598..600 602 605..607 609..610 612..616 618..620 623..630 633..634 636..638 640..647 650..651 653..659 661..666 669 671..673 675..677 679..682 684..685 687..688 690..693 695..705 707 709..712 714..716 720..727 729..736 738..743 745..991 993..994 997..999 1002..1005 1013..1014 1016 1018 1020..1023 1025 1031 1078 1080 1082 1084 1086..1092 1095..1097 1099..1101 1103 1105 1108..1109 1111..1119 1121 1123..1124 1127..1132 1135..1142 1144..1152 1154..1156 1158 1162..1165 1167..1175 1177..1178 1180..1184 1186..1191 1194 1198..1199 1203 1205..1215 1217 1220..1228 1230 1232..1236 1238..1239 1241..1242 1244..1257 1261 1263..1264 1268 1270..1274 1276..1284 1286..1287 1289..1298 1300..1303 1306 1309 1311..1312 1314 1316 1319 1321..1336 1338 1340..1343 1346..1352 1354 1356 1358 1361..1363 1365 1367..1369 1371..1373 1375..1377 1379..1380 1382..1400 1406..1408 1410..1415 1417 1419..1424 1426 1428..1436 1438..1445 1447..1453 1457..1463 1465..1466 1472..1473 1475..1485 1487..1488 1490..1492 1494..1500 1502..1503 1505 1507 1510 1512..1514 1516..1517 1520..1529 1531 1533..1534 1536..1538 1544..1556 1559..1570 1572..1575 1577..1712 1716 1720 1722..1723 1725..1726 1728..1729 1732 1734..1736 1738..1743 1745..1749 1753 1756..1757 1759 1761..1762 1764 1767..1770 1775..1776 1778..1780 1782..1783 1785..1788 1791 1793..1798 1800..1801 1803..1805 1807..1808 1810 1812..1816 1818..1819 1822 1824..1825 1828 1830..1833 1836..1837 1841..1842 1869 1873..1876 1878 1880 1883 1886..1888 1891..1892 1894 1896..1902 1904 1906..1908 1910 1913 1917..1920 1923 1926..1927 1930..1932 1935 1937..1945 1947..1949 1952 1955..1958 1960 1962..1963 1968 1972..1982 1986..1989 1991..1994 1996..1997 2001..2005 2008..2012 2014..2015 2017..2019 2024..2042 2044..2046 2048 2051..2057 2059..2061 2063..2067 2070..2073 2075..2076 2078 2080..2082 2215 2218 2222..2224 2261 2264 2273..2275 2277 2279..2282 2284..2288 2290..2293 2295..2305 2307..2315 2317..2568 2570 2573 2575 2577..2578 2580..2583 2585..2586 2590..2593 2601 2604 2643..2644 2646..2648 2650 2652 2654..2656 2658 2660..2665 2667 2669 2672..2676 2678..2679 2682..2684 2686..2689 2691 2694..2696 2698..2703 2705..2706 2708..2709 2711..2720 2722..2730 2732..2733 2735..2736 2739..2740 2742..2743 2745..2747 2749..2755 2757..2760 2763 2766 2768..2769 2771..2773 2775..2780 2782..2785 2787..2792 2797..2801 2803..2812 2816..2819 2821..2826 2828 2830..2832 2834..2835 2837..2845 2847..2853 2855..2857 2859..2861 2863..2864 2868..2869 2872..2879 2881..2883 2885 2889..2890 2892..2900 2902 2904..2909 2911 2913..2914 2916..2920 2922..2926 2929 2931 2933..2936 2939..2940 2942..2944 2947..2952 2954..2956 2958..2963 2966 2968..2977 2979..2982 2984..2986 2988..2990 2992 2994 2996 2998..2999 3001..3473}",
  :recovered-frac 11/1158,
  :unexpected-frac 0,
  :revived-frac 0,
  :unexpected "#{}",
  :lost-frac 0,
  :ok-frac 2695/3474},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
@bdarnell bdarnell added this to the 1.1 milestone Aug 7, 2017
@bdarnell bdarnell self-assigned this Aug 7, 2017
@bdarnell
Copy link
Contributor Author

bdarnell commented Aug 7, 2017

The sets test was the one that prompted the introduction of AmbiguousResultError. It performs a series of insert statements (to a table with the auto-generated unique_rowid() primary key) and verifies that A) no inserts which reported success to the client were lost and B) no insert was executed more than once. Here, an insert is being executed more than once.

This appears to have started on Aug 1 (we don't have much history to work with because the tests had other issues prior to july 25, but it had several green days in a row before it started failing frequently on aug 1). That would put the culprit somewhere in this commit range.

@bdarnell
Copy link
Contributor Author

bdarnell commented Aug 9, 2017

That commit range is a red herring. The failure is non-deterministic and was still present during the four consecutive green runs we had. That means it's been present since before the most recent fixes to the tests, so it's going to be mixed in with other failures and bisecting will not be very efficient in tracking it down.

@bdarnell
Copy link
Contributor Author

The problem is here:

// The Unavailable code is used by GRPC to indicate that a
// request fails fast and is not sent, so we can be sure there
// is no ambiguity on these errors. Note that these are common
// if a node is down.
// See https://github.com/grpc/grpc-go/blob/52f6504dc290bd928a8139ba94e3ab32ed9a6273/call.go#L182
// See https://github.com/grpc/grpc-go/blob/52f6504dc290bd928a8139ba94e3ab32ed9a6273/stream.go#L158
if haveCommit && grpc.Code(err) != codes.Unavailable {

We assume that GRPC's Unavailable error is unambiguous (indicating a fail-fast before anything went out on the wire) and allows us to try again, but this is not true. When a server drains, it also sends the Unavailable error code.

(Why do we see ErrStreamDrain in the jepsen tests (which use kill -9)? It's a part of grpc/grpc-go#1147, which I haven't fully understood yet but tries to disconnect clients that send too many HTTP/2 "ping" frames. This results in an EnhanceYourCalm message in the logs, which we see occasionally. This needs further investigation)

Treating GPRC Unavailable errors as ambiguous should be a quick fix for the jepsen failure (I'll test this), although I think it will increase the rate of ambiguous errors when nodes fail (I'm not sure how effectively we filter out dead nodes instead of relying on this error code). We may need upstream GRPC changes to distinguish ambiguous failures from unambiguous ones.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 16, 2017
This error code is used for fail-fast errors (which can be retried
unambiguously), but it is also used in other cases (such as a server
draining) in which we cannot assume that the previous attempt was not
completed. (It's unclear whether this assumption was once true and
changed or if it's always been incorrect. The specific source of
ambiguous Unavailable errors we're seeing is grpc/grpc-go#1147)

This is expected to increase prevalence of AmbiguousResultErrors; this
will be fixed in a follow-up change.

Fixes cockroachdb#17491
bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 16, 2017
This error code is used for fail-fast errors (which can be retried
unambiguously), but it is also used in other cases (such as a server
draining) in which we cannot assume that the previous attempt was not
completed. (It's unclear whether this assumption was once true and
changed or if it's always been incorrect. The specific source of
ambiguous Unavailable errors we're seeing is grpc/grpc-go#1147)

This is expected to increase prevalence of AmbiguousResultErrors; this
will be fixed in a follow-up change.

Fixes cockroachdb#17491
bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 17, 2017
This error code is used for fail-fast errors (which can be retried
unambiguously), but it is also used in other cases (such as a server
draining) in which we cannot assume that the previous attempt was not
completed. (It's unclear whether this assumption was once true and
changed or if it's always been incorrect. The specific source of
ambiguous Unavailable errors we're seeing is grpc/grpc-go#1147)

This is expected to increase prevalence of AmbiguousResultErrors; this
will be fixed in a follow-up change.

Fixes cockroachdb#17491
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant