Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SMP squid reports YAML-compliant #1784

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

kinkie
Copy link
Contributor

@kinkie kinkie commented Apr 17, 2024

Format output from each kid as its own separate YAML document

@kinkie
Copy link
Contributor Author

kinkie commented Apr 17, 2024

This solution works by having one yaml document per kid. It is legal and works really well for humans (see an easy-read document). It is very simple to implement and works quite nicely with the current model. It can even work right now for non-YAML reports. The ability by YAML parsers to interpret multiple documents depends on the parser itself, but it is in the end part of the spec.

Another option to get the result could be to use a yaml map, keyed on "kid :" and indenting all output of reports up by one level. The indentation could be achieved in two ways I can think of:

  • turning spaces() or its successor into something stateful, so in case of SMP squid it can be instructed to indent more
  • having spaces() blindly add two additional spaces in case of SMP squid

What do you think?

@kinkie
Copy link
Contributor Author

kinkie commented Apr 17, 2024

The case of SMP-aware reports or non-SMP squid can be handled already, by not emitting "kid #" at all

@kinkie
Copy link
Contributor Author

kinkie commented Apr 17, 2024

Note: this PR does NOT depend on 1735 or any other. It works with the current non-converted reports as well

Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review does not approve or reject any formatting choices; I am discussing pros and cons instead.

Implementation bugs identified in this review can wait until we settle on the format.

@@ -50,9 +50,7 @@ Mgr::FunAction::dump(StoreEntry* entry)
debugs(16, 5, MYNAME);
Must(entry != nullptr);
if (UsingSmp())
storeAppendPrintf(entry, "by kid%d {\n", KidIdentifier);
storeAppendPrintf(entry, "---\nkid: %d\n", KidIdentifier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution adds a kid or similar field to SMP kid documents and only those documents. Ideally, we want a set of YAML document fields to be constant, regardless of SMP mode. This wrinkle is not a showstopper IMO, but it is a (small) argument against multidocument formats (unless YAML supports some kind of metadata that can name documents).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't.
For instance, in the updated server_list report, we do not emit ICP or HTCP stats if they are not enabled.

In all libraries I've tested, multiple documents get rendered as vectors of documents, and the size of this vector is the authority information for the number of kids. Specifying a kid number is useful because kids might be reordered in different runs but I believe we can safely skip it when there is only one. It would be useful to mark aggregate statistics as such but that's a separate topic

Copy link
Contributor

@rousskov rousskov Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't. For instance, in the updated server_list report, we do not emit ICP or HTCP stats if they are not enabled.

You are right that the set of fields, in the sense you are interpreting that phrase, cannot be constant. I agree that we can treat kid identification as an optional field that may be absent from some reports.

Specifying a kid number is useful because kids might be reordered in different runs

I agree that reporting kid identifier is useful (mostly for humans), even when that information is redundant. I do not think kid reports might be reordered in the current implementation. However, even if we manage to preserve the natural order forever, some kid reports may be missing (because a kid has not started yet or was not revived after death).

It would be useful to mark aggregate statistics as such but that's a separate topic

Agreed on both counts, but the two topics intersect because we may decide to use the same field to report all three variations (option A):

  • (no kids field): non-aggregated non-SMP output
  • kids: [2]: non-aggregated SMP output (produced by kid2)
  • kids: [1,2,3]: aggregated SMP output (aggregated based on kid1, kid2, and kid3 information)
  • kids: [2]: aggregated SMP output (aggregated based on kid2 information)

Cons: No way to distinguish reports aggregated across a single kid (yes, that may happen; see "missing" above for one example) from non-aggregated SMP reports, as illustrated by the second and last bullet using identical spelling. Since aggregation is a complicated action even when one kid is involved, this difference may be important enough to signal about.

The alternative (option B) is to use two different fields:

  • (no kid or kids field): non-aggregated non-SMP output
  • kid: 2: non-aggregated SMP output (produced by kid2)
  • kids: [1,2,3]: aggregated SMP output (aggregated based on kid1, kid2, and kid3 information)
  • kids: [2]: aggregated SMP output (aggregated based on kid2 information)

Cons: Consumers will see a change in the report field name (from kid to kids) as we aggregate more reports (this kind of goes back to my earlier poorly worded argument about the constant set of fields being a desirable design goal).

If we pick option A, then this PR code will be affected.

My current weak preference is for option B: I think option A "cons" is (more) significant.

What do you think?

Copy link
Contributor Author

@kinkie kinkie Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put forward another option, and a proposal for a strategy:

  • kids: 2 for non-aggregated SMP output
  • kids: all for aggregated output
  • no kids key at all for output that needs no aggregation

I don't expect there will ever be a case where we havde a partial collection of kids, therefore listing the kids is imposing unnecessary complexity on the users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put forward another option, and a proposal for a strategy:

I have added quotes to bullets to show where I think the examples start/end in your proposal.

  • "kids: 2" for non-aggregated SMP output

"kids: 2" reads like "two kids" to me which is kind of the opposite of the intended (AFAICT) "the second kid" or, more precisely, "kid identified as kid2" meaning.

  • "kids: all" for aggregated output

IMO, from machine consumer point of view, we should not change the type of the value from number (e.g., "2" in non-aggregated output) to a string (i.e. "all"), especially when the reverse conversion from the string to a number is impossible.

  • no kids key at all for output that needs no aggregation

I am not sure how "needs no aggregation" is defined in this context. Are you thinking about mgr:menu or a report from a single kid?

I don't expect there will ever be a case where we havde a partial collection of kids

This expectation directly contradicts the "missing" kids examples in my previous comment. Please clarify whether you consider my examples wrong/impossible or unimportant (or perhaps you just did not notice them).

Another example where aggregation may work across a subset of kids are reports specific to certain kid roles. For example, a disk utilization report aggregated exclusively across "disker" kids (i.e. skipping "worker" kids and "coordinator").

therefore listing the kids is imposing unnecessary complexity on the users

The "unnecessary" assertion seems false to me: Even for the typical case where reports from all kids are aggregated, the list of kid IDs is often useful. Imagine, for example, that you are receiving a report from an admin saying that Squid aggregation appears to have half of the expected values. Without the list of kid identifiers, we will not be able to tell whether aggregation code is indeed broken or half of the kids have died.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"kids: 2" reads like "two kids" to me which is kind of the opposite of the intended (AFAICT) "the second kid" or, more precisely, "kid identified as kid2" meaning.

Right. That should be "kid".

kids: [ vector ]

aka

kids:
  - 2

Could be easier for a machine reader, while:

kid: <number>

is better for a human reading it. This is, I believe, the crux of the matter, and the reason why by this logic alone it will be hard , if possible at all, to come to an agreement.

Your argument about being helpful to know the status of worker processes is valid, but it is IMO not best positioned here; it would deserve its own report page and I'll be happy to work on that in due time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level collection can be anonymous. The existing one for "page" already is. Please make the entry within the collection keep a named ID/number ("kid" was/is fine IMO).

---
kid: 1\n
...
---
kid:2\n
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Francesco: Your argument about being helpful to know the status of worker processes is valid

That is not my argument.

My argument is that the recipient of a report aggregated across kids X and Y must know the IDs of those kids (in some cases). Without that information, it is impossible (in some cases) to interpret the report correctly. If we were able to guarantee that all aggregated reports are aggregated across all kids at all times, then there would be some wiggle room here to omit that information, but the reality we have to deal with is more complex (as detailed already), and our reporting format/approach has to reflect that complexity.

Francesco: status of worker processes ... is IMO not best positioned here; it would deserve its own report page and I'll be happy to work on that in due time.

Sure, but a future "status of kids" report is not my concern, and I am not requesting or even suggesting any such work here.

Amos: The top-level collection can be anonymous.

FWIW, I do not know what "top-level collection" is in this context. If it means "report aggregated across one or more kids", then it should disclose what those kids were (as detailed earlier), either in all cases or in cases where aggregation does not include all kids.

FWIW, I recommend getting back to comprehensive proposals (e.g., option A and option B in #1784 (comment) and another option at #1784 (comment)) that cover all known use cases. Each individual use case is trivial to handle. It is the combination that requires some effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with you on one count: there are only two valid states here, one is "we are not aggregating data" and the other is "we are aggregating data from all workers that are responding". Piggybacking the information that "only some workers are responding" here is misplacing that kind of information - that status information belongs to some other place, such as the "info" page or a new dedicated report page if enough information needs to be shown.

Furthermore, we don't really have the infrastructure in place to accumulate which workers we are aggregating information from. It can be added, but then we fall victims of scope creep and as a result nothing will get done as it happened other times.
My bet is that by insitsting to emit a "kids: [ vector ] " information, all that will ever be output is "kids: [ exactly one entry in the vector ]", because I have no interest in going further than that, and others who might have interest have no time

Copy link
Contributor

@rousskov rousskov Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with you on one count: there are only two valid states here, one is "we are not aggregating data" and the other is "we are aggregating data from all workers that are responding".

I see at least three cases (called "valid states" in the above text):

  • no aggregation without kid reports (e.g., "mgr:menu")
  • no aggregation with kid reports (e.g., "mgr:objects")
  • aggregation across kids (e.g., "mgr:info")

My "option A" and "option B" sketches define new YAML field(s) format for all three cases (while detailing two out of three sub-cases for the last bullet: multiple kids and a single kid).

Piggybacking the information that "only some workers are responding" here is misplacing that kind of information

Agreed. None of the three cases itemized above should include "only some workers are responding" information (unless some report is about reporting that kind of information, of course, but that consideration is outside this discussion scope).

However, aggregated reports should (eventually) report aggregation scope (for the reasons I have already detailed).

Furthermore, we don't really have the infrastructure in place to accumulate which workers we are aggregating information from.

This assertion is irrelevant to this discussion. To save time, I am not going to review/comment on it in detail. Suffice to say that, currently, I am not requesting that we add "workers we are aggregating from" information to aggregated reports. I am only requesting that we agree on how that information will be added in the future, so that reporting style remains consistent across use cases, and we avoid changing format once again when that information is added.

My bet is that by insitsting to emit a "kids: [ vector ] " information

Nobody insists on emitting that information in this PR AFAICT. Again, If we pick Option A, then this PR code will be affected. If we pick Option B, then this PR code will not be affected. If we pick some other yet-unspecified option, then I do not know whether this PR code will be affected. FWIW, I still lean towards Option B.

src/stat.cc Outdated Show resolved Hide resolved
src/mgr/FunAction.cc Show resolved Hide resolved
src/mgr/FunAction.cc Show resolved Hide resolved
@rousskov rousskov added S-waiting-for-author author action is expected (and usually required) S-waiting-for-more-reviewers needs a reviewer and/or a second opinion labels Apr 17, 2024
@kinkie kinkie added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Apr 24, 2024
@yadij yadij removed the S-waiting-for-more-reviewers needs a reviewer and/or a second opinion label Apr 25, 2024
yadij
yadij previously approved these changes Apr 25, 2024
@yadij yadij requested a review from rousskov April 25, 2024 06:06
@rousskov rousskov removed their request for review April 26, 2024 20:32
@rousskov rousskov added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Apr 26, 2024
@kinkie
Copy link
Contributor Author

kinkie commented Apr 29, 2024

My understanding is that https://github.com/squid-cache/squid/pull/1784/files#r1582816570 is the only blocker we have right now to move forward with this PR, which is blocking several others (and will require some work in the testing infra - I'm looking for the relative code to update the tests

@kinkie
Copy link
Contributor Author

kinkie commented Apr 29, 2024

I have submitted a PR against squid-overlord to support the revised kids report syntax.

@kinkie kinkie added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Apr 29, 2024
@@ -50,9 +50,7 @@ Mgr::FunAction::dump(StoreEntry* entry)
debugs(16, 5, MYNAME);
Must(entry != nullptr);
if (UsingSmp())
storeAppendPrintf(entry, "by kid%d {\n", KidIdentifier);
storeAppendPrintf(entry, "---\nkid: %d\n", KidIdentifier);
Copy link
Contributor

@rousskov rousskov Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with you on one count: there are only two valid states here, one is "we are not aggregating data" and the other is "we are aggregating data from all workers that are responding".

I see at least three cases (called "valid states" in the above text):

  • no aggregation without kid reports (e.g., "mgr:menu")
  • no aggregation with kid reports (e.g., "mgr:objects")
  • aggregation across kids (e.g., "mgr:info")

My "option A" and "option B" sketches define new YAML field(s) format for all three cases (while detailing two out of three sub-cases for the last bullet: multiple kids and a single kid).

Piggybacking the information that "only some workers are responding" here is misplacing that kind of information

Agreed. None of the three cases itemized above should include "only some workers are responding" information (unless some report is about reporting that kind of information, of course, but that consideration is outside this discussion scope).

However, aggregated reports should (eventually) report aggregation scope (for the reasons I have already detailed).

Furthermore, we don't really have the infrastructure in place to accumulate which workers we are aggregating information from.

This assertion is irrelevant to this discussion. To save time, I am not going to review/comment on it in detail. Suffice to say that, currently, I am not requesting that we add "workers we are aggregating from" information to aggregated reports. I am only requesting that we agree on how that information will be added in the future, so that reporting style remains consistent across use cases, and we avoid changing format once again when that information is added.

My bet is that by insitsting to emit a "kids: [ vector ] " information

Nobody insists on emitting that information in this PR AFAICT. Again, If we pick Option A, then this PR code will be affected. If we pick Option B, then this PR code will not be affected. If we pick some other yet-unspecified option, then I do not know whether this PR code will be affected. FWIW, I still lean towards Option B.

src/mgr/FunAction.cc Show resolved Hide resolved
@rousskov rousskov added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Apr 30, 2024
@kinkie kinkie marked this pull request as draft May 2, 2024 13:44
@kinkie kinkie added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label May 2, 2024
@kinkie
Copy link
Contributor Author

kinkie commented May 2, 2024

Commit 6db2c7e works, except for one detail: it can't make use of Action::closeKidSection because there is no accesible StatObjectsAction to rely on.

@rousskov , @yadij , I need some help in understanding why commit f5bb52f does not work. It builds and starts just fine, but when
I call mgr:objects, it stalls immediately after emitting the HTTP headers. What am I missing here?
Thanks for any insights

#endif
if (IamPrimaryProcess())
if (IamPrimaryProcess()) {
storeAppendPrintf(entry, "---\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to not duplicate code here, except by factoring malloc statistics out

That assertion is false because there is another way: Factor out code that opens and closes reports (as originally suggested!). Call the right functions from the right places, eliminating (poor) code duplication (and possibly fixing other problems).

src/mgr/InfoAction.cc Show resolved Hide resolved
@rousskov rousskov removed the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label May 2, 2024
@rousskov
Copy link
Contributor

rousskov commented May 2, 2024

I need some help in understanding why commit f5bb52f does not work. It builds and starts just fine, but when I call mgr:objects, it stalls immediately after emitting the HTTP headers. What am I missing here?

Unfortunately, there are too many red flags and complexities in that commit for me to be able to identify the primary culprit by reading commit diff. Let's start from the very beginning: Why "Implement StatObjectsAction" in this PR?

@kinkie
Copy link
Contributor Author

kinkie commented May 2, 2024

Unfortunately, there are too many red flags and complexities in that commit for me to be able to identify the primary culprit by reading commit diff. Let's start from the very beginning: Why "Implement StatObjectsAction" in this PR?

Commit 6db2c7e uses the legacy code path. The chain is stat_objects_get > statObjectsStart > iterate statObjects using eventAdd, using StatObjectsActionState to keep state across invocations. Eventually it will iterate over all objects, and callback into onObjectsDumpComplete which, just before closing the StoreEntry, will print the kid terminator.
This is the only non-atomic action code. Action cannot deal with the end framing because when its dump method finishes, the asynchronous code is still running.
The problem with this code is only one: it does not have access to Action::closeKidSection, because the relevant Action code is FunAction which the stats code has no access to.

The cleanest way I can think of to resolve this issue is to migrate this action to the Action API, without the mediation of FunAction. In this way, the implementation has access to all the needed bits of information. In order to minimize changes to the code (this is WIP, as shown by the fact that I've converted the PR back to Draft status), StatObjectsAction inherits the StatObjectsActionState, but reimplements the Action API (mostly adapted from FunAction), porting the statObjectsStart (which becomes StatObjectsAction::dump and statObjects which becomes StatObjectsAction::performStatObjects , to work with the event API [1]

This is all to avoid having an explicit section-closing code in stat.cc, which is what is done currently in master.

The strategy I have in mind is:

  1. get the Action based code to work
  2. clean up the legacy code
  3. fold StatObjectsActionState into StatObjectsAction
  4. cleanup names (e.g. statObjects2)
  5. fix a bug in Action::openKidsSection and Action::closeKidsSection, moving the (!aggregatable() && UsingSmp()) from caller code into them to avoid consistency bugs.
  6. profit

Do you see a better strategy? If not, any hints about what could be blocking strategy point 1?

src/mgr/Action.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything in the code that doesn't follow these specs or can we move on to land this PR?

Both :-).

I have left a few specific change requests to bring PR code closer to the draft specs. I would not be surprised if we will adjust the specs further based on that round of code review/changes! For now, I called the current specs in these change requests as "specs v2".

I expect more tests and development/review rounds. We are probably not going to be done with primary changes after this round, but I think we are making progress.

P.S. I continue to ignore secondary PR problems like Squid code style violations and Doxygen bugs.

Comment on lines 129 to 130
if (!UsingSmp())
return;
Copy link
Contributor

@rousskov rousskov May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specs v2 do not distinguish SMP from non-SMP reports. This condition needs to be moved lower, to become exclusive to non-YAML reports.

Same for closeKidSection(), of course.

P.S. I assume we want to keep old/non-YAML format unchanged (to the extent possible). Otherwise, this condition can be completely removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er.. I'm confused. The spec says that a YAML map with a "kid" section; it says nothing about who should emit that "kid" entry. The current code places that burden on the infrastructure for non-aggregatable jobs. Why should we change that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec ... says nothing about who should emit that "kid" entry. The current code places that burden on the infrastructure for non-aggregatable jobs. Why should we change that?

Hm... I am having trouble connecting the question with the change request. This change request does not change who emits the kid entry. It changes when the kid entry is emitted. After this change request is addressed, for YAML reports, the entry will be emitted in more cases (to comply with specs v2) because it will no longer be silenced for non-SMP cases.

In other words, producers of non-aggregated content will call openKidSection() and emit the "kid" entry in both SMP and non-SMP modes. AFAICT, this matches specs v2 partially quoted below:

1: A kid-specific document produced by a kid process is a YAML map containing a kid: ID entry ...
2: A document produced by Coordinator by aggregating SMP kid information is a YAML map containing a kids: [ ... ] entry

Should we reorder the above items and make it more explicit that they are mutually exclusive? Or is the problem elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, got you. You are suggesting that in the non-SMP case there will be exactly one document for kid1. Sure, that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the main effect is to break testCacheMgr: testRegister() creates a StoreEntry with no mem_obj

Hm... PR diff for testRegister() shows explicit mem_obj creation:

    StoreEntry *sentry = new StoreEntry();
    sentry->createMemObject();
    sentry->mem_obj->setUris("http://localhost/squid-internal-mgr/menu", ...);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. But before I added the setUris() call, it would fail an Assure check that the well-known-prefix be present. Now it is working, although it's noisy because Assure uses ReportAndThrow() which unconditionally prints to standard error in unit tests even when the resulting exception is caught

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Assure() throws during a test, then that test is not working correctly (unless the test is specifically designed to trigger Squid assertions). Assure() is an assertion -- an indication of a Squid bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do. See testCacheManager.cc:218

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Francesco: Right now the main effect is to break testCacheMgr: testRegister() creates ...

Alex: PR diff for testRegister() shows explicit mem_obj creation ...

Francesco: yes. But before I added [that code] it would fail an Assure check

Alex: If Assure() throws during a test, then that test is not working correctly (unless the test is specifically designed to trigger Squid assertions). Assure() is an assertion -- an indication of a Squid bug.

Francesco: They do.

"They"? There is only one plural in my statement -- "assertions". "Assertions do" does not really make sense in this context.

See testCacheManager.cc:218

I am aware of testInvalidUrl(). How is that testParseUrl()-specific code related to testRegister() we were discussing?

src/mgr/InfoAction.cc Show resolved Hide resolved
Comment on lines +336 to +337
if (state->onCompleteHandler)
state->onCompleteHandler(state->sentry);
Copy link
Contributor

@rousskov rousskov May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no good reason to add a third parameter to statObjectsStart() when all statObjectsStart() get the same third parameter value and all StatObjectsState instances get that same value. AFAICT, this state->onCompleteHandler is always true. AFAICT, all these changes should be removed and this code will look like this:

Suggested change
if (state->onCompleteHandler)
state->onCompleteHandler(state->sentry);
Mgr::closeKidSection(state->sentry, false); // we have not converted these reports to YAML yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That call to closeKidSection is currently done manually in statObjects,
at the test for state->theSearch->isDone()).
We need to close the framing at that point for non-atomic non-aggregatable actions,
of which at this time there is exactly one but we might have more in the future, this is an
attempt to have a generic(er) solution. It was sparked by me failing to understand why
there was a section being closed there with no corresponding opening (it is in FunAction)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an attempt to have a generic(er) solution

Thank you for disclosing the intent. I do not see sufficient signs that StatObjectsState needs a (more) general solution for closing the YAML section, and I see enough signs to question the proposed form of such a general solution. The change request stands.

It was sparked by me failing to understand why there was a section being closed there with no corresponding opening (it is in FunAction)

I support adding a brief C++ comment to isDone() if statement to assist future code readers in connecting the dots. For example:

Suggested change
if (state->onCompleteHandler)
state->onCompleteHandler(state->sentry);
// match Mgr::openKidSection() in Mgr::FunAction::dump()
Mgr::closeKidSection(state->sentry, false /* we have not converted these reports to YAML yet */ );

#endif
if (IamPrimaryProcess())
if (IamPrimaryProcess()) {
storeAppendPrintf(entry, "---\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this line should be removed (along with PR-added braces). Removing it may not fix some problems in the official code (not sure), but adding it creates problems in PR code.

src/mgr/Action.cc Outdated Show resolved Hide resolved
Comment on lines 117 to 118
if (!aggregatable() && atomic())
closeKidSection(entry, is_yaml());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the opening condition, this code should also be removed (and restored elsewhere!) AFAICT.

Suggested change
if (!aggregatable() && atomic())
closeKidSection(entry, is_yaml());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That elsewhere is FunAction, which means that the framing would only be done for functions that register a callback function instead of the newer API. My understanding is that the idea is to eventually remove the callback-based API, is it not? @yadij had put in the work here IIRC, any input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That elsewhere is FunAction, which means that the framing would only be done for functions that register a callback function instead of the newer API.

Not exactly: FunAction (at least) will probably continue to frame kid-produced kid:ID responses. Other/aggregating actions will (eventually) start framing kids:IDs responses because only those actions know which kids they are polling to aggregate the reported info.

My understanding is that the idea is to eventually remove the callback-based API, is it not?

Yes, at least for cases where there is some associated state that should be maintained by an Action-derived class.

Copy link
Contributor Author

@kinkie kinkie May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly: FunAction (at least) will probably continue to frame kid-produced kid:ID responses. Other/aggregating actions will (eventually) start framing kids:IDs responses because only those actions know which kids they are polling to aggregate the reported info.

This I don't get. FunAction means (to me) FunctionAction which means its job is to bridge between the new and the old API. Framing is outside its scope; it used to do it but there is no reason why this shouldn't be done one level up, so to be also useable for Actions that want to use the new API but not aggregate.
By going the direction you're suggesting, each Action using the new API needs to either aggregate or duplicate the framing. This feels to me like binding together concepts that do not need being bound

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunAction (at least) will probably continue to frame kid-produced kid:ID responses. Other/aggregating actions will (eventually) start framing kids:IDs responses because only those actions know which kids they are polling to aggregate the reported info.

This I don't get. FunAction means (to me) FunctionAction

Correct.

which means its job is to bridge between the new and the old API.

That assertion is not wrong, but it is also possible to view FunAction as a class that provides a convenient way to implement certain simple cache manager actions. For example, if ten actions can be implemented well as ten basic stand-alone functions, then there is no need to define ten new classes, one for each of those ten implementations. I used this broader view of FunAction in my response. Needless to say, if we discover that there are virtually no such actions, then FunAction class will be gone when its last user is converted from a stand-alone function to a class.

Framing is outside its scope; ... there is no reason why this shouldn't be done one level up

N.B. The current PR code does not move all framing decisions "one level up", to Action -- there are still two more places where framing is currently done. I do not yet know whether doing all framing in Action is both feasible and a good idea.

For me, the correctness of the "scope" and "no reason" assertions quoted above remains to be seen: I am not yet ready to say which class or classes will be responsible for framing. We need to clean up and test this PR a bit further before I can confidently agree with those assertions. In general: If the base Action class can do some framing task well, then obviously it should be responsible for that task. Derived classes (possibly but not necessarily including FunAction) will continue to carry the remaining framing responsibilities (if any).

so to be also useable for Actions that want to use the new API but not aggregate.

There is a third option here: Perhaps such Actions will have a common parent that is not Action. Again, for me, this design aspect remains to be seen.

By going the direction you're suggesting, each Action using the new API needs to either aggregate or duplicate the framing.

If we do our job well, there will be no duplication. Please do not interpret my current change requests as an attempt to bring this PR (or even a portion of it) to its final state. They are a (failing) attempt to create a clean starting point and advance cautiously, in several iterations. The only feasible alternative I see in this context is for me to (slowly) define and implement this PR goals (going through these iterations without bothering you) and then post the final changes for your review. If you prefer that model, please let me know.

The affected PR code does not look good because it leaves a YAML document open for actions that are both not aggregatable() and not atomic(). It would be trivial to forget to close that document. The same problem exists in official code, but it is localized to FunAction where we may decide to ignore it. We should avoid making things worse, but if you insist on enlarging the scope/impact of this existing problematic code now, then I will come back to this problem later, after other PR problems have been resolved. Your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. The current PR code does not move all framing decisions "one level up", to Action -- there are still two more places where framing is currently done. I do not yet know whether doing all framing in Action is both feasible and a good idea.

Moving the framing fully to Action feels to me way more complicated than the benefit. The cleanest way I can think of to do it is to add and use a hook on storeEntry::completed() calling back into the Action. This is to cover the (so far only) non-atomic actions, which share most code.

There are three calls to closeKidAction() in the current code:

  • one in Action which is where I'd like it to stay
  • one in InfoAction. This is an artifact of the fact that InfoAction has a per-kid section, for which it is to provide its own framing. The correct solution for me is to remove that per-kid section to its own per-kid report, as we already discussed
  • One in the objects-related code. This is necessary because it's a non-atomic action. Non-atomic actions are quite complicated, needing to have state, and continuations. They may end up having a million problems, but forgetting to close a kid section doesn't seem to be one to be concerned. And that is even assuming that we won't have a cleaner way to manage it using the stream-based API that Amos is wokrking on, or that at that point passing a functor to StoreEntry::completed() to close the kid will be a worthwhile endeavour

If we move the framing to FunAction, then every Action which is registered with the newer API will need to do its own framing, even when not atomic; or is there a way you envision to deal with this that I have missed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Francesco: Framing is outside its scope; ... there is no reason why this shouldn't be done one level up

Alex: The current PR code does not move all framing decisions "one level up", to Action -- there are still two more places where framing is currently done. I do not yet know whether doing all framing in Action is both feasible and a good idea.

Francesco: Moving the framing fully to Action feels to me way more complicated than the benefit.

I do not yet know the correct/final answer, but I am glad you see red flags in this direction. I share your concerns. To avoid a misunderstanding, I am not suggesting that we move (or do not move) all framing code to Action at this time.

The cleanest way I can think of to do it is to add and use a hook on storeEntry::completed() calling back into the Action. This is to cover the (so far only) non-atomic actions, which share most code.

I am not yet convinced that StoreEntry::completed() callback is a good solution for several reasons, including these two:

1: StoreEntry does not "complete" on its own. Some external code has to call StoreEntry::completeSuccessfully() or equivalent. If that external code is Action code, then no StoreEntry callback hack should be necessary because Action itself would know when it completes the report! The current PR code where action code itself calls the callback illustrates this point.

2: A completed StoreEntry cannot be modified. It would be too late to add any document closure markup at StoreEntry::completeSuccessfully() point. Perhaps the callback will be called before the entry is completed, but it is not clear what will trigger that StoreEntry callback in that case. The callback should not be synchronous, so it should not be executed from the StoreEntry::complete() itself, unless we split that method into two, effectively introducing a new StoreEntry state and raising red flags as large as Comm's "closing" and "closed" descriptor states!

There are three calls to closeKidAction() in the current code:

  • one in Action which is where I'd like it to stay

That is essentially the subject of this change request. Just to avoid a misunderstanding: My concerns regarding this code have not been addressed yet.

  • one in InfoAction. This is an artifact of the fact that InfoAction has a per-kid section, for which it is to provide its own framing. The correct solution for me is to remove that per-kid section to its own per-kid report, as we already discussed

We have discussed that move, but there is currently no consensus on whether the moved code will have a combination of aggregated and non-aggregated stats. If it has that combination, then the move itself is not going to solve any relevant problems (it will just move these memory stats into a more appropriate, dedicated report).

Please also note that specs v2 do allow mixed reports. Thus, the "we will remove one mixed report" argument does not work unless you are also willing to ban mixed reports. We have discussed that already and, so far, decided to allow them. Let's assume they continue to exist instead of focusing on removing one of them.

  • One in the objects-related code. This is necessary because it's a non-atomic action. Non-atomic actions are quite complicated, needing to have state, and continuations. They may end up having a million problems, but forgetting to close a kid section doesn't seem to be one to be concerned.

Please note that responding with "doesn't seem to be one to be concerned" to a detailed concern does not help us make progress. Again, to me, this seems like a very easy bug to introduce. Needless to say, we have had plenty of bugs where we forget to do something like that or, worse, accidentally made something like that conditional. If possible, I would like us to avoid designs that allow such bugs. I do not yet know whether that is possible in this case, but I hope it will be.

If we move the framing to FunAction, then every Action which is registered with the newer API will need to do its own framing, even when not atomic; or is there a way you envision to deal with this that I have missed?

If we keep framing in FunAction (compared to official code) for now, then we will solve those other problems when we actually need to solve them while moving this PR forward now. That is all I am suggesting at this time. We will come back to this as needed, probably armed with better understanding of our actual needs.

The alternative is to continue discussing how current real concerns may or may not be addressed by some future classes/changes that are currently nowhere close to even rough design consensus (not to mention implementation completion!): "we may forget"; "we will not forget because some future class will not let us forget"; ...

In summary: This code is objectively problematic, even if we disagree on the severity of its problems. We do not need to move it (and, hence, enlarge its scope/impact) now. Let's not do that until we have to.

src/mgr/FunAction.cc Show resolved Hide resolved
handler(entry);
if (atomic() && UsingSmp())
storeAppendPrintf(entry, "} by kid%d\n\n", KidIdentifier);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
if (atomic())
closeKidSection(entry, is_yaml());
// else handler() is responsible for (eventual) closing
}

Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in testCacheManager.cc are not exactly in scope for this PR, IMO they should be done separately.

Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in testCacheManager.cc are not exactly in scope for this PR, IMO they should be done separately.

If in-scope PR changes require some testCacheManager.cc changes for "make check" to succeed, then we do not have a choice but to modify that file. AFAICT, that outcome is Francesco's current assumption. We should still avoid unnecessary changes, of course (e.g., testValidUrl() and testInvalidUrl() changes do look out of scope to me).

Otherwise, I certainly support leaving unnecessary testCacheManager.cc improvements outside this PR scope!

@kinkie
Copy link
Contributor Author

kinkie commented May 10, 2024

Otherwise, I certainly support leaving unnecessary testCacheManager.cc improvements outside this PR scope!

Done in PR #1811

Comment on lines +336 to +337
if (state->onCompleteHandler)
state->onCompleteHandler(state->sentry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an attempt to have a generic(er) solution

Thank you for disclosing the intent. I do not see sufficient signs that StatObjectsState needs a (more) general solution for closing the YAML section, and I see enough signs to question the proposed form of such a general solution. The change request stands.

It was sparked by me failing to understand why there was a section being closed there with no corresponding opening (it is in FunAction)

I support adding a brief C++ comment to isDone() if statement to assist future code readers in connecting the dots. For example:

Suggested change
if (state->onCompleteHandler)
state->onCompleteHandler(state->sentry);
// match Mgr::openKidSection() in Mgr::FunAction::dump()
Mgr::closeKidSection(state->sentry, false /* we have not converted these reports to YAML yet */ );

Comment on lines +52 to +53
// always non-aggregatable, no need to check
openKidSection(entry, is_yaml());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not document this hidden1 dependency when the dependency itself can be trivially made explicit in code. That "need to check" does exist here (when human and code refactoring needs are properly accounted for).

A minimal-change PR can continue to secretly assume that aggregatable() is false here and make an unconditional openKidSection() call (without a problematic comment):

Suggested change
// always non-aggregatable, no need to check
openKidSection(entry, is_yaml());
openKidSection(entry, is_yaml());

If you would like to go a step further (e.g., to be explicit about the condition you had to discover the hard way), I would support doing that (but still without a problematic comment):

Suggested change
// always non-aggregatable, no need to check
openKidSection(entry, is_yaml());
if (!aggregatable())
openKidSection(entry, is_yaml());

Footnotes

  1. The implied "FunAction is always non-aggregatable here" dependency in current code is hidden from the compiler. Even with a comment, it is also hidden from humans that are working on related code sections. For example, if I were to adjust FunAction::aggregatable() implementation (e.g., to do something like what Action::atomic() already does), both me and the reviewer might not notice this comment and break things for some FunActions. This PR did not introduce that dependency (i.e. the flaw exists in the official code), but documenting it is not the right solution here.

Comment on lines +57 to +58
// always non-aggratable and atomic, no need to check
closeKidSection(entry, is_yaml());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. This PR code is subject to the same "always non-aggratable" concerns documented in the other change request. The suggestion below applies the second suggestion from that change request to this code as well.

FunActions are not always atomic(). For example, mgr:objects action (i.e. stat_objects_get) is not atomic. For more information, see Mgr::Action::atomic() implementation inherited by FunAction.

Suggested change
// always non-aggratable and atomic, no need to check
closeKidSection(entry, is_yaml());
if (!aggregatable() && atomic())
closeKidSection(entry, is_yaml());
// else: non-aggregatable() non-atomic() actions call closeKidSection() when they are done

P.S. The comment suggested above does document a dependency, similar to the comment rejected in another change request. However, in this case, the comment is warranted because this dependency cannot be "trivially made explicit in code" (unfortunately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-for-author author action is expected (and usually required)
Projects
None yet
3 participants