Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more @FormatMethod annotation in Logger #954

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ksobolew
Copy link

@ksobolew ksobolew commented Oct 19, 2021

They may not look like format methods, but in fact they are, just without parameters. The reason to treat them this way is that then Error Prone can detect mis-use of these methods, like passing a String concatenation instead of a format string plus arguments.

The caveat is that from now on the message on these methods will be actually treated as format string, so any % characters will be interpreted. This will break messages like

Frobnicity increased by 10%!

which would have to be changed to

Frobnicity increased by 10%%!

This is a breaking change, but it makes treatment of the messages consistent across all variants of these log methods. For compatibility with existing code, if there is IllegalFormatException thrown by String#format, we fall back to treating the string as a literal message.

This PR includes dogfooding of this change on Airlift itself. There is a simple Maven profile provided which enables the relevant check (and only that).

@@ -74,6 +75,7 @@ public void debug(Throwable exception, String message)
*
* @param message a literal message to log
*/
@FormatMethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does log.debug("%%") log single percent or two?

Copy link
Contributor

@findepi findepi Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the caller allowed to call log.debug("%...a malformed format string")?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I haven't thought of that :/ No, the %s won't be interpreted by these methods, but Error Prone will treat them as format strings regardless. So, back to the drawing board I guess.

Copy link
Author

@ksobolew ksobolew Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed an update, with the following caveat:

(...) from now on the message on these methods will be actually treated as format string (...) This is a breaking change, but it makes treatment of the messages consistent across all variants of these log methods.

Are we willing to accept that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i admit this is a grey area. log.xxx(String) doesn't need to format, since there are no args, but could do format for consistency. however, it's not IMO OK to make a breaking change like that.

@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch from 4d88940 to f52ec26 Compare October 19, 2021 12:32
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not format methods. Making them so is a breaking change to an API that has existed for over 10 years. I like the idea of catching misuse, but not at the expense of breaking logging which is often not tested or only happens in exceptional circumstances.

@ksobolew
Copy link
Author

These are not format methods. Making them so is a breaking change to an API that has existed for over 10 years. I like the idea of catching misuse, but not at the expense of breaking logging which is often not tested or only happens in exceptional circumstances.

OK, thanks. Just wanted to test the waters, so to speak :)

@ksobolew ksobolew closed this Oct 19, 2021
@ksobolew ksobolew deleted the kudi/annotate-parameter-less-log-methods-as-format-methods branch October 19, 2021 16:28
@electrum
Copy link
Member

Do you have any examples of misuse seen in practice?

I’m wondering if we can do something clever here. There are two kinds of callers:

  • Static message. These can be detected with Error Prone and updated.
  • Dynamic message: log.error(e.getMessage())

Maybe we change so that that the message is now a format string, but if it’s malformed, we simply log the original (old behavior). This would only be for the no-arg version.

It makes usage slightly uglier, but makes the API consistent and allows checkers.

cc @martint

@ksobolew ksobolew restored the kudi/annotate-parameter-less-log-methods-as-format-methods branch October 20, 2021 07:46
@ksobolew
Copy link
Author

I think there are about 30 in Trino. They are in four general categories:

  • Logging using concatenation instead of the format string:
     log.warn("Switching coordinator affinity from " + coordinatorId + " to " + assignments.getCoordinatorId());
  • Logging a pre-formatted String:
    log.info(format("Copying resource dir '%s' to %s", resourceName, tempDir));
  • Preparing a message beforehand so that it can be used to log and as an exception message:
    String message = format("Failure when checking behavior against %s", driver);
    // log immediately since further tests may take more time; "log and rethrown" is not harmful in tests
    log.error(e, message);
    failures.add(new AssertionError(message, e));
  • Non-static-string (like you mentioned):
    log.error(message.toString());

Overall, nothing serious, "just" a code quality issue.

I proposed to mark these methods as format methods, but I am aware that this is an indirect way to make Error Prone do what we want. There are other tools we could use, for example SpotBugs (with some external checks) can detect mis-use of logger methods, but I don't think any of them knows that these are actually logger methods. So we still have to annotate them somehow, and this seems like it's as good as any other way to do that.

So given the above the question now is: is it OK that we mark these methods as format methods even though they're not really? :)

@ksobolew ksobolew reopened this Oct 20, 2021
@findepi
Copy link
Contributor

findepi commented Oct 20, 2021

  • Preparing a message beforehand so that it can be used to log and as an exception message:
  • Non-static-string (like you mentioned):

these are valid and shouldn't be prohibited

still, workaround is easy: log.xxx("%s", message)

maybe

  • deprecated non-formatting log.xxx(String), log.xxx(Throwable, String) methods
    • for interim we could use "%s", message; if we later remove non-formatting methods, we could simplify that, leveraring varargs overload
  • if we don't want deprecate for everyone, we can prohibited them in Trino with modernizer
    • this would mean we never improve to leverage varargs overload

@electrum
Copy link
Member

electrum commented Oct 20, 2021

The no-argument versions are really useful and common, so it would suck to have to use a dummy %s for them:

  • log.info("Doing something here");
  • log.error(e, "Failed doing something");

I think it is fine to say that these are no-arg format string messages, such that the dynamic log.error(m.toString()) would be replaced with an explicit log.error('%s', m). The only question is if we can do it without breaking existing code at runtime (finding new static Error Prone violations or similar is fine).

In other words, if we have log.info('Frobnicity increased by 10%!'); then we can fail if Error Prone flags it, but existing calls in compiled code should work as they did before, with the exception that valid format strings containing %% will have those converted to %.

@ksobolew
Copy link
Author

still, workaround is easy: log.xxx("%s", message)

It's not only a workaround - if the message is to be interpreted as a format string, then it prevents that from happening in case the message variable contains something that may misbehave.

OTOH,

  • log.info("Doing something here");
  • log.error(e, "Failed doing something");

These won't get flagged, because these are literal strings and literal strings are explicitly OK (unless they contain malformed format string).

And lastly,

  • Preparing a message beforehand so that it can be used to log and as an exception message:

I know this sucks to have to repeat onself. For example

            String message = format("User [%s] not a member of an authorized group", user);
            log.debug(message);
            throw new AccessDeniedException(message);

will have to change into

            log.debug("User [%s] not a member of an authorized group", user);
            throw new AccessDeniedException(format("User [%s] not a member of an authorized group", user));

But I don't think it's such a big deal, personally.

@ksobolew
Copy link
Author

And one more thing - I haven't actually found any cases where the message contains something that might be misinterpreted as a format string. Not in Trino, at least. And not in static strings. It may still happen when we log a result of some invocation, though.

@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch from f52ec26 to da3e952 Compare October 20, 2021 10:53
@ksobolew
Copy link
Author

In this new version:

  • The parameter on these methods is renamed from message to format (and the javadocs updated accordingly)
  • There is a fallback on IllegalFormatException
  • There are unit tests for the fallback, and the no-arg variants are also covered now

@ksobolew
Copy link
Author

bump

@ksobolew
Copy link
Author

One reason to treat no-param logs as format strings is when we want to merge subsequent logs into one, so that they are both logged atomically:

-        log.info("trying to give consent");
-        log.info("waiting for openId checkbox");
+        log.info("trying to give consent%nwaiting for openId checkbox");

The %n format specifier is the portable way to print newline. It could be replaced with \n, but that would not be ideal.

@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch 3 times, most recently from cd8e789 to b73347b Compare November 10, 2021 10:54
@electrum
Copy link
Member

electrum commented Dec 6, 2021

This seems reasonable to me. I do like the consistency, and the fallback of catching IllegalFormatException seems like a good approach.

Can you add a test for including valid placeholders like %s and %d in the no-args version?

@electrum
Copy link
Member

electrum commented Dec 6, 2021

To expand on the motivation for this change, it allows Error Prone to flag code like this:

 log.debug("Task scheduled " + taskId); 
trino/core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java:266:18
java: [FormatStringAnnotation] Format strings must be either literals or variables. Other expressions are not valid.
  Invalid format string: "Task scheduled " + taskId
    (see https://errorprone.info/bugpattern/FormatStringAnnotation)

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the CHANGES file in the commit that makes the behavior change?

@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch 2 times, most recently from 1333083 to e06f66b Compare December 7, 2021 13:17
@ksobolew
Copy link
Author

ksobolew commented Dec 7, 2021

@electrum AC, thanks!

ksobolew added a commit to ksobolew/trino that referenced this pull request Dec 8, 2021
These all use variants of the `Logger` method which take a `String` and
no format string arguments. Formerly this was treated as a simple
message to be passed to logs verbatim, but this creates plenty of
opportunities to mis-use it. A change in Airlift may change the
semantics of these methods to treat the `String` as a format string with
no arguments, and mark them as `@FormatMethod`s, so that Error Prone can
catch improper uses (see airlift/airlift#954).

There are roughly four categories of instances flagged given the above:

* Logging using concatenation instead of the format string:
  - `logger.info("Foo " + bar + " bazzed");`
* Logging a pre-formatted String:
  - `logger.info(String.format("%s %d", foo, bar));`
* Preparing a message beforehand so that it can be used to log and as an
  exception message:
  - ```
    String message = String.format(...);
    logger.info(message);
    throw new Exception(message);
    ```
* Non-static-string:
  - `log.error(message.toString());`

Most of the above have pretty obvious fixes, except for the third kind,
where we avoid duplication of the message string by using the universal
"%s" format string to just output the preformatted message.
@martint
Copy link
Member

martint commented Dec 8, 2021

I'm not yet 100% convinced this is something we want to change. The choice of making the no-args version a regular string (vs a format string) was intentional. It is true that "they may not look like format methods, but in fact they are, just without parameters". That was the original intention (except for the implications around how naked % are interpreted).

It makes it more cumbersome to log naked strings that contain '%'. With this change, callers are forced to do log.debug("%s", string) or manually escape % characters beforehand. As a consequence, there is no safe way to use the no-arg method unless you know it doesn't contain % -- and every time you call it, you have to think about that fact. We might as well just remove the no-arg method, in that case.

Also, I don't like the implications and constraints this forces. In particular, while things like log.debug("xxx " + taskId) are generally not desirable, they are also not catastrophic. Sometimes they are even necessary. E.g., you might want to log the result of a function call, so "either literals or variables" is too constraining, and forcing the caller to create an artificial variable is an anti-pattern.

@findepi
Copy link
Contributor

findepi commented Dec 9, 2021

It makes it more cumbersome to log naked strings that contain '%'. With this change, callers are forced to do log.debug('%s", string) or manually escape % characters beforehand. As a consequence, there is no safe way to use the no-arg method unless you know it doesn't contain % -- and every time you call it, you have to think about that fact. We might as well just remove the no-arg method, in that case.

@martint see also #954 (comment)

Also, I don't like the implications and constraints this forces. In particular, while things like log.debug("xxx " + taskId) are generally not desirable, they are also not catastrophic. Sometimes they are even necessary. E.g., you might want to log the result of a function call, so "either literals or variables" is too constraining, and forcing the caller to create an artificial variable is an anti-pattern.

I agree with "not catastrophic".
i don't agree with "necessary". Concatenation in log.debug(...) is a bad/undesired code which we want to eliminate, because we pay concatenation cost even if debug log is disabled. log.debug(foo(...)) falls into same cathegory.
If the intent is to call foo regardless of log level settings, and log the result if logging enabled, log.debug(foo(...)) is a pretty nice way to hide that intent.

Anyway, the problem we're solving is: "how to find log.debug(...) calls with concatenation".
We found a disputable solution: mark this as a @FormatMethod and let errorprone do the job.
I'd hope a less contented solution exists. Perhaps we can add an additional checker to errorprone. Similar to how @FormatMethod check works, but without misrepresenting log.debug as a formatting.

findepi pushed a commit to trinodb/trino that referenced this pull request Dec 9, 2021
These all use variants of the `Logger` method which take a `String` and
no format string arguments. Formerly this was treated as a simple
message to be passed to logs verbatim, but this creates plenty of
opportunities to mis-use it. A change in Airlift may change the
semantics of these methods to treat the `String` as a format string with
no arguments, and mark them as `@FormatMethod`s, so that Error Prone can
catch improper uses (see airlift/airlift#954).

There are roughly four categories of instances flagged given the above:

* Logging using concatenation instead of the format string:
  - `logger.info("Foo " + bar + " bazzed");`
* Logging a pre-formatted String:
  - `logger.info(String.format("%s %d", foo, bar));`
* Preparing a message beforehand so that it can be used to log and as an
  exception message:
  - ```
    String message = String.format(...);
    logger.info(message);
    throw new Exception(message);
    ```
* Non-static-string:
  - `log.error(message.toString());`

Most of the above have pretty obvious fixes, except for the third kind,
where we avoid duplication of the message string by using the universal
"%s" format string to just output the preformatted message.
@electrum
Copy link
Member

electrum commented Jan 4, 2022

Related, we should add supplier versions of all the no-arg logger methods, for times when you need to log the result of a method call or other expression. I think that would largely eliminate the objection, since it's very rare you need to log an arbitrary pre-constructed string.

While I understand @martint's argument, I think it's rare enough that the API consistency and Error Prone check makes the extra verbosity in those cases worthwhile.

@electrum
Copy link
Member

electrum commented Jan 4, 2022

@ksobolew There are a few places in Airlift to update before this change, which are examples of arbitrary strings passed to the no-arg methods:

  • io.airlift.http.client.jetty.JettyClientDiagnostics
  • io.airlift.bootstrap.LoggingWriter
  • io.airlift.discovery.client.ExponentialBackOff
  • io.airlift.log.LoggingOutputStream
  • io.airlift.bootstrap.Bootstrap using log::warn

@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch from e06f66b to 5faf415 Compare January 5, 2022 14:10
@ksobolew
Copy link
Author

ksobolew commented Jan 5, 2022

@electrum AC, thanks!

@ksobolew There are a few places in Airlift to update before this change, which are examples of arbitrary strings passed to the no-arg methods:

* `io.airlift.http.client.jetty.JettyClientDiagnostics`

* `io.airlift.bootstrap.LoggingWriter`

* `io.airlift.discovery.client.ExponentialBackOff`

* `io.airlift.log.LoggingOutputStream`

* `io.airlift.bootstrap.Bootstrap` using `log::warn`

You're absolutely right, we should first fix the potentially problematic calls in Airlift itself. I included a very simple Maven profile which enables the relevant Error Prone check to make finding these cases easier. I fixed all the instances which it flagged, which include all the examples you provided.

@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch from 5faf415 to 9757c71 Compare January 5, 2022 15:50
Currently the message is intepreted literally.
The message returned from the `Supplier` is not considered a format
string - it's assumed that it'll be preformatted by the caller. This
allows us also to re-use these method to implement some of the other
variants.

Note that we can also conveniently use the `Supplier`-taking methods
from the underlying library.
The `Logger` methods which take `String` only and no format arguments,
will be treated in the future by the logger as format strings, which is
unexpected if the message comes from outside and is not intended to be
interpreted in this way. We either have to provide the format string
explicitly, or use the `Supplier` variant.

The latter also has performance benefits in case the particular log
level is not enabled.

This is guided by Error Prone, for which, proactively, a Maven profile
is added. It enables only the `FormatStringAnnotation` check, so it's
very tightly focused for now.
They may not look like format methods, but in fact they are, just
without parameters. The reason to treat them this way is that then Error
Prone can detect mis-use of these methods, like passing a String
concatenation instead of a format string plus arguments.

The caveat is that from now on the message on these methods will be
actually treated as format string, so any `%` characters will be
interpreted. This will break messages like

```
Frobnicity increased by 10%!
```

which would have to be changed to

```
Frobnicity increased by 10%%!
```

This is a breaking change, but it makes treatment of the messages
consistent across all variants of these log methods. For compatibility
with existing code, if there is `IllegalFormatException` thrown by
`String#format`, we fall back to treating the string as a literal
message.
@ksobolew ksobolew force-pushed the kudi/annotate-parameter-less-log-methods-as-format-methods branch from 9757c71 to efb9fef Compare January 5, 2022 16:00
@martint
Copy link
Member

martint commented Jan 10, 2022

log.debug(foo(...)) falls into same category. If the intent is to call foo regardless of log level settings, and log the result if logging enabled, log.debug(foo(...)) is a pretty nice way to hide that intent.

What about:

String result = foo(...);
log.debug(foo);

That a seems pretty legit usage to me, and forcing log.debug("%s", foo) is very unnatural and likely for people to forget to do that. Unless the result from foo() properly escapes %, that is a buggy line.

In fact, such a usage now would become quite dangerous. If, for instance, result contains a sequence like "%1000000000%", String.format will expand that into a 1GB string. So, innocent usages like the one above that would be easy to overlook during code review and hard to catch with ErrorProne would be potential crashes or serious performance issues waiting to happen -- way worse than the effect of string concatenation.

Related, we should add supplier versions of all the no-arg logger methods, for times when you need to log the result of a method call or other expression. I think that would largely eliminate the objection, since it's very rare you need to log an arbitrary pre-constructed string.

Can you elaborate? Not sure how it solves the problem. Also, using capturing lambdas can have performance implications, as they are constructed eagerly before the method gets called. It'd be trading one problem (string concatenation) for another one (creating capturing lambdas before the call).

I think it's rare enough

I personally use that often enough during development or debugging that I wouldn't want it to get in the way. It's not just about the code that will end up merged, but throwaway code you write/build/run while prototyping and developing.

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment above, code like this has worse problems that those caused string concatenation that this change is attempting to catch.

String result = foo(...); // function which returns a string containing "%1000000000%"
log.debug(result);

@ksobolew
Copy link
Author

Per comment above, code like this has worse problems that those caused string concatenation that this change is attempting to catch.

String result = foo(...); // function which returns a string containing "%1000000000%"
log.debug(result);

I agree that this is potentially dangerous. On the other hand, this will definitely be flagged by Error Prone. On the third hand, Error Prone is not a silver bullet as it takes effort to enable it, and in projects like Trino it's not even enabled by default. So, yeah, this is a hard balance. Personally I'm biased towards more static analysis, even at the cost of some inconvenience, but I understand the hesitation.

@ksobolew
Copy link
Author

Related, we should add supplier versions of all the no-arg logger methods, for times when you need to log the result of a method call or other expression. I think that would largely eliminate the objection, since it's very rare you need to log an arbitrary pre-constructed string.

Can you elaborate? Not sure how it solves the problem. Also, using capturing lambdas can have performance implications, as they are constructed eagerly before the method gets called. It'd be trading one problem (string concatenation) for another one (creating capturing lambdas before the call).

The result of the lambda call is not interpreted in any way, including not as a format pattern, so this is one way to fix the problem of potentially dangerous strings. As an added benefit, it reduces overhead in some cases, as the message construction can be encapsulated and executed only when needed.

Regarding the performance concerns, the use case for lambda-taking methods is in places where creating the string is way more costly than setting up a lambda (which I don't believe have any measurable cost anyway, but that's just an aside). It's not for cases where you have a string that you'll want to use for other purposes as well, only for those where the message is constructed solely for the logger call. As this PR demonstrates, there are two such places in Airlift alone.

@martint
Copy link
Member

martint commented Jan 11, 2022

I agree that this is potentially dangerous. On the other hand, this will definitely be flagged by Error Prone

How will Error Prone know to flag that? How would it ever be able to tell that the output from foo(...) might contain such a dangerous pattern?

@ksobolew
Copy link
Author

I agree that this is potentially dangerous. On the other hand, this will definitely be flagged by Error Prone

How will Error Prone know to flag that? How would it ever be able to tell that the output from foo(...) might contain such a dangerous pattern?

Well, it's not smart enough to track "tainted" (or "unsanitized") values, but it sees that the debug(String) method is a @FormatMethod (assuming this PR is merged) and requires the argument to be a String literal, exactly because otherwise you don't know if the string is potentially dangerous.

@martint
Copy link
Member

martint commented Jan 11, 2022

Well, that sucks :)

Also, one big issue is that this assumes you are using ErrorProne. Many projects that depend on Airlift don't, and it's unreasonable to force it on them. Without checks that prevent using the method with anything other than a literal string, this new reimagining of the method is more dangerous, as I described above.

@ksobolew
Copy link
Author

ksobolew commented Jan 12, 2022

You're right, you have a responsibility for not breaking projects using Airlift. This means it's not possible to merge it in the current form, and it's fine. Backward compatibility is difficult :) I still think that detecting more issues is a worthy goal, so I'm thinking of a way to enable it. My first thought was to add methods like debugFormatted, but that's just silly. (Why would anyone use it? And more, why would anyone switch existing debug calls to it?) Or I can add a custom Error Prone check (like the one I'm experimenting with in Trino) specifically for this, which users can opt in to if they want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants