Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoformat confusable units #4430

Merged
merged 4 commits into from Nov 3, 2023
Merged

Conversation

covracer
Copy link
Contributor

I've seen errors crop up from using the different micro and mu characters. Follow matching recommendations on which character to prefer for micro, ohm, and angstrom. References:

@github-actions
Copy link

github-actions bot commented May 14, 2023

PR Check Results

Ecosystem

ℹ️ ecosystem check detected changes. (+2, -0, 0 error(s))

zulip (+2, -0)

+ zerver/tests/test_invite.py:1197:32: RUF001 [*] String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1033:29: RUF001 [*] String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

Rules changed: 1
Rule Changes Additions Removals
RUF001 2 2 0

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     13.9±0.02ms     2.9 MB/sec    1.00     13.9±0.02ms     2.9 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.4±0.01ms     4.9 MB/sec    1.00      3.4±0.01ms     4.9 MB/sec
linter/all-rules/numpy/globals.py          1.01    428.0±0.47µs     6.9 MB/sec    1.00    424.9±0.45µs     6.9 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.9±0.01ms     4.4 MB/sec    1.00      5.9±0.02ms     4.4 MB/sec
linter/default-rules/large/dataset.py      1.00      6.8±0.01ms     6.0 MB/sec    1.00      6.8±0.01ms     6.0 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01   1508.9±2.89µs    11.0 MB/sec    1.00   1494.0±3.40µs    11.1 MB/sec
linter/default-rules/numpy/globals.py      1.02    173.2±1.81µs    17.0 MB/sec    1.00    169.7±1.31µs    17.4 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.1±0.00ms     8.2 MB/sec    1.00      3.1±0.01ms     8.2 MB/sec
parser/large/dataset.py                    1.01      5.2±0.01ms     7.8 MB/sec    1.00      5.2±0.00ms     7.8 MB/sec
parser/numpy/ctypeslib.py                  1.00   1020.1±0.54µs    16.3 MB/sec    1.00   1019.2±0.66µs    16.3 MB/sec
parser/numpy/globals.py                    1.00    105.6±0.27µs    27.9 MB/sec    1.00    105.5±0.16µs    28.0 MB/sec
parser/pydantic/types.py                   1.01      2.2±0.00ms    11.4 MB/sec    1.00      2.2±0.00ms    11.4 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     16.6±0.20ms     2.5 MB/sec    1.00     16.6±0.31ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.2±0.05ms     4.0 MB/sec    1.00      4.2±0.04ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.01   506.7±14.96µs     5.8 MB/sec    1.00    501.2±6.66µs     5.9 MB/sec
linter/all-rules/pydantic/types.py         1.00      7.0±0.08ms     3.6 MB/sec    1.00      7.0±0.08ms     3.6 MB/sec
linter/default-rules/large/dataset.py      1.00      8.2±0.14ms     4.9 MB/sec    1.00      8.3±0.07ms     4.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00  1771.6±24.87µs     9.4 MB/sec    1.00  1778.4±26.63µs     9.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    203.4±3.13µs    14.5 MB/sec    1.02    207.3±7.52µs    14.2 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.7±0.05ms     6.9 MB/sec    1.01      3.8±0.07ms     6.8 MB/sec
parser/large/dataset.py                    1.00      6.3±0.05ms     6.4 MB/sec    1.02      6.4±0.07ms     6.3 MB/sec
parser/numpy/ctypeslib.py                  1.00  1192.4±16.10µs    14.0 MB/sec    1.02  1221.5±20.10µs    13.6 MB/sec
parser/numpy/globals.py                    1.00    123.3±2.36µs    23.9 MB/sec    1.01    124.7±2.84µs    23.7 MB/sec
parser/pydantic/types.py                   1.00      2.7±0.04ms     9.4 MB/sec    1.01      2.7±0.03ms     9.3 MB/sec

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.

I'm not very familiar with unicode confusables. So bare with me ;)

Our current table is based on the VS code table and the standard seems to differentiate between mixed script and other confusables. I believe we currently only test for mixed-script confusables (similar to unicode-security). Do you know if these added scripts are also mixed script confusables or general confusables?


use once_cell::sync::Lazy;
use rustc_hash::FxHashMap;

/// Via: <https://github.com/hediet/vscode-unicode-data/blob/main/out/ambiguous.json>
/// See: <https://github.com/microsoft/vscode/blob/095ddabc52b82498ee7f718a34f9dd11d59099a8/src/vs/base/common/strings.ts#L1094>
pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u8>> = Lazy::new(|| {
pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u32>> = Lazy::new(|| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I wonder if we should instead change the type to <char, char>. Having the chars in place could make the table a bit more readable and avoids the unwrap calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaReiser - That seems reasonable to me, but will it increase the size of this map?

@covracer
Copy link
Contributor Author

Thanks for taking a look!

The VS Code list only contains "... characters (key) that are confusable with a basic ascii [0x20 to 0x7E] character (value)". https://github.com/hediet/vscode-unicode-data/blob/a2ee3e17acaad2957f910cf6c46f6ea2c6eae753/index.ts#L159

I wonder why they did that. Was it an easy starting point and they haven't been motivated to build it out further? Would a full table would be too slow in node.js? 🤷

I too am new to Unicode security, and am still trying to figure out how to apply domain name spoofing concepts to code linting and autoformatting. TR 39 suggested reading TR 36 first, where I found this:

While compatibility normalization and mixed-script detection can handle the majority of spoofing cases, they do not handle single-script confusables. https://www.unicode.org/reports/tr36/#Single_Script_Spoofing

So it seems like normalization comes first in the process, before mixed script or single script handling. The substitutions I'm adding are normalizations.

A longer discussion of normalization from TR 36:

Fortunately the design of IDN prevents a huge number of spoofing attacks. All conformant users of [IDNA2003] are required to process domain names to convert what are called compatibility-equivalent characters into a unique form using a process called compatibility normalization (NFKC)—for more information on this, see [UAX15]. This processing eliminates most possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences.

@@ -2116,5 +2116,8 @@ pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u8>> = Lazy::new(|| {
(1059, 89),
(65283, 35),
(65307, 59),
(0x212B, 0x00C5), // ANGSTROM SIGN → LATIN CAPITAL LETTER A WITH RING ABOVE
(0x2126, 0x03A9), // OHM SIGN → GREEK CAPITAL LETTER OMEGA
(0x00B5, 0x03BC), // MICRO SIGN → GREEK SMALL LETTER MU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Section 2.5 in https://www.unicode.org/reports/tr25/, should this also include Kelvin sign?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list from VS Code covers Kelvin with (8490, 75).

@zanieb
Copy link
Member

zanieb commented Oct 19, 2023

I believe we were concerned about increasing the size of the map here but with use of match it shouldn't matter anymore? @covracer would you be interested in updating with the latest changes on main and we can try to get this merged?

@charliermarsh charliermarsh enabled auto-merge (squash) November 3, 2023 04:54
@charliermarsh charliermarsh merged commit 9f30ccc into astral-sh:main Nov 3, 2023
16 checks passed
@charliermarsh
Copy link
Member

Thanks, and sorry for the extensive delay here.

Copy link

github-actions bot commented Nov 3, 2023

ruff-ecosystem results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+9 -0 violations, +0 -0 fixes in 41 projects)

bokeh/bokeh (+1 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL

+ examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

pandas-dev/pandas (+6 -0 violations, +0 -0 fixes)

+ pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5352:26: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5353:46: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5354:69: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/io/test_clipboard.py:67:34: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

zulip/zulip (+2 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL

+ zerver/tests/test_invite.py:1281:32: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1021:29: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

Changes by rule (2 rules affected)

code total + violation - violation + fix - fix
RUF001 6 6 0 0 0
RUF003 3 3 0 0 0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+9 -0 violations, +0 -0 fixes in 41 projects)

bokeh/bokeh (+1 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL --preview

+ examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

pandas-dev/pandas (+6 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --preview

+ pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5352:26: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5353:46: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5354:69: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/io/test_clipboard.py:67:34: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

zulip/zulip (+2 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL --preview

+ zerver/tests/test_invite.py:1281:32: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1021:29: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

Changes by rule (2 rules affected)

code total + violation - violation + fix - fix
RUF001 6 6 0 0 0
RUF003 3 3 0 0 0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@charliermarsh
Copy link
Member

Note to self: revisit these ecosystem changes today.

@charliermarsh
Copy link
Member

\cc @zanieb - Do you have an opinion on whether this should be a preview change...?

@zanieb
Copy link
Member

zanieb commented Nov 3, 2023

@charliermarsh looks like it creates new violations and it's not a bug fix so it should probably be preview?

@charliermarsh
Copy link
Member

Done in #8473.

@covracer covracer deleted the confusable-units branch November 30, 2023 15:21
@covracer
Copy link
Contributor Author

Thanks! I've looked over the hits and made a best guess at how to proceed. Over time I'll try to test and submit Pull Requests.

examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String ...

I would suggest applying the fix for this abbreviation of microvolts in tooltip text.

pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String ...
pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String ...

timedeltas.pyi and timedeltas.pyx list many aliases for microseconds (among other units). test_constructors.py tests using the aliases. I would suggest # noqa: RUF001.

pandas/core/indexes/base.py:5352:26: RUF003 Comment ...
pandas/core/indexes/base.py:5353:46: RUF003 Comment ...
pandas/core/indexes/base.py:5354:69: RUF003 Comment ...

This multiline comment documents performance analysis timings. I would suggest applying the fix.

pandas/tests/io/test_clipboard.py:67:34: RUF001 String ...

This is part of dict[str, list[str]] test data for utf8. ASCII and not-confusable Latin-1 Supplement characters are included alongside characters from higher Unicode blocks. I would suggest applying the fix.

zerver/tests/test_invite.py:1281:32: RUF001 String ...
zerver/tests/test_signup.py:1021:29: RUF001 String ...

These tests use the same copy+pasted string to test "non-ASCII characters", 3 from the Basic Latin block and 6 from the Latin-1 Supplement block (5 of which aren't flagged as confusable). Unless the project would like to begin testing characters from higher blocks, I would suggest dropping the character or adding # noqa: RUF001.

@covracer
Copy link
Contributor Author

Bokeh updated in bokeh/bokeh#13668

@charliermarsh
Copy link
Member

Thanks @covracer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants