Autoformat confusable units #4430

covracer · 2023-05-14T13:18:52Z

I've seen errors crop up from using the different micro and mu characters. Follow matching recommendations on which character to prefer for micro, ohm, and angstrom. References:

Section 22.2 Letterlike Symbols, subsection Unit Symbols, page 877 of The Unicode Standard, Version 15.0
Section 2.5 Duplicated Characters of Unicode Technical Report 25
SI brochure
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/confusables.txt

github-actions · 2023-05-14T13:25:17Z

PR Check Results

Ecosystem

ℹ️ ecosystem check detected changes. (+2, -0, 0 error(s))

zulip (+2, -0)

+ zerver/tests/test_invite.py:1197:32: RUF001 [*] String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1033:29: RUF001 [*] String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

Rules changed: 1

Rule	Changes	Additions	Removals
RUF001	2	2	0

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     13.9±0.02ms     2.9 MB/sec    1.00     13.9±0.02ms     2.9 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.4±0.01ms     4.9 MB/sec    1.00      3.4±0.01ms     4.9 MB/sec
linter/all-rules/numpy/globals.py          1.01    428.0±0.47µs     6.9 MB/sec    1.00    424.9±0.45µs     6.9 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.9±0.01ms     4.4 MB/sec    1.00      5.9±0.02ms     4.4 MB/sec
linter/default-rules/large/dataset.py      1.00      6.8±0.01ms     6.0 MB/sec    1.00      6.8±0.01ms     6.0 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01   1508.9±2.89µs    11.0 MB/sec    1.00   1494.0±3.40µs    11.1 MB/sec
linter/default-rules/numpy/globals.py      1.02    173.2±1.81µs    17.0 MB/sec    1.00    169.7±1.31µs    17.4 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.1±0.00ms     8.2 MB/sec    1.00      3.1±0.01ms     8.2 MB/sec
parser/large/dataset.py                    1.01      5.2±0.01ms     7.8 MB/sec    1.00      5.2±0.00ms     7.8 MB/sec
parser/numpy/ctypeslib.py                  1.00   1020.1±0.54µs    16.3 MB/sec    1.00   1019.2±0.66µs    16.3 MB/sec
parser/numpy/globals.py                    1.00    105.6±0.27µs    27.9 MB/sec    1.00    105.5±0.16µs    28.0 MB/sec
parser/pydantic/types.py                   1.01      2.2±0.00ms    11.4 MB/sec    1.00      2.2±0.00ms    11.4 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     16.6±0.20ms     2.5 MB/sec    1.00     16.6±0.31ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.2±0.05ms     4.0 MB/sec    1.00      4.2±0.04ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.01   506.7±14.96µs     5.8 MB/sec    1.00    501.2±6.66µs     5.9 MB/sec
linter/all-rules/pydantic/types.py         1.00      7.0±0.08ms     3.6 MB/sec    1.00      7.0±0.08ms     3.6 MB/sec
linter/default-rules/large/dataset.py      1.00      8.2±0.14ms     4.9 MB/sec    1.00      8.3±0.07ms     4.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00  1771.6±24.87µs     9.4 MB/sec    1.00  1778.4±26.63µs     9.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    203.4±3.13µs    14.5 MB/sec    1.02    207.3±7.52µs    14.2 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.7±0.05ms     6.9 MB/sec    1.01      3.8±0.07ms     6.8 MB/sec
parser/large/dataset.py                    1.00      6.3±0.05ms     6.4 MB/sec    1.02      6.4±0.07ms     6.3 MB/sec
parser/numpy/ctypeslib.py                  1.00  1192.4±16.10µs    14.0 MB/sec    1.02  1221.5±20.10µs    13.6 MB/sec
parser/numpy/globals.py                    1.00    123.3±2.36µs    23.9 MB/sec    1.01    124.7±2.84µs    23.7 MB/sec
parser/pydantic/types.py                   1.00      2.7±0.04ms     9.4 MB/sec    1.01      2.7±0.03ms     9.3 MB/sec

MichaReiser

Thanks for your contribution.

I'm not very familiar with unicode confusables. So bare with me ;)

Our current table is based on the VS code table and the standard seems to differentiate between mixed script and other confusables. I believe we currently only test for mixed-script confusables (similar to unicode-security). Do you know if these added scripts are also mixed script confusables or general confusables?

MichaReiser · 2023-05-15T07:02:52Z

scripts/update_ambiguous_characters.py


 use once_cell::sync::Lazy;
 use rustc_hash::FxHashMap;

 /// Via: <https://github.com/hediet/vscode-unicode-data/blob/main/out/ambiguous.json>
 /// See: <https://github.com/microsoft/vscode/blob/095ddabc52b82498ee7f718a34f9dd11d59099a8/src/vs/base/common/strings.ts#L1094>
-pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u8>> = Lazy::new(|| {
+pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u32>> = Lazy::new(|| {


Nit: I wonder if we should instead change the type to <char, char>. Having the chars in place could make the table a bit more readable and avoids the unwrap calls.

@MichaReiser - That seems reasonable to me, but will it increase the size of this map?

covracer · 2023-05-16T12:00:18Z

Thanks for taking a look!

The VS Code list only contains "... characters (key) that are confusable with a basic ascii [0x20 to 0x7E] character (value)". https://github.com/hediet/vscode-unicode-data/blob/a2ee3e17acaad2957f910cf6c46f6ea2c6eae753/index.ts#L159

I wonder why they did that. Was it an easy starting point and they haven't been motivated to build it out further? Would a full table would be too slow in node.js? 🤷

I too am new to Unicode security, and am still trying to figure out how to apply domain name spoofing concepts to code linting and autoformatting. TR 39 suggested reading TR 36 first, where I found this:

While compatibility normalization and mixed-script detection can handle the majority of spoofing cases, they do not handle single-script confusables. https://www.unicode.org/reports/tr36/#Single_Script_Spoofing

So it seems like normalization comes first in the process, before mixed script or single script handling. The substitutions I'm adding are normalizations.

A longer discussion of normalization from TR 36:

Fortunately the design of IDN prevents a huge number of spoofing attacks. All conformant users of [IDNA2003] are required to process domain names to convert what are called compatibility-equivalent characters into a unique form using a process called compatibility normalization (NFKC)—for more information on this, see [UAX15]. This processing eliminates most possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences.

charliermarsh · 2023-06-02T03:10:02Z

crates/ruff/src/rules/ruff/rules/confusables.rs

@@ -2116,5 +2116,8 @@ pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u8>> = Lazy::new(|| {
        (1059, 89),
        (65283, 35),
        (65307, 59),
+        (0x212B, 0x00C5), // ANGSTROM SIGN → LATIN CAPITAL LETTER A WITH RING ABOVE
+        (0x2126, 0x03A9), // OHM SIGN → GREEK CAPITAL LETTER OMEGA
+        (0x00B5, 0x03BC), // MICRO SIGN → GREEK SMALL LETTER MU


Based on Section 2.5 in https://www.unicode.org/reports/tr25/, should this also include Kelvin sign?

The list from VS Code covers Kelvin with (8490, 75).

zanieb · 2023-10-19T16:16:34Z

I believe we were concerned about increasing the size of the map here but with use of match it shouldn't matter anymore? @covracer would you be interested in updating with the latest changes on main and we can try to get this merged?

This reverts commit e8879d5.

charliermarsh · 2023-11-03T04:59:01Z

Thanks, and sorry for the extensive delay here.

github-actions · 2023-11-03T05:14:49Z

`ruff-ecosystem` results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+9 -0 violations, +0 -0 fixes in 41 projects)

bokeh/bokeh (+1 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL

+ examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

pandas-dev/pandas (+6 -0 violations, +0 -0 fixes)

+ pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5352:26: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5353:46: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5354:69: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/io/test_clipboard.py:67:34: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

zulip/zulip (+2 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL

+ zerver/tests/test_invite.py:1281:32: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1021:29: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

Changes by rule (2 rules affected)

code	total	+ violation	- violation	+ fix	- fix
RUF001	6	6	0	0	0
RUF003	3	3	0	0	0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+9 -0 violations, +0 -0 fixes in 41 projects)

bokeh/bokeh (+1 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL --preview

+ examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

pandas-dev/pandas (+6 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --preview

+ pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5352:26: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5353:46: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/core/indexes/base.py:5354:69: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/io/test_clipboard.py:67:34: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

zulip/zulip (+2 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --select ALL --preview

+ zerver/tests/test_invite.py:1281:32: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1021:29: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?

Changes by rule (2 rules affected)

code	total	+ violation	- violation	+ fix	- fix
RUF001	6	6	0	0	0
RUF003	3	3	0	0	0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

charliermarsh · 2023-11-03T12:58:21Z

Note to self: revisit these ecosystem changes today.

charliermarsh · 2023-11-03T13:50:32Z

\cc @zanieb - Do you have an opinion on whether this should be a preview change...?

zanieb · 2023-11-03T15:12:39Z

@charliermarsh looks like it creates new violations and it's not a bug fix so it should probably be preview?

charliermarsh · 2023-11-03T16:09:32Z

Done in #8473.

covracer · 2023-11-30T16:12:56Z

Thanks! I've looked over the hits and made a best guess at how to proceed. Over time I'll try to test and submit Pull Requests.

examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String ...

I would suggest applying the fix for this abbreviation of microvolts in tooltip text.

pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String ...
pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String ...

timedeltas.pyi and timedeltas.pyx list many aliases for microseconds (among other units). test_constructors.py tests using the aliases. I would suggest # noqa: RUF001.

pandas/core/indexes/base.py:5352:26: RUF003 Comment ...
pandas/core/indexes/base.py:5353:46: RUF003 Comment ...
pandas/core/indexes/base.py:5354:69: RUF003 Comment ...

This multiline comment documents performance analysis timings. I would suggest applying the fix.

pandas/tests/io/test_clipboard.py:67:34: RUF001 String ...

This is part of dict[str, list[str]] test data for utf8. ASCII and not-confusable Latin-1 Supplement characters are included alongside characters from higher Unicode blocks. I would suggest applying the fix.

zerver/tests/test_invite.py:1281:32: RUF001 String ...
zerver/tests/test_signup.py:1021:29: RUF001 String ...

These tests use the same copy+pasted string to test "non-ASCII characters", 3 from the Basic Latin block and 6 from the Latin-1 Supplement block (5 of which aren't flagged as confusable). Unless the project would like to begin testing characters from higher blocks, I would suggest dropping the character or adding # noqa: RUF001.

covracer · 2024-04-26T11:52:32Z

Bokeh updated in bokeh/bokeh#13668

charliermarsh · 2024-04-26T13:16:18Z

Thanks @covracer!

MichaReiser reviewed May 15, 2023

View reviewed changes

charliermarsh reviewed Jun 2, 2023

View reviewed changes

covracer force-pushed the confusable-units branch from 6aaf205 to a9cbfdd Compare June 3, 2023 11:59

covracer requested a review from charliermarsh June 3, 2023 12:22

covracer and others added 2 commits November 3, 2023 00:29

Satisfy clippy

e8879d5

Revert "Satisfy clippy"

25bccf0

This reverts commit e8879d5.

charliermarsh force-pushed the confusable-units branch from a9cbfdd to 2bb2135 Compare November 3, 2023 04:41

charliermarsh self-assigned this Nov 3, 2023

Rebase

292bf7f

charliermarsh force-pushed the confusable-units branch from 2bb2135 to 292bf7f Compare November 3, 2023 04:47

Merge branch 'main' into confusable-units

9e24e64

charliermarsh enabled auto-merge (squash) November 3, 2023 04:54

charliermarsh merged commit 9f30ccc into astral-sh:main Nov 3, 2023
16 checks passed

charliermarsh mentioned this pull request Nov 3, 2023

Use characters instead of u32 in confusable map #8463

Merged

miccal mentioned this pull request Nov 3, 2023

ruff 0.1.4 Homebrew/homebrew-core#153286

Merged

covracer deleted the confusable-units branch November 30, 2023 15:21

covracer mentioned this pull request Mar 29, 2024

F821 triggers when using non-ASCII characters #942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoformat confusable units #4430

Autoformat confusable units #4430

covracer commented May 14, 2023

github-actions bot commented May 14, 2023 •

edited

MichaReiser left a comment

MichaReiser May 15, 2023

charliermarsh Jun 2, 2023

covracer commented May 16, 2023

charliermarsh Jun 2, 2023

covracer Jun 3, 2023

zanieb commented Oct 19, 2023

charliermarsh commented Nov 3, 2023

github-actions bot commented Nov 3, 2023

charliermarsh commented Nov 3, 2023

charliermarsh commented Nov 3, 2023

zanieb commented Nov 3, 2023

charliermarsh commented Nov 3, 2023

covracer commented Nov 30, 2023

covracer commented Apr 26, 2024

charliermarsh commented Apr 26, 2024

Autoformat confusable units #4430

Autoformat confusable units #4430

Conversation

covracer commented May 14, 2023

github-actions bot commented May 14, 2023 • edited

PR Check Results

Ecosystem

Benchmark

Linux

Windows

MichaReiser left a comment

Choose a reason for hiding this comment

MichaReiser May 15, 2023

Choose a reason for hiding this comment

charliermarsh Jun 2, 2023

Choose a reason for hiding this comment

covracer commented May 16, 2023

charliermarsh Jun 2, 2023

Choose a reason for hiding this comment

covracer Jun 3, 2023

Choose a reason for hiding this comment

zanieb commented Oct 19, 2023

charliermarsh commented Nov 3, 2023

github-actions bot commented Nov 3, 2023

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

charliermarsh commented Nov 3, 2023

charliermarsh commented Nov 3, 2023

zanieb commented Nov 3, 2023

charliermarsh commented Nov 3, 2023

covracer commented Nov 30, 2023

covracer commented Apr 26, 2024

charliermarsh commented Apr 26, 2024

github-actions bot commented May 14, 2023 •

edited

`ruff-ecosystem` results