Make stl.h `list|set|map_caster` more user friendly. #4686

rwgk · 2023-05-30T20:03:10Z

Description

(The equivalent of this PR was merged as google/pybind11k#30045 on July 17, 2023.)

With this PR, it is no longer necessary to explicitly convert Python iterables to tuple(), set(), or map() in several common situations, for example:

-    fn(tuple({"x": 8, "y": 11}.values()))
+    fn({"x": 8, "y": 11}.values())

-    fn(set({5: None, 12: None}.keys()))
+    fn({5: None, 12: None}.keys())

-    fn(dict(PyMappingWithItems().items()))
+    fn(PyMappingWithItems())

See #4686 (comment) below for a more complete demonstration of the behavior differences. (Note that the diff in that comment is reversed, what is + here is - there and vice versa.)

This PR strikes a careful compromise between being user friendly and being safe. This compromise was informed by 10s of thousands of use cases in the wild (Google codebase, which includes a very large number of open source third-party packages). Concretely, in connection with PyCLIF-pybind11 integration work, the behaviors of PyCLIF (http://github.com/google/clif) and pybind11 needed to be converged. Originally PyCLIF was extremely far on the permissive side of the spectrum, by accepting any Python iterable as input for a C++ vector/set/map argument, as long as the elements were convertible. The obvious (in hindsight) problem was that any empty Python iterable could be passed to any of these C++ types, e.g. {} was accepted for C++ vector/set arguments, or [] for C++ map arguments. To fix this, the code base was cleaned up, and gating functions were added to enforce the desired behavior:

https://github.com/google/clif/blob/ae4eee1de07cdf115c0c9bf9fec9ff28efce6f6c/clif/python/runtime.cc#L388-L438

The exact same gating functions are adopted here. Without this behavior change in pybind11, it would be necessary to insert a very large number of explicit tuple(), set(), map() conversions in the Google codebase, to complete the PyCLIF-pybind11 integration work. While it is very easy to tell new users to pass e.g. tuple(iterable) instead of iterable, it is not easy to convince hundreds of happy existing PyCLIF users to accept retroactively inserting the explicit conversions. The obvious question that will come back: Why do we have to clutter our code like this? There is no good reason. The exact implementation of the gating functions may seem a little arbitrary at first sight, but is in fact informed by what amounts to a large-scale field experiment, with the originally very permissive behavior of PyCLIF.

Suggested changelog entry:

stl.h `list|set|map_caster` were made more user friendly: it is no longer necessary to explicitly convert Python iterables to `tuple()`, `set()`, or `map()` in several common situations.

…rc.ptr()`). Note for completeness: This is a more conservative change than google/pybind11k#30042

Skylion007 · 2023-06-01T22:19:11Z

include/pybind11/stl.h

+        if (!convert) {
+            return false;
+        }
+        if (PyGen_Check(src.ptr())) {


Shouldn't we really be supporting a py::iterable object instead? I think a generator would count here and simplify the code.

I definitely like the general thinking.

Please give me a little more time. I'm in the middle of what I am thinking of as "data driven cleanup".

PyCLIF is very far on the permissive side of the spectrum. — That's what's giving me the data: what is being used, how much.

pybind11 is very far on the strict side.

A glimpse on where I am right now: PyCLIF supports passing any iterable for a C++ set. I think that's going too far to the permissive side of the spectrum. So I'm going through cleaning that up, even though it's a lot of leg work. But I believe it makes the code safer and more readable.

I just pushed the equivalent of the behavior I settled on for PyCLIF. I'm still working on rolling out the PyCLIF behavior changes internally. Only after that can I submit internally and then export the changed PyCLIF to github.

High-level: I'm converging the behaviors

by changing PyCLIF to be stricter & safer,

while this PR makes pybind11 more user friendly without crossing into unsafe territory.

Shouldn't we really be supporting a py::iterable object instead?

That's what PyCLIF used to do, but it has pitfalls that led to lots of unclean test code: Python {} (empty dict) passed as input for std::vector and std::set, [] as input for std::set and std::map.

The originally very permissive PyCLIF behavior gave me a fairly unique set of data: I could see what's being used a lot, what makes sense (IMO), and what doesn't make sense or seems unsafe (again IMO). The new code block near the top of stl.h, transferred here from PyCLIF, encodes my conclusions. With that the internal cleanup turned out to be relatively small, only around 50 changes I had to mail out. This is part of the change description for those (slightly edited):

Motivations for the behavior change:

Improves safety / prevents accidents:

For C++ std::set (and std::unordered_set): The potentially reducing conversion from a Python sequence (e.g. Python list or tuple) to a C++ set must be explicit.

For C++ std::vector: In very rare cases (itertools.chain) the conversion from a Python iterable must be explicit.

Improves readability: Python literals to be passed as C++ sets must be set literals.

I'll finish the rollout of the PyCLIF change, then come back here for polishing maybe and further expanding the tests. Please let me know any high-level suggestions.

The PyCLIF behavior change (with prior cleanup around the codebase) is now complete: google/clif@30c5488

Internally I'm testing (globally) PyCLIF-pybind11 with this PR. Unfortunately some unrelated problem slipped in that I need to troubleshoot first, before I can convince myself that this PR actually fixes all pybind11/stl.h vs PyCLIF behavior differences.

…just tests.

…into `PyObjectTypeIsConvertibleToStdVector()`.

…`, `std::map` conversions. **Please request a rollback only as a last resort**, to buy time for working out a forward fix. Forward fixes are almost certainly trivial: Based on the Python traceback, insert an explicit conversion to `tuple()`, `set()`, or `dict()` (e.g. cl/538602196), or change a Python `set` or `list` literal to the correct type (e.g. cl/538563104). Motivations for the behavior changes: * Improves safety / prevents accidents: * For C++ `std::set` (and `std::unordered_set`): The potentially reducing conversion from a Python sequence (e.g. Python `list` or `tuple`) to a C++ set must be explicit. * For C++ `std::vector` (and `std::array`): In very rare cases (see b/284333274#comment11) the conversion from a Python iterable must be explicit. * For C++ `std::map` (and `std::unordered_map`): The conditions added in this CL (`clif::PyObjectTypeIsConvertibleToStdMap()`) did not break any existing unit tests. * Improves readability: Python literals to be passed as C++ sets must be set literals. * Note for completeness: A Python set can still be passed to a C++ `std::vector`. The rationale is that this conversion is not reducing. Implicit conversions of this kind are also fairly commonly used in google3, therefore enforcing explicit conversions has an unfavorable cost : benefit ratio. * Original trigger for this work: Converge pybind11 & PyCLIF behavior (go/pyclif_pybind11_fusion). * Note that pybind11 is very far on the strict side of the spectrum. See also pybind/pybind11#4686. * PyCLIF was originally extremely far on the permissive side of the spectrum. This CL is already the 2nd cleanup step: prior to cl/525187106 a conversion from any zero-length iterator (e.g. the `{}` empty dict literal) to a C++ `std::vector` or `std::set` was possible. This CL was started from cl/535028028, but the changes under google3/third_party/absl/python were backed out, the change for `std::array` in stltypes.h was added. OSS build tested interactively with copybara to local directory & cmake commands in Ubuntu 20.04 / Python 3.9 docker container. PiperOrigin-RevId: 538823846

rwgk · 2023-06-10T04:24:29Z

Close-reopen to trigger GHA, after adding python dev label.

…ertibleTo*()` implementations.

rwgk · 2023-07-15T17:18:07Z

Original PR title and description (outdated)

Title: WIP: Change list_caster (stl.h) to also accept generator objects.

Description:

WIP — It is still undecided if this is the best compromise.

An easy way to understand the pybind11 behavior change is to look at the small test change under commit 7d30378,

     fn = m.pass_std_vector_int
-    with pytest.raises(TypeError):
-        fn(i for i in range(3))
+    assert fn(i for i in range(3)) == 3

What does this help?

This PR resolves a very large number of global test failures in connection with the PyCLIF-pybind11 integration work. The exact number is still to be determined, but estimated to exceed 200k. The number of root causes is difficult to estimate, order of magnitude 100 probably.

It is very easy to tell new users to pass tuple(generator_object) instead of generator_object to wrapped functions with std::vector or std::array arguments.
It is not easy to convince hundreds of happy existing PyCLIF users to retroactively insert tuple().

The large number of existing use cases is also a signal that the feature is definitely useful in aggregate.

For full context:

This PR is a much more conservative than Accept Python iterables and sequences as C++ sets and vectors google/pybind11k#30042 (rolled back), and safer, because generator objects are guaranteed to be fully consumed, behavior-equivalent to passing tuple(generator_object).
There was exactly one global test failure for Accept Python iterables and sequences as C++ sets and vectors google/pybind11k#30042:
- https://github.com/pytorch/pytorch/blob/3263bd24bee43b4e2c263c0076a2136de6ead947/test/test_native_functions.py#L112

The same test passes with this PR.

2023-07-15 09:55 AM PDT: Global testing with commit 6b5c780 passed (Google-internal test id OCL:538028927:BASE:548369355:1689440116187:326f581c)
2023-06-07 10:53 AM PDT: Global testing with commit b4f767b passed (Google-internal test id OCL:538028927:BASE:538529638:1686159334502:31d711fe).
2023-05-31 6:39 AM Global testing with commit 7d30378 passed (Google-internal test id OCL:536396745:BASE:536692592:1685539068476:4a666603).

rwgk · 2023-07-15T17:40:32Z

Demonstration of behavior changes

New tests in test_stl.py adjusted to work with current master (@ 0620d71).

diff --git a/tests/test_stl.py b/tests/test_stl.py
index 43615843..4bc4ce6e 100644
--- a/tests/test_stl.py
+++ b/tests/test_stl.py
@@ -386,11 +386,11 @@ def test_return_vector_bool_raw_ptr():
 )
 def test_pass_std_vector_int(fn, offset):
     assert fn([7, 13]) == 140 + offset
-    assert fn({6, 2}) == 116 + offset
-    assert fn({"x": 8, "y": 11}.values()) == 138 + offset
-    assert fn({3: None, 9: None}.keys()) == 124 + offset
-    assert fn(i for i in [4, 17]) == 142 + offset
-    assert fn(map(lambda i: i * 3, [8, 7])) == 190 + offset  # noqa: C417
+    assert fn(tuple({6, 2})) == 116 + offset
+    assert fn(tuple({"x": 8, "y": 11}.values())) == 138 + offset
+    assert fn(tuple({3: None, 9: None}.keys())) == 124 + offset
+    assert fn(tuple(i for i in [4, 17])) == 142 + offset
+    assert fn(tuple(map(lambda i: i * 3, [8, 7]))) == 190 + offset  # noqa: C417
     with pytest.raises(TypeError):
         fn({"x": 0, "y": 1})
     with pytest.raises(TypeError):
@@ -399,8 +399,8 @@ def test_pass_std_vector_int(fn, offset):

 def test_pass_std_vector_pair_int():
     fn = m.pass_std_vector_pair_int
-    assert fn({1: 2, 3: 4}.items()) == 406
-    assert fn(zip([5, 17], [13, 9])) == 2222
+    assert fn(tuple({1: 2, 3: 4}.items())) == 406
+    assert fn(tuple(zip([5, 17], [13, 9]))) == 2222


 def test_list_caster_fully_consumes_generator_object():
@@ -416,7 +416,7 @@ def test_list_caster_fully_consumes_generator_object():
 def test_pass_std_set_int():
     fn = m.pass_std_set_int
     assert fn({3, 15}) == 254
-    assert fn({5: None, 12: None}.keys()) == 251
+    assert fn(set({5: None, 12: None}.keys())) == 251
     with pytest.raises(TypeError):
         fn([])
     with pytest.raises(TypeError):
@@ -466,7 +466,7 @@ class FakePyMappingGenObj(FakePyMappingMissingItems):
 def test_pass_std_map_int():
     fn = m.pass_std_map_int
     assert fn({1: 2, 3: 4}) == 4506
-    assert fn(FakePyMappingWithItems()) == 3507
+    assert fn(dict(FakePyMappingWithItems().items())) == 3507
     with pytest.raises(TypeError):
         fn(FakePyMappingMissingItems())
     with pytest.raises(TypeError):

rwgk · 2023-07-17T22:08:43Z

For easy reference: The equivalent of this PR was reviewed and merged under google/pybind11k#30045.

If the review here leads to more changes, I'll port them to pywrapcc after merging here.

@Skylion007

* Snapshot of PR pybind#4686 @ commit 3631886 applied to pywrapcc. * Also use `PyObjectTypeIsConvertibleToStdVector()` in `array_caster`. * Update comment pointing to clif/python/runtime.cc (code is unchanged). * Comprehensive test coverage, enhanced set_caster load implementation. * Resolve clang-tidy eror. * Add a long C++ comment explaining what led to the `PyObjectTypeIsConvertibleTo*()` implementations. * Minor function name change in test. * strcmp -> std::strcmp (thanks @Skylion007 for catching this) * Add `PyCallable_Check(items)` in `PyObjectTypeIsConvertibleToStdMap()` * Resolve clang-tidy error * Use `PyMapping_Items()` instead of `src.attr("items")()`, to be internally consistent with `PyMapping_Check()` * Update link to PyCLIF sources. * Fix typo (thanks @wangxf123456 for catching this)

…rc.ptr()`). Note for completeness: This is a more conservative change than google/pybind11k#30042

…just tests.

…into `PyObjectTypeIsConvertibleToStdVector()`.

…ertibleTo*()` implementations.

…nally consistent with `PyMapping_Check()`

…pybind11 into list_caster_pass_generator

henryiii · 2023-11-15T20:09:21Z

include/pybind11/stl.h

+    }
+    return (PyGen_Check(obj) != 0) || (PyAnySet_Check(obj) != 0)
+           || PyObjectIsInstanceWithOneOfTpNames(
+               obj, {"dict_keys", "dict_values", "dict_items", "map", "zip"});


This requires explicit subclasses, rather than any object that conforms to collections.abc.Sequence? (Likewise for collections.abc.Mapping and collections.abc.Set below) The Pythonic way to do this would be to ask if the object has the methods required by these protocols, rather than checking explicit inheritance.

Specifically, from https://docs.python.org/3/library/collections.abc.html:

Sequence: __getitem__, __len__, __contains__, __iter__, __reversed__, index, and count.

Set: __contains__, __iter__, __len__, __le__, __lt__, __eq__, __ne__, __gt__, __ge__, __and__, __or__, __sub__, __xor__, and isdisjoint.

Mapping: __getitem__, __iter__, __len__, __contains__, keys, items, values, get, __eq__, and __ne__.

In addition to what I wrote yesterday, there was actually another concern in the back of my mind: the current implementation is super fast, direct strcmp calls for 5 (vector) of 1 (set) very short string(s).

In contrast, hasattr is far more complex and probably orders of magnitude slower, although I wouldn't really worry about it in the usual bigger context (it'll not even be a blip on the radar in most cases). The code extra code complexity is the bigger worry.

So there is a code complexity and runtime overhead cost.

What is the benefit?

That brings me back to my "large-scale field experiment" experience hinted in the PR description. In a nutshell:

Current pybind11 is super restrictive.

Original PyCLIF was super permissive.

I had this huge pile of data from the wild contrasting the two. Based on that I could see very clearly that there is very little to gain from being more permissive than what this PR implements. IOW I don't think paying the code complexity and runtime overhead cost will help in any significant way.

Could you please let me know if you have a different opinion? — It will probably cost me a whole day worth of effort to implement and fully test (global testing) attribute inspection. We will also have to agree on what subsets of the attributes we actually want to test against. I expect that there will be some devils in the details.

I think having a short path (check inheritance first) could alleviate much of the speed problem - the slower "correct" check could be done if this is not a subclass of X for common X. Also, this list is somewhat arbitrary - you probably included things (like dict_keys? dict_items?) based on usage (in PyCLIF), rather than making a comprehensive list of everything iterable in Python? Also, I believe checking the attributes on a class (you don't need to check them on the instance) is pretty fast if you avoid the descriptor protocol (typing.runtime_checkable switched to this in 3.12 (since 3.2) which made it much faster).

From a usage standpoint, implementing the Protocol specified by these ABC's is supposed to make your class work wherever the standard library classes work, so it seems ideal if this could follow that. Not required (and we could start with this if you really want to), but it seems like the correct thing to do.

We will also have to agree on what subsets of the attributes we actually want to test against.

This is concretely specified in collections.abc. No arguments needed.

I think having a short path (check inheritance first) could alleviate much of the speed problem

I agree. Which basically means: this PR plus follow-on work to make the check more permissive.

Do you think it's reasonable then to merge this PR, and do the fallback attribute inspection in a separate PR? — To be completely honest, I wouldn't want to prioritize this, for the reasons explained before. It a significant amount of work, extra code complexity, and I don't think very many people will ever notice the difference in practice. But if someone else wants to invest the effort, I'd be on board.

This is concretely specified in collections.abc. No arguments needed.

I see, sounds good to me, but from a purely practical viewpoint, assuming that most duck-typing only implements a subset of the attributes, insisting on the full set of attributes here probably means nobody will ever see a difference in practice, unless they are specifically looking to pass the checks here.

rwgk · 2023-11-15T20:30:06Z

Thanks for the feedback @henryiii!

To explain briefly: I was completely focused on making pybind11 more permissive "just enough" to avoid large-scale changes around the Google codebase, changes that would probably be viewed as annoying regressions by users.

I agree aligning with collections.abc is fundamentally better, but that didn't even cross my mind before. I'll work on that asap.

rwgk added 2 commits May 30, 2023 08:18

Add test_pass_std_vector_int(), test_pass_std_set_int() in test_stl

d2a36d9

Change list_caster to also accept generator objects (`PyGen_Check(s…

7d30378

…rc.ptr()`). Note for completeness: This is a more conservative change than google/pybind11k#30042

Skylion007 reviewed Jun 1, 2023

View reviewed changes

rwgk added 5 commits June 5, 2023 12:33

Drop in (currently unpublished) PyCLIF code, use in list_caster, ad…

039d1b7

…just tests.

Use PyObjectTypeIsConvertibleToStdSet() in set_caster, adjust tests.

bdba857

Use PyObjectTypeIsConvertibleToStdMap() in map_caster, add tests.

ec50ca0

Simplify list_caster load() implementation, push str/bytes check …

eeb59af

…into `PyObjectTypeIsConvertibleToStdVector()`.

clang-tidy cleanup with a few extra (... != 0) to be more consistent.

3631886

rwgk mentioned this pull request Jun 6, 2023

Make stl.h list|set|map_caster more user friendly. google/pybind11k#30045

Merged

rwgk added 2 commits June 6, 2023 18:06

Also use PyObjectTypeIsConvertibleToStdVector() in array_caster.

b4f767b

Merge branch 'master' into list_caster_pass_generator

42e22d7

rwgk added the python dev Working on development versions of Python label Jun 10, 2023

rwgk closed this Jun 10, 2023

rwgk reopened this Jun 10, 2023

rwgk removed the python dev Working on development versions of Python label Jun 10, 2023

rwgk requested review from henryiii and removed request for henryiii June 10, 2023 16:02

rwgk added 8 commits June 27, 2023 15:17

Merge branch 'master' into list_caster_pass_generator

9647a92

Merge branch 'master' into list_caster_pass_generator

1ed90cf

Update comment pointing to clif/python/runtime.cc (code is unchanged).

3cc034b

Merge branch 'master' into list_caster_pass_generator

18c4153

Comprehensive test coverage, enhanced set_caster load implementation.

e7330f2

Resolve clang-tidy eror.

da29bf5

Add a long C++ comment explaining what led to the `PyObjectTypeIsConv…

1fa338e

…ertibleTo*()` implementations.

Minor function name change in test.

6b5c780

rwgk changed the title ~~WIP: Change list_caster (stl.h) to also accept generator objects.~~ Make stl.h list|set|map_caster more user friendly. Jul 15, 2023

Merge branch 'master' into list_caster_pass_generator

6c1ec6a

rwgk added 22 commits August 4, 2023 12:34

Merge branch 'master' into list_caster_pass_generator

cc4d69c

Add test_pass_std_vector_int(), test_pass_std_set_int() in test_stl

dacb827

Change list_caster to also accept generator objects (`PyGen_Check(s…

3fc2d2a

…rc.ptr()`). Note for completeness: This is a more conservative change than google/pybind11k#30042

Drop in (currently unpublished) PyCLIF code, use in list_caster, ad…

ff0724f

…just tests.

Use PyObjectTypeIsConvertibleToStdSet() in set_caster, adjust tests.

be02291

Use PyObjectTypeIsConvertibleToStdMap() in map_caster, add tests.

0a79f55

Simplify list_caster load() implementation, push str/bytes check …

ae95424

…into `PyObjectTypeIsConvertibleToStdVector()`.

clang-tidy cleanup with a few extra (... != 0) to be more consistent.

8aa81d7

Also use PyObjectTypeIsConvertibleToStdVector() in array_caster.

4435ed0

Update comment pointing to clif/python/runtime.cc (code is unchanged).

9aca5e8

Comprehensive test coverage, enhanced set_caster load implementation.

9dc7d7c

Resolve clang-tidy eror.

d9eeea8

Add a long C++ comment explaining what led to the `PyObjectTypeIsConv…

9ac319b

…ertibleTo*()` implementations.

Minor function name change in test.

b834767

strcmp -> std::strcmp (thanks @Skylion007 for catching this)

d19a0ff

Add PyCallable_Check(items) in PyObjectTypeIsConvertibleToStdMap()

69a0da8

Resolve clang-tidy error

c86fa97

Use PyMapping_Items() instead of src.attr("items")(), to be inter…

940e190

…nally consistent with `PyMapping_Check()`

Update link to PyCLIF sources.

f7725e4

Fix typo (thanks @wangxf123456 for catching this)

fa72918

Merge branch 'list_caster_pass_generator' of https://github.com/rwgk/…

a6b9a00

…pybind11 into list_caster_pass_generator

Merge branch 'master' into list_caster_pass_generator

ac3ac4d

rwgk added the pybind11k backport label Sep 12, 2023

henryiii reviewed Nov 15, 2023

View reviewed changes

rwgk added 2 commits November 16, 2023 11:04

Merge branch 'master' into list_caster_pass_generator

87ff92a

Fix typo discovered by new version of codespell.

3ae34f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make stl.h `list|set|map_caster` more user friendly. #4686

Make stl.h `list|set|map_caster` more user friendly. #4686

rwgk commented May 30, 2023 •

edited

Skylion007 Jun 1, 2023

rwgk Jun 1, 2023

rwgk Jun 5, 2023

rwgk Jun 8, 2023

rwgk commented Jun 10, 2023

rwgk commented Jul 15, 2023

rwgk commented Jul 15, 2023

rwgk commented Jul 17, 2023

henryiii Nov 15, 2023

henryiii Nov 15, 2023 •

edited

rwgk Nov 16, 2023

henryiii Nov 16, 2023 •

edited

rwgk Nov 16, 2023

rwgk commented Nov 15, 2023

Make stl.h list|set|map_caster more user friendly. #4686

Are you sure you want to change the base?

Make stl.h list|set|map_caster more user friendly. #4686

Conversation

rwgk commented May 30, 2023 • edited

Description

Suggested changelog entry:

Skylion007 Jun 1, 2023

Choose a reason for hiding this comment

rwgk Jun 1, 2023

Choose a reason for hiding this comment

rwgk Jun 5, 2023

Choose a reason for hiding this comment

rwgk Jun 8, 2023

Choose a reason for hiding this comment

rwgk commented Jun 10, 2023

rwgk commented Jul 15, 2023

Original PR title and description (outdated)

rwgk commented Jul 15, 2023

Demonstration of behavior changes

rwgk commented Jul 17, 2023

henryiii Nov 15, 2023

Choose a reason for hiding this comment

henryiii Nov 15, 2023 • edited

Choose a reason for hiding this comment

rwgk Nov 16, 2023

Choose a reason for hiding this comment

henryiii Nov 16, 2023 • edited

Choose a reason for hiding this comment

rwgk Nov 16, 2023

Choose a reason for hiding this comment

rwgk commented Nov 15, 2023

Make stl.h `list|set|map_caster` more user friendly. #4686

Make stl.h `list|set|map_caster` more user friendly. #4686

rwgk commented May 30, 2023 •

edited

henryiii Nov 15, 2023 •

edited

henryiii Nov 16, 2023 •

edited