Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-place copy of a package gets metadata from installed package #94181

Closed
jaraco opened this issue Jun 23, 2022 · 8 comments
Closed

In-place copy of a package gets metadata from installed package #94181

jaraco opened this issue Jun 23, 2022 · 8 comments
Assignees
Labels
pending The issue will be closed if no feedback is provided topic-importlib

Comments

@jaraco
Copy link
Member

jaraco commented Jun 23, 2022

I built an in-place version of numpy with python setup.py build_ext -i (note that that will not have a .dist-info dir at the root of the repo, distutils doesn't produce that), and then used PYTHONPATH to switch between the numpy installed into the conda env and this in-place build:

$ echo $PYTHONPATH
/home/rgommers/code/numpy

Then open IPython and:

>>> import numpy as np
>>> np.__version__
'1.24.0.dev0+291.g2c5f407cf6'
>>> import importlib.metadata
>>> importlib.metadata.version("numpy")
'1.22.3'

So rather than returning None or raising an exception, I'm getting the metadata for the wrong package.

I haven't yet decided if metadata items being undefined should result in None or raise an Exception (maybe KeyError).

Either way would be fine with me, as long as it's not silently returning incorrect metadata. There will be other cases like that other than numpy, and PYTHONPATH is pretty commonly used to switch around packages when developing. A .dist-info directory may be required to get correct metadata, but it being missing should not cause incorrect results - that's a clear bug.

Originally posted by @rgommers in #91216 (comment)

@jaraco
Copy link
Member Author

jaraco commented Jun 23, 2022

By design, a distribution package and its metadata are loosely-coupled from the Python package(s), for better or for worse. That is, when you import numpy, that's a different numpy than when you call importlib.metadata.version('numpy') or importlib.distribution('numpy'). In some projects, the distribution name doesn't match the modules/packages at all (setuptools supplies setuptools and pkg_resources, Django supplies django, pytest supplies _pytest).

importlib.metadata does honor PYTHONPATH and sys.path, but it's because of the separation of metadata and installed modules that the issue arises.

The problem is that by building numpy using old tools and manually patching the environment, you're creating an invalid state (you've upgraded numpy but without replacing the metadata). When you request metadata for numpy, you get the metadata from the only known "numpy" distribution found, the previously-installed package. If instead you had used pip (or other standards-based tool) to install the package under development, it would have required the metadata to be generated and present and would have detected that an existing install needed to be removed to make room for the new version. It's also not necessarily the case that you couldn't have both versions present. importlib.metadata will handle multiple distributions existing on sys.path and will give preference to the first one. Really what it boils down to is you have to actually generate the metadata and make sure that metadata is present and ahead on sys.path.

You're right that distutils doesn't install dist-info metadata. It does have a command to install_egg_info for a package, but I wouldn't recommend it and I'm not confident that format is supported by importlib.metadata anyway.

Your best bet is to use something like setuptools to build/install the package (perhaps in an "editable" mode) and ensure that the metadata is built/present.

If you wish to programmatically override the search path when resolving the distribution, you could do something like:

>>> import importlib.metadata as md
>>> # find a numpy distribution but look only in /dev/null
>>> dist = next(md.distributions(name='numpy', path=['/dev/null']), None)
>>> dist is None
True

I recognize, however, that's not particularly convenient or interesting.

Another option could be to uninstall the installed copy of numpy. Then importlib.metadata.version('numpy') would raise a PackageNotFoundError.

Unfortunately, I don't see any way in the current design for importlib.metadata to know to ignore the metadata that it finds in the environment.

Now that I think about it, maybe there is a cheap and easy way. What if you just create an empty metadata for this in-development package with:

$ touch $HOME/code/numpy/numpy.dist-info

I believe with that, you'll now get either an error or None for .version('numpy') (depending on the outcome of #91216). Taking such an action is the cheap metadata equivalent of setting PYTHONPATH=$HOME/code/numpy.

Does that provide an adequate workaround?

@rgommers
Copy link

The problem is that by building numpy using old tools and manually patching the environment, you're creating an invalid state (you've upgraded numpy but without replacing the metadata)

I don't think old tools are the issue here. For a pure Python package, you can use PYTHONPATH without invoking any build tool. This is an easy way - perhaps the easiest way - to switch between two versions of the same package. I would not call this "invalid state", it's simply a convenient development practice.

It's also not necessarily the case that you couldn't have both versions present

I think the second one is "present" in a physical sense (files on disk in an environment), but there's no way that import numpy will ever be able to get at that package version.

Does that provide an adequate workaround?

I don't really need a workaround. For context, this came up in this thread on the packaging Discourse where it was suggested that importlib.metadata.version was a good replacement for a __version__ attribute. I tested it in the first open environment I had available in a terminal, and immediately ran into this discrepancy. So if this is not a bug, it's very clear that the two are not the same, and __version__ is needed as the only 100% reliable way of determining the version of a module.

Unfortunately, I don't see any way in the current design for importlib.metadata to know to ignore the metadata that it finds in the environment.

It looks like importlib.import_module actually does look in the right place and imports the first version in the regular import order (i.e., where PYTHONPATH points). So perhaps a consistency check can be done? Raising an exception if the locations of the packages are not matching would solve the issue.

@jaraco
Copy link
Member Author

jaraco commented Aug 14, 2022

You're right. importlib.import_module and importlib.metadata deal with two largely separate concerns. The former deals with Python modules/packages and their contents. The latter deals with metadata sitting alongside Python modules/packages. The only association between these two domains is that they both share the PYTHONPATH (or more precisely sys.meta_path) search path. Metadata is loaded without ever touching the import modules. Because the association between these two domains is loosely coupled, there's no good way to detect that the metadata you loaded is not for a package you might have imported or might later import. The reason this approach works for most users is because most environments are well-behaved, installing a package and its metadata reliably to a prominent path (pip basically enforces this expectation).

Now in the case of an imported package detecting its own version, that might be more feasible... because an imported package knows its own context and can limit the search path when discovering metadata.

So one could add this to numpy/__init__.py:

import importlib.metadata as metadata
from os.path import dirname

numpy_where = dirname(dirname(__spec__.origin))
dist, = metadata.distributions('numpy', path=[numpy_where])
__version__ = dist.version

That would in most cases resolve the metadata only from the same path where numpy was imported (or fail). I'm not super-confident that dirname(dirname(__spec__.origin)) is always a reliable way to derive the appropriate path entry from an imported module. I'm fairly confident it will work in the standard file system importer and zip importer, but I'm less confident about other custom finders/loaders that might implement loading of modules or metadata differently.

But even thinking about sys.path, I notice that '' on sys.path results in __spec__.origin having absolute paths, so while loading from dirname(dirname(__spec__.origin)) might work, it does mean that metadata is getting loaded from a different path than it would if loaded from sys.path.


Somewhat related, I've stopped (and recommended others to stop) supplying __version__ in packages, once I realized that there isn't a direct relationship between the distribution package version (the origin of that metadata) and the Python module/package, and I realized that any user that cares about the version can simply query it as importlib.metadata.version(dist_pkg_name) directly. I realize it's sometimes convenient to have pkg.__version__, but it's difficult to keep that value consistent with the distribution metadata, that there's no direct relationship between the module and its metadata.

Things would be different if Python's packaging ecosystem had an explicit, direct relationship between modules and their metadata.

@jaraco jaraco self-assigned this Aug 14, 2022
@rgommers
Copy link

Thanks @jaraco, that all makes sense, and I agree there's only a loose coupling between metadata and imported package. Please feel free to close this as invalid/wontfix if you prefer not to do a consistency check inside importlib.

Somewhat related, I've stopped (and recommended others to stop) supplying __version__ in packages, once I realized that there isn't a direct relationship between the distribution package version (the origin of that metadata) and the Python module/package,

This is the one thing I disagree with. Given that there is only a loose coupling, and that I want to get the actual version corresponding to the imported package, it is clear to me that __version__ is a hard necessity. It's also a lot easier on the user than a very verbose importlib invocation. So I'm convinced both are useful, just for different purposes.

@jaraco
Copy link
Member Author

jaraco commented Aug 20, 2022

It's not at all obvious to me what sort of consistency check would even be possible. Thinking about the scenario you've described, here's one way it could maybe be done:

  • Expose an interface on Distribution to return the "location" of that distribution metadata.
  • When querying for any distribution's metadata, open the top_level.txt file in the metadata and for each line in there, use importlib.util.find_spec on each and check its origin to ensure it matches the location of the Distribution.
  • Exclude any distributions that fail the check.

This computation is probably prohibitively expensive (definitely not something you'd want to have as an operation that occurs on every import of every top-level package.

Which leads me to think the best thing that can be done here is to provide a helper function that implements the technique I described in #94181 (comment), something like:

def dist_for_module(name, spec):
    where = dirname(dirname(spec.origin))
    dist, = metadata.distributions(name, path=[where])
    return dist

Then numpy could do something like:

# numpy/__init__.py
import importlib.metadata as metadata

__version__ = metadata.dist_for_module('numpy', __spec__).version

Would that be worth exploring? Do you have any other ideas?

@jaraco jaraco added the pending The issue will be closed if no feedback is provided label Sep 10, 2022
@jaraco
Copy link
Member Author

jaraco commented Oct 16, 2022

@rgommers I'm keenly interested in your take on my latest proposal.

@jaraco
Copy link
Member Author

jaraco commented Nov 26, 2022

Closing as languishing, but feel free to revive the conversation.

@dansebcar
Copy link

dansebcar commented Feb 11, 2023

I also think a method to find the metadata for an imported package would be useful. I was using metadata.version(__package__) until I realised that it returns the wrong answer in this case. Here's an implementation that works for me:

from importlib import machinery, metadata
from pathlib import Path


def module_dist(spec: machinery.ModuleSpec) -> metadata.Distribution:
    path = Path(spec.origin)

    if spec.parent:
        path = path.parent

    dists = metadata.distributions(name=spec.name, path=[path.parent])
    try:
        return next(dists)
    except StopIteration as error:
        raise metadata.PackageNotFoundError(spec.name) from error

The usage is __version__ = module_dist(__spec__).version (with a try/ except in case the package isn't installed at all). It supports package directories and module files, but probably not namespace packages or zip files (which I'm not familiar with).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending The issue will be closed if no feedback is provided topic-importlib
Projects
None yet
Development

No branches or pull requests

4 participants