Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo Autoimport #464

Closed
wants to merge 67 commits into from
Closed

Redo Autoimport #464

wants to merge 67 commits into from

Conversation

bagel897
Copy link
Contributor

@bagel897 bagel897 commented Apr 5, 2022

Description

Rewrite of the Autoimport code.
The goal is to create an autoimport module capable of being used in pylsp.

Upgrades:

  1. Uses sqllite3 database to store names (either in memory or on disk without manually calling write)
  2. Uses ast parser to parse files quickly without importing them (it imports builtin and compiled so modules)
  3. Multiprocess module loader so it can parse my entire python library (208 packages, 700 mb) in ~ 3.705s (8c/8t cpu ryzen 4700u)
  4. Minimal database size: ~3.2mb
  5. Adds option to search for a package and formats an import statement for it
    image
  6. Basic source classification and sorting
  7. Parses type of a name - Class, function, or variable/constant

Sidegrades (subjective):

  1. Uses modern features such as f-strings (>=3.6) and type hints (>=3.5), pathlib.

Downgrades:

  1. breaks parts of the API, incomplete list below
    a. There's less control of how to handle underlined imports
    b. some changes have to be made regarding how modules are named: "module" wouldn't work but "project.module" will unless added to the path. Fixed
    c. submodules taken out of public API (minor change). Resulting test removed
    d. internal modules must use the resource system. Resulting test removed.
  2. Requires AutoImport object to be closed to work properly with a persistent sqllite3 database (it functions fine without closing, but it should be just in case)

Issues with the resource system

These largely existed before

  1. There is no way to know if an external package was upgraded. In a future PR, it will use the modified time to do this
  2. The rope method of finding python files is inefficient. Switching to pathlib for that (or ideally, everywhere) would increase speed. I'm not sure how to reconcile this with the ignore paths.

TODO (will mark as ready when done):

  • I have added a parser for built in modules
  • I have added a parser for compiled modules
  • I have improved handling of duplicates such as importing the same package
  • I have fixed the performance issues with resource based imports (mostly, at least from my end)
  • I have added the type of a name to the database.
  • I have added tests that prove my fix is effective or that my feature works (haven't added any, but it runs under existing tests. Not sure how to test something which indexes literally everything)
  • I have updated CHANGELOG.md
  • I have made corresponding changes to user documentation for new features, Not Applicable
  • I have made corresponding changes to library documentation for API changes

@bagel897 bagel897 changed the title (WIP) (WIP) Redo Autoimport Apr 5, 2022
@bagel897 bagel897 changed the title (WIP) Redo Autoimport Redo Autoimport Apr 5, 2022
@bagel897 bagel897 marked this pull request as ready for review April 5, 2022 07:26
rope/contrib/autoimport/autoimport.py Outdated Show resolved Hide resolved
rope/contrib/autoimport/autoimport.py Outdated Show resolved Hide resolved
rope/contrib/autoimport/autoimport.py Outdated Show resolved Hide resolved

def generate_modules_cache(
self,
modules_to_find: List[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename modules_to_find back to modules. This is part of the public API of the AutoImport class, so unfortunately it can't be changed.

Also, we need to keep the task_handle argument, as it's used by editor integrations to display progress report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Done
  2. I added a placeholder, but I'll figure out how to add a full one later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 2, I basically suffer from a simple issue: a job set relies on knowing the number of jobs ahead of time, while my code doesn't. So I just set the initial count to 0 and manually incremented the count as it discovered modules and it looks mostly fine. Ideally, there'd be a better way (and shared interfaces for NullJobSets/TaskHandles and their non-Null counterparts). However, I'm not too familiar on how to test this or anything.

self.underlined = underlined
self.names = project.data_files.read_data("globalnames")
if self.names is None:
self.names = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to retain self.names here as a property that generates the names from the database?

python-mode (which integrates rope into vim) uses AutoImport.names to decide when to regenerate cache. If it's empty, then it calls generate_modules_cache().

I'm rather iffy with how it does that, I'd rather they should update it when they next update their builtin rope, but unfortunately it currently has to do that since there's no public API detect the need to regenerate cache. But the fact that one integration does this means that there may be other integrations that does something similar as well.

The alternative to removing .names is that at the very least python-mode will need to be updated to no longer rely on this attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maintains a separate database of indexed packages, which it uses to determine what needs to be updated currently in around 0.008 sec. In the future I want to add modified times to this. That way generating the cache works with existing caches (although the resource side does probably not work as well)/
I can make a function that checks the length of modules that need to be updated, but then we're doubling that computation across each of them.

.github/workflows/main.yml Outdated Show resolved Hide resolved
if package in banned or (package.startswith("_") and not underlined):
return # Builtins is redundant since you don't have to import it.
try:
module = import_module(str(package))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think compiled-only packages probably should not be indexed by default. import-ing a compiled python module involves executing executable code from the imported module's, which means it may include running potentially harmful or malicious code.

Unless we can statically analyze the compiled modules somehow, this should not be default behavior.

At the very least, I'd suggest removing this code from get_names() for now or restricting it to standard library modules and re-think on how to handle compiled modules as a separate ticket/PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I limited importing compiled modules to Project, Standard, and Builtin modules which isn't 100% accurate. I can remove Project or Standard, but builtin has to remain for modules like os and sys.
I agree that not importing modules is a good idea, but we do need it unless we want to bundle a python decompiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Project as well. Just because it's in your own project directory does not necessarily mean that it's trusted code. People may be checking out untrusted codebase so they can review whether or not it is malicious.

In the long run, I think we might make this a config option. If people want to enable this option in a codebase that they know they can trust and understand the implications and risks, that's on them. But the defaults should be as conservatively safe as possible.

Keeping Standard library and Builtin modules, I think are fine. They are a much less likely avenue for these kind of attacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It doesn't really make sense to keep Project since it'd require the user to have a compiled .so module in their local project without the python code. In the future, we could add a whitelist or something for these edge cases.

ropetest/contrib/autoimporttest.py Outdated Show resolved Hide resolved
@bagel897
Copy link
Contributor Author

Hi bageljrkhanofemus, this is really great work. Just to give an interim update that I hadn't forgotten about this PR wink. I'm still wrapping my head around this PR as there's quite a lot of changes from the original implementation.

This is my interim review notes, some of these are just stream of thoughts, so do let me know if anything sounds a bit unclear:

* I would've suggested trying out a less recursive implementation, and instead making a generator that yields `(modname, filename)` (named?)tuples. That might simplify the logic of some of the recursive calls.

* with the generator described above, we can then untangle code for the recursive listing of module (which might not benefit from parallelization much) from parsing of files (which can definitely benefit from running in parallel).

* The current implementation seems to parallelize per module, which is fine, but might not be optimal if some packages are much larger than others, as large package will be processed essentially single threaded. I would suggest trying parallelizing per-module instead.

A pseudocode sketch of what this might look like:

ImportName = str

def generate_modules_cache(...):
    valid_modules = filter(is_valid_modules, iter_modules())
    with ProcessPoolExecutor() as executor:
        for name_list in executor.map(get_names_from_file, names_in_modules):
            for names_list in names_in_modules:
                names_list = filter(is_valid_name, names_list)  # maybe we should still index underscored names, and skip them when calling `import_assist` instead 
                self._add_names(..., names_list, ...)

def is_valid_module(modname: ImportName, filename: Path) -> bool:
    if should skip because underlined:
        return False
    if should skip because in ignore file:
        return False
    return True

def iter_modules() -> Generator[Tuple[ImportName, Path]]:
    for package in self._get_available_packages():
        for module in self.find_modules(package):
            ...

def find_modules(package) -> Tuple[ImportName, Path]:
    ... maybe just path.glob("**.py") and some logic to handle __init__.py and __main__.py ...

Of course, there's a while lot of handwaving the details above.

* Would the `names` table benefit from a `CREATE INDEX`?

Thanks for this feedback. I incorporated the ideas of using namedtuples and generators, which made the code much cleaner. I'll look into CREATE INDEX later.

"""Get all the names from a given file using ast."""
with open(module, mode="rb") as file:
try:
root_node = ast.parse(file.read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already using pathlib, we might as well use module.read_bytes() here instead of open()-ing explicitly. read_bytes() would not require context manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if package in banned or (package.startswith("_") and not underlined):
return # Builtins is redundant since you don't have to import it.
try:
module = import_module(str(package))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Project as well. Just because it's in your own project directory does not necessarily mean that it's trusted code. People may be checking out untrusted codebase so they can review whether or not it is malicious.

In the long run, I think we might make this a config option. If people want to enable this option in a codebase that they know they can trust and understand the implications and risks, that's on them. But the defaults should be as conservatively safe as possible.

Keeping Standard library and Builtin modules, I think are fine. They are a much less likely avenue for these kind of attacks.

return Source.PROJECT
if package.as_posix().__contains__("site-packages"):
return Source.SITE_PACKAGE
if package.as_posix().startswith(sys.prefix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasons for using package.as_posix() instead of str(package)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Changed it to str

rope/contrib/autoimport/utils.py Outdated Show resolved Hide resolved
rope/contrib/autoimport/utils.py Outdated Show resolved Hide resolved
rope/contrib/autoimport/utils.py Outdated Show resolved Hide resolved
@bagel897 bagel897 marked this pull request as ready for review April 14, 2022 21:19
@lieryan
Copy link
Member

lieryan commented Apr 28, 2022

I've been test-driving this branch for the last couple weeks at work and I think it's been pretty great, really great work @bageljrkhanofemus. I did found a number of issues with the database file being locked out when using memory=False, though I hadn't ruled out whether this might just be an issue with my local setup, my hacked-together patches to python-mode, or if it's something to do in the autoimport code itself.

In the interest of merging this into the codebase while allowing easier transitions for IDE/editor integrators, I am going to merge this branch to master but keep the pickle-based implementation as the default for now; instead the new sqlite-based implementation is provided as an alternate experimental implementation to make it easier to start testing and fixing any issues in IDE/editor integrations. Once editor plugin authors have the chance to catch up with testing their integrations and in making any necessary fixes, the intent is to switch the default implementation to the sqlite implementation.

Once this branch is merged, future changes should be made in a separate PR, which will also make them easier to review as well.

@all-contributors add @bageljrkhanofemus for code contribution.

@allcontributors
Copy link
Contributor

@lieryan

I've put up a pull request to add @bageljrkhanofemus! 🎉

@bagel897
Copy link
Contributor Author

Thank you for all your help!

  1. You must call .close() to close the sqllite3 connection. Furthermore, you can't have more than one Autoimport object concurrently (so no parallel testing).
  2. As for IDE integrations, I'm working on (and daily driving) a pylsp implementation which detects when to insert imports and avoids existing names. Ideally, everyone would use the lsp implementation so we don't have to make editor specific versions.
  3. I'll file another PR for that last commit (reduces the number of extraneous imports considerably) and some docs changes.
  4. Should we deprecate the get_modules, get_all_names and import_assist in favor of the new search methods in the sqllite3 implementation? They offer more functionality over the existing methods and would minimize the number of SQL calls.
  5. Later, I'll work on an update mechanism for external packages and possibly the entire database.

@bagel897 bagel897 closed this Apr 28, 2022
@lieryan lieryan added this to the 1.1.0 milestone Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants