Redo Autoimport #464

bagel897 · 2022-04-05T01:21:16Z

Description

Rewrite of the Autoimport code.
The goal is to create an autoimport module capable of being used in pylsp.

Upgrades:

Uses sqllite3 database to store names (either in memory or on disk without manually calling write)
Uses ast parser to parse files quickly without importing them (it imports builtin and compiled so modules)
Multiprocess module loader so it can parse my entire python library (208 packages, 700 mb) in ~ 3.705s (8c/8t cpu ryzen 4700u)
Minimal database size: ~3.2mb
Adds option to search for a package and formats an import statement for it
Basic source classification and sorting
Parses type of a name - Class, function, or variable/constant

Sidegrades (subjective):

Uses modern features such as f-strings (>=3.6) and type hints (>=3.5), pathlib.

Downgrades:

breaks parts of the API, incomplete list below
a. There's less control of how to handle underlined imports
~~b. some changes have to be made regarding how modules are named: "module" wouldn't work but "project.module" will unless added to the path.~~ Fixed
c. submodules taken out of public API (minor change). Resulting test removed
d. internal modules must use the resource system. Resulting test removed.
Requires AutoImport object to be closed to work properly with a persistent sqllite3 database (it functions fine without closing, but it should be just in case)

Issues with the resource system

These largely existed before

There is no way to know if an external package was upgraded. In a future PR, it will use the modified time to do this
The rope method of finding python files is inefficient. Switching to pathlib for that (or ideally, everywhere) would increase speed. I'm not sure how to reconcile this with the ignore paths.

TODO (will mark as ready when done):

I have added a parser for built in modules
I have added a parser for compiled modules
I have improved handling of duplicates such as importing the same package
I have fixed the performance issues with resource based imports (mostly, at least from my end)
I have added the type of a name to the database.
I have added tests that prove my fix is effective or that my feature works (haven't added any, but it runs under existing tests. Not sure how to test something which indexes literally everything)
I have updated CHANGELOG.md
I have made corresponding changes to user documentation for new features, Not Applicable
I have made corresponding changes to library documentation for API changes

rope/contrib/autoimport/autoimport.py

lieryan · 2022-04-14T03:29:31Z

rope/contrib/autoimport/autoimport.py

+
+    def generate_modules_cache(
+        self,
+        modules_to_find: List[str] = None,


Can we rename modules_to_find back to modules. This is part of the public API of the AutoImport class, so unfortunately it can't be changed.

Also, we need to keep the task_handle argument, as it's used by editor integrations to display progress report.

Done

I added a placeholder, but I'll figure out how to add a full one later

On 2, I basically suffer from a simple issue: a job set relies on knowing the number of jobs ahead of time, while my code doesn't. So I just set the initial count to 0 and manually incremented the count as it discovered modules and it looks mostly fine. Ideally, there'd be a better way (and shared interfaces for NullJobSets/TaskHandles and their non-Null counterparts). However, I'm not too familiar on how to test this or anything.

lieryan · 2022-04-14T03:41:12Z

rope/contrib/autoimport.py

-        self.underlined = underlined
-        self.names = project.data_files.read_data("globalnames")
-        if self.names is None:
-            self.names = {}


How difficult would it be to retain self.names here as a property that generates the names from the database?

python-mode (which integrates rope into vim) uses AutoImport.names to decide when to regenerate cache. If it's empty, then it calls generate_modules_cache().

I'm rather iffy with how it does that, I'd rather they should update it when they next update their builtin rope, but unfortunately it currently has to do that since there's no public API detect the need to regenerate cache. But the fact that one integration does this means that there may be other integrations that does something similar as well.

The alternative to removing .names is that at the very least python-mode will need to be updated to no longer rely on this attribute.

It maintains a separate database of indexed packages, which it uses to determine what needs to be updated currently in around 0.008 sec. In the future I want to add modified times to this. That way generating the cache works with existing caches (although the resource side does probably not work as well)/
I can make a function that checks the length of modules that need to be updated, but then we're doubling that computation across each of them.

.github/workflows/main.yml

lieryan · 2022-04-14T16:19:01Z

rope/contrib/autoimport/parse.py

+    if package in banned or (package.startswith("_") and not underlined):
+        return  # Builtins is redundant since you don't have to import it.
+    try:
+        module = import_module(str(package))


I think compiled-only packages probably should not be indexed by default. import-ing a compiled python module involves executing executable code from the imported module's, which means it may include running potentially harmful or malicious code.

Unless we can statically analyze the compiled modules somehow, this should not be default behavior.

At the very least, I'd suggest removing this code from get_names() for now or restricting it to standard library modules and re-think on how to handle compiled modules as a separate ticket/PR.

For now I limited importing compiled modules to Project, Standard, and Builtin modules which isn't 100% accurate. I can remove Project or Standard, but builtin has to remain for modules like os and sys.
I agree that not importing modules is a good idea, but we do need it unless we want to bundle a python decompiler.

Can we remove Project as well. Just because it's in your own project directory does not necessarily mean that it's trusted code. People may be checking out untrusted codebase so they can review whether or not it is malicious.

In the long run, I think we might make this a config option. If people want to enable this option in a codebase that they know they can trust and understand the implications and risks, that's on them. But the defaults should be as conservatively safe as possible.

Keeping Standard library and Builtin modules, I think are fine. They are a much less likely avenue for these kind of attacks.

Done. It doesn't really make sense to keep Project since it'd require the user to have a compiled .so module in their local project without the python code. In the future, we could add a whitelist or something for these edge cases.

ropetest/contrib/autoimporttest.py

bagel897 · 2022-04-14T17:06:35Z

Hi bageljrkhanofemus, this is really great work. Just to give an interim update that I hadn't forgotten about this PR wink. I'm still wrapping my head around this PR as there's quite a lot of changes from the original implementation.

This is my interim review notes, some of these are just stream of thoughts, so do let me know if anything sounds a bit unclear:

* I would've suggested trying out a less recursive implementation, and instead making a generator that yields `(modname, filename)` (named?)tuples. That might simplify the logic of some of the recursive calls.

* with the generator described above, we can then untangle code for the recursive listing of module (which might not benefit from parallelization much) from parsing of files (which can definitely benefit from running in parallel).

* The current implementation seems to parallelize per module, which is fine, but might not be optimal if some packages are much larger than others, as large package will be processed essentially single threaded. I would suggest trying parallelizing per-module instead.

A pseudocode sketch of what this might look like:

ImportName = str

def generate_modules_cache(...):
    valid_modules = filter(is_valid_modules, iter_modules())
    with ProcessPoolExecutor() as executor:
        for name_list in executor.map(get_names_from_file, names_in_modules):
            for names_list in names_in_modules:
                names_list = filter(is_valid_name, names_list)  # maybe we should still index underscored names, and skip them when calling `import_assist` instead 
                self._add_names(..., names_list, ...)

def is_valid_module(modname: ImportName, filename: Path) -> bool:
    if should skip because underlined:
        return False
    if should skip because in ignore file:
        return False
    return True

def iter_modules() -> Generator[Tuple[ImportName, Path]]:
    for package in self._get_available_packages():
        for module in self.find_modules(package):
            ...

def find_modules(package) -> Tuple[ImportName, Path]:
    ... maybe just path.glob("**.py") and some logic to handle __init__.py and __main__.py ...

Of course, there's a while lot of handwaving the details above.

* Would the `names` table benefit from a `CREATE INDEX`?

Thanks for this feedback. I incorporated the ideas of using namedtuples and generators, which made the code much cleaner. I'll look into CREATE INDEX later.

lieryan · 2022-04-14T17:22:05Z

rope/contrib/autoimport/parse.py

+    """Get all the names from a given file using ast."""
+    with open(module, mode="rb") as file:
+        try:
+            root_node = ast.parse(file.read())


Since we're already using pathlib, we might as well use module.read_bytes() here instead of open()-ing explicitly. read_bytes() would not require context manager.

lieryan · 2022-04-14T17:32:40Z

rope/contrib/autoimport/parse.py

+    if package in banned or (package.startswith("_") and not underlined):
+        return  # Builtins is redundant since you don't have to import it.
+    try:
+        module = import_module(str(package))


Can we remove Project as well. Just because it's in your own project directory does not necessarily mean that it's trusted code. People may be checking out untrusted codebase so they can review whether or not it is malicious.

In the long run, I think we might make this a config option. If people want to enable this option in a codebase that they know they can trust and understand the implications and risks, that's on them. But the defaults should be as conservatively safe as possible.

Keeping Standard library and Builtin modules, I think are fine. They are a much less likely avenue for these kind of attacks.

lieryan · 2022-04-14T17:38:39Z

rope/contrib/autoimport/utils.py

+        return Source.PROJECT
+    if package.as_posix().__contains__("site-packages"):
+        return Source.SITE_PACKAGE
+    if package.as_posix().startswith(sys.prefix):


Any reasons for using package.as_posix() instead of str(package)?

Not really. Changed it to str

rope/contrib/autoimport/utils.py

…t in the file

… files

lieryan · 2022-04-28T04:17:35Z

I've been test-driving this branch for the last couple weeks at work and I think it's been pretty great, really great work @bageljrkhanofemus. I did found a number of issues with the database file being locked out when using memory=False, though I hadn't ruled out whether this might just be an issue with my local setup, my hacked-together patches to python-mode, or if it's something to do in the autoimport code itself.

In the interest of merging this into the codebase while allowing easier transitions for IDE/editor integrators, I am going to merge this branch to master but keep the pickle-based implementation as the default for now; instead the new sqlite-based implementation is provided as an alternate experimental implementation to make it easier to start testing and fixing any issues in IDE/editor integrations. Once editor plugin authors have the chance to catch up with testing their integrations and in making any necessary fixes, the intent is to switch the default implementation to the sqlite implementation.

Once this branch is merged, future changes should be made in a separate PR, which will also make them easier to review as well.

@all-contributors add @bageljrkhanofemus for code contribution.

allcontributors · 2022-04-28T04:17:44Z

@lieryan

I've put up a pull request to add @bageljrkhanofemus! 🎉

bagel897 · 2022-04-28T05:02:14Z

Thank you for all your help!

You must call .close() to close the sqllite3 connection. Furthermore, you can't have more than one Autoimport object concurrently (so no parallel testing).
As for IDE integrations, I'm working on (and daily driving) a pylsp implementation which detects when to insert imports and avoids existing names. Ideally, everyone would use the lsp implementation so we don't have to make editor specific versions.
I'll file another PR for that last commit (reduces the number of extraneous imports considerably) and some docs changes.
Should we deprecate the get_modules, get_all_names and import_assist in favor of the new search methods in the sqllite3 implementation? They offer more functionality over the existing methods and would minimize the number of SQL calls.
Later, I'll work on an update mechanism for external packages and possibly the entire database.

bagel897 added 14 commits April 2, 2022 14:56

switch to sqllite3

0bab02d

use ast

0df9167

refactor to use ProcessPoolExecutor

3f2c1a8

actually somewhat working implementation

55de3ae

parse __all__

fa945e4

improve parsing of modules with submodules

0a20abd

fix resource parsing

56460d5

add package finder and update_module

b622adc

propogate underlined more

70e2b99

match public api more closely

42ef20d

migrate to pytest

f57fdbb

handle observer module names correctly

b49dda1

error propogation

3f93257

get name locations

9290528

bagel897 changed the title ~~(WIP)~~ (WIP) Redo Autoimport Apr 5, 2022

bagel897 added 13 commits April 4, 2022 21:19

update generate_module_cache to exclude the project

b37fad5

add builtin handling

65f0c72

split into multiple files

08e8768

make private api public within autoimport

c0aba3d

add so parsing

ab11c36

fix builtin handling

2155385

reformat

bc21e92

fix project location issue

03585be

update documentation

ebfc3b8

use code block

f719f95

use code block

90a1600

improve formatting and change example to Dict

a9ddf65

use old testsuite(mostly)

ed12e20

bagel897 changed the title ~~(WIP) Redo Autoimport~~ Redo Autoimport Apr 5, 2022

bagel897 marked this pull request as ready for review April 5, 2022 07:26

bagel897 added 7 commits April 12, 2022 23:14

itemkind infrence for compiled modules

9e6d956

add more tuples and use them more

277edf1

WIP on autoimport using namedTuples

9f09410

mostly working threaded implementation

780b52b

reformat and update tests

7035b6e

blacken

30595cd

check underlined on submodules

292df31

lieryan reviewed Apr 14, 2022

View reviewed changes

fix some typos

7c4d3c2

lieryan reviewed Apr 14, 2022

View reviewed changes

fix: basic job set implementation, improve list comprehension of utils

dea4f42

bagel897 marked this pull request as ready for review April 14, 2022 21:19

bagel897 added 5 commits April 14, 2022 16:24

hack to make the job_set work

fc8d3d7

fix: don't increment on null job sets

f52695b

process imports on __init__.py

af14440

split out search

77a9e5f

ignore submodules for vanilla import statements

15b82ad

bagel897 mentioned this pull request Apr 18, 2022

Add a plugin to provide autoimport functionality python-lsp/python-lsp-server#199

Merged

6 tasks

bagel897 added 4 commits April 22, 2022 12:40

implement search_full with the ability to ignore names already presen…

1bac155

…t in the file

use CREATE INDEX,remove unnessecary checks

b952b20

use a multithread executor to dramatically speed up indexing of local…

49bc66e

… files

ignore imports which don't belong to the package

4a99a77

lieryan mentioned this pull request Apr 28, 2022

New AutoImport implementation (early-merge of @bageljrkhanofemus's changes) #467

Merged

allcontributors bot mentioned this pull request Apr 28, 2022

docs: add bageljrkhanofemus as a contributor for test, code #468

Merged

bagel897 closed this Apr 28, 2022

lieryan added this to the 1.1.0 milestone Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redo Autoimport #464

Redo Autoimport #464

bagel897 commented Apr 5, 2022 •

edited

lieryan Apr 14, 2022

bagel897 Apr 14, 2022

bagel897 Apr 14, 2022

lieryan Apr 14, 2022

bagel897 Apr 14, 2022

lieryan Apr 14, 2022

bagel897 Apr 14, 2022

lieryan Apr 14, 2022

bagel897 Apr 14, 2022

bagel897 commented Apr 14, 2022

lieryan Apr 14, 2022

bagel897 Apr 14, 2022

lieryan Apr 14, 2022

lieryan Apr 14, 2022

bagel897 Apr 14, 2022

lieryan commented Apr 28, 2022 •

edited

allcontributors bot commented Apr 28, 2022

bagel897 commented Apr 28, 2022

Redo Autoimport #464

Redo Autoimport #464

Conversation

bagel897 commented Apr 5, 2022 • edited

Description

Upgrades:

Sidegrades (subjective):

Downgrades:

Issues with the resource system

TODO (will mark as ready when done):

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bagel897 commented Apr 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lieryan commented Apr 28, 2022 • edited

allcontributors bot commented Apr 28, 2022

bagel897 commented Apr 28, 2022

bagel897 commented Apr 5, 2022 •

edited

lieryan commented Apr 28, 2022 •

edited