Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Query isin #847

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

arabidopsis
Copy link
Contributor

Now that I have had a couple of pull-requests merged the awesome lektor maintainers
probably think "I'm not a moron" (ref. Linus). So I'm going to impertinently proffer an enhancement...

isin Expression

Lektor has a contains expression method
e.g. F.tags.contains('amazing') but it doesn't have the complement viz: F.tag.isin(['amazing','zounds!'])

The isin name was chosen because is serves the same purpose as panda's DataFrame isin.

With an isin (and isnotin) some of the necessarily "clunky" expression syntax for ands and ors goes away (sqlalchemy has this problem too).

# imaging doing this for 10 values...
# and don't forget the parentheses!
q1 = projects.children.filter((F.name == "Coffee") | (F.name == "Wolf") | (F.name == "Bagpipe"))
q2 = projects.children.filter((F.name.isin(["Coffee", "Wolf", "Bagpipe"]))
assert q1.all() == q2.all()

Also you don't need a separate syntax for jinja templates as is currently required.

projects.children.filter((F.name == "Coffee").or(F.name == "Wolf").or(F.name == "Bagpipe"))

It plays well with fields that are lists. e.g. with a model field like:

[fields.slideshow]
label = Slideshow
type = checkboxes
source = record.attachments.images
item_key = {{ this._id }}

it's easy to select the correct slides:

{% set slides = this.attachments.images.filter(F._id.isin(this.slideshow)) %}

There is also a isnotin method that makes long chains of
ands moot:

projects.children.filter((F.name != "Coffee").and(F.name != "Wolf").and(F.name != 'Master'))

becomes:

projects.children.filter(F.name.isnotin(['Master','Wolf','Master']))

This is the same as F.name.isin(seq).false() but possibly more
user friendly.

There is also an isnot method attached to the query so that

projects.children.filter(F.name.isnotin(['Master','Wolf','Master'])).all() ==
projects.children.filter(F.name.isin(['Master','Wolf','Master'])).isnot().all()

Allows the entire query to be complemented.

How safe are these changes?

These additions should be safe for existing lektor projects
in the wild since everything is backwards compatible.

The exception being if someone is storing get_structure_hash values long term (Since I have changed Query.__get_lektor_param_hash__).

Notes

Query.isnot doesn't invert is_hidden and is_discoverable values.
This seems more natural. You want to complement the query after selecting
-- or not -- for is_hidden, is_discoverable.

_id values are special in that you can return full records:

{% for child in this.children.filter(F._id.isin(site.get('/someurl').children)) %}

whereas for any other attribute you will need to do something
like.

{% for child in this.children.filter(F.name.isin(site.get('/someurl').children | map(attribute='name'))) %}

Description of Changes

  • Wrote at least one-line docstrings (for any new functions)
  • Added unit test(s) covering the changes (if testable)
  • Included a screenshot or animation (if affecting the UI, see Licecap)
  • Link to corresponding documentation pull request for getlektor.com

@nixjdm
Copy link
Member

nixjdm commented Dec 6, 2020

lektor maintainers probably think "I'm not a moron" (ref. Linus). So I'm going to impertinently proffer an enhancement...

Indeed! More are welcome :D This looks like a good addition to me. On a first read this seems straight-forward and great, but I'll give it a second pass and go over implementation a bit more slowly.

The main expression docs page ought to have it's example updated with this, and some pages made for these new methods.

Thanks!

@arabidopsis
Copy link
Contributor Author

Depending on how much of the PR is accepted I'm perfectly happy to add a PR with the added documentation for the website too.

@dairiki
Copy link
Contributor

dairiki commented Feb 5, 2021

My initial reaction: I really like the addition of Expression.isin() and Expression.isnotin(). These seem useful, and their meaning and use seem fairly obvious and intuitive.

The Query.isnot() addition leaves a funny taste in my mouth. The meaning is not clear. "Is not" what, exactly? From my (perhaps flawed) understanding, a more clear name would be .negate_filter() or similar. Is it really necessary? Can you give an example where it is particularly helpful?
If the purpose is really to invert the filter expression, it seems it would be clearer to add expression syntax which would better allow expression negation. Or maybe a Query.filterout(expr) method?

Anyhow, if I get a vote (an admittedly questionable proposition) I'd vote for Expression.{isin,isnotin} and vote against Query.isnot.

@yagebu
Copy link
Contributor

yagebu commented Feb 15, 2021

Thank you for this PR :) I pretty much agree with the following:

[..] I'd vote for Expression.{isin,isnotin} and vote against Query.isnot.

@arabidopsis
Copy link
Contributor Author

Removed Query.isnot() expression. Everybody is correct, it's confusing.

lektor/db.py Outdated

if self.__evaluated:
return self.__value
self.__value = [eval(v) for v in self.__value]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this caching is not correct. What happens if __eval__ is called a second time with a different record (and the sequence contains an Expression)?

(Is the caching even worth the effort? The evaluation of non-Expression values is cheap. The Expression-valued items are potentially heavy, but how often do they get re-evaluated for the same record?)


Cache or not, there is another optimization that could be made here. I'm not completely sure it's worth the effort, but it's worth considering.

Instead of always evaluating every item in the sequence, __eval__ could return an iterator/generator that evaluates the items one at a time, as they are iterated over, thus saving some (potentially heavy) evaluations if the calling is in operator hits early.

This would make the caching logic more complicated, but not overly so.
If the caching is eliminated, then this optimization is simple enough that it's almost certainly worth making.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the caching is a worry. I'll remove the __evaluated test.

But returning a generator looks problematic too. Since the _ContainmentExpr into which
it is placed (see the isin method) might be evalued more that once in which case the generator will be already
exhausted on the second evaluation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I take the second paragraph back. _ContainmentExpr re-evalues seq for each on its eval's

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Made the evaluation "lazier" but still had to guard against multiple evaluations of a generator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one just "listified" the sequence/iterable in the constructor for _SeqExpr? Something like:

class _SeqExpr(Expression):
    def __init__(self, value):
        if isinstance(value, collections.abc.Collection):
            self.__value = value
        else:
            try:
                self.__value = list(value)
            except TypeError:
                self.__value = [value]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't pick up the case for value being a generator (which can be the case -- my test in
test_db.py fail). I could use isinstance(value, Iterable) but....

There is a footnote in https://docs.python.org/3/library/collections.abc.html that says:

Checking isinstance(obj, Iterable) detects classes that are registered as Iterable or that have an iter() method, but it does not detect classes that iterate with the getitem() method. The only reliable way to determine whether an object is [iterable] is to call iter(obj)

Which the reason for the iter(value) test.

Also... I don't know if I should test for value being a string. Since currently: F.name.isin("ABC") == F.name.isin(['A','B','C'])
(since list("ABC") => ['A','B','C'], which is "probably" not what a user intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't pick up the case for value being a generator

I'm pretty sure it will. Note that the if isinstance(value, Collection) check is just an optimization — it could be omitted altogether. The check for iterability happens at try: self.__value = list(value) (which is essentially the same test as your try: iter(value), except that now we pre-cache the results of the iteration.

The idea behind the if instance(value, Collection) check is to identify those values for which iter(value) always returns a "fresh" iterator. Since we can already iterate these values multiple times without issue, we can skip the value = list(value) cache step.

For the sake of concreteness, here is my previous proposal, in its entirety. I've rewritten the constructor for (I think) clarity, but it's equivalent to my previous proposal.)

class _SeqExpr(Expression):
    def __init__(self, value):
        if not isinstance(value, collections.abc.Collection):
            try:
                value = list(value)
            except TypeError:
                value = [value]
        self.__value = value

    def __eval__(self, record):
        for item in self.__value:
            if instance(item, record):
                yield item["_id"]
            elif isinstance(item, Expression):
                yield item.__eval__(record)
            else:
                yield item

Since Collections are iterable and sized, the assumption made here is that iter(collection) always produces a fresh iterable (since one would expect len(list(iter(collection))) == len(collection) to always be true.)


Also... I don't know if I should test for value being a string. Since currently: F.name.isin("ABC") == F.name.isin(['A','B','C']) (since list("ABC") => ['A','B','C'], which is "probably" not what a user intended.

I don't have strong feelings, but, probably should either go all-in on the DWIM and test for string, or skip the DWIM altogether and just throw TypeError if value is not iterable. I can see arguments either way. My personal feeling (at the moment) is explicit is better, and skip the DWIM. But, since Expressions are largely intended for use in templates, I can see where the DWIM approach is reasonable, too.

If one does go full DWIM, perhaps things should be made so that F.name.isin('Fred', 'Joe') is equivalent to F.name.isin(['Fred', 'Joe'])? In that case the DWIM logic should probably be moved to Expression.isin and Expression.isnotin (abstracted to a separate helper function).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants