Enhancement: Query isin #847

arabidopsis · 2020-12-05T22:45:38Z

Now that I have had a couple of pull-requests merged the awesome lektor maintainers
probably think "I'm not a moron" (ref. Linus). So I'm going to impertinently proffer an enhancement...

`isin` Expression

Lektor has a contains expression method
e.g. F.tags.contains('amazing') but it doesn't have the complement viz: F.tag.isin(['amazing','zounds!'])

The isin name was chosen because is serves the same purpose as panda's DataFrame isin.

With an isin (and isnotin) some of the necessarily "clunky" expression syntax for ands and ors goes away (sqlalchemy has this problem too).

# imaging doing this for 10 values...
# and don't forget the parentheses!
q1 = projects.children.filter((F.name == "Coffee") | (F.name == "Wolf") | (F.name == "Bagpipe"))
q2 = projects.children.filter((F.name.isin(["Coffee", "Wolf", "Bagpipe"]))
assert q1.all() == q2.all()

Also you don't need a separate syntax for jinja templates as is currently required.

projects.children.filter((F.name == "Coffee").or(F.name == "Wolf").or(F.name == "Bagpipe"))

It plays well with fields that are lists. e.g. with a model field like:

[fields.slideshow]
label = Slideshow
type = checkboxes
source = record.attachments.images
item_key = {{ this._id }}

it's easy to select the correct slides:

{% set slides = this.attachments.images.filter(F._id.isin(this.slideshow)) %}

There is also a isnotin method that makes long chains of
ands moot:

projects.children.filter((F.name != "Coffee").and(F.name != "Wolf").and(F.name != 'Master'))

becomes:

projects.children.filter(F.name.isnotin(['Master','Wolf','Master']))

This is the same as F.name.isin(seq).false() but possibly more
user friendly.

There is also an isnot method attached to the query so that

projects.children.filter(F.name.isnotin(['Master','Wolf','Master'])).all() ==
projects.children.filter(F.name.isin(['Master','Wolf','Master'])).isnot().all()

Allows the entire query to be complemented.

How safe are these changes?

These additions should be safe for existing lektor projects
in the wild since everything is backwards compatible.

The exception being if someone is storing get_structure_hash values long term (Since I have changed Query.__get_lektor_param_hash__).

Notes

Query.isnot doesn't invert is_hidden and is_discoverable values.
This seems more natural. You want to complement the query after selecting
-- or not -- for is_hidden, is_discoverable.

_id values are special in that you can return full records:

{% for child in this.children.filter(F._id.isin(site.get('/someurl').children)) %}

whereas for any other attribute you will need to do something
like.

{% for child in this.children.filter(F.name.isin(site.get('/someurl').children | map(attribute='name'))) %}

Description of Changes

Wrote at least one-line docstrings (for any new functions)
Added unit test(s) covering the changes (if testable)
Included a screenshot or animation (if affecting the UI, see Licecap)
Link to corresponding documentation pull request for getlektor.com

nixjdm · 2020-12-06T06:23:45Z

lektor maintainers probably think "I'm not a moron" (ref. Linus). So I'm going to impertinently proffer an enhancement...

Indeed! More are welcome :D This looks like a good addition to me. On a first read this seems straight-forward and great, but I'll give it a second pass and go over implementation a bit more slowly.

The main expression docs page ought to have it's example updated with this, and some pages made for these new methods.

Thanks!

arabidopsis · 2020-12-06T23:45:22Z

Depending on how much of the PR is accepted I'm perfectly happy to add a PR with the added documentation for the website too.

dairiki · 2021-02-05T05:46:44Z

My initial reaction: I really like the addition of Expression.isin() and Expression.isnotin(). These seem useful, and their meaning and use seem fairly obvious and intuitive.

The Query.isnot() addition leaves a funny taste in my mouth. The meaning is not clear. "Is not" what, exactly? From my (perhaps flawed) understanding, a more clear name would be .negate_filter() or similar. Is it really necessary? Can you give an example where it is particularly helpful?
If the purpose is really to invert the filter expression, it seems it would be clearer to add expression syntax which would better allow expression negation. Or maybe a Query.filterout(expr) method?

Anyhow, if I get a vote (an admittedly questionable proposition) I'd vote for Expression.{isin,isnotin} and vote against Query.isnot.

yagebu · 2021-02-15T16:49:28Z

Thank you for this PR :) I pretty much agree with the following:

[..] I'd vote for Expression.{isin,isnotin} and vote against Query.isnot.

arabidopsis · 2022-02-21T09:40:38Z

Removed Query.isnot() expression. Everybody is correct, it's confusing.

dairiki · 2022-02-23T21:49:49Z

lektor/db.py

+
+        if self.__evaluated:
+            return self.__value
+        self.__value = [eval(v) for v in self.__value]


I think this caching is not correct. What happens if __eval__ is called a second time with a different record (and the sequence contains an Expression)?

(Is the caching even worth the effort? The evaluation of non-Expression values is cheap. The Expression-valued items are potentially heavy, but how often do they get re-evaluated for the same record?)

Cache or not, there is another optimization that could be made here. I'm not completely sure it's worth the effort, but it's worth considering.

Instead of always evaluating every item in the sequence, __eval__ could return an iterator/generator that evaluates the items one at a time, as they are iterated over, thus saving some (potentially heavy) evaluations if the calling is in operator hits early.

This would make the caching logic more complicated, but not overly so.
If the caching is eliminated, then this optimization is simple enough that it's almost certainly worth making.

Yes the caching is a worry. I'll remove the __evaluated test.

But returning a generator looks problematic too. Since the _ContainmentExpr into which
it is placed (see the isin method) might be evalued more that once in which case the generator will be already
exhausted on the second evaluation.

Oops! I take the second paragraph back. _ContainmentExpr re-evalues seq for each on its eval's

OK. Made the evaluation "lazier" but still had to guard against multiple evaluations of a generator

What if one just "listified" the sequence/iterable in the constructor for _SeqExpr? Something like:

class _SeqExpr(Expression): def __init__(self, value): if isinstance(value, collections.abc.Collection): self.__value = value else: try: self.__value = list(value) except TypeError: self.__value = [value]

This won't pick up the case for value being a generator (which can be the case -- my test in
test_db.py fail). I could use isinstance(value, Iterable) but....

There is a footnote in https://docs.python.org/3/library/collections.abc.html that says:

Checking isinstance(obj, Iterable) detects classes that are registered as Iterable or that have an iter() method, but it does not detect classes that iterate with the getitem() method. The only reliable way to determine whether an object is [iterable] is to call iter(obj)

Which the reason for the iter(value) test.

Also... I don't know if I should test for value being a string. Since currently: F.name.isin("ABC") == F.name.isin(['A','B','C'])
(since list("ABC") => ['A','B','C'], which is "probably" not what a user intended.

This won't pick up the case for value being a generator

I'm pretty sure it will. Note that the if isinstance(value, Collection) check is just an optimization — it could be omitted altogether. The check for iterability happens at try: self.__value = list(value) (which is essentially the same test as your try: iter(value), except that now we pre-cache the results of the iteration.

The idea behind the if instance(value, Collection) check is to identify those values for which iter(value) always returns a "fresh" iterator. Since we can already iterate these values multiple times without issue, we can skip the value = list(value) cache step.

For the sake of concreteness, here is my previous proposal, in its entirety. I've rewritten the constructor for (I think) clarity, but it's equivalent to my previous proposal.)

class _SeqExpr(Expression): def __init__(self, value): if not isinstance(value, collections.abc.Collection): try: value = list(value) except TypeError: value = [value] self.__value = value def __eval__(self, record): for item in self.__value: if instance(item, record): yield item["_id"] elif isinstance(item, Expression): yield item.__eval__(record) else: yield item

Since Collections are iterable and sized, the assumption made here is that iter(collection) always produces a fresh iterable (since one would expect len(list(iter(collection))) == len(collection) to always be true.)

Also... I don't know if I should test for value being a string. Since currently: F.name.isin("ABC") == F.name.isin(['A','B','C']) (since list("ABC") => ['A','B','C'], which is "probably" not what a user intended.

I don't have strong feelings, but, probably should either go all-in on the DWIM and test for string, or skip the DWIM altogether and just throw TypeError if value is not iterable. I can see arguments either way. My personal feeling (at the moment) is explicit is better, and skip the DWIM. But, since Expressions are largely intended for use in templates, I can see where the DWIM approach is reasonable, too.

If one does go full DWIM, perhaps things should be made so that F.name.isin('Fred', 'Joe') is equivalent to F.name.isin(['Fred', 'Joe'])? In that case the DWIM logic should probably be moved to Expression.isin and Expression.isnotin (abstracted to a separate helper function).

arabidopsis added 5 commits November 15, 2020 17:23

added isin expression operator

2d67c41

Merge remote-tracking branch 'upstream/master' into query-isin

d5673fe

name change

d3fae20

Merge remote-tracking branch 'upstream/master' into query-isin

55524bb

Merge branch 'master' into query-isin

0f239c8

arabidopsis added 4 commits February 14, 2022 13:36

Merge branch 'master' into query-isin

d65b86b

removed F from test arguments.... it is global now

64df130

Merge branch 'lektor:master' into query-isin

2065b65

removed isnot() operator

e53d57d

dairiki reviewed Feb 23, 2022

View reviewed changes

arabidopsis added 2 commits February 26, 2022 11:44

"lazier" evaluation of sequence expressions

1f0fab4

Simpilfy _SeqExpr

768078c

dairiki force-pushed the master branch from 4aaae00 to 1999510 Compare March 1, 2022 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Query isin #847

Enhancement: Query isin #847

arabidopsis commented Dec 5, 2020

nixjdm commented Dec 6, 2020

arabidopsis commented Dec 6, 2020

dairiki commented Feb 5, 2021

yagebu commented Feb 15, 2021

arabidopsis commented Feb 21, 2022

dairiki Feb 23, 2022

arabidopsis Feb 26, 2022

arabidopsis Feb 26, 2022

arabidopsis Feb 26, 2022

dairiki Feb 26, 2022

arabidopsis Feb 26, 2022

dairiki Feb 26, 2022

Enhancement: Query isin #847

Are you sure you want to change the base?

Enhancement: Query isin #847

Conversation

arabidopsis commented Dec 5, 2020

isin Expression

How safe are these changes?

Notes

Description of Changes

nixjdm commented Dec 6, 2020

arabidopsis commented Dec 6, 2020

dairiki commented Feb 5, 2021

yagebu commented Feb 15, 2021

arabidopsis commented Feb 21, 2022

dairiki Feb 23, 2022

Choose a reason for hiding this comment

arabidopsis Feb 26, 2022

Choose a reason for hiding this comment

arabidopsis Feb 26, 2022

Choose a reason for hiding this comment

arabidopsis Feb 26, 2022

Choose a reason for hiding this comment

dairiki Feb 26, 2022

Choose a reason for hiding this comment

arabidopsis Feb 26, 2022

Choose a reason for hiding this comment

dairiki Feb 26, 2022

Choose a reason for hiding this comment

`isin` Expression