POC: Set focus to new and used columns after mutate() #6252

krlmlr · 2022-04-30T15:42:45Z

In the example below, speed, air_time and distance are shown because they played a role in the mutate() . Normally none of these columns are visible.

The focus columns are highlighted with an underline, this formatting can't be shown in a reprex.

library(dplyr)
nycflights13::flights %>%
  mutate(speed = air_time / distance)
#> # A tibble:      336,776 × 20
#> # Focus columns: speed, air_time, distance
#>     year month   day dep…¹ sch…² dep…³ arr…⁴ sch…⁵ arr…⁶ air_time distance speed
#>    <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>    <dbl>    <dbl> <dbl>
#>  1  2013     1     1   517   515     2   830   819    11      227     1400 0.162
#>  2  2013     1     1   533   529     4   850   830    20      227     1416 0.160
#>  3  2013     1     1   542   540     2   923   850    33      160     1089 0.147
#>  4  2013     1     1   544   545    -1  1004  1022   -18      183     1576 0.116
#>  5  2013     1     1   554   600    -6   812   837   -25      116      762 0.152
#>  6  2013     1     1   554   558    -4   740   728    12      150      719 0.209
#>  7  2013     1     1   555   600    -5   913   854    19      158     1065 0.148
#>  8  2013     1     1   557   600    -3   709   723   -14       53      229 0.231
#>  9  2013     1     1   557   600    -3   838   846    -8      140      944 0.148
#> 10  2013     1     1   558   600    -2   753   745     8      138      733 0.188
#> # … with 336,766 more rows, abbreviated variable names ¹dep_time,
#> #   ²sched_dep_time, ³dep_delay, ⁴arr_time, ⁵sched_arr_time, ⁶arr_delay, and 8
#> #   more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, hour <dbl>, minute <dbl>, time_hour <dttm>

^{Created on 2022-04-30 by the reprex package (v2.0.1)}

hadley · 2022-08-04T22:30:08Z

Hmmmm I like the way this looks, but adding the extra attribute is makes a lot of tests fail. I'm confident we can work around this for dplyr itself (see current hack in utils.R) but we'd need to think about how this will play out for other packages. Maybe if is_testing() is TRUE we could instead set waldo_opts? But that'll only help for packages using testthat 3e. And we might need to figure out some way to avoid the small expensive of an env var check for every mutate.

krlmlr · 2022-08-18T23:50:27Z

Crazy thought: what if we reversed responsibility here? If each mutate() registered the resulting data frame together with its focus columns in a global variable (collection?) so that attributes remain unchanged but the focus columns are still known? We could purge old data frames from that container after a relatively brief interval (seconds? minutes? defined with an option?) to avoid hidden state kept indefinitely.

Otherwise: can we set waldo_opts in on_load(), but only when testing?

DavisVaughan · 2022-08-19T13:15:19Z

Concrete example of what I don't particularly love about focus columns. I'm explicitly requesting that the new column come before time_hour, but the focus column makes it look like it came after distance (which also doesn’t even really exist there either)

library(dplyr)

nycflights13::flights %>%
  mutate(speed = air_time / distance, .before = time_hour)
#> # A tibble:      336,776 × 20
#> # Focus columns: air_time, distance, speed
#>     year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ air_time distance speed
#>    <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl>    <dbl>    <dbl> <dbl>
#>  1  2013     1     1      517      515       2     830     819      11      227     1400 0.162
#>  2  2013     1     1      533      529       4     850     830      20      227     1416 0.160
#>  3  2013     1     1      542      540       2     923     850      33      160     1089 0.147
#>  4  2013     1     1      544      545      -1    1004    1022     -18      183     1576 0.116
#>  5  2013     1     1      554      600      -6     812     837     -25      116      762 0.152
#>  6  2013     1     1      554      558      -4     740     728      12      150      719 0.209
#>  7  2013     1     1      555      600      -5     913     854      19      158     1065 0.148
#>  8  2013     1     1      557      600      -3     709     723     -14       53      229 0.231
#>  9  2013     1     1      557      600      -3     838     846      -8      140      944 0.148
#> 10  2013     1     1      558      600      -2     753     745       8      138      733 0.188
#> # … with 336,766 more rows, 8 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> #   variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

I also don't really love that it makes the column look like its in the 12th position, like I could access it with df[[12]], but it isn't really there and it is somewhat hard to figure out where it actually is

DavisVaughan · 2022-08-19T13:15:59Z

Also why is air_time showing up twice? Once as arr_t…³ and once as air_time?

krlmlr · 2022-08-26T05:13:19Z

Thanks for your feedback.

I need to look into the confused column order, something doesn't seem to be right.

I have thought about adding a vertical separator if columns are omitted, perhaps |. Would that help?

air_time is showing only once, the other column is arr_time .

DavisVaughan · 2022-08-26T11:23:45Z

a vertical separator if columns are omitted,

I'm not sure I follow, can you mock an example?

krlmlr · 2022-08-28T01:55:53Z

The column order is correct. We have three focus column as a result of the operation: the two input columns, and the output column.

The focus columns are shown in the header. Is this helpful at all?

I have tweaked the reprex to show what the vertical separator could look like.

Should we treat focus columns differently in non-interactive mode, e.g., ignore them for now?

library(conflicted)
library(dplyr)

options(max.print = 20)

nycflights13::flights %>%
  mutate(speed = air_time / distance, .before = time_hour) %>%
  attributes()
#> $names
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "speed"          "time_hour"     
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#>  [ reached getOption("max.print") -- omitted 336756 entries ]
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $pillar_focus
#> [1] "speed"    "air_time" "distance"
nycflights13::flights %>%
  mutate(speed = air_time / distance, .before = time_hour)
#> # A tibble:      336,776 × 20
#> # Focus columns: speed, air_time, distance
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³|air_time distance speed
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>|   <dbl>    <dbl> <dbl>
#>  1  2013     1     1      517        515       2     830      227     1400 0.162
#>  2  2013     1     1      533        529       4     850      227     1416 0.160
#>  3  2013     1     1      542        540       2     923      160     1089 0.147
#>  4  2013     1     1      544        545      -1    1004      183     1576 0.116
#>  5  2013     1     1      554        600      -6     812      116      762 0.152
#>  6  2013     1     1      554        558      -4     740      150      719 0.209
#>  7  2013     1     1      555        600      -5     913      158     1065 0.148
#>  8  2013     1     1      557        600      -3     709       53      229 0.231
#>  9  2013     1     1      557        600      -3     838      140      944 0.148
#> 10  2013     1     1      558        600      -2     753      138      733 0.188
#> # … with 336,766 more rows, 10 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> #   variable names ¹sched_dep_time, ²dep_delay, ³arr_time

^{Created on 2022-08-28 by the reprex package (v2.0.1)}

krlmlr requested review from hadley and DavisVaughan April 30, 2022 15:52

krlmlr mentioned this pull request Aug 18, 2022

Add dplyr_sticky_cols() or similar #5290

Open

krlmlr and others added 4 commits August 28, 2022 03:47

Set focus to new and used columns after mutate()

4204ca9

No longer need pillar PR

7006d95

Quiet test errors

75fa45e

Flip order

f26e48e

krlmlr force-pushed the f-poc-focus branch from f40d45e to f26e48e Compare August 28, 2022 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Set focus to new and used columns after mutate() #6252

POC: Set focus to new and used columns after mutate() #6252

krlmlr commented Apr 30, 2022 •

edited

hadley commented Aug 4, 2022

krlmlr commented Aug 18, 2022

DavisVaughan commented Aug 19, 2022

DavisVaughan commented Aug 19, 2022

krlmlr commented Aug 26, 2022

DavisVaughan commented Aug 26, 2022 •

edited

krlmlr commented Aug 28, 2022

POC: Set focus to new and used columns after mutate() #6252

Are you sure you want to change the base?

POC: Set focus to new and used columns after mutate() #6252

Conversation

krlmlr commented Apr 30, 2022 • edited

hadley commented Aug 4, 2022

krlmlr commented Aug 18, 2022

DavisVaughan commented Aug 19, 2022

DavisVaughan commented Aug 19, 2022

krlmlr commented Aug 26, 2022

DavisVaughan commented Aug 26, 2022 • edited

krlmlr commented Aug 28, 2022

krlmlr commented Apr 30, 2022 •

edited

DavisVaughan commented Aug 26, 2022 •

edited