Filtering using boolean indexes? #20

georgemarrows · 2020-04-08T19:04:19Z

Thank you very much for developing and maintaining plydata. It makes pandas usable for me.

In analysing data from some logs, I wanted to filter to rows which matched a regex. I ended up using pandas' boolean indexes:

df = df[df.logMessage.map(lambda x: bool(re.search('"success": true', x)))]

Is there a more plydata way to do that? I would have expected a filter verb or an overload of query

The text was updated successfully, but these errors were encountered:

has2k1 · 2020-04-08T22:24:27Z

You can do it with query, but not in one expression

idx = df.logMessage.map(lambda x: bool(re.search('"success": true', x)))
df = df >> query('@idx')

filter is a python key word, so we can not use it as a verb, so we have query. Since filter in dplyr is very similar to pandas query method, I decided to just that without changing the API. This is a limitation.

You could also do it using the call helper verb, but it is not too pretty.

df = df >> call('__getitem__', df.logMessage.map(lambda x: bool(re.search('"success": true', x))))

Maybe the solution is to add a switch to query so that it can accept a boolean array. That would change the API, I need to think about it some more.

antonio-yu · 2020-11-04T02:26:43Z

Are there more convenient methods to use Regex in filtering rows,a.k.a. in function "query".
In pandas,regular expressions seem to be limited in function DataFrame.query .
For example:
df.query('col.str.contains("a")') doesn't work.
In plydata, same situation， df >> query('col.str.contains("a")') doesn't work.

While in R , df %>% filter(str_detect(col,'a')) is very useful.
Regex is always needed in filtering rows ,so I wonder if there are some methods.

has2k1 · 2020-11-04T20:57:06Z

There isn't a way to filter using a regex.

df.query('col.str.contains("a")') doesn't work.

Yes it does not work. query is limited and will stay that way since it just passes on the task to pandas DataFrame.query, but there is room for an alternative more general method that would permit a specification like 'col.str.contains("a")' and so on.

antonio-yu · 2020-11-05T02:24:04Z

Thanks for reply.
Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem.
Anyway,thanks a lot.

antonio-yu · 2020-11-06T09:42:36Z

Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem.

The function query is just not that convenient since I get used to the 'filter' in R.
I wish there would be an alternative method that could replace built-in 'query'.

Just a personl wish,hhh.

has2k1 · 2020-12-08T14:00:06Z

@antonio-yu, can you come up with a short specification/example of how you would expect regex filtering to work. Then we can start from there.

antonio-yu · 2020-12-09T04:51:59Z

@has2k1 ,hi, Hassan. Here is an example :

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']})

   x      y  
0  0   zero  
1  1    one  
2  2    two  
3  3  three

Frist,I wanna select these rows in which y contains the key word 'o'

   x     y
0  0  zero
1  1   one
2  2   two

Some different methods to do it by using regular expressions .

df1 <- df %>% filter(str_detect(y,'o')) #R

While in pandas and plydata ,the syntax df1= df >> query('y.str.contains("o")') doesn't work.
I have to choose anothe way : df1=df[df.y.str.contains('o')].

I also tried the silimar package dplython, ther regrex works in pipe operation .
df1=df >> sift(X.y.str.contains('o'))

Normally, I need to filter rows by regrex and then do all the further operations , like group and aggregation,in chainable way.
df %>% filter( str_detect(y,'o')) %>% group_by(y) %>% summarise(num=n()) #R style

In pandas and plydata , the regrex doesn't work in function 'query ' df1 >> df.query('y.str.contains("o")') ,so I have to stop the pip operation and define a new varibale ,df1=df[df.y.str.contains('o')], then continue the further operations df1['num']=df1.groupby('y').transform('count').

The key point is that if one chain breaks ,the pip opreration stops. Regrex is always necessary when selecting rows and columns ,especilly in our chinese sentences. I wish the query in plydata could work the same way as filter in R, so that regrex, comparison operations and other boolean values can be parameters.

Hope that I were clear 😄. Best wishs!

has2k1 · 2020-12-11T14:13:33Z

Frist,I wanna select these rows in which y contains the key word 'o'

There are the query_all, query_at and query_if helpers, but I admit they are not easy to think of.

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']
})

df >> query_at('y', any_vars='{_}.str.contains("o")')

"""
   x     y
0  0  zero
1  1   one
2  2   two
"""

# However in this case since we are querying a single column we do not need
# to use the '{_}' placeholder for the column name.
df >> query_at('y', any_vars='y.str.contains("o")')
"""
   x     y
0  0  zero
1  1   one
2  2   two
"""

Normally, I need to filter rows by regrex and then do all the further operations

I got confused by what you meant by "regrex", I thought you meant regular expression.

has2k1 · 2020-12-11T14:35:22Z

I think it merits a second function. Now, what to call it sift, sieve, query2, ...?

antonio-yu · 2020-12-11T18:06:22Z

Sorry, I wrote the 'regrex' wrongly,I meant 'regex'. 😂

The functions query_all,query_at,query_if are equivalent to functions filter_all, filter_at, filter_if in R, right? They have the same logic. But in plydata, these three functions always need to select some columns and an argument 'any_vars= ' or ' all_vars = ' , sometimes it's really not easy to think of.

df = pd.DataFrame({
    'alpha': list('aaabbb'),
    'beta': list('babruq'),
    'theta': list('cdecde'),
    'x': [1, 2, 3, 4, 5, 6],
    'y': [6, 5, 4, 3, 2, 1],
    'z': [7, 9, 11, 8, 10, np.nan]
})

df >> query_at('beta',any_vars='beta.isin(["b"])')

'''
  alpha beta theta  x  y     z
0     a    b     c  1  6   7.0
2     a    b     e  3  4  11.0
'''

If there is a new functon , that allows me to write the code df >> query_at('beta',any_vars='beta.isin(["b"])') in the following simple way df >> new_function('beta.isin(["b"])'),just like the filter syntax in R df %>% filter (beta %in% ('b')), that would be easier and cooler .
So , I think it just lacks a mirror function that has the same logic as function filter of R.

I think it merits a second function. Now, what to call it sift, sieve, query2, ...?

The query itself stays the same with panda DataFrame.query.
So I think filter_by can be a good name, query_at equals filter_at, query_if equala filter_if, query_all equals filter_all, filter_by equals filter. This can be useful for users who convert R to Python.

antonio-yu · 2020-12-19T14:55:30Z

Hi , @has2k1 ,how is everything going.
Another question about the regex in select .


#Select columns by regex.

df = pd.DataFrame({
    'alpha': list('aaabbb'),
    'beta': list('babruq'),
    'theta': list('cdecde'),
    'x': [1, 2, 3, 4, 5, 6],
    'y': [6, 5, 4, 3, 2, 1],
    'z': [7, 9, 11, 8, 10, np.nan]
})

# select columns that end with 'a', it works well. 
df >> selecet(endswith ='a')

 alpha beta theta
0     a    b     c
1     a    a     d
2     a    b     e
3     b    r     c
4     b    u     d
5     b    q     e 

# But how to select columns that don't end with 'a'?

df >> selecet(endswith !='a') , df >> selecet(endswith ='-a') ,   df >> selecet(~(endswith ='a')) ?

has2k1 · 2020-12-19T20:21:22Z

But how to select columns that don't end with 'a'?

You can use a regular expression

df >> select(matches='[^a]$')

has2k1 added enhancement question NOT YET RELEASED The issue has been fixed, but it has not in the current official release. and removed NOT YET RELEASED The issue has been fixed, but it has not in the current official release. labels Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering using boolean indexes? #20

Filtering using boolean indexes? #20

georgemarrows commented Apr 8, 2020

has2k1 commented Apr 8, 2020

antonio-yu commented Nov 4, 2020

has2k1 commented Nov 4, 2020

antonio-yu commented Nov 5, 2020

antonio-yu commented Nov 6, 2020

has2k1 commented Dec 8, 2020

antonio-yu commented Dec 9, 2020

has2k1 commented Dec 11, 2020

has2k1 commented Dec 11, 2020

antonio-yu commented Dec 11, 2020 •

edited

antonio-yu commented Dec 19, 2020 •

edited

has2k1 commented Dec 19, 2020 •

edited

Filtering using boolean indexes? #20

Filtering using boolean indexes? #20

Comments

georgemarrows commented Apr 8, 2020

has2k1 commented Apr 8, 2020

antonio-yu commented Nov 4, 2020

has2k1 commented Nov 4, 2020

antonio-yu commented Nov 5, 2020

antonio-yu commented Nov 6, 2020

has2k1 commented Dec 8, 2020

antonio-yu commented Dec 9, 2020

has2k1 commented Dec 11, 2020

has2k1 commented Dec 11, 2020

antonio-yu commented Dec 11, 2020 • edited

antonio-yu commented Dec 19, 2020 • edited

has2k1 commented Dec 19, 2020 • edited

antonio-yu commented Dec 11, 2020 •

edited

antonio-yu commented Dec 19, 2020 •

edited

has2k1 commented Dec 19, 2020 •

edited