Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering using boolean indexes? #20

Open
georgemarrows opened this issue Apr 8, 2020 · 12 comments
Open

Filtering using boolean indexes? #20

georgemarrows opened this issue Apr 8, 2020 · 12 comments

Comments

@georgemarrows
Copy link

Thank you very much for developing and maintaining plydata. It makes pandas usable for me.

In analysing data from some logs, I wanted to filter to rows which matched a regex. I ended up using pandas' boolean indexes:

df = df[df.logMessage.map(lambda x: bool(re.search('"success": true', x)))]

Is there a more plydata way to do that? I would have expected a filter verb or an overload of query

@has2k1
Copy link
Owner

has2k1 commented Apr 8, 2020

You can do it with query, but not in one expression

idx = df.logMessage.map(lambda x: bool(re.search('"success": true', x)))
df = df >> query('@idx')

filter is a python key word, so we can not use it as a verb, so we have query. Since filter in dplyr is very similar to pandas query method, I decided to just that without changing the API. This is a limitation.

You could also do it using the call helper verb, but it is not too pretty.

df = df >> call('__getitem__', df.logMessage.map(lambda x: bool(re.search('"success": true', x))))

Maybe the solution is to add a switch to query so that it can accept a boolean array. That would change the API, I need to think about it some more.

@has2k1 has2k1 added enhancement question NOT YET RELEASED The issue has been fixed, but it has not in the current official release. and removed NOT YET RELEASED The issue has been fixed, but it has not in the current official release. labels Apr 8, 2020
@antonio-yu
Copy link

Are there more convenient methods to use Regex in filtering rows,a.k.a. in function "query".
In pandas,regular expressions seem to be limited in function DataFrame.query .
For example:
df.query('col.str.contains("a")') doesn't work.
In plydata, same situation, df >> query('col.str.contains("a")') doesn't work.

While in R , df %>% filter(str_detect(col,'a')) is very useful.
Regex is always needed in filtering rows ,so I wonder if there are some methods.

@has2k1
Copy link
Owner

has2k1 commented Nov 4, 2020

There isn't a way to filter using a regex.

df.query('col.str.contains("a")') doesn't work.

Yes it does not work. query is limited and will stay that way since it just passes on the task to pandas DataFrame.query, but there is room for an alternative more general method that would permit a specification like 'col.str.contains("a")' and so on.

@antonio-yu
Copy link

Thanks for reply.
Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem.
Anyway,thanks a lot.

@antonio-yu
Copy link

Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem.

The function query is just not that convenient since I get used to the 'filter' in R.
I wish there would be an alternative method that could replace built-in 'query'.

Just a personl wish,hhh.

@has2k1
Copy link
Owner

has2k1 commented Dec 8, 2020

@antonio-yu, can you come up with a short specification/example of how you would expect regex filtering to work. Then we can start from there.

@antonio-yu
Copy link

@has2k1 ,hi, Hassan. Here is an example :

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']})
   x      y  
0  0   zero  
1  1    one  
2  2    two  
3  3  three 

Frist,I wanna select these rows in which y contains the key word 'o'

   x     y
0  0  zero
1  1   one
2  2   two               

Some different methods to do it by using regular expressions .

df1 <- df %>% filter(str_detect(y,'o')) #R

While in pandas and plydata ,the syntax df1= df >> query('y.str.contains("o")') doesn't work.
I have to choose anothe way : df1=df[df.y.str.contains('o')].

I also tried the silimar package dplython, ther regrex works in pipe operation .
df1=df >> sift(X.y.str.contains('o'))

Normally, I need to filter rows by regrex and then do all the further operations , like group and aggregation,in chainable way.
df %>% filter( str_detect(y,'o')) %>% group_by(y) %>% summarise(num=n()) #R style

In pandas and plydata , the regrex doesn't work in function 'query ' df1 >> df.query('y.str.contains("o")') ,so I have to stop the pip operation and define a new varibale ,df1=df[df.y.str.contains('o')], then continue the further operations df1['num']=df1.groupby('y').transform('count').

The key point is that if one chain breaks ,the pip opreration stops. Regrex is always necessary when selecting rows and columns ,especilly in our chinese sentences. I wish the query in plydata could work the same way as filter in R, so that regrex, comparison operations and other boolean values can be parameters.

Hope that I were clear 😄. Best wishs!

@has2k1
Copy link
Owner

has2k1 commented Dec 11, 2020

Frist,I wanna select these rows in which y contains the key word 'o'

There are the query_all, query_at and query_if helpers, but I admit they are not easy to think of.

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']
})

df >> query_at('y', any_vars='{_}.str.contains("o")')

"""
   x     y
0  0  zero
1  1   one
2  2   two
"""

# However in this case since we are querying a single column we do not need
# to use the '{_}' placeholder for the column name.
df >> query_at('y', any_vars='y.str.contains("o")')
"""
   x     y
0  0  zero
1  1   one
2  2   two
"""

Normally, I need to filter rows by regrex and then do all the further operations

I got confused by what you meant by "regrex", I thought you meant regular expression.

@has2k1
Copy link
Owner

has2k1 commented Dec 11, 2020

I think it merits a second function. Now, what to call it sift, sieve, query2, ...?

@antonio-yu
Copy link

antonio-yu commented Dec 11, 2020

Sorry, I wrote the 'regrex' wrongly,I meant 'regex'. 😂

The functions query_all,query_at,query_if are equivalent to functions filter_all, filter_at, filter_if in R, right? They have the same logic. But in plydata, these three functions always need to select some columns and an argument 'any_vars= ' or ' all_vars = ' , sometimes it's really not easy to think of.

df = pd.DataFrame({
    'alpha': list('aaabbb'),
    'beta': list('babruq'),
    'theta': list('cdecde'),
    'x': [1, 2, 3, 4, 5, 6],
    'y': [6, 5, 4, 3, 2, 1],
    'z': [7, 9, 11, 8, 10, np.nan]
})

df >> query_at('beta',any_vars='beta.isin(["b"])')

'''
  alpha beta theta  x  y     z
0     a    b     c  1  6   7.0
2     a    b     e  3  4  11.0
'''

If there is a new functon , that allows me to write the code df >> query_at('beta',any_vars='beta.isin(["b"])') in the following simple way df >> new_function('beta.isin(["b"])'),just like the filter syntax in R df %>% filter (beta %in% ('b')), that would be easier and cooler .
So , I think it just lacks a mirror function that has the same logic as function filter of R.

I think it merits a second function. Now, what to call it sift, sieve, query2, ...?

The query itself stays the same with panda DataFrame.query.
So I think filter_by can be a good name, query_at equals filter_at, query_if equala filter_if, query_all equals filter_all, filter_by equals filter. This can be useful for users who convert R to Python.

@antonio-yu
Copy link

antonio-yu commented Dec 19, 2020

Hi , @has2k1 ,how is everything going.
Another question about the regex in select .


#Select columns by regex.

df = pd.DataFrame({
    'alpha': list('aaabbb'),
    'beta': list('babruq'),
    'theta': list('cdecde'),
    'x': [1, 2, 3, 4, 5, 6],
    'y': [6, 5, 4, 3, 2, 1],
    'z': [7, 9, 11, 8, 10, np.nan]
})

# select columns that end with 'a', it works well. 
df >> selecet(endswith ='a')

 alpha beta theta
0     a    b     c
1     a    a     d
2     a    b     e
3     b    r     c
4     b    u     d
5     b    q     e 

# But how to select columns that don't end with 'a'?

df >> selecet(endswith !='a') , df >> selecet(endswith ='-a') ,   df >> selecet(~(endswith ='a')) ?

@has2k1
Copy link
Owner

has2k1 commented Dec 19, 2020

But how to select columns that don't end with 'a'?

You can use a regular expression

df >> select(matches='[^a]$')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants