Replies: 9 comments 1 reply
-
@Elpiro This is a great question and not something that I had considered until now. Based on what I understand from the original publications, I don’t think that this would work mathematically. Essentially, a boolean time series is the same as a time series with categorical values but only with two classes (True and False). The matrix profile needs to calculate a z-normalized Euclidean distance between two subsequences and, for a boolean time series (or categorical time series) it isn’t at all clear what the meaning of the resulting distance would be (even if the individual time series values were cast to 0.0/1.0). So, my instinct would be to advise against it but I am open to discussing it further if you like. |
Beta Was this translation helpful? Give feedback.
-
Perhaps, @mcyeh or @zpzim have thought about this more. I am also curious as to how matrix profile calculations would work for categorical variables (where there are more than two categories). My intuition tells me that it wouldn’t be useful for categorical cases where there isn’t a way to compare the categories but I have also been known to be wrong. |
Beta Was this translation helpful? Give feedback.
-
Maybe the start of an answer, the course about matrix profil here : https://www.youtube.com/watch?v=1ZHW977t070&t=1464s talks about computing the matrix profile on DNA data (at 24:24). The way they do it is that they convert the categorical data to a real valued time series, by using this set of rules :
But I feel like it is very case-specific and wouldn't work for a set of 100+ features to create a time series from. |
Beta Was this translation helpful? Give feedback.
-
@Elpiro I don't think I follow the rules (and I'm a trained biochemist... whatever that is worth). It's not clear why one should add/subtract 2 or 1 and why they are mapped to specific values. Any numbers that are assigned in this way necessarily implies a quantifiable relationship between the categories. To your point, I agree that these types of mappings are very case-specific and I wouldn't feel comfortable providing a universal approach to map categoricals to real valued time series. |
Beta Was this translation helpful? Give feedback.
-
A comment: I agree with @seanlaw that it might be stretching the underlying statistical assumptions a bit far, but I personally think I would leave that decision to the user. |
Beta Was this translation helpful? Give feedback.
-
@miktoki There are several strong concerns here that I don't quite know how to overcome (maybe you've considered this already). When computing the matrix profile one needs to correctly compute:
Also, the speed gain by STOMP/STUMP comes from computing the Euclidean distance and not other forms of distances (i.e., cosine, mahalanobis, manhattan).
I don't mean to sound argumentative (so please don't take this as such) but I feel very confident in that the statistical assumptions are completely wrong (binary) if we use integers. As much as I would like to "leave the decision to the user", as a maintainer, we will be left with the repercussions of what API we choose to support. I respect your opinion so thank you for bringing this up. I want for all of us to have continued and thoughtful discussions as we prioritize new features so know that this dialogue is important to me. So, let's keep this feature in mind and let's collect a dozen examples (with data) to clearly demonstrate/motivate where this application would be warranted. How does that sound? |
Beta Was this translation helpful? Give feedback.
-
@Elpiro @miktoki Closing this issue for now but feel free to reopen if you think that it would make sense to continue the discussion. |
Beta Was this translation helpful? Give feedback.
-
Apologies for revisiting this discussion, but I simply wish to confirm my understanding of the issue concerning multivariate time series data comprising both numeric and categorical variables. As far as I understood, generating a multidimensional matrix profile for such cases is indeed feasible, but it needs to be through a brute force method using other distance measures, rather than employing a rapid algorithm like STUMP/STOMP which only uses Euclidean distance. |
Beta Was this translation helpful? Give feedback.
-
I think I am late to the party :) I will define a problem myself and try to devise a solution (hack? trick?) Let's say there are different categories (or tags, activities, labels) one can record... Let's say we have five categories: Let's say we have a sequence In multi-class classification, one approach is 1-vs-rest. For instance, if we have three class X, Y, Z, we will have three binary classifiers:
So, basically, the first one says whether something is X or not X. We can use a similar concept here. In other words, we can do one-hot-encoding (ohe). So, in our example, the categories are Example:
And when we convert it back to string, we will see:
As @seanlaw pointed out:
(Note that the underlying assumption in the proposed approach above is that the distance between any two non-identical categories is 1) |
Beta Was this translation helpful? Give feedback.
-
Is the matrix profile a tool that can be used for motif discovery in boolean time series ? I am asking because the only data type accepted in stumpy.mstump() is float (no int or bool). We can workaround it by using float datatype on the 0/1 data.
But is it mathematically correct to compute the matrix profile on this kind of data ?
Usecase example : There are 4 buttons on a webpage that can be clicked in any order and as many times as you want. We could represent the actions of clicking by a multidimensional time series (4 dimensions). The patterns we're looking for are the order in which the buttons are clicked, how they are spaced in time, etc..
Beta Was this translation helpful? Give feedback.
All reactions