Centering rather than normalizing subsequences duing motif discovery #940

EitanHemed · 2024-01-04T11:17:39Z

EitanHemed
Jan 4, 2024

Hi all

This is a good opportunity to say thanks to the developers of STUMPY. So useful!

And now to my question. Currently, I am working on a dataset comprising of locations (X, Y), bounded between 0 and 1.

Using multidimensional motif discovery, i am able to find motifs and matches which represent common patterns of paths (e.g., left-downward movement). I've tweaked the parameters of stumpy.mstump and stumpy.mmotifs, quite a lot but can't really get what i need. Here i give simplified examples, but usually i look for motiffs using m=60 (30Hz data).

--

As i set normalize=True, the sub-sequences of the location data included in each motif could contain data of different scale, and be located at a different area in space.

e.g., the path XY1 = (0, 0) >> XY2 = (1, 1) describing upward-rightward path would be clustered with XY1 = (0.3, 0.4) >> XY2 = (0.4, 0.5), regardless of the distance covered or the origin and end point.

The following scatter displays a set of matches for one of the motifs (left downward movement, here).

The purple scatter is the trivial match (distance = 0).
The orange squares show the most similar match.
The yellow triangle scatter is the least similar match.
The red S indicates the start of the sub-sequence, E stands for the end of the sub-sequence.

I would like to use a different normalization method* for each sub-sequence, rather than Z-normalization.

# For example, apply this centering for each sub-sequence
def new_normalization(d):
    return d - d[0]

as this will allow me to find sets of paths which have similar shape but different scale.

The matches plotted above, will look like this, when centered (i.e., originate from XY = [0, 0]).

Is there a way to do that currently in STUMPY? Unless I miss something, a multidimensional profile matrix normalize only takes a boolean value at the moment.

I imagine that this could be a useful feature for many, so if this is an issue of just implementing it, i would be willing to take a stab at it.

One approach I've tried with limited success is following motif discovery, take the sum of distance covered in the path of each sub-sequence, and cluster the matches to short vs. long paths. However, I was hoping for a more elegant solution.

Thanks!

NimaSarajpoor · 2024-01-05T02:21:06Z

NimaSarajpoor
Jan 5, 2024
Maintainer

@EitanHemed

Thank you for your question and welcome to the STUMPY community! Also, thank you for the kind words!

Computing distance between one multi-dimensional motif and its closest match may need its own discussion as it is more complicated than computing the distances between subsequences in one dimensional time series data. For now, let's assume you have one single time series data (i.e. just one dimension). Let's try to answer your question for this simple case first.

I would like to use a different normalization method* for each sub-sequence, rather than Z-normalization.
Is there a way to do that currently in STUMPY? Unless I miss something, a multidimensional profile matrix normalize only takes a boolean value at the moment.

The short answer is "No". In this similar post, @seanlaw mentioned that:

Unfortunately, what you are asking for is not possible in STUMPY. While it might seem trivial on the surface, STUMPY employs many complex mathematical manipulations and computational tricks behind the scenes to ensure that the matrix profile is computed as efficiently as possible.

It is also suggested that one might be better off to just compute the pairwise distances between all subsequences IF the volume of your data is small (see this comment)

I imagine that this could be a useful feature for many, so if this is an issue of just implementing it, i would be willing to take a stab at it.

New contributions are welcome indeed if it sounds a good fit to the library.

@seanlaw
Do you have any additional thought / suggestion?

@EitanHemed
Can you please let us know how many data points you have? Also, would it be possible for you to drag-and-drop the csv file of data so that we can better understand how it looks like?

1 reply

seanlaw Jan 5, 2024
Maintainer

As @NimaSarajpoor mentioned, the multi-dimensional matrix profile is far more complex (to interpret the meaning of, especially within the "subspace" where only a subset of all dimensions are considered) than the 1-dimensional matrix profile and so I'd hesitate to deviate from the original published work.

Is there a way to do that currently in STUMPY? Unless I miss something, a multidimensional profile matrix normalize only takes a boolean value at the moment.

This gets super complicated in the multi-dimensional matrix profile case since, behind the scenes, we need to compute something (obscure and undocumented) called the "minimum-description-length" (something that took me several years on-and-off to fully grasp and understand) for z-normalized subsequences and it would not be practical/possible to support doing this for other normalization functions without adding additional long-term maintenance cost to the code base (My self-preservation instincts are apparent in this statement and so I mean this in the kindest way possible!).

I imagine that this could be a useful feature for many, so if this is an issue of just implementing it, i would be willing to take a stab at it.

I am always open to new contributions but, respectfully, given the ever-growing size/complexity of our code base, we would need more evidence (in the form of additional user comments or "emoji upvotes" or some other form) to begin having the conversation of increasing the scope of STUMPY. I completely understand and am sympathetic to your need for this feature but we must also balance the practicality of long term maintenance and any potential reduction in code readability. Having said that, we would be happy to review any contributions and to offer feedback.

NimaSarajpoor · 2024-01-07T03:14:11Z

NimaSarajpoor
Jan 7, 2024
Maintainer

@EitanHemed

I was thinking more about your problem and I think there might be a way to do it. But, there are a few notes that you should consider:

If the volume of your data is small, it is still recommended that you just get all subsequences, do the transformation, and then compute the full distance matrix.
If the volume of your data is large, we can start with finding a way to compute the matrix profile for one-dimensional time series data considering the new transformation. While matrix profile may not be what you need, it can be used as our starting point. We can then start thinking about how to modify mstump / mmotifs accordingly (at this moment, I am not aware of their challenges)
I assume the following conditions are met:

(I) The transformation of our interest is just offset; i.e. we are interested in computing $dist(S_{i} - \alpha_{i}, S_{j} - \alpha_{j})$, where $\alpha_{i}$ is the offset for the subsequence $S_{i}$, and $\alpha_{j}$ is the offset for the subsequence $S_{j}$.

(II) dist is the two-norm Euclidean distance function.

Let's say you have two subsequences $T_{i}$ and $T_{j}$, each with length m.

$S_{i} = [T_{i,0}, T_{i,1}, ..., T_{i,m-1}]$

$S_{j} = [T_{j,0}, T_{j,1}, ..., T_{j,m-1}]$

And, let's say you want to apply some offset to them,
e.g applying $\alpha_{i}$ offset to $T_{i}$ and applying $\alpha_{j}$ offset to $T_{j}$

$x = dist(S_{i}, S_{j})$

$y = dist(S_{i} -\alpha_{i}, S_{j} -\alpha_{j})$

where dist is the norm-two Euclidean distance function. $x$ is the distance with no transformation, and $y$ is the distance we are looking for.

$y ^ 2 = \sum_{s=0:m}{[(T_{i,s} - \alpha_{i}) - (T_{j,s} -\alpha_{j})]^{2}}$

$y ^ 2 = \sum_{s=0:m}{[(T_{i,s} - T_{j,s}) - (\alpha_{i} -\alpha_{j})]^{2}}$

$y ^ 2 = \sum_{s=0:m}{(T_{i,s} - T_{j,s})^2} + \sum_{s=0:m}{(\alpha_{i} - \alpha_{j})^2} - \sum_{s=0:m}{2(\alpha_{i}- \alpha_{j})(T_{i,s} - T_{j,s})}$

$y ^ 2 = \sum_{s=0:m}{(T_{i,s} - T_{j,s})^2} + \sum_{s=0:m}{(\alpha_{i} - \alpha_{j})^2} - 2(\alpha_{i}- \alpha_{j})\sum_{s=0:m}{(T_{i,s} - T_{j,s})}$

$y ^ 2 = \sum_{s=0:m}{(T_{i,s} - T_{j,s})^2} + \sum_{s=0:m}{(\alpha_{i} - \alpha_{j})^2} - 2(\alpha_{i}- \alpha_{j})(\sum_{s=0:m}{T_{i,s}} - \sum_{s=0:m}{T_{j,s}})$

$y ^ 2 = \sum_{s=0:m}{(T_{i,s} - T_{j,s})^2} + \sum_{s=0:m}{(\alpha_{i} - \alpha_{j})^2} - 2(\alpha_{i}- \alpha_{j})(m\mu_{i} - m\mu_{j})$

$y ^ 2 = x ^ 2 + m (\alpha_{i} - \alpha_{j})^2 - 2m(\alpha_{i} - \alpha_{j})(\mu_{i} - \mu_{j})$

Let alpha be an array of size len(T) - m + 1, where alpha[i] is the offset we wants to apply to the i-th subsequence with length m. Then, we need to change this block of code

stumpy/stumpy/aamp.py

Lines 121 to 140 in 3559b38

    
           if uint64_i == 0 or uint64_j == 0: 
        
               p_norm = ( 
        
                   np.linalg.norm( 
        
                       T_B[uint64_j : uint64_j + uint64_m] 
        
                       - T_A[uint64_i : uint64_i + uint64_m], 
        
                       ord=p, 
        
                   ) 
        
                   ** p 
        
               ) 
        
           else: 
        
               p_norm = np.abs( 
        
                   p_norm 
        
                   - np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1]) 
        
                   ** p 
        
                   + np.absolute( 
        
                       T_B[uint64_j + uint64_m - uint64_1] 
        
                       - T_A[uint64_i + uint64_m - uint64_1] 
        
                   ) 
        
                   ** p 
        
               )

to this:

if uint64_i == 0 or uint64_j == 0:
    p_norm = (
        np.linalg.norm(
            T_B[uint64_j : uint64_j + uint64_m]
            - T_A[uint64_i : uint64_i + uint64_m],
            ord=p,
        )
        ** p
    )

    # NEW POST-PROCESS, ONLY for p == 2, with offset 
    # NOTE: Replace `i` and `j` with `uint64_i` and `uint64_j` respectively
    p_norm = p_norm  + m * (alpha[i] - alpha[j]) ** 2  - 2 * m * (alpha[i] - alpha[j])*(μ_A[i] -μ_B[j] )

else:
    
    # compute x based on y, update it, and then, compute y again!
    
    # NEW PRE-PROCESS, ONLY for p == 2, with offset
    p_norm = p_norm  - m * (alpha[i-1] - alpha[j-1]) ** 2  + 2 * m * (alpha[i-1] - alpha[j-1])*(μ_A[i-1] -μ_B[j-1] )

    p_norm = np.abs(
        p_norm
        - np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1])
        ** p
        + np.absolute(
            T_B[uint64_j + uint64_m - uint64_1]
            - T_A[uint64_i + uint64_m - uint64_1]
        )
        ** p
    )

    # NEW POST-PROCESS, ONLY for p == 2, with offset
    p_norm = p_norm  + m * (alpha[i] - alpha[j]) ** 2  - 2 * m * (alpha[i] - alpha[j])*(μ_A[i] -μ_B[j] )

We can use this to address #900 as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centering rather than normalizing subsequences duing motif discovery #940

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Centering rather than normalizing subsequences duing motif discovery #940

EitanHemed Jan 4, 2024

Replies: 2 comments · 1 reply

NimaSarajpoor Jan 5, 2024 Maintainer

seanlaw Jan 5, 2024 Maintainer

NimaSarajpoor Jan 7, 2024 Maintainer

EitanHemed
Jan 4, 2024

Replies: 2 comments 1 reply

NimaSarajpoor
Jan 5, 2024
Maintainer

seanlaw Jan 5, 2024
Maintainer

NimaSarajpoor
Jan 7, 2024
Maintainer