Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-means anchors ratios #308

Open
mnslarcher opened this issue May 17, 2020 · 26 comments
Open

k-means anchors ratios #308

mnslarcher opened this issue May 17, 2020 · 26 comments

Comments

@mnslarcher
Copy link

mnslarcher commented May 17, 2020

Hi,

If someone want to give it a try, like https://github.com/Cli98/anchor_computation_tool, I also adapted the code of https://github.com/zhouyuangan/K-Means-Anchors for obtaining anchors ratios using K-Means.

The main difference is that my implementation is faster (less for loops), it takes as input a .json file (not a xml) and has some additional settings (but the Changlin Li repo has more visualization).

https://github.com/mnslarcher/kmeans-anchors-ratios

If you try my code on COCO you will obtain anchors ratios in line with this repo (but for your data set they can be different and this is why it's useful).

Any feedback is appreciated.

UPDATE: I added the commands to get the optimal anchors ratios for COCO, the returned ratios are [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)], very near to [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)] (default values of this repo).

@Cli98
Copy link

Cli98 commented May 21, 2020

Hi,

If someone want to give it a try, like https://github.com/Cli98/anchor_computation_tool, I also adapted the code of https://github.com/zhouyuangan/K-Means-Anchors for obtaining anchor ratios using K-Means.

The main difference is that my implementation is faster (less for loops), it takes as input a .json file (not a xml) and has some additional settings (but the Changlin Li repo has more visualization).

https://github.com/mnslarcher/kmeans-anchor-ratios

If you try my code on COCO you will obtain anchor ratios in line with this repo (but for your data set they can be different and this is why it's useful).

Any feedback is appreciated.

UPDATE: I added the commands to get the optimal anchor ratios for COCO, the returned ratios are [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)], very near to [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)] (default values of this repo).

@mnslarcher Good to see this.

@mnslarcher
Copy link
Author

Thanks @Cli98 !

@mnslarcher mnslarcher changed the title k-means anchor ratios k-means anchors ratios May 23, 2020
@Cli98
Copy link

Cli98 commented May 27, 2020

Hi,

If someone want to give it a try, like https://github.com/Cli98/anchor_computation_tool, I also adapted the code of https://github.com/zhouyuangan/K-Means-Anchors for obtaining anchors ratios using K-Means.

The main difference is that my implementation is faster (less for loops), it takes as input a .json file (not a xml) and has some additional settings (but the Changlin Li repo has more visualization).

https://github.com/mnslarcher/kmeans-anchors-ratios

If you try my code on COCO you will obtain anchors ratios in line with this repo (but for your data set they can be different and this is why it's useful).

Any feedback is appreciated.

UPDATE: I added the commands to get the optimal anchors ratios for COCO, the returned ratios are [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)], very near to [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)] (default values of this repo).

@mnslarcher lol you claimed that your repo is faster than mine. I don't agree on that since I see no reasons to justify this.

@mnslarcher
Copy link
Author

Hi,
If someone want to give it a try, like https://github.com/Cli98/anchor_computation_tool, I also adapted the code of https://github.com/zhouyuangan/K-Means-Anchors for obtaining anchors ratios using K-Means.
The main difference is that my implementation is faster (less for loops), it takes as input a .json file (not a xml) and has some additional settings (but the Changlin Li repo has more visualization).
https://github.com/mnslarcher/kmeans-anchors-ratios
If you try my code on COCO you will obtain anchors ratios in line with this repo (but for your data set they can be different and this is why it's useful).
Any feedback is appreciated.
UPDATE: I added the commands to get the optimal anchors ratios for COCO, the returned ratios are [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)], very near to [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)] (default values of this repo).

@mnslarcher lol you claimed that your repo is faster than mine. I don't agree on that since I see no reasons to justify this.

Hi @Cli98 :)! I'm saying this because you use the implementation of https://github.com/zhouyuangan/K-Means-Anchors. From that implementation I removed this

        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters)

by vectoring the iou calculation.

From you notebook:

    rows = boxes.shape[0]

    distances = np.empty((rows, k))
    last_clusters = np.zeros((rows,))

    np.random.seed()

    # the Forgy method will fail if the whole array contains the same rows
    clusters = boxes[np.random.choice(rows, k, replace=False)]

    while True:
        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters)

        nearest_clusters = np.argmin(distances, axis=1)

        if (last_clusters == nearest_clusters).all():
            break

        for cluster in range(k):
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0)

        last_clusters = nearest_clusters

    return clusters

This is mine:

    np.random.seed(seed)
    last_nearest_centroids = np.ones(boxes.shape[0]) * -1
    # the Forgy method will fail if the whole array contains the same rows
    centroids = boxes[np.random.choice(boxes.shape[0], num_clusters, replace=False)]
    i = 0
    while True:
        if i >= max_iter:
            logger.warning(
                f"[{datetime.now().strftime('%m/%d %H:%M:%S')}] "
                "Maximum number of iterations reached, increase max_inter to do more "
                f"iterations (max_inter = {max_iter})"
            )
            break

        nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

        for centroid in range(num_clusters):
            centroids[centroid] = centroid_calc_fn(
                boxes[nearest_centroids == centroid], axis=0
            )

        if (nearest_centroids == last_nearest_centroids).all():
            break

        last_nearest_centroids = nearest_centroids
        i += 1

    return centroids

This made the implementation faster as one would expect. I have an option to run K-Means more than one time but if you set --num-runs=1 you should see some benefit, obviously feel free to correct me but this is the reason

@Cli98
Copy link

Cli98 commented May 27, 2020

Hi,
If someone want to give it a try, like https://github.com/Cli98/anchor_computation_tool, I also adapted the code of https://github.com/zhouyuangan/K-Means-Anchors for obtaining anchors ratios using K-Means.
The main difference is that my implementation is faster (less for loops), it takes as input a .json file (not a xml) and has some additional settings (but the Changlin Li repo has more visualization).
https://github.com/mnslarcher/kmeans-anchors-ratios
If you try my code on COCO you will obtain anchors ratios in line with this repo (but for your data set they can be different and this is why it's useful).
Any feedback is appreciated.
UPDATE: I added the commands to get the optimal anchors ratios for COCO, the returned ratios are [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)], very near to [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)] (default values of this repo).

@mnslarcher lol you claimed that your repo is faster than mine. I don't agree on that since I see no reasons to justify this.

Hi @Cli98 :)! I'm saying this because you use the implementation of https://github.com/zhouyuangan/K-Means-Anchors. From that implementation I removed this

        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters)

by vectoring the iou calculation.

From you notebook:

    rows = boxes.shape[0]

    distances = np.empty((rows, k))
    last_clusters = np.zeros((rows,))

    np.random.seed()

    # the Forgy method will fail if the whole array contains the same rows
    clusters = boxes[np.random.choice(rows, k, replace=False)]

    while True:
        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters)

        nearest_clusters = np.argmin(distances, axis=1)

        if (last_clusters == nearest_clusters).all():
            break

        for cluster in range(k):
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0)

        last_clusters = nearest_clusters

    return clusters

This is mine:

    np.random.seed(seed)
    last_nearest_centroids = np.ones(boxes.shape[0]) * -1
    # the Forgy method will fail if the whole array contains the same rows
    centroids = boxes[np.random.choice(boxes.shape[0], num_clusters, replace=False)]
    i = 0
    while True:
        if i >= max_iter:
            logger.warning(
                f"[{datetime.now().strftime('%m/%d %H:%M:%S')}] "
                "Maximum number of iterations reached, increase max_inter to do more "
                f"iterations (max_inter = {max_iter})"
            )
            break

        nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

        for centroid in range(num_clusters):
            centroids[centroid] = centroid_calc_fn(
                boxes[nearest_centroids == centroid], axis=0
            )

        if (nearest_centroids == last_nearest_centroids).all():
            break

        last_nearest_centroids = nearest_centroids
        i += 1

    return centroids

This made the implementation faster as one would expect. I have an option to run K-Means more than one time but if you set --num-runs=1 you should see some benefit, obviously feel free to correct me but this is the reason

@mnslarcher

Where is your vectorized iou code? Are you referring to this?

nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

@mnslarcher
Copy link
Author

mnslarcher commented May 27, 2020

Where is your vectorized iou code? Are you referring to this?

nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

In that line I'm using the vectorized iou, this is my implementation:

def iou(boxes, anchors):
    """Calculates the Intersection over Union (IoU) between a numpy array of
    n boxes and an array of k anchors.
    Arguments:
        boxes (array_like): array of shape (n, 2) of boxes' widths and heights.
        anchors (array_like): array of shape (k, 2) of anchors' widths and heights.
    Returns:
        A numpy array of shape (n, k) containing the IoU values ​​for each combination
        of boxes and anchors.
    """
    intersection_width = np.minimum(anchors[:, [0]], boxes[:, 0]).T
    intersection_height = np.minimum(anchors[:, [1]], boxes[:, 1]).T

    if np.any(intersection_width == 0) or np.any(intersection_height == 0):
        raise ValueError("Some boxes have zero size.")

    intersection_area = intersection_width * intersection_height
    boxes_area = np.prod(boxes, axis=1, keepdims=True)
    anchors_area = np.prod(anchors, axis=1, keepdims=True).T
    union_area = boxes_area + anchors_area - intersection_area
    return intersection_area / union_area

This is the original one:

def iou(box, clusters):
    """
    Calculates the Intersection over Union (IoU) between a box and k clusters.
    :param box: tuple or array, shifted to the origin (i. e. width and height)
    :param clusters: numpy array of shape (k, 2) where k is the number of clusters
    :return: numpy array of shape (k, 0) where k is the number of clusters
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")

    intersection = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]

    iou_ = intersection / (box_area + cluster_area - intersection)

    return iou_

If you see, one return a (n, k) array and the other a (k, 0)

@Cli98
Copy link

Cli98 commented May 27, 2020

Where is your vectorized iou code? Are you referring to this?
nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

In that line I'm using the vectorized iou, this is my implementation:

def iou(boxes, anchors):
    """Calculates the Intersection over Union (IoU) between a numpy array of
    n boxes and an array of k anchors.
    Arguments:
        boxes (array_like): array of shape (n, 2) of boxes' widths and heights.
        anchors (array_like): array of shape (k, 2) of anchors' widths and heights.
    Returns:
        A numpy array of shape (n, k) containing the IoU values ​​for each combination
        of boxes and anchors.
    """
    intersection_width = np.minimum(anchors[:, [0]], boxes[:, 0]).T
    intersection_height = np.minimum(anchors[:, [1]], boxes[:, 1]).T

    if np.any(intersection_width == 0) or np.any(intersection_height == 0):
        raise ValueError("Some boxes have zero size.")

    intersection_area = intersection_width * intersection_height
    boxes_area = np.prod(boxes, axis=1, keepdims=True)
    anchors_area = np.prod(anchors, axis=1, keepdims=True).T
    union_area = boxes_area + anchors_area - intersection_area
    return intersection_area / union_area

This is the original one:

def iou(box, clusters):
    """
    Calculates the Intersection over Union (IoU) between a box and k clusters.
    :param box: tuple or array, shifted to the origin (i. e. width and height)
    :param clusters: numpy array of shape (k, 2) where k is the number of clusters
    :return: numpy array of shape (k, 0) where k is the number of clusters
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")

    intersection = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]

    iou_ = intersection / (box_area + cluster_area - intersection)

    return iou_

@mnslarcher I see.
But that will not change too much if you only vectorize iou function. Normally we have ~50 bboxs. That's not a large number. Did you time you kmean function? Any stats?

@mnslarcher
Copy link
Author

@Cli98 , I did it but honestly I don't remember the exact result, maybe I will re-do it this we and post it here :)

@mnslarcher I see.
But that will not change too much if you only vectorize iou function. Normally we have ~50 bboxs. That's not a large number. Did you time you kmean function? Any stats?

Why do you say that we have ~50 bboxs? in my dataset I have 100k images with more ore less 10 boxes each so 1mln bboxes, far more than 50 if I get your point (probably I misunderstood you). Clearly if your dataset is small the benefit is basically zero.

In general the increase of speed is not a crucial feature of my repo, is only a plus, other than that I added a couple of functionalities that were useful for me and I will continue to add more, I have not done anything extraordinary, but if anyone wants to try it, I will be happy to receive any feedback

@mnslarcher
Copy link
Author

mnslarcher commented Jun 6, 2020

Hi @Cli98 , I did the speed test, the result, if I haven't made any mistake, is that my implementation is much faster (70x faster on COCO train 2017).

I saw in a comment that you then edited that you didn't find any evidence of this so I'll post here the results and also how you can replicate them, I did a very fast experiment so feel free to tell me if you spot any mistake, I didn't use timeit just because I didn't want to wait but feel free to replace %%time with %%timeit.

The results (mine vs your):
with 100 bboxes: 3.47 ms vs 10.7 ms
with 1000 bboxes: 4.16 ms vs 104 ms
with 10000 bboxes: 20.8 ms vs 3.65 s
with 100000 bboxes: 531 ms vs 42.5 s
with 859999 bboxes (coco train 2017): 6.5 s vs 7min 24s

So if you see, for COCO mine take just 6.5s, your 7min 24s, both return the same result (mine is 70x faster in this experiment).

Get the bboxes:

import json
import numpy as np
import os

INSTANCES_PATH = "./annotations/instances_train2017.json"

if not os.path.exists(INSTANCES_PATH):
    !wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
    !unzip annotations_trainval2017.zip
    
with open(INSTANCES_PATH) as f:
    instances = json.load(f)
    
bboxes = np.array([i["bbox"][2:] for i in instances["annotations"] if np.prod(i["bbox"][2:]) > 0])
del instances

Define the functions (not_vectorized are the original fun that you are also using in your notebook, the others are the ones that I defined for my repo):

def not_vectorized_iou(box, clusters):
    """
    Calculates the Intersection over Union (IoU) between a box and k clusters.
    :param box: tuple or array, shifted to the origin (i. e. width and height)
    :param clusters: numpy array of shape (k, 2) where k is the number of clusters
    :return: numpy array of shape (k, 0) where k is the number of clusters
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")

    intersection = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]

    iou_ = intersection / (box_area + cluster_area - intersection)

    return iou_


def not_vectorized_kmeans(boxes, k, dist=np.median):
    """
    Calculates k-means clustering with the Intersection over Union (IoU) metric.
    :param boxes: numpy array of shape (r, 2), where r is the number of rows
    :param k: number of clusters
    :param dist: distance function
    :return: numpy array of shape (k, 2)
    """
    rows = boxes.shape[0]

    distances = np.empty((rows, k))
    last_clusters = np.zeros((rows,))

    np.random.seed()

    # the Forgy method will fail if the whole array contains the same rows
    clusters = boxes[np.random.choice(rows, k, replace=False)]

    while True:
        for row in range(rows):
            distances[row] = 1 - not_vectorized_iou(boxes[row], clusters)

        nearest_clusters = np.argmin(distances, axis=1)

        if (last_clusters == nearest_clusters).all():
            break

        for cluster in range(k):
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0)

        last_clusters = nearest_clusters

    return clusters


def iou(boxes, anchors):
    """Calculates the Intersection over Union (IoU) between a numpy array of
    n boxes and an array of k anchors.

    Arguments:
        boxes (array_like): array of shape (n, 2) of boxes' widths and heights.
        anchors (array_like): array of shape (k, 2) of anchors' widths and heights.

    Returns:
        A numpy array of shape (n, k) containing the IoU values ​​for each combination
        of boxes and anchors.
    """
    intersection_width = np.minimum(anchors[:, [0]], boxes[:, 0]).T
    intersection_height = np.minimum(anchors[:, [1]], boxes[:, 1]).T

    if np.any(intersection_width == 0) or np.any(intersection_height == 0):
        raise ValueError("Some boxes have zero size.")

    intersection_area = intersection_width * intersection_height
    boxes_area = np.prod(boxes, axis=1, keepdims=True)
    anchors_area = np.prod(anchors, axis=1, keepdims=True).T
    union_area = boxes_area + anchors_area - intersection_area
    return intersection_area / union_area


def kmeans(boxes, num_clusters=3, max_iter=300, seed=None, centroid_calc_fn=np.median):
    """Calculates K-Means clustering using the Intersection over Union (IoU) metric.

    Arguments:
        boxes (array_like): array of the bounding boxes' heights and widths, with
            shape (n, 2).
        num_clusters (int): the number of clusters to form as well as the number of
            centroids to generate.
        max_iter (int, optional): maximum number of iterations of the K-Means algorithm
        for a single run. Default: 300.
        seed: (int, optional): random seed. Default: None.
        centroid_calc_fn (function, optional): function used for calculating centroids
            Default: numpy.median.

    Returns:
        A numpy array of shape (num_clusters, 2).
    """
    np.random.seed(seed)
    last_nearest_centroids = np.ones(boxes.shape[0]) * -1
    # the Forgy method will fail if the whole array contains the same rows
    centroids = boxes[np.random.choice(boxes.shape[0], num_clusters, replace=False)]
    i = 0
    while True:
        if i >= max_iter:
            logger.warning(
                f"[{datetime.now().strftime('%m/%d %H:%M:%S')}] "
                "Maximum number of iterations reached, increase max_inter to do more "
                f"iterations (max_inter = {max_iter})"
            )
            break

        nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

        for centroid in range(num_clusters):
            centroids[centroid] = centroid_calc_fn(
                boxes[nearest_centroids == centroid], axis=0
            )

        if (nearest_centroids == last_nearest_centroids).all():
            break

        last_nearest_centroids = nearest_centroids
        i += 1

    return centroids

Run the experiment that you wish, es:

%%time

kmeans(bboxes, 3)

Result:

CPU times: user 6.49 s, sys: 174 µs, total: 6.49 s
Wall time: 6.5 s

array([[201.49, 218.97],
       [ 55.75,  65.25],
       [ 15.97,  19.85]])
%%time

not_vectorized_kmeans(bboxes, 3)

Result:

CPU times: user 7min 28s, sys: 8.91 s, total: 7min 36s
Wall time: 7min 24s

array([[ 15.97,  19.85],
       [ 55.75,  65.25],
       [201.49, 218.97]])

@Cli98
Copy link

Cli98 commented Jun 6, 2020

@Cli98 , I did it but honestly I don't remember the exact result, maybe I will re-do it this we and post it here :)

@mnslarcher I see.
But that will not change too much if you only vectorize iou function. Normally we have ~50 bboxs. That's not a large number. Did you time you kmean function? Any stats?

Why do you say that we have ~50 bboxs? in my dataset I have 100k images with more ore less 10 boxes each so 1mln bboxes, far more than 50 if I get your point (probably I misunderstood you). Clearly if your dataset is small the benefit is basically zero.

In general the increase of speed is not a crucial feature of my repo, is only a plus, other than that I added a couple of functionalities that were useful for me and I will continue to add more, I have not done anything extraordinary, but if anyone wants to try it, I will be happy to receive any feedback

Apologize for late reply. I'm referring to single image instead, not for all of them. Normally we do not conduct kmean experiments on all bounding boxes given their large variation, not to say class imbalance issue.

@Cli98
Copy link

Cli98 commented Jun 6, 2020

Hi @Cli98 , I did the speed test, the result, if I haven't made any mistake, is that my implementation is much faster (70x faster on COCO train 2017).

I saw in a comment that you then edited that you didn't find any evidence of this so I'll post here the results and also how you can replicate them, I did a very fast experiment so feel free to tell me if you spot any mistake, I didn't use timeit just because I didn't want to wait but feel free to replace %%time with %%timeit.

The results (mine vs your):
with 100 bboxes: 3.47 ms vs 10.7 ms
with 1000 bboxes: 4.16 ms vs 104 ms
with 10000 bboxes: 20.8 ms vs 3.65 s
with 100000 bboxes: 531 ms vs 42.5 s
with 859999 bboxes (coco train 2017): 6.5 s vs 7min 24s

So if you see, for COCO mine take just 6.5s, your 7min 24s, both return the same result (mine is 70x faster in this experiment).

Get the bboxes:

import json
import numpy as np
import os

INSTANCES_PATH = "./annotations/instances_train2017.json"

if not os.path.exists(INSTANCES_PATH):
    !wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
    !unzip annotations_trainval2017.zip
    
with open(INSTANCES_PATH) as f:
    instances = json.load(f)
    
bboxes = np.array([i["bbox"][2:] for i in instances["annotations"] if np.prod(i["bbox"][2:]) > 0])
del instances

Define the functions (not_vectorized are the original fun that you are also using in your notebook, the others are the ones that I defined for my repo):

def not_vectorized_iou(box, clusters):
    """
    Calculates the Intersection over Union (IoU) between a box and k clusters.
    :param box: tuple or array, shifted to the origin (i. e. width and height)
    :param clusters: numpy array of shape (k, 2) where k is the number of clusters
    :return: numpy array of shape (k, 0) where k is the number of clusters
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")

    intersection = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]

    iou_ = intersection / (box_area + cluster_area - intersection)

    return iou_


def not_vectorized_kmeans(boxes, k, dist=np.median):
    """
    Calculates k-means clustering with the Intersection over Union (IoU) metric.
    :param boxes: numpy array of shape (r, 2), where r is the number of rows
    :param k: number of clusters
    :param dist: distance function
    :return: numpy array of shape (k, 2)
    """
    rows = boxes.shape[0]

    distances = np.empty((rows, k))
    last_clusters = np.zeros((rows,))

    np.random.seed()

    # the Forgy method will fail if the whole array contains the same rows
    clusters = boxes[np.random.choice(rows, k, replace=False)]

    while True:
        for row in range(rows):
            distances[row] = 1 - not_vectorized_iou(boxes[row], clusters)

        nearest_clusters = np.argmin(distances, axis=1)

        if (last_clusters == nearest_clusters).all():
            break

        for cluster in range(k):
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0)

        last_clusters = nearest_clusters

    return clusters


def iou(boxes, anchors):
    """Calculates the Intersection over Union (IoU) between a numpy array of
    n boxes and an array of k anchors.

    Arguments:
        boxes (array_like): array of shape (n, 2) of boxes' widths and heights.
        anchors (array_like): array of shape (k, 2) of anchors' widths and heights.

    Returns:
        A numpy array of shape (n, k) containing the IoU values ​​for each combination
        of boxes and anchors.
    """
    intersection_width = np.minimum(anchors[:, [0]], boxes[:, 0]).T
    intersection_height = np.minimum(anchors[:, [1]], boxes[:, 1]).T

    if np.any(intersection_width == 0) or np.any(intersection_height == 0):
        raise ValueError("Some boxes have zero size.")

    intersection_area = intersection_width * intersection_height
    boxes_area = np.prod(boxes, axis=1, keepdims=True)
    anchors_area = np.prod(anchors, axis=1, keepdims=True).T
    union_area = boxes_area + anchors_area - intersection_area
    return intersection_area / union_area


def kmeans(boxes, num_clusters=3, max_iter=300, seed=None, centroid_calc_fn=np.median):
    """Calculates K-Means clustering using the Intersection over Union (IoU) metric.

    Arguments:
        boxes (array_like): array of the bounding boxes' heights and widths, with
            shape (n, 2).
        num_clusters (int): the number of clusters to form as well as the number of
            centroids to generate.
        max_iter (int, optional): maximum number of iterations of the K-Means algorithm
        for a single run. Default: 300.
        seed: (int, optional): random seed. Default: None.
        centroid_calc_fn (function, optional): function used for calculating centroids
            Default: numpy.median.

    Returns:
        A numpy array of shape (num_clusters, 2).
    """
    np.random.seed(seed)
    last_nearest_centroids = np.ones(boxes.shape[0]) * -1
    # the Forgy method will fail if the whole array contains the same rows
    centroids = boxes[np.random.choice(boxes.shape[0], num_clusters, replace=False)]
    i = 0
    while True:
        if i >= max_iter:
            logger.warning(
                f"[{datetime.now().strftime('%m/%d %H:%M:%S')}] "
                "Maximum number of iterations reached, increase max_inter to do more "
                f"iterations (max_inter = {max_iter})"
            )
            break

        nearest_centroids = np.argmax(iou(boxes, centroids), axis=1)

        for centroid in range(num_clusters):
            centroids[centroid] = centroid_calc_fn(
                boxes[nearest_centroids == centroid], axis=0
            )

        if (nearest_centroids == last_nearest_centroids).all():
            break

        last_nearest_centroids = nearest_centroids
        i += 1

    return centroids

Run the experiment that you wish, es:

%%time

kmeans(bboxes, 3)

Result:

CPU times: user 6.49 s, sys: 174 µs, total: 6.49 s
Wall time: 6.5 s

array([[201.49, 218.97],
       [ 55.75,  65.25],
       [ 15.97,  19.85]])
%%time

not_vectorized_kmeans(bboxes, 3)

Result:

CPU times: user 7min 28s, sys: 8.91 s, total: 7min 36s
Wall time: 7min 24s

array([[ 15.97,  19.85],
       [ 55.75,  65.25],
       [201.49, 218.97]])

@mnslarcher I actually modify my comment and delete the second sentence. But I have no idea why you see can see my previous statement.

I actually conduct my experiment on all code snipes and find avg 1000 trials per 100 bboxs gives similar performance 38/57 ms. But you're get a different way to time it. I think it also makes sense. So I think that's the end of this discussion and I will try to do some optimization for my code XD

@mnslarcher
Copy link
Author

@Cli98 , I did it but honestly I don't remember the exact result, maybe I will re-do it this we and post it here :)

@mnslarcher I see.
But that will not change too much if you only vectorize iou function. Normally we have ~50 bboxs. That's not a large number. Did you time you kmean function? Any stats?

Why do you say that we have ~50 bboxs? in my dataset I have 100k images with more ore less 10 boxes each so 1mln bboxes, far more than 50 if I get your point (probably I misunderstood you). Clearly if your dataset is small the benefit is basically zero.
In general the increase of speed is not a crucial feature of my repo, is only a plus, other than that I added a couple of functionalities that were useful for me and I will continue to add more, I have not done anything extraordinary, but if anyone wants to try it, I will be happy to receive any feedback

Apologize for late reply. I'm referring to single image instead, not for all of them. Normally we do not conduct kmean experiments on all bounding boxes given their large variation, not to say class imbalance issue.

Mmm I'm not sure I get your point, I understand that maybe you don't want to run K-Means on all you bounding boxes if your dateset is not balanced (mine is balanced) but I don't understand what is the utility of running K-Means on just one image to decide anchors ratios, but is late where I'm and I'm tired so I'll go to sleep now :)...

@Cli98
Copy link

Cli98 commented Jun 6, 2020

@Cli98 , I did it but honestly I don't remember the exact result, maybe I will re-do it this we and post it here :)

@mnslarcher I see.
But that will not change too much if you only vectorize iou function. Normally we have ~50 bboxs. That's not a large number. Did you time you kmean function? Any stats?

Why do you say that we have ~50 bboxs? in my dataset I have 100k images with more ore less 10 boxes each so 1mln bboxes, far more than 50 if I get your point (probably I misunderstood you). Clearly if your dataset is small the benefit is basically zero.
In general the increase of speed is not a crucial feature of my repo, is only a plus, other than that I added a couple of functionalities that were useful for me and I will continue to add more, I have not done anything extraordinary, but if anyone wants to try it, I will be happy to receive any feedback

Apologize for late reply. I'm referring to single image instead, not for all of them. Normally we do not conduct kmean experiments on all bounding boxes given their large variation, not to say class imbalance issue.

Mmm I'm not sure I get your point, I understand that maybe you don't want to run K-Means on all you bounding boxes if your dateset is not balanced (mine is balanced) but I don't understand what is the utility of running K-Means on just one image to decide anchors ratios, but is late where I'm and I'm tired so I'll go to sleep now :)...

"running K-Means on just one image to..."
Nope, that's not what I mean sir... But just let it pass.

@zylo117
Copy link
Owner

zylo117 commented Jun 11, 2020

actually, I think sklearn.cluster.KMeans is much faster, considering it's a c impl.

@mnslarcher
Copy link
Author

actually, I think sklearn.cluster.KMeans is much faster, considering it's a c impl.

Probably, but what we call K-Means here is K-Means with IoU as a distance metric (is the main contribution of this kind of repos) and for my current understanding (in the beginning I tried to see if it was possible to use sklearn) is not straightforward to implement K-Means using as a distance metric IoU using sklearn, if someone know how to do the same using that implementation of K-Means I would be curious to see it and eventually to update my code.

The first problem is that that implementation doesn’t let you choose a distance metric at all. Then there are other stuffs that you need to consider for this implementation, if you look at the k-means implementation of these repos you should be able to see it 😊.

@zylo117 zylo117 pinned this issue Jun 11, 2020
@zylo117
Copy link
Owner

zylo117 commented Jun 11, 2020

@mnslarcher
I did replace it with sklearn's kmeans.

def kmeans(boxes, num_clusters=3, max_iter=300, seed=None, centroid_calc_fn=np.median):
    """Calculates K-Means clustering using the Intersection over Union (IoU) metric.

    Arguments:
        boxes (array_like): array of the bounding boxes' heights and widths, with
            shape (n, 2).
        num_clusters (int): the number of clusters to form as well as the number of
            centroids to generate.
        max_iter (int, optional): maximum number of iterations of the K-Means algorithm
        for a single run. Default: 300.
        seed: (int, optional): random seed. Default: None.
        centroid_calc_fn (function, optional): function used for calculating centroids
            Default: numpy.median.

    Returns:
        A numpy array of shape (num_clusters, 2).
    """
    kmeans = KMeans(n_clusters=num_clusters, max_iter=300, tol=1e-3, random_state=seed).fit(boxes)
    centroids = kmeans.cluster_centers_
    return centroids

But yours is obviously much faster and more efficient because sklearn uses all available cpus while yours only takes one.

And if the statistics are right, I think both of them provide the similar correct result.

I think both using iou and euclidean distance are quite similar for this kind of situation. But yes, iou is more straightforward. Just realize this is why iou loss beats l1 loss in training bbox regression model.

your algorithm:

[06/11 15:18:07] Starting the calculation of the optimal anchors ratios
[06/11 15:18:07] Extracting and preprocessing bounding boxes
[06/11 15:18:07] Discarding 0 bounding boxes with size lower or equal to 0
[06/11 15:18:07] K-Means (10 runs): 100%|███████████████| 10/10 [00:04<00:00,  2.07it/s]
[06/11 15:18:12] Runs avg. IoU: 87.90% ± 0.00% (mean ± std. dev. of 10 runs, 0 skipped)
[06/11 15:18:12] Avg. IoU between bboxes and their most similar anchors after normalizing them so that they have the same area (only the anchor ratios matter): 87.90%
[06/11 15:18:12] Avg. IoU between bboxes and their most similar anchors (no normalization, both anchor ratios and sizes matter): 55.65%
[06/11 15:18:13] Num. bboxes with similar anchors (IoU >= 0.5):  79499/124806 (63.70%)
[06/11 15:18:13] Optimal anchors ratios: [(0.78, 1.29), (1.04, 0.96), (1.25, 0.8)]

sklearn's algorithm:

[06/11 15:18:25] Starting the calculation of the optimal anchors ratios
[06/11 15:18:25] Extracting and preprocessing bounding boxes
[06/11 15:18:25] Discarding 0 bounding boxes with size lower or equal to 0
[06/11 15:18:25] K-Means (10 runs): 100%|███████████████| 10/10 [00:18<00:00,  1.89s/it]
[06/11 15:18:44] Runs avg. IoU: 87.37% ± 0.00% (mean ± std. dev. of 10 runs, 0 skipped)
[06/11 15:18:44] Avg. IoU between bboxes and their most similar anchors after normalizing them so that they have the same area (only the anchor ratios matter): 87.37%
[06/11 15:18:44] Avg. IoU between bboxes and their most similar anchors (no normalization, both anchor ratios and sizes matter): 55.73%
[06/11 15:18:44] Num. bboxes with similar anchors (IoU >= 0.5):  79599/124806 (63.78%)
[06/11 15:18:44] Optimal anchors ratios: [(0.74, 1.36), (1.02, 0.99), (1.28, 0.78)]

@mnslarcher
Copy link
Author

Nice! yes I think is true than one can have a decent approximation of K-Means with IoU using regular K-Means with euclidean distance, in both cases you end up with reasonable anchors ratios

@zylo117 zylo117 unpinned this issue Jun 11, 2020
@zylo117 zylo117 pinned this issue Jun 11, 2020
@zylo117
Copy link
Owner

zylo117 commented Jun 11, 2020

@mnslarcher
Hi, I tried to compare the optimal anchors with the default anchors.
And for most of the time, it helps, but in some datasets, it doesn't.
Like this:

[06/11 17:08:58] When you use default anchors...
[06/11 17:08:58] Avg. IoU between bboxes and their most similar anchors (no normalization, both anchor ratios and sizes matter): 69.61%
[06/11 17:08:59] Num. bboxes with similar anchors (IoU >= 0.5):  37659/38510 (97.79%)
[06/11 17:08:59] Default anchors ratios: [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)]
[06/11 17:08:59] Runs avg. IoU: 88.01% ± 0.00% (mean ± std. dev. of 10 runs, 0 skipped)
[06/11 17:08:59] Avg. IoU between bboxes and their most similar anchors after normalizing them so that they have the same area (only the anchor ratios matter): 88.01%
[06/11 17:08:59] Avg. IoU between bboxes and their most similar anchors (no normalization, both anchor ratios and sizes matter): 70.88%
[06/11 17:08:59] Num. bboxes with similar anchors (IoU >= 0.5):  37275/38510 (96.79%)
[06/11 17:08:59] Optimal anchors ratios: [(0.8, 1.3), (1.0, 1.0), (1.2, 0.9)]
[06/11 17:08:59] Improvement: -1.00%

The ratio of bboxes with similar anchors reduces by 1% after using the optimal anchors.
It'd be nice to have a comparison with the default anchors, so that we can know whether we should apply the new one or not.

Also the reason why it sometimes doesn't help after the fitting should be also intriguing.

@mnslarcher
Copy link
Author

It makes sense, in the end K-Means is not trying to maximize the coverage, is doing something correlated but not exactly the same, in particular is possible that the default anchors have a lower avg IoU with the bounding boxes but they have an IoU > 0.5 more frequently, instead the avg IoU of the K-Means anchors is probably higher but they sacrifice some bounding boxes for this. At least this is my current intuition.

I think that default > k-means is almost always when both have a very good coverage after removing the effect of the anchors sizes (like in the case that you reported)

For giving you an example of the limitations of K-Means, I have one detector that needs to detect electric poles. In almost all the images, for every poles, heights >> widths, but I have also some cases where the helicopter took the photo from a strange perspective and some pole height << pole width.

In my script you can increase num_anchors_ratios but, given that the cases above are rare, this will produce just another anchor with height > width, is a limitation of K-Means.

My solution for these cases is to run 1 time the algorithm with num_anchors_ratios=2 or 3 and then I re-run it with num_anchors_ratios=1 or 2 (depending if I want end up with 3, 4 or 5 anchors ratios) on what in the tutorial I call instances_without_similar_anchors. This produce good results in my case and I think is a reasonable way to increase coverage.

I'll add the comparison with the default anchors in the output of the script, thanks! :)

@mnslarcher
Copy link
Author

Quick update:

I did what @zylo117 suggested, now the output is something like this:

[06/13 12:57:38] Reading ./annotations/instances_train2017.json
[06/13 12:57:54] Starting the calculation of the optimal anchors ratios
[06/13 12:57:54] Extracting and preprocessing bounding boxes
[06/13 12:57:57] Discarding 2 bounding boxes with size lower or equal to 0
[06/13 12:57:57] K-Means (3 runs): 100%|██████████████████| 3/3 [00:33<00:00, 11.06s/it]
        Runs avg. IoU: 80.48% ± 0.00% (mean ± std. dev. of 3 runs, 0 skipped)
        Avg. IoU between bboxes and their most similar anchors after norm. them to make their area equal (only ratios matter):
80.48%
[06/13 12:58:33] Default anchors ratios: [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)]
        Avg. IoU between bboxes and their most similar default anchors, no norm. (both ratios and sizes matter): 55.16%
        Num. bboxes without similar default anchors (IoU < 0.5):  253049/860001 (29.42%)
[06/13 12:58:37] K-Means anchors ratios: [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)]
        Avg. IoU between bboxes and their most similar K-Means anchors, no norm. (both ratios and sizes matter): 55.72%
        Num. bboxes without similar K-Means anchors (IoU < 0.5):  240788/860001 (28.00%)
[06/13 12:58:37] K-Means anchors have an IoU < 50% with bboxes in 1.43% less cases than the default anchors, you should consider to use them

As you can see, I decided to report num. bboxes without similar anchors instead of num. bboxes with similar anchors, because it is easier to interpret when the anchors have an high coverage.

@Cli98
Copy link

Cli98 commented Jun 13, 2020

It makes sense, in the end K-Means is not trying to maximize the coverage, is doing something correlated but not exactly the same, in particular is possible that the default anchors have a lower avg IoU with the bounding boxes but they have an IoU > 0.5 more frequently, instead the avg IoU of the K-Means anchors is probably higher but they sacrifice some bounding boxes for this. At least this is my current intuition.

I think that default > k-means is almost always when both have a very good coverage after removing the effect of the anchors sizes (like in the case that you reported)

For giving you an example of the limitations of K-Means, I have one detector that needs to detect electric poles. In almost all the images, for every poles, heights >> widths, but I have also some cases where the helicopter took the photo from a strange perspective and some pole height << pole width.

In my script you can increase num_anchors_ratios but, given that the cases above are rare, this will produce just another anchor with height > width, is a limitation of K-Means.

My solution for these cases is to run 1 time the algorithm with num_anchors_ratios=2 or 3 and then I re-run it with num_anchors_ratios=1 or 2 (depending if I want end up with 3, 4 or 5 anchors ratios) on what in the tutorial I call instances_without_similar_anchors. This produce good results in my case and I think is a reasonable way to increase coverage.

I'll add the comparison with the default anchors in the output of the script, thanks! :)

Quick update:

I did what @zylo117 suggested, now the output is something like this:

[06/13 12:57:38] Reading ./annotations/instances_train2017.json
[06/13 12:57:54] Starting the calculation of the optimal anchors ratios
[06/13 12:57:54] Extracting and preprocessing bounding boxes
[06/13 12:57:57] Discarding 2 bounding boxes with size lower or equal to 0
[06/13 12:57:57] K-Means (3 runs): 100%|██████████████████| 3/3 [00:33<00:00, 11.06s/it]
        Runs avg. IoU: 80.48% ± 0.00% (mean ± std. dev. of 3 runs, 0 skipped)
        Avg. IoU between bboxes and their most similar anchors after norm. them to make their area equal (only ratios matter):
80.48%
[06/13 12:58:33] Default anchors ratios: [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)]
        Avg. IoU between bboxes and their most similar default anchors, no norm. (both ratios and sizes matter): 55.16%
        Num. bboxes without similar default anchors (IoU < 0.5):  253049/860001 (29.42%)
[06/13 12:58:37] K-Means anchors ratios: [(0.6, 1.5), (1.0, 1.0), (1.4, 0.7)]
        Avg. IoU between bboxes and their most similar K-Means anchors, no norm. (both ratios and sizes matter): 55.72%
        Num. bboxes without similar K-Means anchors (IoU < 0.5):  240788/860001 (28.00%)
[06/13 12:58:37] K-Means anchors have an IoU < 50% with bboxes in 1.43% less cases than the default anchors, you should consider to use them

As you can see, I decided to report num. bboxes without similar anchors instead of num. bboxes with similar anchors, because it is easier to interpret when the anchors have an high coverage.

@mnslarcher and @zylo117

I don't think this comparison should be suggested, as the result of kmean is not stable and highly depends on distribution of input data.

To prove this, run multiple experiment multiple times, the avg iou may not be the same. And thus it is highly possible that your may see difference of AP flips from -1 to 1. Totally invalid conclusion.

So I do not think there are any ways to solve this, even with the proposed solution discussed above. Let me know your thoughts. If you ask me, I will take the one differ from default. Even the performance drops, given the difference is not too much. As bbox distribution in customized dataset must have some difference compared with default.

@Cli98
Copy link

Cli98 commented Jun 13, 2020

It makes sense, in the end K-Means is not trying to maximize the coverage, is doing something correlated but not exactly the same, in particular is possible that the default anchors have a lower avg IoU with the bounding boxes but they have an IoU > 0.5 more frequently, instead the avg IoU of the K-Means anchors is probably higher but they sacrifice some bounding boxes for this. At least this is my current intuition.

I think that default > k-means is almost always when both have a very good coverage after removing the effect of the anchors sizes (like in the case that you reported)

For giving you an example of the limitations of K-Means, I have one detector that needs to detect electric poles. In almost all the images, for every poles, heights >> widths, but I have also some cases where the helicopter took the photo from a strange perspective and some pole height << pole width.

In my script you can increase num_anchors_ratios but, given that the cases above are rare, this will produce just another anchor with height > width, is a limitation of K-Means.

My solution for these cases is to run 1 time the algorithm with num_anchors_ratios=2 or 3 and then I re-run it with num_anchors_ratios=1 or 2 (depending if I want end up with 3, 4 or 5 anchors ratios) on what in the tutorial I call instances_without_similar_anchors. This produce good results in my case and I think is a reasonable way to increase coverage.

I'll add the comparison with the default anchors in the output of the script, thanks! :)

It's still a trade off here (number of anchors VS AP) and thus not the optimal. The better solution is to jump out of box.

@mnslarcher
Copy link
Author

To prove this, run multiple experiment multiple times, the avg iou may not be the same. And thus it is highly possible that your may see difference of AP flips from -1 to 1. Totally invalid conclusion.

I don't agree. In my implementation I have a num_runs parameter for this, in most of the cases running K-Means many times produce exactly the same result (at least on fairly large datasets). This because we have just two dimensions and, for reasonable datasets, a lot of samples. In any case is not something that one have to theorize, is enough to use an high value for num_runs and look at the avg. IoU std. dev..

If you ask me, I will take the one differ from default. Even the performance drops, given the difference is not too much. As bbox distribution in customized dataset must have some difference compared with default.

This doesn't make sense to me. Is true that in most of the cases the default anchors are not the best ones but there is no logical reason to think that some other boxes are better just because they are different (especially if you have a metric that is telling you the opposite). Following this logic using default anchors + random noise would be better than using default anchors.

There is a proper way to measure how good the anchors are, you can use a validation set (in this case I don't think we never over-fit the training set so also an evaluation on the training set is OK).

It's still a trade off here (number of anchors VS AP) and thus not the optimal. The better solution is to jump out of box.

I don't understand what do you mean with "jump out of box". In any case I agree, is not optimal, is just an heuristic that I use to balance the limitation of K-Means. This is an example where this heuristic works:

[3, 10] * 1000
[4, 10] * 1000
[5, 10] * 1000
[10, 3] * 100

In this case, with num_anchors_ratios=3, K-means would find 3 anchors ratios with height > width because the centroids will be driven by the shapes with height > width. This is not optimal if the objective is to maximize coverage. To avid this, one can just run K-Means with num_anchors_ratios=1 or 2 and the result will be 1 or 2 anchors that alone cover all the boxes with the exception of the 100 [10, 3]. At this point one can run again K-Means on the residuals (that are only the [10, 3] boxes) and the results will be [10, 3], the set of anchors at this point have an IoU > 50% with all the boxes. This is obviously a toy example just to clarify the idea.

@Cli98
Copy link

Cli98 commented Jun 14, 2020

@mnslarcher Are we on the same page? I just point out my thoughts that the difference added to decide which is more useful may not lead to correct conclusion.

I don't agree. In my implementation I have a num_runs parameter for this, in most of the cases running K-Means many times produce exactly the same result (at least on fairly large datasets). This because we have just two dimensions and, for reasonable datasets, a lot of samples. In any case is not something that one have to theorize, is enough to use an high value for num_runs and look at the avg. IoU std. dev..

Interesting to see you get exactly same result. If this holds, I admit that the decision holds. But Kmeans is not a stable algorithm, is it? Is initialization stable? How could you produce same result? By doing what? I did not see in any part you actually use seed, any other way you have done?

This doesn't make sense to me. Is true that in most of the cases the default anchors are not the best ones but there is no logical reason to think that some other boxes are better just because they are different (especially if you have a metric that is telling you the opposite). Following this logic using default anchors + random noise would be better than using default anchors.

Let me ask you a question, for a given customized dataset, I run your algorithm and get same results as compared with coco default[(1,1), (1.4,0.7), (0.7,1.4)], while I conduct EDA and find the average ratio<1 , If I get kmeans result [(1,1), (0.7,1.4)] and find it lower from default, which should I choose?

Noise can present anywhere in this world. Running kmeans cannot kick all of them out.

There is a proper way to measure how good the anchors are, you can use a validation set (in this case I don't think we never over-fit the training set so also an evaluation on the training set is OK).

It's still a trade off here (number of anchors VS AP) and thus not the optimal. The better solution is to jump out of box.

I don't understand what do you mean with "jump out of box". In any case I agree, is not optimal, is just an heuristic that I use to balance the limitation of K-Means. This is an example where this heuristic works:

[3, 10] * 1000
[4, 10] * 1000
[5, 10] * 1000
[10, 3] * 100

In this case, with num_anchors_ratios=3, K-means would find 3 anchors ratios with height > width because the centroids will be driven by the shapes with height > width. This is not optimal if the objective is to maximize coverage. To avid this, one can just run K-Means with num_anchors_ratios=1 or 2 and the result will be 1 or 2 anchors that alone cover all the boxes with the exception of the 100 [10, 3]. At this point one can run again K-Means on the residuals (that are only the [10, 3] boxes) and the results will be [10, 3], the set of anchors at this point have an IoU > 50% with all the boxes. This is obviously a toy example just to clarify the idea.

Validation is a way. Though not effective to me.
What you state is a solution. However the memory may not able to hold so much anchors.

@mnslarcher
Copy link
Author

@mnslarcher Are we on the same page? I just point out my thoughts that the difference added to decide which is more useful may not lead to correct conclusion.

I don't agree. In my implementation I have a num_runs parameter for this, in most of the cases running K-Means many times produce exactly the same result (at least on fairly large datasets). This because we have just two dimensions and, for reasonable datasets, a lot of samples. In any case is not something that one have to theorize, is enough to use an high value for num_runs and look at the avg. IoU std. dev..

Interesting to see you get exactly same result. If this holds, I admit that the decision holds. But Kmeans is not a stable algorithm, is it? Is initialization stable? How could you produce same result? By doing what? I did not see in any part you actually use seed, any other way you have done?

This doesn't make sense to me. Is true that in most of the cases the default anchors are not the best ones but there is no logical reason to think that some other boxes are better just because they are different (especially if you have a metric that is telling you the opposite). Following this logic using default anchors + random noise would be better than using default anchors.

Let me ask you a question, for a given customized dataset, I run your algorithm and get same results as compared with coco default[(1,1), (1.4,0.7), (0.7,1.4)], while I conduct EDA and find the average ratio<1 , If I get kmeans result [(1,1), (0.7,1.4)] and find it lower from default, which should I choose?

Noise can present anywhere in this world. Running kmeans cannot kick all of them out.

There is a proper way to measure how good the anchors are, you can use a validation set (in this case I don't think we never over-fit the training set so also an evaluation on the training set is OK).

It's still a trade off here (number of anchors VS AP) and thus not the optimal. The better solution is to jump out of box.

I don't understand what do you mean with "jump out of box". In any case I agree, is not optimal, is just an heuristic that I use to balance the limitation of K-Means. This is an example where this heuristic works:

[3, 10] * 1000

[4, 10] * 1000

[5, 10] * 1000

[10, 3] * 100

In this case, with num_anchors_ratios=3, K-means would find 3 anchors ratios with height > width because the centroids will be driven by the shapes with height > width. This is not optimal if the objective is to maximize coverage. To avid this, one can just run K-Means with num_anchors_ratios=1 or 2 and the result will be 1 or 2 anchors that alone cover all the boxes with the exception of the 100 [10, 3]. At this point one can run again K-Means on the residuals (that are only the [10, 3] boxes) and the results will be [10, 3], the set of anchors at this point have an IoU > 50% with all the boxes. This is obviously a toy example just to clarify the idea.

Validation is a way. Though not effective to me.

What you state is a solution. However the memory may not able to hold so much anchors.

Sorry @Cli98 , I’m not sure I’m following you 100%, in any case just to clarify:

  1. There is the possibility to set a random seed in my K-Means implementation (you can look at the code some messages above) but for obvious reasons I don’t use it when I run K-Means multiple time (I run K-Means multiple times exactly to try different initializations and to keep the best result of them)
  2. What I’m saying is that, even with different initializations, in the special case of using K-Means for this problem where we have just 2 dimensions, usually many data points and well defined clusters, K-Means tends to find always the same solution. I’m not saying this is always true for K-Means, it depend on the type of problem (dataset) where you apply K-Means. This obviously is not a rigorous demonstrations, I’m reporting my empirical evidence resulted from running K-Means many times with different initializations and some different datasets
  3. I don’t understand 100% the rest of your message but strictly speaking I’m just saying that one time you choose a metric that you want to maximize (like number of boxes with IoU > 50%) and you find a solution that maximize it, you should stick with it, independently on how you found it. So, if default values have an higher score for your metric, if you choose your metric well, you should stick with the default values

I’m not sure if I answer to you given that I’m not sure if I understood your point

@Cli98
Copy link

Cli98 commented Jun 14, 2020

@mnslarcher Are we on the same page? I just point out my thoughts that the difference added to decide which is more useful may not lead to correct conclusion.

I don't agree. In my implementation I have a num_runs parameter for this, in most of the cases running K-Means many times produce exactly the same result (at least on fairly large datasets). This because we have just two dimensions and, for reasonable datasets, a lot of samples. In any case is not something that one have to theorize, is enough to use an high value for num_runs and look at the avg. IoU std. dev..

Interesting to see you get exactly same result. If this holds, I admit that the decision holds. But Kmeans is not a stable algorithm, is it? Is initialization stable? How could you produce same result? By doing what? I did not see in any part you actually use seed, any other way you have done?

This doesn't make sense to me. Is true that in most of the cases the default anchors are not the best ones but there is no logical reason to think that some other boxes are better just because they are different (especially if you have a metric that is telling you the opposite). Following this logic using default anchors + random noise would be better than using default anchors.

Let me ask you a question, for a given customized dataset, I run your algorithm and get same results as compared with coco default[(1,1), (1.4,0.7), (0.7,1.4)], while I conduct EDA and find the average ratio<1 , If I get kmeans result [(1,1), (0.7,1.4)] and find it lower from default, which should I choose?
Noise can present anywhere in this world. Running kmeans cannot kick all of them out.

There is a proper way to measure how good the anchors are, you can use a validation set (in this case I don't think we never over-fit the training set so also an evaluation on the training set is OK).

It's still a trade off here (number of anchors VS AP) and thus not the optimal. The better solution is to jump out of box.

I don't understand what do you mean with "jump out of box". In any case I agree, is not optimal, is just an heuristic that I use to balance the limitation of K-Means. This is an example where this heuristic works:

[3, 10] * 1000

[4, 10] * 1000

[5, 10] * 1000

[10, 3] * 100

In this case, with num_anchors_ratios=3, K-means would find 3 anchors ratios with height > width because the centroids will be driven by the shapes with height > width. This is not optimal if the objective is to maximize coverage. To avid this, one can just run K-Means with num_anchors_ratios=1 or 2 and the result will be 1 or 2 anchors that alone cover all the boxes with the exception of the 100 [10, 3]. At this point one can run again K-Means on the residuals (that are only the [10, 3] boxes) and the results will be [10, 3], the set of anchors at this point have an IoU > 50% with all the boxes. This is obviously a toy example just to clarify the idea.

Validation is a way. Though not effective to me.
What you state is a solution. However the memory may not able to hold so much anchors.

Sorry @Cli98 , I’m not sure I’m following you 100%, in any case just to clarify:

  1. There is the possibility to set a random seed in my K-Means implementation (you can look at the code some messages above) but for obvious reasons I don’t use it when I run K-Means multiple time (I run K-Means multiple times exactly to try different initializations and to keep the best result of them)
  2. What I’m saying is that, even with different initializations, in the special case of using K-Means for this problem where we have just 2 dimensions, usually many data points and well defined clusters, K-Means tends to find always the same solution. I’m not saying this is always true for K-Means, it depend on the type of problem (dataset) where you apply K-Means. This obviously is not a rigorous demonstrations, I’m reporting my empirical evidence resulted from running K-Means many times with different initializations and some different datasets
  3. I don’t understand 100% the rest of your message but strictly speaking I’m just saying that one time you choose a metric that you want to maximize (like number of boxes with IoU > 50%) and you find a solution that maximize it, you should stick with it, independently on how you found it. So, if default values have an higher score for your metric, if you choose your metric well, you should stick with the default values

I’m not sure if I answer to you given that I’m not sure if I understood your point

@mnslarcher So here is my point. I am not saying your kmean algorithm is incorrect. I'm just wondering if use the difference to measure the result of kmean really works. Since the result is not stable, such comparison may not really make sense. Just my thought.

So I guess your point is, such randomness may not change conclusion as it is to some degree stable. This may not always hold from my perspective, actually depends on difficulty of dataset.And we agree each other on this. That's enough I guess.

"I’m reporting my empirical evidence resulted from running K-Means many times with different initializations and some different datasets".

I'm glad to see any reports in the near future. And this follows your concern #3. My points are, since you judge the result of kmeans from AP, you may see difference work. But how effective to judge by AP remains controversial to me. Let's see if your experiments can justify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants