Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect ordering of iterations in scalar reporting of metrics #1265

Open
HackXIt opened this issue May 16, 2024 · 1 comment
Open

Incorrect ordering of iterations in scalar reporting of metrics #1265

HackXIt opened this issue May 16, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@HackXIt
Copy link

HackXIt commented May 16, 2024

Describe the bug

I believe I discovered a bug or at least weird behavior in the ClearML scalar reporting mechanism.

In my data processing task, I have a metric, which in theory as well as in the implementation can only ever increase in value. I report the scalar in each iteration of the loop.

However, when viewed in ClearML, it shows that the scalar is actually dropping in value in certain runs of the tasks.

It is apparent, that the ordering of the reported iterations is incorrect and as such, earlier iterations are actually reported later.
This does not occur all the time, however.

Additionally, I am confused by the scalar metric in general, since I clearly have iterations going from 0 to X in incremental steps of 1. But the plot actually shows it going from iteration 0 to something like iteration 6 or 7.
So there's also something incorrect there.

Correct report:
Correct_scalar-report
Incorrect report:
Incorrect_scalar-report

To reproduce

  1. Create a task in function
  2. Store a variable starting with 0
  3. Run a loop in the task
  4. Perform a lengthy task (calling a subprocess for example doing data processing)
  5. Increase variable by X
  6. Report variable in each iteration of the loop

Retry those steps some amount of times and view the report in ClearML.

Code that produced the issue for me
def capture_design(design_folder: str):
    import subprocess, os, shutil
    from clearml import Task
    print(f"Capturing designs from {design_folder}...")
    task = Task.current_task()
    logger = task.get_logger()
    design_files = [f for f in os.listdir(design_folder) if os.path.isfile(os.path.join(design_folder, f))]
    if len(design_files) == 0:
        print(f"No design files found in {design_folder}")
        return
    widgets = {}
    for widget in implemented_types:
        widgets[widget] = 0
    files = []
    errors = 0
    logger.report_scalar(title='Generator', series='total_widgets', value=sum(widgets.values()), iteration=0)
    logger.report_scalar(title='Generator', series='errors', value=errors, iteration=0)
    for widget in widgets:
        logger.report_scalar(title='Widget metrics', series=widget, value=widgets[widget], iteration=0)
    for i, design_file in enumerate(design_files):
        print(f"Iteration: {i+1}/{len(design_files)} - {design_file}")
        attempts = 0
        success = False
        # NOTE Retry mechanism due to possible MemoryErrors when dynamically allocating screenshot data (Trust in the OS to clean up the mess)
        while not success and attempts < 4:
            print(f"Running design generator on file {design_file}")
            gen = subprocess.run([os.path.abspath(env['mpy_path']), os.path.abspath(env['mpy_main']), '-m', 'design', '-o', 'screenshot.jpg', '-f', os.path.abspath(os.path.join(design_folder, design_file)), '--normalize'], cwd=os.path.abspath(os.path.curdir), capture_output=True, text=True)
            if gen.returncode != 0:
                print(f"Failed to generate UI from design file {design_file}:\n{gen.stdout}\n{gen.stderr}")
                attempts += 1
                continue
            success = True
        if not success:
            print(f"Failed to generate UI from design file {design_file} after {attempts} attempts")
            errors += 1
            continue
        tmp_image = os.path.abspath(os.path.join(os.path.abspath(os.path.curdir), "screenshot.jpg"))
        tmp_text = os.path.abspath(os.path.join(os.path.abspath(os.path.curdir), "screenshot.txt"))
        if not os.path.exists(tmp_image) or not os.path.exists(tmp_text):
            print(f"Failed to find generated UI files from design file {design_file}")
            errors += 1
            continue
        gen_image = os.path.abspath(os.path.join(env['output_folder'], f"ui_{i}.jpg"))
        gen_text = os.path.abspath(os.path.join(env['output_folder'], f"ui_{i}.txt"))
        try:
            shutil.move(tmp_image, gen_image)
            shutil.move(tmp_text, gen_text)
        except FileNotFoundError as e:
            print(f"Failed to move files from design file {design_file}:\n{tmp_image} -> {gen_image}\n{tmp_text} -> {gen_text}\n{e}")
            errors += 1
            continue
        files.append((gen_image, gen_text))
        annotation_errors = []
        with open(gen_text, 'r+') as f:
            # Each line is in this format: "class x y w h" (Need to grab class)
            new_lines = []
            for i, line in enumerate(f.readlines()):
                widget, x, y, w, h = line.split(' ')
                x, y, w, h = float(x), float(y), float(w), float(h)
                if any([x < 0.0, y < 0.0, w < 0.0, h < 0.0]) or any([x > 1.0, y > 1.0, w > 1.0, h > 1.0]):
                    errors += 1
                    print(f"[Line {i}] Invalid bounding box found in annotation file of {design_file}")
                    print(f"Removed: {widget} {x} {y} {w} {h}")
                    annotation_errors.append(i)
                    continue
                new_lines.append(line)
                if widget in widgets:
                    widgets[widget] += 1
                else:
                    errors += 1
                    print(f"[Line {i}] Unknown widget class {widget} found in annotation file of {design_file}")
            # NOTE Delete invalid annotations in label file
            f.seek(0)
            f.writelines(new_lines)
            f.truncate()
            del new_lines
        logger.report_scalar(title='Generator', series='total_widgets', value=sum(widgets.values()), iteration=i+1)
        logger.report_scalar(title='Generator', series='errors', value=errors, iteration=i+1)
        for widget in widgets:
            logger.report_scalar(title='Widget metrics', series=widget, value=widgets[widget], iteration=i+1)
    generated_files = len(files)
    env['generated_files'] = generated_files
    env['files'] = files

Expected behaviour

Scalar plot should display the reported values for each iteration in the order that they were reported in. (i.e. each iteration in sequence)

Environment

  • Server type app.clear.ml
  • ClearML SDK Version 1.15.1
  • Python Version 3.11
  • OS Linux WSL2 (Debian 12)

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1715875927944579

@HackXIt HackXIt added the bug Something isn't working label May 16, 2024
@jkhenning
Copy link
Member

Hi @HackXIt , thanks for the report, we'll try to reproduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants