Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systematic framework tests for all machines #4712

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

karlnapf
Copy link
Member

@karlnapf karlnapf commented Jul 9, 2019

testing

  • use sgobject iterator to instantiate and test all machines from a given list
  • individual ignore lists for each test (runtime)
  • automate header and machine name extraction from trained_model_serialization.cc.py
  • separate data generating code into environment
  • clean up

tests

  • consistency of views/subset with subsampled(no-view, same data)
  • training/apply leaves data unchanged
  • training thread consistency
  • test cross-validation thread consistency
  • import trained_model_serialization tests
  • delete old trained_model_serialization tests
  • training without initializing does throw an exception (and doesnt crash)
  • document non-working machines in ignore lists

}
}

TEST(AllMachines, cv_thread_consistency)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theartful these are the kind of tests I was talking about.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are very nice!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think separating the data generating code to the public interface would be a good idea? examples would benefit from them for instance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I want to move all that into an environment similar to those used by trained_model_serialization so the data can be re-used for other instances. Will add that to the list, thanks

@karlnapf karlnapf force-pushed the feature/machine_tests branch 2 times, most recently from 8c4fb25 to 123757c Compare July 9, 2019 15:09
@@ -1,206 +0,0 @@
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are subsumed by the new ones

Copy link
Member

@vigsterkr vigsterkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant and inconsistent


TEST(AllMachines, train_thread_consistency)
{
std::set<std::string> ignores = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste all around the source... plz avoid redundancy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ignores are test-specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah 3x times the very same ignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because the number of tested machines is small and not yet automated. Although there are already differences here, see random forest. For more machines tested, there will be more diversity. If not, i can of course drop this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you copy paste 3 times the value... you could create a value of it instead of copy-pasting the same string around. 1 copy-paste is one too many imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i will look into this once the coverage is bigger

cv->put("seed", 1);
auto result_single = cv->evaluate();

get_global_parallel()->set_num_threads(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a by-chance checking of the consistency... depends on many aspects. for example on the machine that the code is being run on etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but this already revealed bugs. open for ways to improve

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's one thing that you have a test locally to have this tested... its another thing that have this formalised in a test that is part of the codebase and the CI that is going to be run on many different arch/distro etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example:

  • do not use global_parallel - that is going to be dropped eventually - the sooner the better
  • why 4, why not 8, or 3 or 11 or 42? you could actually use a function to get the runtime env supported number of max threads and use that number, and if it's 1 then say that foobar.
    ....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number is of course arbitrary, good idea to use the available number of threads, i will do that

init_machine(machine);
auto data = generate_data(machine);

machine->set_labels(data.second);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the triplet: set_labels, train, apply sprayed around this source again... while it could be a function that returns a tuple that you nicely tie and get it back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in this case you would actually just return the predictions... so you dont even need that... just pass the rvalue of the tuple (labels, data)

}

// TODO, generate this automatically, like in trained_model_serialization
std::set<string> all_machines = {"LibSVM", "Perceptron", "LibLinear",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least make a function that returns this set instead of having it here a global var....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do once this extraction is automated


get_global_parallel()->set_num_threads(1);
machine->set_labels(data.second);
if (machine->has("seed"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's actually a function for this: seed(machine, 1);

auto result_single = machine->apply(data.first);

init_machine(machine);
get_global_parallel()->set_num_threads(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total random number....

get_global_parallel()->set_num_threads(4);
machine2->set_labels(data.second);
if (machine2->has("seed"))
machine2->put("seed", 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess the seed should be the same as above, right? in that case again just copy-pasting the same number around... refactoring is a pain.. use a variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

machine2->set_labels(data.second);
if (machine2->has("seed"))
machine2->put("seed", 1);
machine2->train(data.first);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below: set_labels, train, apply. could you use a function for this?
the seeding part could be fixed by having that function a 4th arg that has a default value of some special number that means no seeding...

machine_subsampled->put("seed", 1);
}

machine_subset->set_labels(labels_subset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_labels, train, apply... all over again

@stale
Copy link

stale bot commented Feb 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 26, 2020
@gf712 gf712 self-assigned this Feb 26, 2020
@stale stale bot removed the stale label Feb 26, 2020
@gf712 gf712 removed their assignment Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants