-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tracing to evaluate #11911
add tracing to evaluate #11911
Conversation
Documentation preview for e095d98 will be available when this CircleCI job More info
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! Shall we add a test?
The tracing branch is now closed. Can you rebase with a merge target to master? |
3e8b3f6
to
2a3b878
Compare
c82328c
to
d6f7c38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions about clarifying the supported functionality. Thanks!
@@ -361,6 +363,73 @@ def baseline_model_uri(request): | |||
return None | |||
|
|||
|
|||
def test_mlflow_evaluate_logs_traces(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the docs for mlflow.evaluate() to describe this new functionality? I'm curious to see:
- When would the traces be generated with an evaluate() call. Is all model type supported?
- How to find the traces produced by a certain evaluate run.
- Is model = function supported? If so, what does it look like? Is there an example for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All models types should be supported. Anything other than langchain would just trace the input/output. In langchain, it would create a span for each individual section in langchain.
- The traces would appear in the traces section as if you were to log a regular trace
- I think there isn't a difference between a pyfunc model or function here, because
mlflow.evaluate
does the neccessary steps to turn it all into amodel
with apredict
method. Here we would be tracking the input/output of this predict method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything other than langchain would just trace the input/output
It's up to the user to decide what level of granularity to use when tracing; if the user instruments each subroutine in their model with @mlflow.trace, then they won't just get overall input / output. We're not limiting / restricting this granularity in this PR, right @jessechancy ?
The traces would appear in the traces section as if you were to log a regular trace
The key point is that we set the run ID as a tag on the trace, which appears in the UI.
I think there isn't a difference between a pyfunc model or function here, because mlflow.evaluate does the neccessary steps to turn it all into a model with a predict method. Here we would be tracking the input/output of this predict method.
Same as above - the user might instrument at a level that's more granular than just input / output. Can we confirm this works? Looks like your tests already cover that.
targets=iris_dataset._constructor_args["targets"], | ||
evaluators="dummy_evaluator", | ||
) | ||
assert len(get_traces()) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the trace look like with a regressor model? In which scenario would the user want to see a trace for a regressor evaluation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jessechancy , is this addressed somewhere? Could you share a pointer if possible?
Besides, I'm wondering if we should provide a config to turn off the tracing for evaluate() runs, in case non-genai users get confused with the auto-logged traces. cc @BenWilson2
8ecbb1c
to
0c2cdf9
Compare
k: {**v, "disable": True} for k, v in AUTOLOGGING_INTEGRATIONS.items() | ||
} | ||
try: | ||
yield None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield None | |
yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessechancy can we address this nit?
|
||
TRACE_BUFFER.clear() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add one final test to this suite.
- Call evaluate on a model
- Verify traces are recorded for a simple langchain chain
- Call mlflow autolog (specifying no overrides to the default)
- Invoke the model
- Ensure that the default settings for autologging apply and that the expected tracked run and artifacts / metrics / params are logged to the run.
Mostly to ensure that modifications to autologging behavior within a single sessions are locally scoped with the configuration overrides being used in the context handler (it works as expected now, but in the future, if this logic needs to change, having the test will be a guard against a regression in the behavior introduced in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. Can we also test:
- Call autolog with some specific settings
- Call evaluate on a model
- Verify traces are recorded for a simple langchain chain
- Invoke the model
- Ensure that the settings for autologging in (1) apply and that the expected tracked run and artifacts / metrics / params are logged to the run.
Can we also test these cases when evaluate fails with an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the safeguard test is added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a question about the non-genai eval run case.
@jessechancy Do we have another JIRA ticket or PR for updating the documentation? If so, could you link it? Thanks!
targets=iris_dataset._constructor_args["targets"], | ||
evaluators="dummy_evaluator", | ||
) | ||
assert len(get_traces()) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jessechancy , is this addressed somewhere? Could you share a pointer if possible?
Besides, I'm wondering if we should provide a config to turn off the tracing for evaluate() runs, in case non-genai users get confused with the auto-logged traces. cc @BenWilson2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some langchain deps changes that might interrupt non-langchain users. Can you help make sure the experience is still smooth in that case? Thanks!
mlflow/models/evaluation/base.py
Outdated
|
||
def monkey_patch_predict(x): | ||
# Disable all autologging except for traces | ||
mlflow.langchain.autolog(log_inputs_outputs=False, disable=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check whether langchain is imported? I see this code block is not langchain-specific, so if the customer only uses non-langchain packages, shall we skip this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we might want to only apply this enablement if langchain core module is available through a validation with importlib
.github/workflows/master.yml
Outdated
@@ -240,7 +240,7 @@ jobs: | |||
- name: Install dependencies | |||
run: | | |||
source ./dev/install-common-deps.sh | |||
pip install pyspark torch transformers | |||
pip install pyspark torch transformers langchain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make langchain
a common deps? It seems to me that langchain should not be required for non-genai users. (related to my comment in mlflow/models/evaluation/base.py
)
Can we test if langchain is not installed, a user should still be able to call mlflow.evaluate()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be a common dependency - this line change is only for test infra.
That being said, a +1 to the validation check that the patch function with evaluate will work if langchain is not installed. Temporarily removing the install from the REPL context would be good to validate.
Regarding #11911 (comment) - this is valid. I feel like the non-genai use cases (classification, regression) should be opt-in for this behavior as it doesn't make a great deal of sense and could potentially generate a very large dataset that doesn't provide any value, even for debugging. |
8f8478a
to
84597ba
Compare
mlflow/models/evaluation/base.py
Outdated
@@ -1265,6 +1265,7 @@ def _validate_dataset_type_supports_predictions(data, supported_predictions_data | |||
def _evaluate( | |||
*, | |||
model, | |||
model_predict_func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessechancy This will break all existing third-party evaluator plugins, including ones developed by Databricks. Can we construct model_predict_func
inside the default evaluator instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 that it seems naturally scoped to _extract_predict_fn
except in the case that a user would pass this explicitly into a DefaultEvaluator.evaluate. Is there a clear use case around that we are trying to address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the _extract_predict_fn
doesn't cover all cases here, since I see direct usage of model.predict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
@contextlib.contextmanager | ||
def restrict_langchain_autologging_to_traces_only(): | ||
if importlib.util.find_spec("langchain") is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it sufficient to check sys.modules
instead? Are there examples that recommend using find_spec()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few examples throughtout the codebase that use importlib.util.find_spec
, but I can change it to sys.modules
497f28b
to
3fff538
Compare
tests/evaluate/test_evaluation.py
Outdated
def test_langchain_autolog_parameters_matches_default_parameters(): | ||
# get parameters from mlflow.langchain.autolog | ||
params = inspect.signature(mlflow.langchain.autolog).parameters | ||
for name in params: | ||
assert name in MLFLOW_EVALUATE_RESTRICT_LANGCHAIN_AUTOLOG_TO_TRACES_CONFIG | ||
for name in MLFLOW_EVALUATE_RESTRICT_LANGCHAIN_AUTOLOG_TO_TRACES_CONFIG: | ||
assert name in params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this test. Can you give guidance to future engineers who will see this test fail without much context? How should they determine the value of any new parameter in MLFLOW_EVALUATE_RESTRICT_LANGCHAIN_AUTOLOG_TO_TRACES_CONFIG to not violate traces only requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though not familiar enough with implications for official approval
def _extract_predict_fn(model, raw_model, model_predict_fn=None): | ||
predict_fn = model.predict if model is not None else None | ||
if model_predict_fn is not None: | ||
predict_fn = model_predict_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a docstring that indicates the precedence order and restructure this method so that the logic is if / elif / else? It's currently difficult to reason about what happens when model_predict_fn
is specified and raw_model
is also specified, etc.
Can we also add type hints to the function to indicate which parameters are optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once small comments are addressed. Thanks @jessechancy !
# This is the 1st commit message: add tracing to evaluate Signed-off-by: Jesse Chan <jesse.chan@databricks.com> monkeypatch with langchain Signed-off-by: Jesse Chan <jesse.chan@databricks.com> recursion fix Signed-off-by: Jesse Chan <jesse.chan@databricks.com> remove print Signed-off-by: Jesse Chan <jesse.chan@databricks.com> Autologging langchain config and cleanup Signed-off-by: Jesse Chan <jesse.chan@databricks.com> comment fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> wip Signed-off-by: Jesse Chan <jesse.chan@databricks.com> wip Signed-off-by: Jesse Chan <jesse.chan@databricks.com> fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> test import global Signed-off-by: Jesse Chan <jesse.chan@databricks.com> test fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#2: fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#3: fixes Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#4: retrigger tests Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#5: remove copy to prevent retriggers Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#6: fixes + opt in langchain Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#7: fixed tests Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#8: added test for langchain not installed Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#9: add langchain-experimental package Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#10: Create copy of model.predict for tracing Signed-off-by: Jesse Chan <jesse.chan@databricks.com> # This is the commit message mlflow#11: fixed tests Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
Signed-off-by: Jesse Chan <jesse.chan@databricks.com>
da87765
to
e095d98
Compare
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Add tracing to mlflow.evaluate() with support for regular python functions, Pyfunc, and langchain
When a user passes a chain or function to mlflow.evaluate() that has been instrumented with MLflow tracing, MLflow should log traces during evaluation and link them to the evaluation run using the same tagging approach from https://databricks.atlassian.net/browse/ML-40783 .
We should make sure to support LangChain: even if the user hasn’t enabled LangChain autologging, mlflow.evaluate() should enable autologging for the duration of the evaluation so that traces are captured.
We should make sure to only enable the tracing part of LangChain autologging, not the part that logs models
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrationsarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yes
should be selected for bug fixes, documentation updates, and other small changes.No
should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.