Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GCP step logging #2366

Closed
wants to merge 3 commits into from
Closed

Fix GCP step logging #2366

wants to merge 3 commits into from

Conversation

adtygan
Copy link

@adtygan adtygan commented Jan 26, 2024

Describe changes

The issue arises because GCS artifacts are immutable Stack Overflow thread. To fix the issue, I rewrite the existing file with its old contents and buffer's content appended together.

Code to test the change: (Credits: @strickvl)

Please use GCS stack

import gcsfs
from zenml.client import Client
from zenml.logging.step_logging import StepLogsStorage

client = Client()
_ = client.active_stack

TEST_FILE="gs://zenml-2211/test.txt"

log_storage = StepLogsStorage(logs_uri=TEST_FILE, max_messages=5)
for i in range(0,11):
    log_storage.write(f"I'm log line #{i}")
log_storage.save_to_file()

fs = gcsfs.GCSFileSystem()
with fs.open(TEST_FILE, 'r') as f:
    all_of_it = f.read()

print(all_of_it)

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Copy link
Contributor

coderabbitai bot commented Jan 26, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Copy link

gitguardian bot commented Jan 26, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
- Google Cloud Keys 82c26cb zenml-key.json View secret
- Google Cloud Keys c57dd68 zenml-key.json View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@adtygan
Copy link
Author

adtygan commented Jan 26, 2024

I accidentally included my GCP keys during the PR. I later made another commit to delete it. But GitGuardian still shows error. I have not worked on contributing to opensource before and so request some assistance on this.

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Two questions

  • Will this effect other artifact stores like S3?
  • Is this still performant? Somehow I feel like we are doing too many IO operations... Maybe we should slow it down a bit?

@adtygan
Copy link
Author

adtygan commented Jan 26, 2024

Thanks for reviewing @htahir1 !

  • Will this effect other artifact stores like S3?
    I do not think so. This is because the existing solution appended the logs to the file. The proposed solution overwrites existing file with updated content and hence should work.
  • Is this still performant? Somehow I feel like we are doing too many IO operations... Maybe we should slow it down a bit?
    I did try to think of better solutions. The main limitation of GCS files being immutable lead me this option. Could you suggest any alternative approaches that I could try?

Thanks

@htahir1
Copy link
Contributor

htahir1 commented Jan 26, 2024

@adtygan As the keys remain git hit history, please invalidate them from GCP ASAP to make it safe for your cloud.

For the performance, maybe a good way to do it would be to benchmark some running pipelines with varying logs. I have noticed if you use a rich or TQDM progress bar it slows down a LOT, and id love some benchmarks on the local store vs the GCS store for varying scripts :-)

@adtygan
Copy link
Author

adtygan commented Jan 27, 2024

Thanks for the suggestion @htahir1 , I have invalidated my key. With regard to benchmarks, please give me some time. I will get back on this and update you.

@adtygan
Copy link
Author

adtygan commented Feb 3, 2024

Hello @htahir1 , I want to confirm with you if I understand what you said correctly. I'm planning to measure the running time for logging 100, 1,000 and 10,000 lines. For each of these, I'm going to measure the running times for local, GCP, local with TQDM, GCP with TQDM. In total this should give 12 run time values. I need to measure these 12 values for the current version of the code and my PR version.

Is this correct? Thanks.

@htahir1
Copy link
Contributor

htahir1 commented Feb 3, 2024 via email

@strickvl strickvl changed the title fixed gcp step_logging Fix gcp step_logging Feb 5, 2024
@strickvl strickvl changed the title Fix gcp step_logging Fix GCP step logging Feb 5, 2024
@strickvl strickvl added the bug Something isn't working label Feb 5, 2024
@strickvl strickvl linked an issue Feb 5, 2024 that may be closed by this pull request
1 task
@adtygan
Copy link
Author

adtygan commented Feb 6, 2024

I did some parts of the bechmarking and noticed a bunch of issues.

Here is the details of the run for writing 100 lines of logs (averaged over 10 runs):

  • My version on GCP stack: 27.807 seconds
  • Develop branch version on GCP stack: (did not run this because it does not properly write logs)
  • My version on local stack: (this was creating file size of >1 GB which I don't yet know why it is happening, on GCP stack it works fine)
  • Develop branch version on local stack: 0.003 seconds

I'm noticing 2 big issues with my fix

  1. It is not fast enough
  2. It creates a huge file on local stack and I can't understand why it is doing.

I need some more time to look into this issue.
Thanks.

@strickvl
Copy link
Contributor

strickvl commented Feb 8, 2024

@adtygan note that you're getting some linting failures on the CI. if you could fix those as well that'd be great!

@adtygan
Copy link
Author

adtygan commented Feb 14, 2024

Hello @strickvl , I don't think my current code can be optimized to improve performance. Instead, I checked the Potential Solution you had mentioned in the initial post of the issue (#2211 (comment)). This option looks like the best choice. However, I want to clarify how to go about incorporating it.

If I understand correctly, you are suggesting to open the log file in write mode and then proceed with logging. This would write all the contents. I have tested on GCP stack and it works. But the only issue I realize is it is going to overwrite past logs.

Can I work on a solution where we create a temporary file to store the logs, and then using the exit() method append this file's contents to the main log file?

Thanks

@htahir1
Copy link
Contributor

htahir1 commented Feb 16, 2024

@adtygan this sounds like a reasonable plan to try out! Would love to see how this new approach would benchmark against the old one

@adtygan
Copy link
Author

adtygan commented Mar 12, 2024

Sorry @htahir1 , I took a break from work for a few weeks and did not keep you updated. I will get back to working on the issue.

@adtygan
Copy link
Author

adtygan commented Mar 16, 2024

Hello @htahir1 and @strickvl , I have opened a PR (#2533). Please review it. To the best of my knowledge, I think this does not have any errors.

@strickvl strickvl closed this May 3, 2024
@strickvl strickvl mentioned this pull request May 3, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix step logging when using GCS Artifact Store
3 participants