Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): fully accelerated nvenc #9452

Merged
merged 20 commits into from
May 16, 2024
Merged

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented May 14, 2024

Description

Edit: An earlier version of this PR elected to make a breaking change here, but after some consideration I decided against it. The PR now makes hardware decoding opt-in and existing setups will continue to work. There is a new toggle for whether to use hardware decoding, applicable to NVENC and RKMPP. Since it defaults to false, the behavior is the same as current for NVENC. RKMPP will be downgraded to software decoding until the admin enables hardware decoding.

This is a smaller version of #9402 that only changes the behavior for NVENC. That PR aimed to streamline the decoding and filtering process to one pipeline by leveraging libplacebo and Vulkan's cross-device capabilities. However, this is premature due to the following reasons:

  1. Vulkan is not supported on RKMPP, so a separate pipeline is still necessary for end-to-end acceleration.
  2. There is a roughly 30% speed penalty on CPU compared to the more traditional pipeline, likely overhead from uploading frames to and from Vulkan.
  3. Vulkan on FFmpeg is a very active area of development, and we are not able to upgrade to the newest and most feature-complete versions of FFmpeg while Jellyfin is still on 6.0. This limits the APIs and devices that can use Vulkan.
  4. After speaking with Jellyfin devs, they strongly recommended against relying on it too heavily due to an above-average rate of breaking changes and poor backwards compatibility (hitting Windows primarily, but also affecting some Intel devices on Linux).

Vulkan works very well with Nvidia from my testing and has reasonable backwards compatibility (9xx series onward), so it's fine to use it here. Maybe someday it can be used more extensively, but in the meantime it's similar to this XKCD.

Testing

Tested transcoding a video on NVENC and CPU with tone-mapping enabled and disabled, confirming success logs and confirming the video plays (with browser caching disabled to ensure the video is up-to-date).

@mertalev mertalev changed the title feat(server): fully accelerated nvenc feat(server)!: fully accelerated nvenc May 14, 2024
Copy link
Contributor

@jrasm91 jrasm91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just for testing?

server/Dockerfile Outdated Show resolved Hide resolved
@mertalev mertalev force-pushed the feat/server-hw-decoding-no-toggle branch from 98edc66 to d5ace68 Compare May 14, 2024 01:05
Copy link

cloudflare-pages bot commented May 14, 2024

Deploying immich with  Cloudflare Pages  Cloudflare Pages

Latest commit: 75906c1
Status: ✅  Deploy successful!
Preview URL: https://6541231f.immich.pages.dev
Branch Preview URL: https://feat-server-hw-decoding-no-t.immich.pages.dev

View logs

@mertalev mertalev force-pushed the feat/server-hw-decoding-no-toggle branch from eb495c8 to 60aadf3 Compare May 14, 2024 14:19
docker/hwaccel.transcoding.yml Outdated Show resolved Hide resolved
@mertalev mertalev force-pushed the feat/server-hw-decoding-no-toggle branch from 60aadf3 to ea33cd9 Compare May 15, 2024 00:55
@mertalev mertalev changed the title feat(server)!: fully accelerated nvenc feat(server): fully accelerated nvenc May 15, 2024
@mertalev
Copy link
Contributor Author

I decided to add the hardware decoding toggle after all. The new command may not work with older kernels or drivers, and it's useful for debugging purposes. Overall, this takes it from being a change I'm slightly nervous about to one I can confidently say is fine.

@nyanmisaka
Copy link

Just a FYI, jellyfin-ffmpeg includes our homemade native filter tonemap_cuda. It has lower overhead than hwupload+libplacebo impl (extra semaphore required for interop between Cuda<->Vulkan) and performs better on entry-level nVidia GPUs, although it only has the most basic functionality. It also avoids dependence on Vulkan runtime, only Cuda runtime is enough. So this way you can avoid breaking changes introduced by Vulkan. Its usage is similar to the existing tonemap_opencl.

BTW for issue #9252, we also have transpose_{cuda,opencl} as well as vpp_{qsv,rkrga} and flip_vulkan filters. FFmpeg is not so smart that it does not automatically insert them in a full hardware pipeline, so the video captured by the GoPro can be upside down after transcoding.

@zackpollard
Copy link
Contributor

I was going to give this a test but given the latest comment I will wait for @mertalev to give it another pass if he wants to make changes related to that comment.

@mertalev
Copy link
Contributor Author

Thanks for the tips, @nyanmisaka! I didn't know about tonemap_cuda or the other filters you mentioned. I'll try using it here.

@fyfrey
Copy link
Contributor

fyfrey commented May 15, 2024

@mertalev let me know if (and when) I should test this on my RK3588 device

@mertalev
Copy link
Contributor Author

mertalev commented May 15, 2024

@nyanmisaka I did some testing and tonemap_cuda is indeed faster (106s vs 97s), but the resulting colors seem off compared to libplacebo and the source. Do you have any thoughts why that might be?

Test video (downloaded with youtube-dl)

tonemap_cuda command:

ffmpeg -hwaccel cuda -hwaccel_output_format cuda -noautorotate -threads 1 -i HDR.mkv \
-tune hq -qmin 0 -rc-lookahead 20 -i_qfactor 0.75 -c:v av1_nvenc -c:a aac -movflags faststart \
-fps_mode passthrough -map 0:0 -map 0:1 -g 256 -temporal-aq 1 -v verbose -preset p1 -cq:v 40 \
-vf scale_cuda=-2:720,tonemap_cuda=matrix=bt709:primaries=bt709:range=pc:tonemap=hable:transfer=bt709:format=nv12 \
SDR.mp4

libplacebo command:

ffmpeg -hwaccel cuda -hwaccel_output_format cuda -noautorotate -threads 1 -i HDR.mkv \
-tune hq -qmin 0 -rc-lookahead 20 -i_qfactor 0.75 -c:v av1_nvenc -c:a aac -movflags faststart \
-fps_mode passthrough -map 0:0 -map 0:1 -g 256 -temporal-aq 1 -v verbose -preset p1 -cq:v 40 \
-vf scale_cuda=-2:720,hwupload=derive_device=vulkan,libplacebo=color_primaries=bt709:color_trc=bt709:colorspace=bt709:downscaler=none:format=yuv420p:tonemapping=hable:upscaler=none,hwupload=derive_device=cuda \
SDR.mp4

Performance results:

  • Disabled means software decoding + tone-mapping with zscale before hardware encoding

nvenc

libplacebo (1:06 mark):

hw_decoding_libplacebo_no_deband

tonemap_cuda:

hw_decoding_tonemap_cuda

Also wow, I was not expecting that big of a gap between software and hardware decoding / tone-mapping.

Edit: I noticed there are some artifacts in the libplacebo image (lower right, middle left, top left). Looks like there's a bug when used in tandem with -temporal-aq. This is what it looks like without that flag (and debanding enabled):

libplacebo_deband_true_peak_detect_no_aq

Interestingly, both filters also have distorted colors before a scene change when -temporal-aq is used.

@mertalev
Copy link
Contributor Author

@fyfrey There shouldn't be any other changes to the RKMPP code so feel free to test it whenever you're able! It basically makes the software decoding variant the default, but the commands are otherwise the same.

@mertalev mertalev force-pushed the feat/server-hw-decoding-no-toggle branch from 54dbbf9 to a6732cf Compare May 16, 2024 01:24
@nyanmisaka
Copy link

@mertalev The image produced by tonemap_cuda is less saturated because the filter option desat is not turned off. Just like tonemap_opencl, it imposes a fixed value for desaturation.

1 mp4_20240516_101637 150

The complete filter options can be queried through the following command.

./ffmpeg -hide_banner -h filter=tonemap_cuda
Filter tonemap_cuda
  GPU accelerated HDR to SDR tonemapping
    Inputs:
       #0: default (video)
    Outputs:
       #0: default (video)
tonemap_cuda AVOptions:
   tonemap           <int>        ..FV....... Tonemap algorithm selection (from 0 to 7) (default none)
     none            0            ..FV.......
     linear          1            ..FV.......
     gamma           2            ..FV.......
     clip            3            ..FV.......
     reinhard        4            ..FV.......
     hable           5            ..FV.......
     mobius          6            ..FV.......
     bt2390          7            ..FV.......
   tonemap_mode      <int>        ..FV....... Tonemap mode selection (from 0 to 1) (default max)
     max             0            ..FV.......
     rgb             1            ..FV.......
   transfer          <int>        ..FV....... Set transfer characteristic (from -1 to INT_MAX) (default bt709)
     bt709           1            ..FV.......
     bt2020          14           ..FV.......
     smpte2084       16           ..FV.......
   t                 <int>        ..FV....... Set transfer characteristic (from -1 to INT_MAX) (default bt709)
     bt709           1            ..FV.......
     bt2020          14           ..FV.......
     smpte2084       16           ..FV.......
   matrix            <int>        ..FV....... Set colorspace matrix (from -1 to INT_MAX) (default bt709)
     bt709           1            ..FV.......
     bt2020          9            ..FV.......
   m                 <int>        ..FV....... Set colorspace matrix (from -1 to INT_MAX) (default bt709)
     bt709           1            ..FV.......
     bt2020          9            ..FV.......
   primaries         <int>        ..FV....... Set color primaries (from -1 to INT_MAX) (default bt709)
     bt709           1            ..FV.......
     bt2020          9            ..FV.......
   p                 <int>        ..FV....... Set color primaries (from -1 to INT_MAX) (default bt709)
     bt709           1            ..FV.......
     bt2020          9            ..FV.......
   range             <int>        ..FV....... Set color range (from -1 to INT_MAX) (default tv)
     tv              1            ..FV.......
     pc              2            ..FV.......
     limited         1            ..FV.......
     full            2            ..FV.......
   r                 <int>        ..FV....... Set color range (from -1 to INT_MAX) (default tv)
     tv              1            ..FV.......
     pc              2            ..FV.......
     limited         1            ..FV.......
     full            2            ..FV.......
   format            <string>     ..FV....... Output format (default "same")
   apply_dovi        <boolean>    ..FV....... Apply Dolby Vision metadata if possible (default true)
   tradeoff          <int>        ..FV....... Apply tradeoffs to offload computing (from -1 to 1) (default auto)
     auto            -1           ..FV.......
     disabled        0            ..FV.......
     enabled         1            ..FV.......
   peak              <double>     ..FV....... Signal peak override (from 0 to DBL_MAX) (default 0)
   param             <double>     ..FV....... Tonemap parameter (from DBL_MIN to DBL_MAX) (default nan)
   desat             <double>     ..FV....... Desaturation parameter (from 0 to DBL_MAX) (default 0.5)
   threshold         <double>     ..FV....... Scene detection threshold (from 0 to DBL_MAX) (default 0.2)

As for the artifacts caused by the -temporal-aq option of the av1_nvenc encoder, I think it is an internal issue of the NVENC hardware encoder and has nothing to do with the tonemap filter. You can use hwdownload to save that frame as raw YUV and check it with the YUView tool. ... -vf scale_cuda=...,tonemap_cuda=...,hwdownload,format=nv12 -f rawvideo /path/to/raw.yuv

@mertalev
Copy link
Contributor Author

@nyanmisaka Thanks, disabling de-saturation did the trick! The colors in the OpenCL image actually look closer to the HDR video than with libplacebo now.

With that change, I think this PR is ready for final review.

@nyanmisaka
Copy link

@nyanmisaka Thanks, disabling de-saturation did the trick! The colors in the OpenCL image actually look closer to the HDR video than with libplacebo now.

In fact, tone-mapping is a lossy process. Therefore there is no completely accurate result. Personal preference also plays an important role here.

As for the performance difference, it is not much different on beefy GPUs such as RTX. It is more obvious on weak GPUs such as GTX1050.

Copy link
Contributor

@zackpollard zackpollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if you've tested nvidia then I am happy for this to be merged, would just be good to get a test of RKMP although it seems that is effectively the same as before with just some code moved around to account for the new hw decode toggle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Immich defaults to software encoding for remaining videos after a hardware accelerated encode fails.
6 participants