Project

General

Profile

Bug #11514

Review/Change opencl_device_priority defaults, or benchmark, to avoid slowdowns

Added by M. Andree over 2 years ago. Updated over 2 years ago.

Status:
In Progress
Priority:
Low
Assignee:
-
Category:
OpenCL
Start date:
02/18/2017
Due date:
% Done:

50%

Affected Version:
git master branch
System:
Ubuntu
bitness:
64-bit
hardware architecture:
amd64/x86

Description

The default setting for the OpenCL scheduler is opencl_device_priority=*/!0,*/*/*, i. e. the preview cannot use OpenCL device #0 per https://www.darktable.org/usermanual/ch10s02s08.html.php

I dispute that this is adequate, for reasons shown below. This is with a Sony A7 series 14-bit deep uncompressed ARW file of 24 MPx, on a 2.5 GHz quadcore AMD Phenom II X4 905e, with an NVidia GeForce 1060 6GB as OpenCL renderer (with latest NVidia beta driver):

Performance without OpenCL:

[dev_process_thumbnail] pixel pipeline processing took 4.454 secs (12.852 CPU)
[dev_process_image] pixel pipeline processing took 2.953 secs (8.280 CPU)
[dev_process_preview] pixel pipeline processing took 1.730 secs (4.480 CPU)

Performance with OpenCL at default settings:

[dev_process_thumbnail] pixel pipeline processing took 0.507 secs (1.244 CPU)
[dev_process_image] pixel pipeline processing took 0.298 secs (0.376 CPU)
[dev_process_preview] pixel pipeline processing took 1.567 secs (4.188 CPU)

Performance with opencl_device_priority=*,*/*/*:

[dev_process_thumbnail] pixel pipeline processing took 0.126 secs (0.092 CPU)
[dev_process_image] pixel pipeline processing took 0.303 secs (0.364 CPU)
[dev_process_preview] pixel pipeline processing took 0.331 secs (0.364 CPU)

So we see that even if preview pipeline might run in parallel on the CPU with the other GPU pipelines, it takes 1 s longer than if we do everything sequentially on the GPU.

Suggestion: The OpenCL scheduler should benchmark CPU and GPU pipes to determine relative speed, and on fast OpenCL devices, permit running the preview pipe on the GPU as well. It might even be a parallel trial & error if OpenCL is present, dispatch to both in parallel and abort the slower pipe once the faster has delivered its result - but the latter might be at a disadvantage when CPU-only computation steps are in the pipeline.

History

#1 Updated by Roman Lebedev over 2 years ago

That will heavily depend both on the hardware, and on the exact modules used, and their settings.

#2 Updated by M. Andree over 2 years ago

OK, so we might need speculative execution (i. e. dispatch CPU + GPU in parallel and see whoever arrives first at a synchronization point and then either abort the other path while letting the winner continue until the next "CPU or GPU" dispatch decision) if we want to get it right in the general case.

#3 Updated by Ulrich Pegelow over 2 years ago

Performance with opencl_device_priority=*,*/*/*:

[dev_process_thumbnail] pixel pipeline processing took 0.126 secs (0.092 CPU)
[dev_process_image] pixel pipeline processing took 0.303 secs (0.364 CPU)
[dev_process_preview] pixel pipeline processing took 0.331 secs (0.364 CPU)

Can you please elaborate on these settings and results? According to my understanding preview and full pixelpipe are started in parallel in the cases that are relevant for this topic (change of module parameters etc.). The first pixelpipe will get the first free OpenCL device, the second one will then normally run on the CPU. How did you manage to have the GPU process both of them?

#4 Updated by Ulrich Pegelow over 2 years ago

OK, so we might need speculative execution (i. e. dispatch CPU + GPU in parallel and see whoever arrives first at a synchronization point and then either abort the other path while letting the winner continue until the next "CPU or GPU" dispatch decision) if we want to get it right in the general case.

Frankly, I don't like that idea. It's wasting energy and increasing noise level.

#5 Updated by M. Andree over 2 years ago

Ulrich Pegelow wrote:

Performance with opencl_device_priority=*,*/*/*:

[dev_process_thumbnail] pixel pipeline processing took 0.126 secs (0.092 CPU)
[dev_process_image] pixel pipeline processing took 0.303 secs (0.364 CPU)
[dev_process_preview] pixel pipeline processing took 0.331 secs (0.364 CPU)

Can you please elaborate on these settings and results? According to my understanding preview and full pixelpipe are started in parallel in the cases that are relevant for this topic (change of module parameters etc.). The first pixelpipe will get the first free OpenCL device, the second one will then normally run on the CPU. How did you manage to have the GPU process both of them?

Frankly, I don't know how it happens, but it's like 95% consistent. Once in a while one of the pipes gets dispatched on the CPU, and then it's slow, and noticably slow because the GUI update happens many seconds slower than usual. This is my entire set of opencl-related settings:

opencl=TRUE
opencl_async_pixelpipe=TRUE
opencl_avoid_atomics=false
opencl_checksum=2934628546
opencl_device_priority=*/*/*/*
opencl_enable_markesteijn=true
opencl_library=
opencl_memory_headroom=300
opencl_memory_requirement=768
opencl_micro_nap=100
opencl_number_event_handles=100
opencl_size_roundup=16
opencl_synch_cache=false
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false

#6 Updated by M. Andree over 2 years ago

Ulrich Pegelow wrote:

Frankly, I don't like that idea. It's wasting energy and increasing noise level.

If noise level for this fraction-of-a-second computation peak is a concern, then the system design neglected noise totally. I can run Unigine Valley OpenGL for extended periods of time, like 10 benchmarks in a row (some 3 minutes each) with c. 35 fps in "Extreme HD" here, and even after half an hour, the computer is not making audibly more noise than when left idle. Yes, the GPU gets some 35 K warmer up to 70 °C and its two fans go to like 1200/min start, but noise is not at all a concern.

#7 Updated by Ulrich Pegelow over 2 years ago

  • % Done changed from 0 to 50
  • Status changed from New to In Progress

Pull request #1464 contains an according feature.

Also available in: Atom PDF