Marathon Match News Interview with NVIDIA CUDA Superheros iquadrat and b285714!

The first challenge of the NVIDIA CUDA Superhero Challenge is complete and the winners announced. TopCoder took some time to interview the first and second place winners so check out what they had to say!

Read Interviews.

TC: How does it feel to win the NVIDIA CUDA Superhero challenge? Can
you explain how you approached solving the problem?

Iquadrat: It feels great to be one of the winners of the NVIDIA CUDA Superhero
Challenge! It comes to me a bit as surprise as I started exploring CUDA only in the second week of the contest. I had heard of it before but never got the chance to try it out myself.

In the first week, I explored some different algorithms to attack the CCL problem on the CPU only. Finally, I found the find-and-union algorithm most suitable as it needed only a single pass over all
neighbor relations. It works as following: Initially each pixel is assigned to a separate set (with its index as id of the set) by assigning itself as parent. Then, the pixels sets were joined according to the neighboring relation. The joining of two pixel sets was done by finding the ids of the pixels (the pixel that has itself as parent) and then making the pixel with the larger id the child of the pixel with the smaller child. In a final step, the set id is looked up for each pixel and written to the result array. On the CPU, this could already be calculated pretty fast using SIMD instructions to compute the neighboring relations. My fastest CPU-only version scored a bit over 13,000 points.

In week two, I explored the CUDA framework. On Monday I decided to buy a GTX 260 card as my laptop card did not support all the instructions I wanted to use. The most runtime consuming task was getting the data onto the GPU and from there back into the host memory. The copy time was more than the calculation time itself. I did not manage to interlave the data transfer with the computation on the GPU as this would have needed pinned memory. Copying the input vector to pinned memory first
wasn’t an option as this took too much time to be a benefit afterwards.

My algorithm for the GPU was very similar to the one on the CPU:

1) Copy the data to GPU.
2) Divide the image into 44×44 blocks. Each kernel copies its part of
the image into shared memory, then solves the CCL for this part.
3) Join the solved CCLs from the second step along the seams of the
blocks in the graphic card’s global memory.
4) Final pass, look up the final index of the pixels.
5) Copy the data back to host memory.

Because the pixel sets were joined concurrently, I used atomic instructions to prevent race conditions. So, the find-and-union for two pixels repeatedly tries to join the sets until it succeeds. The GPU solution scored about 38000 points on the intermediate tests, so it was about three times faster then the CPU algorithm, despite the overhead of copying the data form the CPU to the GPU and back.

Approx time distribution: Step 1 30%, Step 2 20%, Step 3 15%, Step 4 5%, Step 5 30%.

One of the main problems was the instability of the testing environment.  Solutions sometimes failed test cases without apparent reasons. Because of this, it was very difficult to say if a failing test was because of a bug in the solution or an issue with the tester or hardware.

b285714: The CUDA superhero challenge brought me an excitement I haven’t felt for a long time. It feels great to compete with talented programmers world-wide.

I solved the problem by trying a lot of different alternatives to find the fastest one. This is achieved using BSGP, a high level GPU programming language built on top of CUDA hardware. The built in parallel primitives like reduce/compact helped a great deal in gathering optimization information.

I wrote all my submission entries in BSGP and hand translated them to CUDA for submission.

TC: What did you learn about CUDA from this competition?

Iquadrat: I learned much about the CUDA framework and the computation power of the
NVIDIA graphic cards as well as about the CCL problem and different ways to solve it.

b285714: A lot of new performance tricks, especially when dealing with large data.

TC: Would you recommend this series to others?

Iquadrat: Apart from the instability of the testing environment, the contest was very much fun! Therefore, I recommend the series to everybody who is interested in really massive parallel algorithms and unleashing the power of contemporary graphic cards — and already owns a GTX 260 or higher graphic card which is really needed to get exact timing results and get the most optimized code.

b285714: Surely yes.

TC: What advice would you give to other TopCoder members?

Iquadrat: Try a bunch of different algorithms, keep attention to the TopCoder forums and never give up :-)

b285714: Try BSGP:) It would help your GPU programming experience.

Remember, the second NVIDIA CUDA Superhero Challenge competition starts November 23, 2009 so get practicing!

Read more here:
here!

TopCoder Forums RSS Feed

  • Read the Winner's Interviews Thursday, 1 October 2009, 3:11 pm
    iquadrat and b285714 discuss their solutions, what they learned, and dish out some advice for other competitors. Don't miss their interviews.