INF5063 - Autumn 2017

Home Exam 2: Video Encoding on Tegra X1 using the CUDA framework

In this assignment, you will take advantage of the computing power available on a graphics processor to accelerate video encoding.

You are supposed to:

Optimize the c63 encoder using CUDA and the GPUs on the Nvidia Jetson TX1 boards.
Write a short report where you describe which optimizations you have implemented and discuss your results. You should not describe other thinkable or planned optimizations that you did not test.
Create a poster (2 A3 pages) and participate in the poster session on October 26th.

Codec63

Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It is not compliant with any standards by itself, so the precode contains both an example of an encoder and a decoder (which converts an encoded file back to YUV). C63's inter-frame prediction works by encoding for every macroblock independently whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame.

Macroblocks are encoded according to the JPEG standard [1] if no motion vector is used, and stored in the output file. If a motion vector is used, the residual is stored in the same manner. An illustrative overview of the steps involved during JPEG encoding can be found at Wikipedia [2]. If a motion vector is used, this is stored right before storing the encoded residual.

It is your task to optimize the c63 encoder using the CUDA framework.

The c63 is very basic and shows behavior that you wouldn't allow a standard encoder to have. This concerns in particular the Huffman tables and the unconditional use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).

The video scenario is live streaming. You should not have an encoder pipeline of more than 3 frames. In addition, you should not use parallelization techniques that severely degrade the video quality.

You should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms provide large speedup potential, but they distract from the main goal of this home exam, which is to identify and implement parallelization options.

Two test sequences in YUV format are available in the /mnt/sdcard directory on the lab machines:

foreman (352x288) CIF
tractor (1920x1080) 1080p

These should be used as input to the provided c63 encoder, and can be used to test your implementations.

Precode

The precode consists of the reference c63 code including:

an encoder
a decoder
the command c63pred (which extracts the prediction buffer for debugging purposes)

The precode is written in C. You are not required to touch the decoder or c63pred.

The precode can be downloaded from a Git repository here:

git clone https://bitbucket.org/mpg_code/inf5063-codec63.git

You must login to the Jetson TX1 devkit assigned to your group for this assignment. You should have received an email from the course administrators about which kits to use. Information about how to access the kits can be found in the GPU FAQ.

You are free to adapt, modify or completely rewrite the provided encoder to take full advantage of the target architecture. You are however not allowed to change out the algorithms for Motion Estimation, Motion Compensation or DCT/iDCT. You are not allowed to paste any other pre-written code into your implementation. You are also not allowed to post any code from the home exam on the Internet.

Start by profiling the encoder to see which parts of the encoder that are the bottlenecks. Remember, after optimizing one part of the code, more profiling might be needed to find new bottlenecks.

Some usage examples:

To encode the foreman test sequence

$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv

To decode a sequence

$ ./c63dec /tmp/test.c63 /tmp/test.yuv

To playback a raw yuv file

$ mplayer /tmp/test.yuv -demuxer rawvideo -rawvideo w=352:h=288

Evaluation

Write a short report where you discuss your results. The exam will be graded on how well you are able to take advantage the GPU architecture to solve the task at hand.

In evaluation, we will consider (in order):

Motion Estimation & DCT/iDCT algorithmic functions in the source code have been offloaded to the GPU.
- Document the bottleneck and the effect of your optimization.
A program that works (on the Jetson TX1 provided)
- Runs to completion. (*)
- Encodes tractor (1080p) correctly.
- Output video has a quality with a similar quality and file size as the reference encoder’s.
- Readable, well-commented code
Effect of the GPU offload
- Understanding the SoC architecture, and minimizing overhead with moving data between the CPU and GPU.
- Investigate if any advantages using mixed precision (FP16) on the GPU
- Correctness of memory use on the GPU (memory types, bank conflicts) and GPU code optimization with regards to branching.
- Bonus points for other non-obvious optimizations such as Motion Compensation and/or offloading parts of VLC.
The quality of the report that accompanies the code
- Clear and structured report of the performance changes caused by your modifications to the precode
- References to the relevant parts of the accompanying code (to aid the reviewer of the submitted assignment)
- Graphical presentation of the optimization steps and performance results (plots of performance changes)
- Comparison of / reflection about the alternative approaches tried out by your group

^{(**) We do not debug code before testing; correctness and effectiveness are not evaluated if this is not fulfilled.}

Report

You must write up the results as a technical report of no more than 4 pages in ACM format. The report should serve as a guide to the code modifications you have made and the resulting performance changes.

Machine Setup

The Jetson TX1 devkits are situated at Simula Research Laboratory. Machine names and how to access them can be found in the GPU FAQ. If you have reported your group to the course administration, you should have been assigned to a devkit and provided with a username and a password.

Contact inf5063@ifi.uio.no if you have problems logging in.

Formal Information

The deadline for handing in your assignment is: Friday, October 27th at (16:00:00.00).

Deliver your code and report (as PDF) at https://devilry.ifi.uio.no/. Submit the poster (as PDF) to inf5063@ifi.uio.no.

The groups should also prepare a poster (2 x A3 pages) and a quick 2 minutes talk (without slides) where you pitch your poster for the class on October 31th. Name the poster with your group name, and email the poster to inf5063@ifi.uio.no no later than noon (12:00) on October 30th. We will then print the poster for you.

For questions and course related chatter, we have created a Slack space: https://inf5063.slack.com

There will be a prize for best poster/presentation (awarded by an independent panel and independent of the grade).

Please check the GPU FAQ page for updates and FAQ

For questions please contact:

inf5063@ifi.uio.no

[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example