It’s not hard to find low-quality streaming videos of anime from the Internet. Mathematically speaking, I knew that the reduction of file size by lossy compression would inevitably resulted in the loss of quality. What I did not know was exactly how reduction of data was done or the algorithm of the lossy encoding process. “Dragon Ball Z Season One” DVD Box Set does include some bonus videos, including “Dragon Ball Z Rebirth”, “A New Look”, and “Dragon Ball Z Trailer”. While it’s fair to say that video of ten minute can hardly provide any detailed technical information; these videos more or less use a little bit of technical information to advertise for Dragon Ball Z DVD box sets, i.e. to convince the potential customers to buy every single box set (9 in total for Dragon Ball Z, 16 in total for the entire series); they do provide me insight for further investigation of the theories behind video/film production, lossy and lossless video encoding.
In computer graphics, a raster graphics image, or bitmap, is a dot matrix data structure representing a generally rectangular grid of pixels, or points of color, viewable via a monitor, paper, or other display medium. In an uncompressed image, each pixel, the smallest unit of a single image, has its own color information, which consists of several components or channels; three in the case of color space of RGB. Human eyes can discriminate up to 10 million different colors.
Electromagnetic radiation or photon can be identified by its frequency or wavelength. Wavelength of a given photon is frame-dependent or observer-dependent. The perceived change in frequency or wavelength of light due to the change of inertial frame of reference that the observer is in is also known as Doppler Effect. For an observer in an inertial frame of reference, human eyes roughly differentiate visible light from one color to another by the wavelength or frequency of the photon, for a given light intensity or brightness according human eyes. The perceived brightness of a light also affects human perception of color. In fact, human eyes are far more sensitive to brightness than the frequency of light (chroma). What we perceive as color consists of both luminance and chrominance. In practice, the light we see is often consists of photons of different frequencies; but we still perceive the light of a single color; ‘white’ is example of this, which contains all the wavelengths of the visible spectrum, at full brightness and without absorption.
For an image of the bit depth of 24 bit, each channel of color is of the bit depth of 8 bit; a pixel can represent 224 or 16,777,216 of possible colors. As each pixel represents a single color, each pixel would cost the size of three bytes for uncompressed image. Hence, regardless of the contents, an uncompressed image of 1920 × 1080 pixels would have the size of 6,220,800 Bytes or 5.93 MiB.
Lossless Compression of Image (Raster Graphics/bitmap)
Lossless data compression involves encoding information using fewer bits than the original representation, which allows the exact original data to be reconstructed from the compressed data. Lossless compression reduces bit by identifying and eliminating statistical redundancy. There are different algorithms being used for lossless data compression, which each of them assumes about what kinds of redundancy the compressed data are likely to contain.
Portable Network Graphics (PNG) is an example of a raster graphic file format that supports lossless data compression. PNG uses a non-patented lossless data compression method known as DEFLATE, which is the same algorithm used in zlib compression library.
A single-color 24-bit image of 1920 × 1080 pixels would have the size of 8,610 Byes or 8.40 KiB. The reduction of file size is obviously due to the repeat of the same color in other pixels. Of course, as the contents of the image get more complicated, the size of the image of the same resolution would increase as well. Unlike uncompressed image, the file size of the image of lossless compression is content-dependent. Efficiency of compression depends on the accuracy of identifying redundancy of data.
Lossy Compression of Image (Raster Graphics/bitmap)
In contrast with lossless compression, some information will be lost in lossy compression in exchange for smaller file size. Hence, the lossy compression is irreversible; while the original data can never be fully reconstructed from the data of lossy compression, they can be approximately reconstructed from the compressed data. Whereas algorithms of lossless compression assume that some kinds of redundancy would likely to exist within an image file, algorithms of lossy compression would assume that some kinds of information would be less noticeable or imperceptible to human beings. Lossy compression is mostly being done automatically with an encoder with predefined algorithms; it’s atypical for a person who analyzes the data in case-by-case basis to identify the less noticeable or imperceptible data.
The more aggressive the lossy compression is being done, the less fidelity the image is of. Generally speaking, it’s possible to create a high-fidelity image from the source of uncompressed data or data of lossless compression. However, when the image of lossy compression undergoes lossy compression of subsequent times, which is being done with encoder of predefined algorithms, the discard of data inevitably results in visual artifacts and the loss of quality. Hence, while lossy compression may achieve much smaller file size compared with lossless compression, it is only being done when no additional editing of the image is required. The loss of quality due to multiple lossy compression is known as “Generation Loss”. For those images that are still in production or require additional editing, they are being stored in image format of lossless compression. The images that are for distribution are generally of image format of lossy compression.
How does reduction of file size by lossy compression impact image quality?
For an uncompressed image, bit depth of each pixel or the image determines the number of possible colors. For an uncompressed image of a given resolution, the higher the bit depth (number of possible colors) is, the larger the file size is. Bit per pixel does not change for the image of lossless compression if the encoder does not change the bit depth; the reduction of file size is achieved by reduction or elimination of redundancy. For a 24-bit image of 1920 × 1080 pixels, it can at most have 2,073,600 different colors; that is if each pixel has different color. If such an image has no redundancy at all and has 2,073,600 different colors, the uncompressed image and its counterpart of lossless compression would have the same file size. If such an image undergoes lossy compression, what we would have is the change or loss of some colors in the image. In practice however, we can do both lossless and lossy compression because of the redundancy and limited noticeable colors within image. In other words, we trade off the details of the image in exchange for the smaller file size.
A commonly used image format of lossy compression is JPEG.
There are many different algorithms being used in achieve lossy compression of an image; chroma subsampling is one of them. It is the practice of encoding images by implementing less resolution for chroma information than for luma information. It goes back to a fundamental question: How do we encode color information into digital data?
We mentioned “color space” earlier. Just as its name suggests, it’s an abstract mathematical model where the color can be represented as the tuples of numbers. For example, in the color space of sRGB, red can be represented by (r: 255, g: 0, b: 0); orange can be represented by (r: 255, g: 127, b: 0). sRGB is just one of the color spaces that can be used for encode color information. 24-bit sRGB consists of three channels: Red, Green, and Blue. Colors in sRGB color space can be re-encoded in the color space of Y’CbCr. Note that Y’CbCr is not an absolute color space like sRGB, but a way of encoding RGB information. The justification for re-encoding of RGB information in Y’CbCr is for the purpose of chroma subsampling.
Instead of the channels of Red, Green, and Blue, color is being represented by luma component (brightness) of Y’, chroma (color) component of CB (blue-difference), and chroma (color) component of CR (red-difference).
Chroma subsampling is used in many video encoding schemes as well as in JPEG encoding.
Chroma subsampling counts on human visual system which is much more sensitive to brightness than color. The subsampling scheme is commonly expressed as a three part ratio J:a:b, where
• J: horizontal sampling reference (width of conceptual region). Usually 4
• a: number of chrominance samples (CR, CB) in the first row of J pixels.
• b: number of (additional) chrominance samples (CR, CB) in the second row of J pixels.
For example, for 4:2:2,
A 24-bit image can be regarded to be made of blocks of 4 × 2 pixels. In each block, every pixel would have 8 bit in Y’. In the first row of 4 pixels, only two pixels have 16 bit in CR and CB. The other two pixels in the first row do not have data in CR and CB on their own; instead, they get the data from neighboring pixels. The same applies to the second row. As a result, a block of 4 × 2 pixels would have the color information of 128 bits instead of 192 bits. On average, bit per pixel would be 16 even though the bit depth is 24-bit. The reduction of file size is achieved by the sharing of color information in CR and CB.
In the first row of 4 pixels, only two pixels have 16 bit in CR and CB. None of the pixels from the second row have data in CR and CB on their own; they get the data from the first row.
Different variants of 4:2:0 chroma configurations are found in:
• All ISO/IEC MPEG and ITU-T VCEG H.26x video coding standards including H.262/MPEG Part 2 implementations (although some profiles of MPEG-4 Part 2 and H.264/MPEG 4 Part 10 allow higher quality sampling schemes such as 4:4:4)
• DVD-Video (H.262 Main Profile Main Level or MPEG 1 Part 2) and Blu-ray Disc (H.262, H.264, and VC-1)
• PAL DV and DV CAM
• AVCHD and AVC-Intra 50
• Apple Intermediate Codec
• most common JPEG/JIFF and MJPEG implementations
4:4:4 implies no chroma subsampling, which means that the full RGB information is being encoded.
A film/video/motion picture is a series of still images which, when show on screen, creates the illusion of moving images. The human eye and its brain interface, the human visual system, can process 10 to 12 separate images per second, perceiving them individually. A film frame or video frame is one of many still images which compose the film. Films aimed at theatrical release typically is of 24 frame per second.
A video can be displayed in two ways: progressive scan and interlaced scan. Progressive scanning is a way of displaying, storing or transmitting of moving images in which all the lines of each frame are drawn in sequence. In analog television broadcast, people want to reduce the bandwidth being consumed while preserving perceptional picture quality; instead of transmitting a full frame at once, only half a frame, or a field, is being displayed at once; such a way of display is called interlaced scan. In NTSC (National Television System Committee) standard, 60 fields or 30 frames are being displayed every 1.001 second; in PAL (Phase Alternating Lines) standard, 50 fields or 25 frames are being displayed every second. Unlike progressive scan, because only a field or half of frame is being displayed in every 1.001/60 second (NTSC) or every 1/50 second (PAL), a full frame actually consists of field captured at different time in interlaced scan. To display an image, the television sequentially draws all of the odd-numbered lines from top to bottom and then proceed to fill in the even lines. The field rate for NTSC standard is typically written as 59.94i; the field rate for PAL standard is typically written as 50i; 24 frame per second in progressive scan is typically written as 24p.
Interlaced scan can only reduce bandwidth for analog television broadcast, at the expense of image quality. Because a frame consists of field captured at different time, finer details of image would be lost in the process; in high speed motion, if the fast-moving object is moving fast enough to be in different positions when each individual segment is captured, a “motion artefact” will result.
Before the advent of High Speed Internet or Personal Computer (PC), people typically watch video from either TV (analog broadcast or video storage medium) or at the movie theater. A video aimed at analog broadcast is typically produced and displayed at interlaced scan; a theatrical released film is produced and displayed at progressive scan.
A video can be captured and stored in analog or digital data. From the early days even until today, film stocks have been used for theatrical released films. The master copy of Dragon Ball anime series is film stock. The video can be transferred from film to digital data and vice versa. Other than film stocks, video are also being stored in other analog storage device, such as magnetic tape.
As video can undergo lossless or lossy compression, interlaced scan may not be necessary for digital distribution of video in physical storage media such as DVD, Blu-ray, or USB Flash Drive; through the Internet; or digital television broadcast. The way to calculate the file size of an uncompressed video is fairly similar to uncompressed still image:
The length of the video × frame rate × resolution × bit depth × 1/8
For example, an-hour video with the resolution of 1920 × 1080 pixels, bit depth of 24 bit, and frame rate of 24 FPS is of 537,477,120,000 Bytes or 500.56 GiB. Note that, in some video player such as Media Player Classic Home Cinema, the “Bit Depth” shown in MediaInfo in “Properties” actually means Bit per Channel. That means, 8 bit per channel would mean 24 bit in total for three-channel Y’CBCR. Technically, Y’UV is a different color space meant for analog system; but within the context of digital system, it has been used interchangeably with Y’CBCR.
Lossless Compression of Video
As video is a series of images, each frame can be compressed individually by algorithms of lossless compression; such a way of compression is known as intra-frame compression. The redundancy of data or the repeat of the data may exist from frame to frame; instead of encoding the information for the entire frame, only extra information or changes in other frames are to be encoded; reduction of redundancy of data across several frames is known as inter-frame compression. Reference frames are frames of a compressed video that are used to define future frames; they are only used for inter-frame compression. Video file of lossless compression is rarely found in the distribution of videos. In the physical media such as DVD or Blu-Ray; on video-streaming or file-sharing web site; what we have found is typically of lossy compression. While average consumers may find PNG files over the Internet, they may hardly find any video file of lossless compression. Still, lossless compression is necessary in video editing or production to avoid generation loss. The digital master copy of the video may be stored in file format of lossless compression. Examples of video codecs of lossless compression are Dirac Lossless, FFV1, H.264 Lossless, Huffyuv, etc. Note that not every video codec of lossless compression supports inter-frame compression.
As a video file is technically made of a series images, without inter-frame compression, a master copy of one-minute video of 24 FPS can also mean 1440 image files of lossless compression. With Inter-frame compression, however, encoding them into a single video file would result in the reduction of file size if redundancy across different frames exist. Let me give you an example of the impact of inter-frame compression.
FFmpeg is a free software project; a multimedia framework project; able to decode, encode, mux, demux, stream, and play many image, audio, or video formats. In particular, it has a command-line tool that uses various encoders and decoders, which allow the encoding and decoding of image, video, or audio of various file formats.
Using FFmpeg, we can create a video of lossless compression of 24 FPS from a series of still image. 1440 frames would mean a video of one minute. As inter-frame lossless compression relies on redundancy of image data across different frames, for the purpose of testing, these images are exactly identical to each other.
FFmpeg uses x264 (encoder) to create H.264 video. H.264 videos are commonly found in various Internet streaming sites such as YouTube, Youku, Dailymotion, etc., as well as in Blu-ray Videos. Those videos are of lossy compression. The use of lossy compression by Internet streaming sites as well as Blu-ray Video Producers is mainly to save bandwidth or reduce file size. H.264, nevertheless, does have lossless mode.
A single-frame (or 1/24-second) H.264 video of 1920 × 1080 pixels has the file size of 1,935,449 bytes or 1.84 MiB. Theoretically speaking, a perfect inter-frame lossless compression of the video of 1440 frames should have same file size with the single-frame video; because I deliberately create identical duplicates of the first frame for the next 1439 frames. In practice, I’m using x264 encoder included in FFmpeg with pre-defined algorithms to identify and reduce redundancy of data. Hence, the file size difference is quite noticeable even though I use the exact duplicates of the first frame. The video of 1440 frames have the file size of 11,908,492 bytes or 11.3 MiB; its file size is of more than 6 times of the single-frame video. Such a result is far from perfect; but the result of inter-frame compression is pretty obvious. H.264 supports both intra-frame and inter-frame compression; or else, the single-frame should have the file size of 6,220,800 bytes or 5.93 MiB. The x264 encoder only manages to find redundancy of data across several frames, but not every frame; it would seem.
In FFmpeg project, a video codec of lossless compression was also invented; I’m referring to FFV1. Unlike H.264, it only supports intra-frame lossless compression; it does not support inter-frame lossless compression or lossy compression. I encode the video of 1440 frames using FFV1 codec; the resulted file size is 2,499,512,014 bytes or 2.32 GiB. The uncompressed video of 1440 frames is of the file size of 8,957,952,000 bytes or 8.34 GiB. It clearly shows the benefit of inter-frame compression.
In the actual film/video/animation production, clearly we will not have identical frames for the entire length of video; but we can certainly count on the repeat of certain image data across several frames, such as background, inanimate objects, etc.
To encode a series of images into a single H.264 video file of lossless compression:
ffmpeg -r “24” -i “f_%d.png” -r “24” -vcodec “libx264” -crf “0” “output.mkv”
To encode the first frame into a single-frame H.264 video file lossless compression:
ffmpeg -r “24” -i “f_1.png” -r “24” -vcodec “libx264” -crf “0” “output-2.mkv”
To encode a series of images into a single FFV1 video file:
ffmpeg -r “24” -i “f_%d.png” -r “24” -vcodec “ffv1” “output-3.mkv”
To encode a single video file into a series of images:
ffmpeg -i “input.mkv” “t_%d.png”
The input files and output files are to be placed in the same directory with the ffmpeg command-line tool executable.
|r||Frame rate. Frame rate for input/output option has to be specified before the input/output option. If no frame rate for input option is specified, the default frame rate for pictures is 25 FPS. For video file, however, no frame rate is required to be specified for input option if the original frame rate is intended. Frame rate before the output option will determine the frame rate for the output video. The choice of frame rate for input and output option will determine the frame rate of output video and whether or not the playback speed of the output video is slowed down, sped up, or remains the same.|
|i||Input. Specify the file name (with file name extension) you want to input|
|%d||“%d” is used for image files. “%d” will be replaced with “1, 2, 3, …”|
|vcodec||Video codec. “libx264” is the x264; an encoder for H.264 video. “FFV1” is the video codec that supports lossless intra-frame compression.|
|crf||Constant Rate Factor. This method allows x264 to attempt to achieve a certain output quality for the whole file when the output file size is of less importance. You cannot tell the output file size with this option. The range of the scale is 0 – 51; where 0 is lossless, 23 is default, and 51 is worst possible. Subjectively sane range is 18 – 28. crf scale is not linear for x264; it is logarithmic.|
|-pix_fmt||Set pixel format. The supported pixel format is subject to the codec and profile being used. H.264 High Profile only supports Y’CBCR 4:2:0. By default, encoding a series of images of lossless compression into H.264 lossless (High 4:4:4 Predictive Profile) will use the pixel format of Y’CBCR 4:4:4 (no chroma subsampling). Encoding a video from a video of lossy compression of (Y’CBCR 4:2:0) with “-crf 0” will retain original pixel format (Y’CBCR 4:2:0), which means no color space conversion. We can also force chroma subsampling on output file even when the value of crf is set to “0”, e.g. yuv420p.|
Lossy Compression of Video
As most videos we get from commercial release of physical medium, Internet distribution, or digital broadcast are already of lossy compression, we cannot recover the original uncompressed video or the video of the lossless compression of the uncompressed video from the sources of lossy compression; in other words, lossy compression is irreversible. What happen if we try to encode the video of lossy compression into video of lossless compression? We will see the sharp increase of file size without recovering the original video data.
If we create our own films, videos, or animations, we may keep master copy in the file format of lossless compression; but local playback of video of lossless compression is highly resource-intensive, especially for high-definition video (1280 × 720 pixels or 1920 × 1080 pixels, for the aspect ratio of 16/9). Just like lossy compression of an image, lossy compression of video directly from uncompressed video or video of lossless compression can possibly encode video with high-quality image information. For both lossy compression and lossless compression of video, the compression level is general higher while retaining quality of certain level compared with image due to inter-frame compression. It’s sensible to play video of lossy compression instead of lossless compression, as well as for the purpose of distributions. In other words, we keep video of lossless compression for archival and editing purposes; we keep video of lossy compression for the purposes of consumption and distribution.
Lossy compression of video also counts on some image information would be less noticeable or imperceptible to human visual system.
Commonly used video codecs for lossy compression include H.264, H.262, VC-1, etc. We already mentioned H.264 and H.262 in this article. H.262, also known as MPEG 2 Part 2, is used in DVD-Video, Blu-ray Video, digital broadcast, digital cable TV, digital satellite TV, etc.
A subset of H.262 specifications is being used by DVD-Video. DVD-Video uses H.262 video of Constant-Bit-Rate (CBR) or Variable-Bit-Rate (VBR). It supports H.262 Main Profile at Main Level and Simple Profile at Main Level. Field rate of 59.94i and 50i are expressly supported. Progressively-sourced video is usually encoded on DVD as interlaced field pairs that can be re-interleaved by a progressive player to recreate original progressive player. Allowable image resolution for 59.94i video (NTSC) is 720 × 480 pixels, 704 × 480 pixels, 352 × 480 pixels, and 352 × 240 pixels; allowable image resolution for 50i video (PAL) is 720 × 576 pixels, 704 × 576 pixels, 352 × 576 pixels, and 352 × 288 pixels. Maximum video bit rate is 9.8 Mbps. The “average” video bit rate is around 4 Mbps but depends entirely on length, quality, amount of audio, etc. Chroma subsampling of Y’CBCR 4:2:0 is used. A H.262 video of 720 × 480 pixels encoded at the average bit rate of 4 Mbps of 4 hours has the file size of 7,200,000,000 bytes or 6.71 GiB, compared with the uncompressed video of 358,318,080,000 bytes or 333.71 GiB of 24 FPS. The compression ratio is 1:49.7664.
The intent of H.264 project was to create a standard capable of providing good video quality of substantial lower bit rates than previous standards (e.g. half or less the bit rate of H.262 or H.263), without increasing the complexity of design so much that it would be impractical or excessively expensive to implement. In addition to lossy compression, H.264 also supports lossless compression which is not supported by H.262. The H.264 video format has a very broad application that covers all forms of digital compressed video from low bit rate Internet streaming applications to HDTV broadcast and Digital Cinema applications with nearly lossless coding. H.264 is one of the three mandatory supported codecs for Blu-ray Video.
Features of H.264
H.264 contains a number of new features that allow it to compress video much more effectively than older standards and to provide more flexibility for application to a wide variety of network environments.
In particular, some key features include:
1. Multi-picture inter-picture prediction including the following features:
• Using previously encoded pictures as references in a much more flexible way than in past standards, allowing up to 16 reference frames (or 32 reference fields, in the case of interlaced encoding) to be used in some cases. This is in contrast to prior standards, where the limit was typically one; or, in the case of conventional “B-Pictures” (B-Frames), two. This particular feature usually allows modest improvements in bit rate and quality in most scenes. But in certain types of scenes, such as those with repetitive motion or back-and-forth scene cuts or uncovered background areas, it allows a significant reduction in bit rate while maintaining clarity.
• Variable block-size motion compensation (VBSMC) with block sizes as large as 16 × 16 pixels and as small as 4 × 4 pixels, enabling precise segmentation of moving regions. The supported luma prediction block sizes include 16 × 16 pixels, 16 × 8 pixels, 8 × 16 pixels, 8 × 8 pixels, 8 × 4 pixels, 4 × 8 pixels, and 4 × 4 pixels, many of which can be used together in a single macroblock. Chroma prediction block sizes are correspondingly smaller according to the chroma subsampling in use.
• The ability to use multiple motion vectors per macroblock (one or two per partition) with a maximum of 32 in the case of a B macroblock constructed of 16 partitions of 4 × 4 pixels. The motion vectors for each partition region of 8 × 8 pixels or larger can point to different reference pictures.
• The ability to use any macroblock type in B-frames, including I-macroblocks, resulting in much more efficient encoding when using B-frames. This feature was notably left out from MPEG-4 Part 2 Advanced Simple Profile.
• Six-tap filtering for derivation of half-pel luma sample predictions, for sharper sub-pixel motion-compensation. Quarter-pixel motion is derived by linear interpolation of the half-pel values, to save processing power.
• Quarter-pixel precision for motion compensation, enabling precise description of the displacements of moving areas. For chroma the resolution is typically halved both vertically and horizontally (4:2:0) therefore the motion compensation of chroma uses one-eighth chroma pixel grid units.
• Weighted prediction, allowing an encoder to specify the use of a scaling and offset when performing motion compensation, and providing a significant benefit in performance in special cases – such as fade-to-black, fade-in, and cross-fade transitions. This includes implicit weighted prediction for B-frames, and explicit weighted prediction for P-frames.
2. Spatial prediction from the edges of neighboring blocks for “intra” coding, rather than “DC” only prediction found in H.262 and the transform coefficient prediction found in H.263v2 and MPEG-4 Part 2. This includes luma prediction block sizes of 16 × 16 pixels, 8 × 8 pixels, and 4 × 4 pixels (of which only one type can be used within each macroblock).
3. Lossless macroblock features include:
• A lossless “PCM macroblock” representation mode in which video data samples are represented directly, allowing perfect representation of specific regions and allowing a strict limit to be placed on the quantity of coded data for each macroblock.
• An enhanced lossless macroblock representation mode allowing perfect representation of specific regions while ordinarily using substantially fewer bits than the PCM mode.
4. Flexible interlaced-scan video coding features include:
• Macroblock-adaptive frame-field (MBAFF) coding, using a macroblock pair structure for pictures coded as frames, allowing macroblocks of 16 × 16 pixels in field mode (compared with H.262, where field mode processing in a picture that is coded as a frame results in the processing of half-macroblocks of 16 × 8 pixels).
• Picture-adaptive frame-field coding (PAFF or PicAFF) allowing a freely-selected mixture of pictures coded either as complete frames where both fields are combined together for encoding or as individual single fields.
5. New transform design features include:
• An exact-match integer 4 × 4 spatial block transform, allowing precise placement of residual signals with little of the “ringing” often found in prior codec designs. This design is conceptually similar to that of the well-known discrete cosine transform (DCT), introduced in 1974 by N. Ahmed, T. Natarajan, and K.R. Rao, which is Citation 1 in Discrete cosine transform. However, it is simplified and made to provide exactly specified coding.
• Adaptive encoder selection between the 4 × 4 transform and 8 × 8 transform block sizes for the integer transform operation.
• A secondary Hadamard Transform performed on “DC” coefficients of the primary spatial transform applied to chroma DC coefficients (and also luma in one special case) to obtain even more compression in smooth regions.
6. A quantization design include:
• Logarithmic step size control for easier bit rate management by encoders and simplified inverse-quantization scaling
• Frequency-customized quantization scaling matrices selected by the encoder for perceptual-based quantization optimization
7. An entropy coding (lossless coding) design include:
• Context-adaptive binary arithmetic coding (CABAC), an algorithm to losslessly compress syntax elements in the video stream knowing the probabilities of syntax elements in a given context. CABAC compresses data more efficiently than CAVLC but requires considerably more processing to decode.
• Context-adaptive variable-length coding (CAVLC), which is a lower-complexity alternative to CABAC for the coding of quantized transform coefficient values. Although lower complexity than CABAC, CAVLC is more elaborate and more efficient than the methods typically used to code coefficients in other prior designs.
• A common simple and highly structured variable length coding (VLC) technique for many of the syntax elements not coded by CABAC or CAVLC, referred to as Exponential-Golomb coding.
8. Loss resilience features include:
• A Network Abstraction Layer (NAL) definition allowing the same video syntax to be used in many network environments. One very fundamental design concept of H.264 is to generate self-contained packets, to remove the header duplication as in MPEG-4’s Header Extension Code (HEC). This was achieved by decoupling information relevant to more than one slice from the media stream. The combination of the higher-level parameters is called a parameter set. The H.264 specification includes two types of parameter sets: Sequence Parameter Set (SPS) and Picture Parameter Set (PPS). An active sequence parameter remains unchanged throughout a coded video sequence, and an active picture parameter set remains unchanged within a coded picture. The sequence and picture parameter set structures contain information such as picture size, optional coding modes employed, and macroblock to slice group map.
• Flexible macroblock ordering (FMO), also known as slice groups, and arbitrary slice ordering (ASO), which are techniques for restructuring the ordering of the representation of the fundamental regions (macroblocks) in pictures. Typically, considered an error/loss robustness feature, FMO and ASO can also be used for other purposes.
• Data partitioning (DP), a feature providing the ability to separate more important and less important syntax elements into different packets of data, enabling the application of unequal error protection (UEP) and other types of improvement or error/loss robustness.
• Redundant slices (RS), an error/loss robustness feature that lets an encoder sends an extra representation of a picture region (typically at lower fidelity) that can be used if the primary representation is corrupted or loss.
• Frame numbering, a feature that allows the creation of “sub-sequences”, enabling temporal scalability by optional inclusion of extra pictures between other pictures, and the detection and concealment of losses of entire pictures, which can occur due to network packet losses or channel errors.
9. Switching slices, called SP and SI slices, allowing an encoder to direct a decoder to jump into an ongoing video stream for such purposes as video streaming bit rate switching and “trick mode” operation. When a decoder jumps into the middle of a video stream using the SP/SI feature, it can get an exact match to the decoded pictures at that location in the video stream despite using different pictures, or no pictures at all, as references prior to the switch.
10. A simple automatic process for preventing the accidental emulation of start codes, which are special sequence of bits in the coded data that allow random access into the bitstream and recovery of byte alignment in the systems that can lose byte synchronization.
11. Supplemental enhancement information (SEI) and video usability information (VUI), which are extra information that can be inserted into the bitstream to enhance the use of the video for a wide variety of purposes. SEI FPA (Frame Packing Arrangement) message that contains the 3D arrangement:
• 0: checkerboard – pixels that are alternatively from L and R.
• 1: column alteration – L and R are interlaced by column.
• 2: row alteration – L and R are interlaced by row.
• 3: side by side – L is on the left, R is on the right.
• 4: top bottom – L is on the top, R is on the bottom.
• 5: frame alternation – one view per frame
12. Auxiliary pictures, which can be used for such purposes as alpha compositing.
13. Support of monochrome (4:0:0), 4:2:0, 4:2:2, and 4:4:4 chroma subsampling (depending on the selected profile).
14. Support of sample bit depth precision (bit per channel) ranging from 8 to 14 per sample (depending on the selected profile).
15. The ability to encode individual color planes as distinct pictures with their own slice structures, macroblock modes, motion vectors, etc., allowing encoders to be designed with a simple parallelization structure (supported only in the three 4:4:4-capable profiles)
16. Picture order count, a feature that serves to keep the ordering of the pictures and the values of samples in the decoded pictures isolated from timing information, allowing timing information to be carried and controlled/changed separately by a system without affecting decoded picture content.
With these techniques and several others, H.264 performs significantly better than any prior standard under a wide variety of circumstances in a wide variety of application environments. H.264 can often perform radically better than H.262 video – typically obtaining the same quality at the half of the bit rate or less, especially on high bit rate and high resolution situations.
Selected H.264 Profiles
The standard defines 21 sets of capabilities, which are referred to as profiles, targeting specific classes of applications. A decoder can at least decode one, but not necessarily all profiles. The profiles supported by FFmpeg x264 (encoder) include baseline, main, high, high10, high222, and high444.
|baseline||Baseline Profile. Primarily for low-cost applications that requires additional data loss robustness, this profile is used in videoconferencing and mobile applications.|
|main||Main Profile. Standard-definition digital television broadcast|
|high||High Profile. Broadcast and disc storage application, particularly for high-definition television applications|
|high10||High 10 Profile. Going beyond mainstream consumer product capabilities, this profile builds on top of High Profile, adding support up to 10 bits per sample (channel) of decoded picture precision|
|high422||High 4:2:2 Profile. Primarily targeting professional applications that use interlaced video, this profile builds on top of High 10 Profile, adding support for the 4:2:2 chroma subsampling format while using up to 10 bits per sample (channel) of decoded picture precision|
|high444||High 4:4:4 Predictive Profile. This profile builds on top of the High 4:2:2 Profile, supporting up to 4:4:4 chroma subsampling, up to 14 bits per sample, and additionally supporting efficient lossless region coding and the coding of each picture as three separate color planes.|
Feature support in particular profiles
|Arbitrary Slice Ordering (ASO)||Yes||No||No|
|Interlaced Coding (PicAFF, MBAFF)||No||Yes||Yes|
|CABAC entropy encoding||No||Yes||Yes|
|8 × 8 vs. 4 × 4 transform adaptivity||No||No||Yes|
|Quantization scaling matrices||No||No||Yes|
|Separate CB and CR QP control||No||No||Yes|
|Separate color plane coding||No||No||No|
|Predictive lossless coding||No||No||No|
|SampleDepths (bits)||8 – 10||8 – 10||8 – 14|
|Arbitrary Slice Ordering (ASO)||No||No||No|
|Interlaced Coding (PicAFF, MBAFF)||Yes||Yes||Yes|
|CABAC entropy encoding||Yes||Yes||Yes|
|8 × 8 vs. 4 × 4 transform adaptivity||Yes||Yes||Yes|
|Quantization scaling matrices||Yes||Yes||Yes|
|Separate CB and CR QP control||Yes||Yes||Yes|
|Separate color plane coding||No||No||Yes|
|Predictive lossless coding||No||No||Yes|
Recall how we encode a series of images into a single H.264 video file of lossless compression:
ffmpeg -r “24” -i “f_%d.png” -r “24” -vcodec “libx264” -crf “0” “output.mkv”
We cannot specify any profile for H.264 lossless encoding as the usage of -vprofile is incompatible with lossless encoding; the only profile supported by x264 that supports lossless encoding is High 4:4:4 Predictive Profile; but High 4:4:4 Predictive Profile supports both lossless and lossy encoding.
ffmpeg -i “input.mkv” -vcodec “libx264” -crf “20” -pix_fmt “yuv444p” “output.mkv”
Because only High 4:4:4 Predictive Profile supports chroma format of 4:4:4 (no chroma subsampling), the profile selected by x264 is High 4:4:4 Predictive Profile
If -pix_fmt is omitted, the profile selected will depend on various constraints determined by both the user and the video source; for example, if the source is of chroma format of 4:2:0, x264 may select High Profile instead of High 4:4:4 Predictive Profile at CRF 20; if the source is of chroma format of 4:4:4, High 4:4:4 Predictive Profile may be selected. With -pix_fmt, x264 will disregard the chromat format of the source and use the chromat format intended by the user.
We can also explicitly specify the profile to be used for lossy encoding.
ffmpeg -i “input.mkv” -vcodec “libx264” -crf “20” -vprofile “main” -pix_fmt “yuv420p” “output.mkv”
Note that chroma format of 4:2:0 may need to be specified if color space conversion is involved. (e.g. encoding a video of 4:4:4 into a video of Main Profile)
A “level” is a specified set of constraints that indicate a degree of required performance for a profile. Those constraints include maximum decoding speed, maximum frame size, and maximum video bit rate. x264 would automatically specify a level based on the option we choose during the encoding process.
Video Format Compression Comparison
|Video Format||Uncompressed Video||H.264 LosslessHigh 4:4:4Predictive@L3.1|
|Resolution||1280 × 720 pixels|
|Number of Frames||68,945|
|File Size (Bytes)||190,619,136,000||14,465,171,340|
|Rate Control||–||-crf “0”|
|Video Format||H.264 LossyHigh 4:4:4Predictive@L3.1||H.262 Lossy4:2:2@High|
|Resolution||1280 × 720 pixels|
|Number of Frames||68,945|
|File Size (Bytes)||1,406,569,877||3,158,756,615|
|Rate Control||-crf “18”||-q “2”|
1. The encoding is done by FFmpeg command-line tool.
2. CRF value of “18” is the lowest value in the subjectively sane range of “18 – 28”; even though technically lower value is possible. CRF value of “0” would initiate lossless mode. The subjective quality is observer-dependent; the recommended value thus is based on subjective blind test done on a sufficiently large number of people of diverse background (age, sex, etc.).
3. H.262 does not have lossless mode. qscale value of “2” is lowest possible value given by H.262 encoder included in FFmpeg library.
4. The highest level of color space supported by H.262 is 4:2:2 Y’CBCR.
5. The designs and specifications of the encoders and library are subject to change.
6. The output file size is dependent of the source. The above result merely represents a particular case, which may not tell conclusively how video of other sources would be encoded. Different CRF value may be selected on case-by-case basis, depending on trial and error, intended output file size, acceptable subjective perceptual quality, etc.; as CRF value does not indicate the average output video bit rate. Lossy encoding is always a trade-off. If the output file size is more important than subjective perceptual quality, the average bit rate can be set with -vb option in place of a CRF value.
39. About FFmpeg