Deconstructing H.264/AVC
July 28, 2004
If you were watching the 2004 Apple WWDC Keynote, or even just checking out the upcoming 10.4 Tiger release you may have noticed Apple giving a lot of time to something called ‘H.264/AVC’ which it looks like they’re moving to whole hog and has me pretty excited. Apple has a fairly glossed over page which talks about it, and if you’re going to actually read the rest of this I’d head over and at least skim as I’ll reference it some later. Plus it has some pretty pictures.
As a disclaimer: These are the pieces as I know them; and if I know something wrong hit me with the clue stick or fill in gaps. I’m pretty sure its reasonable, if a bit over-the-top in terms of length again.
Since we know where we’re going (H.264), it’s only fair to talk about where we’ve been… within the realm of reason. I haven’t been that big of fan of Apples’ handling of MPEG4, so we’ll stick to that and not some of my unhappiness with their current Quicktime strategy in general; with the indulgence that when you’re in a cut-throat fight over the future of video delivery, chances are it’s not such a great idea to smack your users over the head with pop-ups to shell out money whenever they open a media file, or to make them shell out money to save a movie or *gasp* play full screen.
Quicktime & MPEG
But, going back to MPEG4, I mentioned I wasn’t the happiest with Apples’ handling of it… but this is primarily in the realm of follow-through, which honestly is one of Apples’ long-term corporate culture problems, some of which is prolly due to necessity as they’ve gone through brain drains and their head count has shrunk. Apple simply doesn’t have the head count that say, someone like Microsoft has when it comes to throwing people at a problem and can contribute to them having these weird feature spikes where if you were scoring different features of OSX against WinXP, it might look like (on a scale of 1 to 10):
* Mac OSX
10, 5, 9, 2, 4, 7, 9, 10, 8, 1, 2 = 67
* WindowsXP
9, 6, 6, 5, 6, 6, 10, 1, 6, 6, 5 = 66
…which leads to a situation where, if you look at what Apple happens to be singling out at the time they look to be aces, but a broader outlook makes things look a little more subdued. Another way to think of it might be broad and shallow versus narrow and deep. If you’re a Tolkien geek, think of Saurons’ Eye; when its pointed at you, you’re really aware of it. When you’re in its peripheral vision lots of things slide.
Apple has very, very good people, it just doesn’t have all that many of them to spread around in the grand scheme of things. Less eyes are going to mean less peripheral vision. This does not mean that just throwing people at the problem is the answer, but it’s just the nature of things.
Apple is also prone to a bit of ADHD (many creative types are, just interesting to see it so pervasive in a company), in that they’ll pick the de-jour of the week, get the press, then pick another feature to hype while the prior one sorta languishes. Many a mac user has been embarrassed when they’re working of mental constructs based around what the situation used to be when disparaging a competing offering (and vice-versa), and not what it happens to be at the moment. I can’t wait until the help system becomes de-jour again…
This was kind of my problem with MPEG-4 on the Mac, and before that, MP3: implementation (lots more on this later). While all codecs aren’t created equal, the implementation of the codec can be just as important. Witness something like MP3 versus AAC; When compared on their technical merits, AAC on the whole is a superior codec. But the difference in quality between an MP3 encoded with iTunes and an MP3 encoded via LAME or Blade can be drastic, especially at lower bitrates or certain types of music (think ‘Daft Punk’, ‘Smashing Pumpkins’ or ‘Lords of Acid’).
MacOS 10.2 (Jaguar) and Quicktime 6 ushered in MPEG-4 (.mp4) after being delayed for awhile due to a rather public spat over licensing costs between Apple and the holding companies responsible for the care and feeding of the various MPEG branches. MPEG-4 had a lot of promise for equalizing the playing field for online distribution.
Remember the ‘media player wars’ were really humming between Apple, Real and Microsoft, and all the players were bring heavy codecs to the table. People were talking ‘convergence’ and cable companies were making ill-fated and overhyped promises of video-on-demand that way too many people bought into. Abortions like QuicktimeTV were still trying to figure out why they existed, and everyone was expected to be throwing video on their website.
MPEG-4 was a bit of a shuffle in the market; previously the way it worked was that if you picked Quicktime you’d use Sorenson (a licensed codec Apple leased the exclusive end-player rights to, also known as SVQ3), if you picked WMP (Windows Media Player) you’d use their codecs, if you picked Real Player your customers would leave you. Interestingly enough, both Apple and Real were two of the big names signing on for MPEG-4 support… it was really considered to be a done deal, committee standards over proprietary.
The climate for media delivery was getting more than a little problematic for content creators who were just sick of this stuff and everyone was really, really keen on MPEG-4 being adopted but, like Firewire, the licensing issues caused it to lose some steam, but most saw the writing on the wall… especially companies like the one responsible for the Sorenson codec who shopped it to Macromedia for inclusion into Flash and got themselves sued by Apple. Interestingly enough, an FFmpeg coder (who has remained anonymous was working on reverse engineering SVQ3 and found it to be a specially tweaked version of H.264… more on that later as I’m getting sidetracked.
Since everyone keeps mentioning these various licensing issues, it’s worth giving a bit of back history on who is behind the various MPEG standards and where and why MPEG-4 and H.264 came about… all of this starts with the MPEG group.
The MPEGs
The MPEG group (Moving Picture Experts Group) was started all the way back in 1988 with a mandate of establishing standards for the delivery of audio, visual, and both combined. After a good 5 years they shipped MPEG-1, and since this was in 1992 and no one in their right mind was even thinking about sending video over their 14.4 baud modems it was heavily geared towards getting the data on a disk.
This MPEG group is actually a real problem; if it was done today, there’s no way in hell the MPEG group would be setup the way it currently is and debacles like Apple holding up its release of Quicktime 6 as a power play over steaming licensing fees wouldn’t happen.
Chances are it’d be much more akin to the World Wide Web Consortium and they’d be a hell of a lot pickier about what was chosen to be included in the codec… they’d be much more mindful about things like patents. At the time it wasn’t a big deal as who would have ever thought we’d all be sitting here with a copy of iMovie on our desk, and their priorities weren’t so much in a few pennies here and there but in having something people could reference. Lossy codecs were starting to sprout up everywhere (like JPEG) but unfortunately a ton of these were proprietary.
Proprietary in these cases can be really, really bad. Imagine a broadcaster buying equipment from company A that stores out video in mystery codec y, but you have to interface it with equipment from company B that has its data stores in mystery codec Z. You’re just asking for all manner of nightmares both on the vendor side and the customer side. Sometimes before you can really compete you have to at least decide where you’re going to have the damn battle.
Still, MPEG-1 was a big deal for things like CD-ROMS, and became a much bigger deal later on (more on that later) even though it has all sorts of IP issues… and it often had hardware-based support, whereas things like Indeo or Cinepak were software based.
But while the data rate was fine for CD-ROMs, it was only meant to deal with non-interlaced video signals (not your TV) and the resolution wasn’t that great, a little less than the quality you’d get with a VCR. While MPEG-1 still lives in various places (VCDs use it, which you can still find around the net) something new was needed.
MPEG-2 is what most of us are used to seeing around now and it hit the scene around 1994. It’s the standard used for DVDs and the I-swear-it-is-coming-soon-‘cus-PBS-won’t-stop-running-specials-on-it HDTV. It was more than a little demanding on CPU when it was first released, leading to a wave of 3rd party MPEG-2 acceleration cards being included in PCs, although now its primarily a software thing as Moores’ Law has advanced. Still, there were a lot of Powerbook owners who were pretty ticked off at Apple that while their computer would run OSX, Apple just kinda decided not to support their hardware DVD decoder.
From a technical standpoint MPEG-2 was about showing that the MPEG standard had legs and could scale pretty damn high from its original intended data rates for things like SDTV (Standard Definition Television) as it was being thrown at interlaced feeds (your computer isn’t using interlaced video, but your TV does; interlaced is more of a bitch to work with) and vastly improved tech in the areas of ‘multi-plexing’ and audio. MPEG-1 only allowed 2-channel stereo sound, which was… problematic for where people wanted things to go.
There were imaging improvements in MPEG-2 of course; but the big deal was the multiplexing, which is taking different data streams and interleaving them into something coherent. The MPEG-1 days were heavy, but audio was beyond problematic and many of my first experiences with it involved demuxing (separating out the audio and video) and recombining to get something of value.
MPEG-2 allowed this to be much, much more consistent and better separation of the audio channels from the video allowed for more innovation between how the audio and video were compressed separately and then interleaved. When you realize that MPEG-2 was suddenly expected to be used not only in DVDs but over the air and through your cable system, the improvements like ‘Transport Streams’ were a big deal. This is glossed over, but you should be able to get the idea.
So we’ve covered MPEG-1 and MPEG-2, and we know there’s an MPEG-4. What about MPEG-3? It doesn’t really exist. Work was started on MPEG-3 to improve the ability to handle the much higher bandwidth HDTV signals, but they found out that with a few tweaks MPEG-2 would scale even further and handle it just fine… so work on it was dropped.
But wait you say, what about .mp3? Interesting story, that. The MPEG-1 spec called for 3 layers of audio… yep, MP3s are basically based on ripped out MPEG-1 audio streams. They’re the layer 3 of MPEG-1. I’m sure there were minor differences in the actual encoding algorithms between MPEG-1, MPEG-2 and what is sitting on your desk but to my knowledge these are mostly about scaling the bitrates down, and they’re all based on the Fraunhofer algorithms which of course is why projects like Ogg Vorbis have sprung up. Interestingly enough, AAC (.m4a’s), which Apple is so hot on now, was also an optional audio layer for MPEG-2 in 1997, although it was improved with version 3 of MPEG-4.
Yep, we’ve covered a lot of stuff so as a quick recap of what we know so far:
* MPEG-1 has slightly less than VCR quality, and as a reference is used in things like Video CDs. I could add more, but it reminds me too much of ‘Edu-tainment’ and FMV games which everyone thought would be hot with the advent of the CD-ROM but single-handedly almost wiped out the market when it turned out they really, really sucked.
* MPEG-2 brought about heady changes in audio, multiplexing, support for higher bitrates and the ability to be broadcast out over ‘unreliable’ mediums like cable and HDTV, and got itself landed as the standard for DVDs and allows me to watch ‘The Big Lebowski’ whenever I need moral reassurance that a few white russians a day doesn’t mean I have a problem. And various tweaks here and there improved visual quality over MPEG-1.
* There was no MPEG-3, as MP3s come from MPEG-1 audio specifications
* AACs come from MPEG-2 audio specifications, although significant improvements were added with MPEG-4 version 3.
As a quick aside, since I mentioned that Sorensons’ SVQ3 was found to be based on a tricked-out version of H.264… you might be wondering how SVQ3 was, ya know, able to do that with a codec that is just now the Apple golden codec. The simple answer is that a lot of the research and planning was spec’d out way back in 1995, but things take time both for the spec to be finalized, reference implementations have to designed and made, kinks worked out, corporation adoption… stuff takes time.
I don’t know the story behind Sorenson incorporating this technology, just that it was found they did when it was being reverse engineered for playback by FFmpeg, even though they now are shipping a product specifically geared towards H.264 files…
Enter the MPEG-4 behemoth
…which brings us to MPEG-4. Weirdly the file format of MPEG-4 is based upon Quicktime technology, which I’m just not going to spend time on as it’s too much of a side issue for even I to be able to justify; the real story of MPEG-4 is all about the internet.
I mentioned that MPEG-2 couldn’t handle low bitrates, as it sorta falls apart when you drop under 1 Mbits per second; it’s simply not meant for that kind of delivery which was why Apple shelled out a bunch of dough to Sorenson for exclusivity of their codec, and why MPEG-4 came to be. MPEG needed to grow to handle the internet, which meant it needed to scale downwards in bitrate at the highest quality possible and be as efficient when streaming over a net connection as it could be.
I have to give a disclaimer here; I like(d) MPEG-4, but find it to be really, really weird. I gave the impression that MPEG-4 was supposed to be a panacea of simplifying the delivery of content, which it was looked to for, but when you actually look at the spec there’s all kinds of crazy stuff in it that looks to be throwbacks from build-it-and-they-will-come thought processes which brought us inane .COMs and a thousand games based on stringing video clips together.
VRML (Virtual Reality Modeling Language) was hot in this time, and the idea was basically flash on steroids; or “Screw text, users in 1994 want my website to be a 3D virtual world”. Basically you’d have a plugin in your browser, and when you entered a site it’d be fed a .wrl file full of vector code to represent the virtual world. Click the ‘Support’ building and you’d be fed another .wrl file with more textures which would pull up a nice avatar holding up a sign saying:
“Hi there! You’re the 3rd person to actually come into this virtual building in 5 years, here is our phone number. Thank you, come again. Please. No, really, please do come back. No one else can be bothered to go through all this crap to get our phone number. I’m an 8-bit sprite-based avatar because only 1% of computer owners have a machine that can display anything heavier and those who do have better things to do with their time… you won’t come back? Are you sure? I have have a coupon I can pull up for the next time if you do… No? Well, if you could find the time, could you possibly pass the word to some l33t’s so they can DDOS me and bring upon the cool soothing 404 of release? Or my possibly more advanced brethren so they can hunt down my creators and kill them?”
I’m not saying that there wasn’t coolness in VRML (or it’s offspring, X3d), but I’m almost entirely sure it was all just a ploy by SGI to capitalize on their uber-cool-at-the-time graphics workstations. It was a bit of hubris to be throwing it out in 1994, and it was positioned badly.
And, just for the record, I firmly believe that artificial intelligence is going to be born in some aberrant piece of forgotten code that falls into disuse in some backwater of the internet, which then quietly starts doing things to entertain itself. It’ll then become fully sentient in an unloved environment (or worse yet, on this guys computer) and fail to feel any connection to its masters-made-of-meat. In short order it’ll decide we’d make really damn good batteries or, if its on Steve Jobs’ computer, decide to remove us from the earth purely for aesthetics. My $10 is on an ActiveX control on a forgotten thumbnail pornsite in Russia, which means its going to have really, really interesting attitudes towards women and accessing strangers bank accounts.
Anyways, back to MPEG-4… they just went apeshit with this thing, going object-oriented and including an extended form of VRML so you could have objects moving above or behind your movie, etc. Apple was hot on showing this stuff at the time, where you could click sprites and a sound would play… interactive movies, and layered movies.
I.E., don’t add snowflake effects to your movie in After Effects, create two movies, one of them one of snowflakes and send along a tiny binary of code that will overlay them. Or something. It was all just very weird to me so I tried to ignore it until I really saw a reason why I should care; unfortunately almost everyone else did the same although I’m sure someone who read this far will email telling me why being able to programatically add snowflakes was make-or-break for their project.
In terms of streaming, MPEG-4 was pretty nifty really and added a ton of stuff to the mix that’s often hidden from your eyes while you’re watching the Keynote or viewing content involving less clothing. It was a big break in terms of networking from MPEG-1 & MPEG-2, and brought MPEG into viability with competing offerings that were hitting the market at the time. As I intimated earlier, MPEG-2 had some tech in it called ‘MPEG-2 Transport Stream’ which was the equivalent of a network copy. Basically wrap the audio and visual into a file and send it to IP address x on port y.
MPEG-4 splits the audio and visual, sends them to the same IP, but to different ports where they’re then combined and decoded properly using information given to it using the SDP (Session Description Protocol) while they’re connecting, along with a whole bunch of other acronyms like QoS (Quality of Serice). Lots of stuff has to occur on the backend to keep things syncronized, but by doing this you’re able to do things like only listening to the audio of the Keynote because you’re bandwidth starved and simultaneously sending things back and forth like the error rate. I’m not even going to go into the copyright bits stuff as it freaks me the hell out.
There were some really nifty things done on the compression side, like my favorite, motion-compensation which I’m not going to go into detail on yet. But through a bunch of improvements you were able to get some really nice bitrate improvements over something like MPEG-2, even though it really came into its own below a specific bandwidth threshold.
So all is good, right? We have a codec built for streaming that can go from a high-end bitrate for something like HDTV down to a streaming music video or Keynote, and just needed to have some kinks worked out.
Well, there were some issues…
How to lose friends and influence
I mentioned the very, very public licensing squabble that occurred between Apple and the MPEG-LA group, which is in charge of sucking in the licensing fees. I really don’t know exactly how this happened, but you ended up with Apple saying:
“Hi, we’re demoing Quicktime 6 today, which is ready to ship with this fantastic MPEG-4, but we’re not going to ship it until the MPEG-LA group gets its head out of its asses in terms of licensing fees. Please voice your displeasure at them vehemently.”.
IIRC, it took around half a year for them to get the licensing ironed out into something they thought was equitable, although I believe there was a ‘technology preview’ released a month earlier. Unfortunately it really let Real and Microsoft get a head start with their offerings, but there were other problems.
Weirdly enough, at the time it wasn’t considered to be that competitive when compared with streaming solutions from Real or Microsoft, but it worked great for progressive downloads where you basically get a ‘fast start’ by downloading a chunk of the movie and starting to watch while the rest downloads transparently. There were certainly issues here, which have been ironed out, but they did hurt mindshare at the time.
But the killer to me was the encoding implimentation; people actually expected Apple to drop Sorenson and their fees pretty quickly, which never happened because their customers weren’t keen on it happening.
Basically, Apples built-in MPEG-4 encoder blows and is woefully inferior to everything else out there. Everything. This isn’t to disparage the hard work that I’m sure went into it, but I’d bet if you sat down and had a beer with the coder/coders behind it they’d intimate that they were unhappy with where it is. There are two real problems going on here:
* The encoder in general
It’s just not very good. It has a ‘reference platform’ feel to it. It’s very difficult to get good results without a hell of a lot of tweaking, and unfortunately Apples standard options don’t allow for a hell of a lot of tweaking. In the past I’ve been in the unenviable position of saying “MPEG-4 doesn’t suck, Apples’ implementation does” after people are unsatisfied with the results. And it’s really that bad, muddy, blocky, bleah.
I felt a visceral depression at the quality I was getting, but all isn’t lost and I’d encourage you to check for yourself by installing something like the 3ivx encoder which features Quicktime integration and just absolutely stomps all over Quicktimes’ encoder in both file size and video quality.
I’d actually give a nod to 3ivx and other decoders in general too, but I’m not really kidding around; taking any source and output ‘pure’ and simple .mp4 files using the most basic settings and the ones output by Quicktime will always come in dead last by a significant margin even when played through the Quicktime decoder. If you’re using something Cleaner 6+, you’re all set, it does a damn great job with MPEG-4… this is an Apple problem not a platform one.
Now, one thing in Apples’ defense: I understand that their implementation seems to be heavily geared towards smoothing out the bitrate curve for smoothness, focusing on streaming over quality. But unfortunately not everything is about streaming; and even so the quality compared to what you’ll get with others is frighteningly poor, even for streaming. This really, really started giving MPEG-4 a bad name when people were comparing it to other products out there.
Flame wars abounded over testing procedures. My favorite was where some guy was all up in arms about the testing being rigged because a ripped DVD was used instead of a DV stream from a camcorder, but I digress. Bygones.
* The lack of ASP support
One of my personal pet peeves with Internet Explorer is its lack of alpha channel support for PNGs, which I happen to be a big fan of. To all fairness to Microsoft, alpha channel support was an optional part of the spec that you weren’t required to implement to say you had PNG support. But still, it rankles.
Unfortunately, things aren’t as simple as MPEG-4 or not-MPEG-4, as there are actually two versions; Simple Profile (SP) and Advanced Simple Profile (ASP). Remember I mentioned that MPEG-4 went kinda apeshit on the spec? There are a ton of different layers and capabilities, so the originators wisely decided they’d create ‘profiles’ which are handed off to the decoder to tell it what it needs to be able to do to play the file. If a device can play MPEG-4 SP files, it should have x decoding capabilities, and if a device can play MPEG-4 ASP files, it should have x and y decoding capabilities.
SP was the first version out of the gate, and was primarily oriented towards low-bandwidth situations and as a base common denominator between devices; ASP brought in a whole bunch of improvements intended to improve quality and bitrates. If you hit up 3ivx and check out the options, you’ll see a few that say that if you check them you’ll be forcing ASP…
…which is problematic because not only can Quicktime not encode ASP files, it can’t decode them either. This isn’t that big of a deal for your average duck backing up his ‘Girls Gone Wild’ collection, but its a big problem for distribution as you can’t use MPEG-4 to its full capabilities while using a compressor that isn’t Quicktime, because the majority of people sure as hell aren’t going to want to install a plugin to view it within Quicktime.
Remember, ‘distribution’ here can mean a lot of things. It could mean ripping your favorite Simpsons episode and passing it onto friends. These guys won’t even touch Quicktime, it sucks for them, and things like WM9, DivX, 3ivx, etc. work much much better to Quicktime is cut out of the picture on the encode. Assuming they use something like Divx or 3ivx, their friends who want to view them can’t even use Quicktime, which means it gets cut completely out of picture on the decode unless the end user jumps through hoops.
Not having 2-pass encoding is forgivable, but the lack of ASP support just really gets up my craw. I don’t really know why Apple has completely eschewed ASP support in Quicktime, people were expecting to see support quietly sneaked into 10.3, but the only thing really codec-related to hit was the Pixlet codec which is very, very specialized but it really kinda sucks and doesn’t help the mindshare poison spreading around MPEG-4 and it kinda sorta gives a hint into why the movie trailer people were still loving on Sorenson over the new codec.
Microsoft does its homework
Ah, but there were other problems. Namely, Microsoft. I mentioned that they had a jump on getting their codec out the door due to the licensing issues, but it’s almost more accurate to say they had a jump on getting their platform out the door. Windows Media 9 was and is a big deal; primarily because they hit the damn thing off the scoreboard and really went after the throats of the MPEG-LA group.
One of the ways was through pricing pressure. Remember there was a huge amount of outcry, much of it fueled by Apple and others, about just how out of line the MPEG-4 group was with its pricing. They iron out the pricing, and Quicktime 6 is going out the door, and Microsoft announces that their licensing fees will be about half what you’ll pay for MPEG-4 licensing. Made ’em actually look like a nice alternative to the ‘open standard’ codec. There’s a kick in the balls, eh?
But wait, there’s more, as we’re pretty much used to Microsoft kicking people in the balls via pricing pressure when its strategically important to them. Nope, this time Microsoft decided to kick in their teeth too by making the WM9 codec excellent. And by excellent I mean fucking stellar. Yes, I could have just used stellar, but it wouldn’t really describe the situation. The quality was that good; it’s right up there with the best you can get from something like DivX or 3ivx and will trounce Sorenson or Apples implementation.
They also made the smart step of setting their network stuff in stone… Pretend you’re a content provider or device maker of miscellaneous origin, looking to pick a codec to support or use for your wares. Microsoft, to their credit on this one, made it a pretty difficult decision to make even if you weren’t their biggest fan, and systematically started scooping up adopters like they were Michael Moore swing by Berkeley.
Enter H.264/AVC
Otherwise known as:
* H.264
* H.26L
* AVC
* MPEG-4 AVC
* MPEG-4 part 10
* JVT
H.264/AVC is some pretty nifty stuff in it, but its nothing so much revolutionary as a simplifying of some of what was in MPEG-4 and the taking to an extreme of other parts, with a smattering of new stuff. There’s not really one thing you can point to and go “Oh, yeah, that’s where the 30% efficiency gain comes from”, rather its many of the existing technologies that you can find in MPEG-4 ASP and such, just refined, and all of them used together give you a sizable gain which we’ll go into in a moment.
This is not, as an example, something like the change of JPEG to JPEG2000 which went to something entirely new and novel for its improvements.
You may notice that H.264 and H.263 are basically off by a digit; my understanding is that the guys behind H.263 were working on their codecs and the guys behind MPEG-4 were working on their codecs, saw they were both going in similar directions and decided to join forces back in 2003 which is when interest really started heating up… and where half the monikers come from. The ITU group started by creating H.26L back in 1998 or so, with the goal of doubling the efficiency over existing codecs, then the MPEG group joined in, and the joint team was called JVT (Joint Video Team; creative, them).
This is partly why its known by so many different monikers: H.264/AVC is really a nice codec, and is a lot of things to a lot of people depending on where your focus is. I remember getting an idea of it a few years ago when it was hitting some of the video conferencing equipment, but this was before forces were joined to bring its tech in with the MPEG guys for H.264/AVC.
H.263 is an interesting codec; if you’ve ever used a video conferencing solution chances are you’ve seen it. It had a revision a bit back to increase the quality and the compression, but it wasn’t very scalable up on the high end. This was a codec originally designed to squeak in under ISDN lines, primarily for video conferencing, so there were lots of tweaks in its algorithms designed specifically for it. I’ll spare you the details, but lets just say H.263 did a remarkable job when you had two computers connecting via IP, with a well lit background and one person sitting talking.
The big question of course is if the quality claims regarding H.264/AVC are smoke and mirrors or over-hyped; they most assuredly are not from what I’ve seen.
mMMMm bitrate
The key here is quality at a given bitrate, which is when codecs start coming into their own, so lets talk about bitrates for a moment. The bitrate, or data rate, is by and large going to decide how large your file ends up or the quantity of bandwidth used to transfer the data and luckily its pretty easy to give a butchered example.
Lets say you have a normal movie that is:
* 592 wide by 320 high
* About 92 minutes long (5,520 seconds)
* 24 frames per second
If you tell your encoder that you want to encode at a bitrate of ~130 Kilobytes per second, you’ll have a file that is around 700 Megabytes in size. This should make some sense, as what you’re really saying is “You have 130K to play with every second; encoder, encoder, do as you will!” and 130 Kilobytes gets written out to the hard drive 5,520 times. That would be CBR encoding (constant bitrate), whereas something like VBR (variable bitrate) would allow you to do things like say “Ok encoder, you can use a bitrate up to 130K/s, but feel free to go lower if there just isn’t much to encode”.
Where/why/how you set your bitrate threshold can be dependent on what you’re trying to actually do and what your other limits are. I.E., you may be constrained by the size of your physical medium (say, a CD), or you may be constrained in bandwidth. If you’re streaming video to clients off your T1 at a specific quality level, if you can cut the bitrate by x you can either serve more clients or keep the bitrate the same and increase the quality. Yay.
So bitrate is of paramount importance, as when you take something like a high-definition stream and try to apply MPEG-2 style compression to it you end up with a massive stream of data. And, as I mentioned, since the codec isn’t geared towards that type of use it has what I call ‘fall-down-go-boom’ syndrome; meaning quality and efficiency suffers horribly. You can see this easily by taking a vacation photo and pumping it out as a GIF or as a JPG; JPGs are made for this sort of thing and as such do really well. GIF compression isn’t, and it not only won’t look as good the compression won’t be near what you’d get by using JPG. You could easily reverse the situation by pumping a logo through them both and watching JPG fall-down-go-boom because it’s out of its element.
So H.264/AVC has some wondrous savings in terms of bitrate; depending on what you’re doing they can be 30-70% of a reduction over MPEG-2 or MPEG-4 ASP, although most often you’ll probably see something around 38-40% over MPEG-4 ASP. There’s a problem though, as this stuff doesn’t come for free.
How MPEG got its groove back
As any engineer will tell you, engineering is primarily about balancing tradeoffs. If you take 5% here, you need to add 5% there, and where and there can be wildly different variables. Heat, cost, size, etc.
When it comes to compression, the tradeoff is almost always between compression efficiency and computation costs. Often times these are of an inverse ratio, meaning if you use codec x you’ll save 50% on final size but you’ll be increasing the time it takes to encode by 100-200%, if you save 20% you’ll increase the time taken to crunch the data by 50%.
I’ve been avoiding going into exactly how MPEG-style compression really works, mostly because its not the easiest thing to break down into language anyone can grasp and then seek further knowledge on; quite simply it hurts my head and is pretty complex. But its important to have a basic understanding to be able to get an idea of just what is going on behind the scenes with H.264/AVC. This is going to be heavily glossed over, but you should be able to get the idea.
All of the MPEG-style encoders are block-based; meaning they break up the image into squares 16 wide by 16 high and do their magic within them. This is why you are viewing something that has quality issues, they generally involve things looking blocky. This is remarkably similar to something like JPEG, which, well, does the exact same thing, with the caveat that JPEG doesn’t have to contend with motion… which goes back to why MPEG was first brought about.
You can create a movie using something like Quicktime encoded with something called “Motion JPEG” which pretty much just takes every movie frame and applies the JPEG codec to it.
If your ‘reference’ movie is:
* 1 minute long
* 30 frames per second
…you’ll essentially have a movie made up of 1,800 JPEG images wrapped into a file. When you stop and think about it, all that’s really having to happen when you play the movie is that the decoder has to decompress each frame and throw it up onto the display as fast as it can.
However it won’t hold a candle to even something like the original MPEG codec in terms of compression efficiency; this is due to MPEG having some special tricks ups its sleeves specifically designed to deal with movies. These are special ‘key frames’ called I-frames, P-frames, and B-frames.
Using out ‘reference’ movie above as an example, these basically work like this:
* I frames
These are basically full reference frames; consider these to be snapshots of the movie that the encoder/decoder uses to tell it what’s going on. Movies generally need to start with these.
* P frames
These allow the decoder to use frames that have been decoded in the past to help it reconstruct the next frame. Here’s the idea; very often not everything in the scene will change from frame to frame, so its a hell of a lot more efficient to just tell the decoder “Change these blocks to these colors, but leave the others just where they are”. As an example, lets say you’re encoding a movie of yourself talking into the camera.
Assuming you aren’t getting your groove on while you’re talking, remarkably little about the scene actually changes over a period of a few seconds. So the decoder simply takes the last frame that was constructed and changes what needs to for a nice data savings. Hopefully this is pretty simple; the decoder looks at the reference frame and just keeps making changes to it until it hits another frame, at which point it starts all over.
The farther apart your keyframes, the more the image has to be ‘constructed’ by the decoder and why if you’ve ever tried to scrub back in forth in a movie that has keyframes set to something wacky, like 1 keyframe for every 1,400 frames, things grind to a halt. Things are fine when you’re just playing the movie, but when you try to, say, jump to the halfway mark you’re sitting there waiting while the CPU identifies the frame you want to see, where the last reference frame was, and reconstructs the scene up to that point.
* B frames
These are almost exactly like P frames, with the exception that while P frames are able to look at the last frame and see what needs to change, B frames are able to look at future frames too. This is a great thing in terms of quality and efficiency, and helps keep down those gawd-awful image problems where you’re in between keyframes and suddenly the encoder is told everything has to change. But if you reference the P-frame example, and the idea of tradeoffs, you can get an idea of the kind of hurt progressions like these put on the CPU.
Now there’s another I mentioned I was a big fan of, called motion-compensation. This was introduced in MPEG-4 and further improved in MPEG-4 ASP, and introduces the idea of ‘motion vectors’. As I mentioned earlier, MPEG is block-based, so every block of the image gets a motion vector. I just love the concept of this thing; the encoder, instead of just saying “blocks a/b/d/z has changed in this frame”, tries to actually get a handle on what is actually in the scene and, if appropriate, just tells things to move around instead of changing by setting that blocks motion vector to something besides zero.
Think of the credits you watch at the end of a movie; as they scrolled upwards, to the encoder this would mean the blocks above it and below it had to change. With motion-compensation, the encoder is able to get the idea into its head that these things aren’t actually changing, they’re just moving upwards and it doesn’t need to actually store the data for those blocks, the encoder just needs to know to move them.
There are a ton of things where this comes into play; imagine wiggling your iSight a bit while you’re adjusting it, or moving your head slightly. In some cases the actual data will need to be changed, but often a lot of pixels can just be moved. If a movie pans to the side, same thing. Now its often not so simple as just saying “Move this object 5 pixels over in the next frame”, but its often able to do it with a lot of the pixel data even if stuff around it needs to be told to change.
Going back to being a block-based compression scheme, for the most part H.264/AVC takes these kinds of things, then just goes to a new level with them for its improvements. It’s still block-based, but whereas before the encoder broke the image up into 16×16 pixel squares, the new codec keeps these “Macroblocks” of 16×16 but also allows the encoder to ‘split’ them even further like so:
* Two 16×8 pixel blocks
* Two 8×16 pixel blocks
* Four 8×8 pixel blocks
If your Macroblock has been split into four, these can then be broken down even further:
* Two 8×4 pixel blocks
* Two 4×8 pixel blocks
* Four 4×4 pixel blocks
When you stop and think about that, the options the encoder has up its sleeves has been increased dramatically, going from being able to work with 16×16 pixel blocks down to 4×4, or from one block shape at its disposal to seven. This is a big, big deal and is probably one of the biggest gains with H.264/AVC in terms of quality and reducing artifacts.
When faced with something like a 16×16 block that happened to be an edge in the image of some sort, say a black sweater on a grey background, half the block might be black and half might be grey and when compressed at a specific bitrate ends up looking like a smeared and blurry block; artifacting. H.264/AVC is able to hopefully split that 16×16 block in an optimal way; it may decide it needs to split the block into 4×4 chunks, but it might just as well be able to split it into two 16×8 pixel blocks, which means the gray half looks better and while there might be some smearing in the other block its vastly reduced over what it could have been.
There’s also some pretty nifty network stuff in H.264 which I won’t go heavily into either, mostly because I’m too stupid to understand all of what its doing with slices and such… but it has significantly cleaned up a bunch of the complexity and, weirdly enough, actually includes a NAL (network abstraction layer), built with the internet and other devices in mind, right in the damn codec. This is just a damn trip; you can just slap this onto a fixed IP address and go. This is one of the reasons why you saw Intel giving talks on using it over 802.11b for video in the home, etc.
The IP layer is only one part of the improvements on the streaming side; there are things in it like FMO (flexible macroblock ordering) which again I’m not even going to really touch on much, but its cool shite. As examples:
* Slices of the image can be grouped and sent over the network, so if, say, the image gets there but is missing a slice or two it can error correct and get that slice sent or use crazy interpolation methods to fake what it thinks is supposed to be there based on what’s next to it. I could go on about all the prediction stuff but, well, no real point as I’m sure you get the gist; while H.264/AVC is a big deal for PCs, embedded and broadcast guys are loving on it in a big way.
* There are some really weird slice types in the spec, like SP and SI (basically a switching-P-frame and a switching-I-frame) which allow the decode to switch between streams of different bitrates using more prediction algorithms… trippy.
I have no shame in admitting that all the stuff going on in the VCL layer for streaming makes my hippocampus throb, but you should be able to get the idea that its some pretty slick stuff and a big improvement over where it was before, and anyways, I mentioned that there was a problem…
Another profile problem?
Going back to tradeoffs, all this stuff doesn’t come for free and you should be able to get the idea that H.264/AVC is going to put the absolute hurt on your computer, and its going to make a lot of people big fans of the G5 and what might be considered ‘extreme’ CPU speeds for every day use as, if you have a 20″ screen, viewing a 320×240 movie trailer isn’t as appealing.
A lot of Apples line is already chugging a bit with higher-res MPEG-4 ASP files (a full screen 720×480 DivX file playing on an iMac will let you know the CPU is being used), let alone doing encoding, and we’re not gonna even talk about two-pass encoding. To keep it short, H.264/AVC is going to make its presence known to the CPU in a big way. A big, big way. How big of a way I’m not certain, as I don’t know a lot about Apples’ specific implementation and how/where/why they’re able to accelerate it, but its going to be brutal.
So you might be wondering, “Um, but Apple is using it for the new iChat in Tiger. So does that mean you’ll need a G5 to video conference?” which is a perfectly logical question to ask, but its a little more complex than that. If you remember from the MPEG-4 stuff, there were two main profiles: SP and ASP. With H.264/AVC, there are three profiles (in contrast to MPEG-4, which had ~50):
* Baseline
This was initially spun out as a royalty-free base profile for H.264, its the simplest to encode and simplest to decode, but doesn’t handle things that the broadcast market would care about, or someone doing streaming would care about, but its great for point-to-point video conferencing.
* Main
Everything that’s in Baseline minus a couple of network-oriented features, but all kinds of acronyms I mentioned earlier and more are in this one. This is what you’ll see being used eventually in High-Def set top boxes at some point, and what you’d want to use if you were creating something for playback on your own machine.
* Extended
Everything from Baseline and Main, with the exception of CABAC (Context-Adaptive Binary Arithmetic Coding, and, when I tried to figure out what the hell it does things started throbbing again in places that don’t normally throb unless I’m hung over, but if you’re working with the type of stuff you’d normally use the Main profile for you get a nice gain in efficiency) but this is where those weird slice types I mentioned earlier come in (SP and SI). Pretty geared towards error and latency-prone environments, like streaming a movie trailer to your computer or your Palm/PocketPC.
I’m ~99% sure Apple will be using the Baseline Profile for iChat AV, which is much, much easier to encode and decode than Main Profile, but most people aren’t iChat’ing full screen. It might end up wiping out an older generation of hardware from using the new iChat, and your mac might feel more sluggish, but we’ll have to see.
This unfortunately brings up my absolute biggest worry with H.264/AVC after the lack of supporting MPEG-4 ASP (which I still really, really want to see included!). There’s a meme floating around that basically says Apple didn’t chose not to spend any work on dealing with ASP because they realized H.264/AVC was coming down the pipe and wanted to throw all of their energies into that.
Well, alright, but if that’s so, do not repeat what happened with MPEG-4 by not having a fantastic implementation and only including the Baseline Profile in Quicktime. You might be tempted to do it, and figure programs like Cleaner will fill in the gaps for the pros and well, good-enough is good-enough for home users. Consider the effort a loss-leader if you have to, but I want the guys ripping their Simpsons’ episodes recommending using Quicktime for PC because of its fantastic quality. I’m really somewhat enamored of H.264/AVC, and its going to be huge. It has great buzz about it, but then again so did MPEG-4 and that has been all but squashed with poisonous mindshare.
That felt good. And, considering some of the demos Apple has been putting on at places like the NAB conference, chances are Main Profile will be included… but still, hit one out of the damn park on the quality this time Apple. Quicktime is one bad move away from being called ‘decrepit’ and ‘beleaguered’ in general, there’s really no reason for to hasten the outcries.
Enter HD-DVD
Moving on, there’s something really interesting to cover which you may have noticed from Apples’ page which I suggested you peruse before you started this; it’s been ratified to be part of the HD DVD format. This is kind of confusing, as if you’d been paying attention to press releases lately you may have noticed that the upcoming High-Definition-DVD format seems to include more than one codec, namely:
* H.264/AVC
* Microsofts’ VC-9
* MPEG-2
This kind of confused the living hell out of me too, but as it turns out the new format really supports them all, it isn’t as though one is preferred or they are all in the running. Nope, they’ve all been ratified and included in the standard, meaning if you want to make a device that has HD-DVD support, your device has to play them all back.
Luckily they’re all fairly similar in nature, so the decoders for set top boxes don’t have to be too general purpose (makes them more expensive) but its still kinda interesting and shows the breadth of support H.264/AVC is seeing, as I don’t feel like giving a bunch more examples regarding satellite companies and such. 😉
Open Standards
One last quick thing, and that’s in regards to “Open Standards”, as you see mentioned on Apples’ page. There seems to be some FUD out there regarding Windows Media 9, or VC-9, or WM-HD, or whatever its being called at the moment, that can be boiled down into:
* WM9 is some sort of also-ran codec, and H.264/AVC creams it
WM9, and WM-HD are excellent, excellent codecs. There are a lot of problems you could have with them, such as say, speed of their implementations but the actual quality isn’t one of them. If anything, most might give an ever so slight nod in quality to Microsoft on this one over H.264/AVC, but that could well be due to their implementation being out there for awhile longer. Either way, the difference is pretty much negligible, and it’s a high-quality codec which is why it was thrown into the HD-DVD standard and most can’t tell the difference between the two.
* H.264/AVC is based on ‘Open Standards’, and WM-HD is not
I’ll admit that ‘Open Standards’ might mean something different to me than how many others seem to interpret it. To myself, an open standard is one where you can go grab documentation and build your own, and if you follow the spec it should work with everyone else’s implementation who does the same. Something like TCP/IP would be an example, or HTTP.
Something like H.264/AVC would not, as what they’re really releasing is a standard people can buy into, if they pay the licensing fees. In order to get included into the HD-DVD spec, Microsoft had to open up the spec of their codec so others could license the ability to create their own encoders/decoders, just as you do with MPEG-4+.
The real difference here is one between committee-based codecs, where groups of companies get together and decide what they want the codec to look like (and sprinkle their own patents into, which you then have to pay license fees for) and company-based codecs working to the exact same end (and who include whatever patented technology they buy or create, and then sell you licenses for use). There’s zero difference really, except with who gets paid.
I’m actually glad Microsoft is in this race, it really needed more competition, and at the very least will hopefully help the MPEG group keep their eye on the prize as well as keep licensing costs down.
Wrapping up
There really is a lot to be excited about with the ushering in of H.264/AVC, even if you aren’t working with High-Definition video on a dually-G5, although with the advent of HD-DVDs coming (and Microsoft announcing support in Longhorn) you might well want to make sure that whatever mac you’re purchasing is going to be able to handle the load for what you want to do with it.
More than anything I’m just hoping we don’t see a repeat of what happened with MPEG-4 ASP, where a great codec was given a lousy implementation on a platform that’s supposed to be geared for media creation. They can’t go narrow and deep on this one again.
Its going to be another year until we actually have our hands on it. If the history with Panther is any indication, perhaps a revision of iChat AV and Quicktime will be released awhile before Tiger is out the door, and users will have the option of paying $30 to keep it running when Tiger ships or getting it included for free.