← Back to Blog

Hypervideo: Exploring Superfast Motion-Capture Teleconferencing

HappyFunCorp Motion-Capture Teleconferencing

Oct 7, 2020

An exploration by HFC Labs


Background


When COVID-19 hit NYC and offices began to shutter in March, suddenly millions of people switched from in-person meetings to using video conferencing for all of their meetings. The crunch of traffic created a strain on ISPs and many conferencing services, and suddenly the value of high-quality conferencing rose tremendously overnight. For example, Zoom’s stock has nearly tripled in the last 6 months.


During the worst of this initial crunch, I started to wonder if there might be an alternative to traditional video that could be more efficient over the network. Maybe it wouldn’t look as good, but enough detail could come through for facial expressions to register. My first thought was simply converting the video feed to ASCII characters and animating them. This had already been explored somewhat, though the results could never look that great. Another was creating a rotoscoped effect and simply transforming simple shapes to show video — a la Waking Life or A Scanner Darkly. There are some great apps out there that shoot video with an effect like this (Olli is fantastic), but it wasn’t clear to me how one would then convert those still frames to something like an SVG with transforms (more on this idea later).


Finally, after seeing a demo of facial motion capture used by a smartphone to animate a 3D character, it clicked. Essentially, by tracking head movement in 3D space and transforms of specific anchor points on the face, you could animate a 3D model that looks however you want it to — much like Apple’s Animoji / Memoji. As far as I could tell, using that approach to replace video had not been done. Sure, you can FaceTime video chat with a cartoon bear face layered over your actual face, but you’re still sending video to the other device.


Here the premise was formed: Is there a low-bandwidth way to ‘video’ chat in a way that would appeal to a reasonably-sized audience? Our goal was to test if facial motion capture ‘video’ conferencing was a) possible and b) at least 10x faster over the network. Here’s what we learned.


Motion Capture


Since Apple has invested a lot in the hardware and software for tracking faces, we looked to ARKit for examples of facial tracking as a first stop. Apple has its own demo app to showcase this functionality, and there are examples of open-source projects to transmit the facial tracking data from an iOS device to Unity. ARKit tracks 52 anchor points on the face (26 on each side), which makes for a decent model for tracking expressions.


Animating a mesh based on your facial expressions is actually pretty simple to do, thanks to ARKit. With some help from trusty SO we were able to center the user’s head in the frame without too much trouble. The next step was getting the data to another device and animating it there as well.


Viewing Peer Motion Capture Data


To get the proof of concept working, we started out by using Multipeer to transmit the data, rather than sending data over the Internet.


Our payload size was initially about 2.5kb per frame, or 75kbps. Netflix claims 1 hour of HD video is 3GB of data transferred. Conservatively, let’s say 1.5GB per hour, then (the lowest other example I’ve seen). That comes out to 13.8kb per frame, or 416.7 kbps, or about 5.6x smaller. To get our payload even smaller, we shrunk the variable names and compressed the JSON using MessagePack, making it 658 bytes per frame, or 19.74 kbps — about 21x smaller than our conservative estimate of HD video size. Note that this is at 30fps — we have enough headroom to hit 60fps and still (barely) meet our 10x target. Not bad!


Example payload:


“eulerAngles” : {

“x” : -0.1867765486240387,

“y” : -0.28117412328720093,

“z” : 0.11671698093414307

},

“worldPosition” : {

“x” : 0.0054289456456899643,

“y” : -0.0052993814460933208,

“z” : -0.31366893649101257

},

“blendShapes” : {

“b” : 0,

“3” : 0.06923520565032959,

“K” : 0.34852069616317749,

“c” : 0.14012959599494934,

...

}

}

Transmitting the data itself wasn’t too difficult, but rendering the face on the peer device required setting up an instance of SceneKit, initializing another facial mesh model, and animating it using the transmitted data.


Once this was all working well using Multipeer, we looked to Pusher to transmit the data over the Internet, but found that the service rate-limits to 10 messages per second… that would look pretty choppy vs the typical 30 fps. Earlier this summer we used Twilio for audio and video transmission on an experimental real-time AR interior design app, and we’d planned to use Twilio for audio in Hypervideo. As it turns out, Twilio’s DataTrack API could support our 30fps target without breaking a sweat, so we used that to transmit the motion capture frame data.


What’s it Like?



Serious GIF compression here ^


It’s actually kind of cool, in an early 90s sci-fi movie sort of way. At 30fps, performance is pretty reliable. At 60 we started running into network problems (perhaps pushing Twilio’s limits). We think 60fps is possible with a custom server, but we decided not to pursue that just yet.


From a UX perspective, it’s still a ‘lower resolution’ experience overall. It’s more personal than shutting your camera off in a video conference, but without seeing eye movements, hand movements, and subtle expressions that aren’t being tracked using ARKit out of the box, it feels a bit removed from reality. That said, it’s pretty fun, and in testing I found it preferable to audio-only.


Future



Shan Huang’s Pose Animator


Incrementally, we could improve the experience in a few clear ways, notably by allowing customization of facial models and backgrounds. For now, we’ve stuck to the stock, low-poly mesh that Apple offers and black backgrounds, but with a bit of work, one could let users select from a set of facial models, or with a lot of work, let users customize their own, a la MeMoji. Audio processing on voices could be interesting as well. It’s hard to imagine this as a business in its own right, but stranger things have happened, and the idea of this kind of experience baked into another service — particularly around gaming — feels plausible, even if the reliance on iOS with this approach makes that a virtual nonstarter.


There are some interesting experiments in the space of using motion capture via webcam to animate characters, including this great demo called Pose Animator on the TensorFlow blog from Shan Huang. The facial tracking in the provided demo doesn’t capture expressions with as much nuance as one would want in a ‘video’ chat context, but that could be refined, potentially well beyond what iOS allows out of the box due to the larger options of facial anchor points (486 to ARKit’s 52). The tradeoff then becomes data transfer.


Just recently, when this post was already drafted and scheduled to go live — October 5, 2020 — Nvidia announced a technique that uses an image of a user as input, tracks facial movements, then uses neural networks to render new animation frames rather than streaming actual video. The results are pretty stunning: generally faster in data transfer and often better-looking than the traditional h.264 video codec. One Twitter commenter put it well: “We’ll all be controlling digital face puppets of ourselves on video calls in the future!



Conclusion


This has been a fun experiment to see what we can achieve using existing technology in new ways, and it was interesting getting to know ARKit and SceneKit in more depth. Given the other concepts we found along the way, there is gathering interest in alternatives to our current video conferencing technology. We’re excited to see how it develops, and what new interactions and experiences can be had.


Thanks to Peter Grates for bringing this crazy concept into reality, and the many open source contributors (linked above) working on the ideas that helped us along.



Learn more about software we build at HFC Labs and our work for clients at happyfuncorp.com.

← Back to Blog