HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research

NVIDIA Developer · Youtube · 80 HN points · 4 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention NVIDIA Developer's video "Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research".
Youtube Summary
New AI breakthroughs in NVIDIA Maxine, cloud-native video streaming AI SDK, slash bandwidth use while make it possible to re-animate faces, correct gaze and animate characters for immersive and engaging meetings. Learn more: https://nvda.ws/3l9foIn

Watch the full GTC 2020 keynote: https://nvda.ws/30Fa4Vl

#NVIDIA #GTC20
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
How much data would be available? Apple have been doing stuff adjacent to this NVIDIA announcement (116 bytes/frame of key-point data, with a GAN or a 3D model for reconstruction) from last year since at least Memoji: https://youtu.be/NqmMnjJ6GEg
NVIDIA just announced this AI-based video conference system that addresses the problem of gaze correction and 3D reconstruction, as well as a bunch of other stuff.

https://developer.nvidia.com/maxine

The 3D reconstruction piece is interesting -- it ends up being a video compression algorithm as well. It replaces a video codec with a GAN algorithm. It takes an initial high-res snapshot your your face (keyframe), maps out a set of keypoints which gets transmitted to the other end. The algorithm at the destination uses a GAN to reconstruct the face. Because keypoints (vectors instead of raster) are an order of magnitude smaller than images, this results in a very high level of compression and very low latency video conferencing on low-bandwidth connections.

https://www.youtube.com/watch?v=NqmMnjJ6GEg

This idea is not unlike Apple's Memoji which uses facial recognition to create an animated emoji, except this reproduces real faces over the wire.

Oct 06, 2020 · 13 points, 6 comments · submitted by janosett
glial
IMO this is a fantastic use of this technology. The bandwidth is low enough to enable videoconferencing on a dial-up connection. I only wonder how big the model is - gigabytes I bet.
Fronzie
The article state that it does send over at least 1 key-frame. That might be used to find coefficients in a much smaller model, similar to how Active Appearance Models approximate a lot of faces with a relatively small model.
BHSPitMonkey
Existing discussion: https://news.ycombinator.com/item?id=24694565
pkulak
This and other ways of using ML for video compression (like DLSS) scare me a bit because it kinda screws with reality. Like, with MPEG compression, you just see less detail when there's less bandwidth and you know exactly what you're missing. But we seem to be moving to methods where instead of removing the missing data, we fill it in with what should most likely be there. There's something that just seems wrong with that.
glial
> instead of removing the missing data, we fill it in with what should most likely be there

Just like the blind spot caused by your optic nerve!

rasz
It will be fine as long as codecs still encode difference between model prediction/ML hallucination and ground truth. We dont need another Xerox compression fiasco http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...
Oct 06, 2020 · 3 points, 0 comments · submitted by TomAnthony
Oct 06, 2020 · 59 points, 21 comments · submitted by andygcook
emerged
Basically it's not video anymore, it's motion capture applied to an avatar, where the default avatar is your original face.

It seems like this could /also/ be used for video by using this technique along with residual coding.

nullc
> It seems like this could /also/ be used for video by using this technique along with residual coding.

Not trivially-- in the pixel or DCT domain the residual would almost certainly be extremely non-sparse, and would take a lot of bits-- potentially similar to just sending an image (for a given target MSE level). Consider, edges (other than the keypoint controlled ones) aren't even in exactly the same place-- so the residual doesn't just need to code the edge, it needs to code it twice. This has been one of the big impediments in using 'synthesis' techniques in video coding generally.

It might be possible to code a residual in some latent NN space and get more useful results, however.

guiambros
This is pretty phenomenal. Just the face realignment and the bandwidth consumption can change drastically the experience with videoconferences.

Looking forward to have these available in consumer hardware soon.

More examples here:

https://developer.nvidia.com/maxine

jack_arleth
Some comments have touched on some possible issues such as the swapping of key-frames of someone else's face and possible funky effects by introducing other faces and or objects into the camera image.

But I haven't seen anybody touch on the compute cost required to implement this. As I'm not in the machine learning field I don't have a good idea what the compute cost is for something like this. Can anybody chime in on that?

If this "codec" were to require a somewhat beefy gpu I don't see the benefits at all. Current H264 is usually done by hardware decode and sometimes even encode. In areas where bandwidth is constrained I would imagine a lack of computing resources, thus nullifying the entire premise. That said, in current times it would save a substantial amount of data transmitted. But I'm not sure if we should lock-in our entire videoconferencing system to nvidia just to save some bandwith.

ksec
What sort of latency are we looking at for these AI regenerative videos?

I thought comparing it in KB per frame was a strange way to measure it, since video codec are used to measurement similar to Network in kbps or mbps.

So the Video Codec was actually 50kbps, which is indeed a very low bitrate. But this was done on H.264, which is now nearly 20 years old. Modern Codec like HEVC and VP9, or State of the Art like AV1 and VVC would have done much much better.

Next problem, would this only work on Nvidia GPU? Apple are already doing something similar to FaceTime, but only with respect to eye contact. Are we entering an era where even AI video codec are bound by devices?

I used to hope and wish Apple introduce these kind of features to iPhone. But their act and response on App Store is making me wary.

villgax
The GAN is similar to the one with no supervision to create DeepFake by Aliksandar et al. The catch is that if they move a lot w.r.t. original frame it creates hilarious artefacts. But still great sure if you have GPUs on each end.
a_e_k
This one has some pretty distracting artifacts to my eye. The red and white pattern on the first woman's shoulder blurs and swims, while the door latch behind the woman with the mask follows her shoulder and even changes size.
ageitgey
The unspoken elephant in the room is obviously it doesn't even have to be your face that is being animated in the video call. You could swap out the first keyframe image and appear to be any other real person during the video call with the same fidelity. Sounds great for corporate espionage and lurking on calls that you shouldn't be on.

I don't think it's fair to call this video compression as much as real-time photo-realistic animation via motion capture.

gsnedders
They literally have such an example in the video; it's definitely not unspoken!
Jasper_
The magic is knowing when to take a new reference photo, in case someone else walks in, or I drink from a cup of coffee, or hold up an object to the camera. At which point we're almost back to H.264, except it's unclear if that will work without additional training.
IshKebab
Really need to see failure modes before judging this. What happens if you actually move your head?
PretzelPirate
One step closer to having decent VR meetings. I don’t want to see someone’s avatar, I want a virtual representation of their face that looks like they’re really talking.
BHSPitMonkey
Not really; The only thing making this hard to do with VR today is a lack of means to capture the HMD wearer's face, because there's stuff in the way and needing to have a dedicated/stationary camera pointed at your face while you're in VR is just impractical.

Future headsets might start to implement some Leap-esque sensors pointed at the user's chin and some eye tracking cameras inside the headset to address this eventually, but that's going to be prohibitively costly for some time (not just in terms of dollars, but in added weight/heat as well).

tipoftheiceberg
I love seeing new technologies like this emerge during these changing and unfamiliar times.

Face time calls on a remote satellite internet setup will be revolutionary.

jaimex2
The main issue with satellite is latency, not bandwidth. I don't think this would help.
kyriakos
would probably make it worse
jaimex2
Limitations are pretty crippling for real world use.

Need a fixed camera, one face and a fairly static background so there goes mobile or conference room use.

ragebol
Trading bandwidth for compute.

Reminds me of a section in Hofstadter's 'Godel, Escher, Bach' about there being knowledge in the signal vs. the receiver, or something akin to that.

pilooch
Yes, that's 250w consumption on both sides for a one-to-one vid all. Not sure it's a good trade-off...
dogma1138
I highly doubt that this takes 250W.... none of the broadcast "RTX" features NVIDIA has released so far seem to tax the GPU in any meaningful way.
LeoPanthera
FaceTime in iOS 14 also includes a feature that makes it appear that you are looking into the camera even when you are not.

https://appleinsider.com/articles/20/06/22/facetime-eye-cont...

Oct 06, 2020 · 5 points, 0 comments · submitted by ghosh
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.