HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
[RIFEv1.0: 24FPS to 96FPS] Video frame interpolation , GPU real-time flow-based method

黄哲威 · Youtube · 112 HN points · 2 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention 黄哲威's video "[RIFEv1.0: 24FPS to 96FPS] Video frame interpolation , GPU real-time flow-based method".
Youtube Summary
Our model can run 30+FPS for 2X 720p interpolation on a 2080Ti GPU. Currently our method supports 2X/4X interpolation for video, and multi-frame interpolation between a pair of images. Everyone is welcome to use this alpha version and make suggestions!

Github: https://github.com/hzwer/arXiv2020-RIFE
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
There are already some impressive models that can render animations at higher framerates, for example DAIN[0] and RIFE[1].

DAIN video: https://www.youtube.com/watch?v=q2i6FXVjNT0

RIFE video: https://www.youtube.com/watch?v=lqtqmP46LaA

[0] https://github.com/baowenbo/DAIN

[1] https://github.com/hzwer/arXiv2020-RIFE

majewsky
And here's the obligatory rebuttal against such tweening models from an animator: https://www.youtube.com/watch?v=_KRb_qV9P4g
Nov 15, 2020 · 112 points, 30 comments · submitted by hzwer
TaylorAlexander
I just wanted to point out that the latest (RTX+) Nvidia cards support hardware dense optical flow calculation which sounds really fast from their docs. I quickly checked the repo for TFA but didn’t see mention of it. It would be interesting to see their method when designed to utilize hardware calculated optical flow, if that would help at all.

Here’s the NVIDIA library: https://developer.nvidia.com/opticalflow-sdk

I’ve played with the NVENC hardware and it’s fast! I’ve not specifically tried the hardware optical flow calculation but docs say it can process flow for 4k video at 150fps!

da-x
Perhaps not impossible - sometime after 2040, a deep learning model reconstructs the entire "Titanic" (the James Cameron film) from a total of 1,000 frames, the script, and voice samples of the actors.
drran
After 2080, it will be able to reconstruct your life from photo and comments on HN.
app4soft
Does it mean that I may record desktop screencast in 15FPS (to not overload GPU) & then convert to 60FPS?
scoopertrooper
Maybe? It might struggle with highly detailed text-heavy workloads though. You can see from their github that it seems to copy stable sections of the video to areas where movement is taking place and does some sort of transformation on that. This could create some curious artefacts in a screencast. It'd probably be fine for gaming though.
app4soft
> It'd probably be fine for gaming though.

Yeah, that would be the good case.

scoopertrooper
Just keep in mind you wouldn't be able to do this in real time.
drran
It can be done in real time, if GPU will provide low detail sketch of the next frame, then will continue work on it, while filling the gap between A frames with optical flow frames.

Also, instead of producing of a static image, GPU can calculate and show an animated image, e.g. mp4 fragment.

varenc
That’s definitely possible, but you won’t be able to do this in real time without more GPU power. And the frame interpolation techniques might not work as well on a desktop recording.

If you can record and process later, you might try recording your desktop in a more raw format. It’ll be very large on disk, but this avoids the need to transcode the recording in real time and strain your GPU/CPU. In ffmpeg just use `-c:v copy` to capture it raw. (Assuming the transcoding is the main limiting factor)

pastrami_panda
Quality looks very good, I did find one example from a quick glance where the interpolation makes the video less coherent:

https://youtu.be/lqtqmP46LaA?t=28

At first I watched the interpolated video and didn't quite understand the movement of the hockey stick closest to the camera - while looking at the original video I found that to be more coherent in some sense. Overall quite amazing though.

rasz
Training data probably didnt contain any hockey matches, or more generic flat surfaces being rotated quickly.
hzwer
I made another demo, https://www.youtube.com/watch?v=kUQ7KK6MhHw We train a more robust model and works very well on video game clips.
codelord
Huh. When I look at 15 fps side on the left, it looks normal, doesn't flicker. As if though my brain adapts to the frame rate and just fills in the gaps a bit. When I look at the right side of the video, the left side looks super flickery!
gazab
Related but not entirely the same model: Boosting Stop-Motion to 60 fps using AI: https://www.youtube.com/watch?v=sFN9dzw0qH8
skwb
IMHO it is missing the fully sampled (the ground truth 60 fps) solution. It's sorta hard to discern the quality of this work without this benchmark.
rasz
The goal is usually passing to human scrutiny, not necessarily faithfully replicating ground truth.
RhysU
Is this sort of upsampling essentially what the eye sees when it "sees" movement?
sorenjan
Our eyes and brain doesn't see frames, they see changes in light intensity. There are cameras that try to mimic this behavior, they're called event cameras. I think they're going to see much more use in robotics in the future, but at the moment they're mainly (or only) used in research.

https://www.prophesee.ai/2019/07/28/event-based-vision-2/

https://www.youtube.com/watch?v=6Sn9-M7qXLk

toxik
Event cameras are certainly interesting, but they have their own set of challenges. I do not believe they’re a better choice in any general sense, but can certainly deliver some impressive performance.
iandanforth
Yes and no. We do hallucinate both temporal and spatial content from any scene we perceive but there's also a big part of the brain which causes you to ignore missing information. For example during a saccade (rapid eye movement) you're basically blind, the brain just edits out of your perceptual stream the part where the world is an incomprehensible blur.
scoot
Expanding in this - if you've ever glanced at a ticking clock, and the second hand seemed frozen for a moment, that was the brain backfilling the gap (after the fact!) left by the saccade.

I was aware of the phenomenon (from personal experience) but not fo the cause, but there was a post here relatively recently that went into the details. Apologies that I can't now find it.

InvaderFizz
I can't find the HN post, but I believe this was the tweet thread that it referenced: https://threadreaderapp.com/thread/1014267515696922624.html
scoot
That's the one - thanks!
derekhsu
Looks great. This should deserve more attention.
andy_ppp
Really cool, particularly the interpolation of two images to 16 frames on their GitHub here: https://github.com/hzwer/arXiv2020-RIFE
scoot
The occluded objects (car behind the pole for example) are particularly impressive.
interestica
Those are from two images? That's just bonkers.
RhysU
It would be more compelling contrasted with ones that did not work.
amelius
I suppose a wheel with spokes, when turning at the right speed would not work due to aliasing (the wheel would appear to be not moving both in the original and the interpolated video).
extr
That's probably true but feels like somewhat of a trivial example...I think there would be some very interesting things to test here. When you get down to something like a 16 frame interpolation of 2 stills, the model is essentially guessing based on context what the interpolated frames should look like. Starting to verge into computational photography territory where the model is supplying it's own interpretation of the action based on human-like semantic understanding of the scene. As someone with an interest but not a career in bleeding edge machine learning, I would be curious to get an intuitive sense of how much this is going on.

Interesting boundary cases might be visuals of physical processes with inherently "chaotic" small-scale behavior. For example:

* What would happen if you fed the model two stills of a drop of food coloring expanding in water? Would it wholesale invent chaotic action that is obviously only one of many solutions but plausibly interpolates between the two states? Maybe not in it's

* Fireworks can completely visually change in 2-3 frame timescales. The underlying process is immediately recognized and could easily be imagined by most people, does the model understand the context here?

Maybe it wouldn't do well right now, but how much better could performance get on the above if there were more exmaples in the training set?

* In the opposite direction of chaos, it might be interesting to look at something like two photos of a starry night taken 30 minutes apart. Based on the two photos, can the model understand the geometry of the scene and rotate the points correctly?

I would also be know what would happen if instead of using frame next to each other, you took them further and further apart. Of 1 second of a 30 FPS video, could you give it frames 1, 15, and 30, and ask it to find the other 27? How about with 5 seconds of a 30FPS video and giving it frames 1, 30, 60, 90, 120, and 150? Etc etc

It looks like they have a collab notebook, so perhaps I should quit writing and start playing around!

andy_ppp
Yes, it would be good to see results with two images of fireworks.
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.