Hacker News Comments on
[RIFEv1.0: 24FPS to 96FPS] Video frame interpolation , GPU real-time flow-based method
黄哲威
·
Youtube
·
112
HN points
·
2
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.There are already some impressive models that can render animations at higher framerates, for example DAIN[0] and RIFE[1].DAIN video: https://www.youtube.com/watch?v=q2i6FXVjNT0
RIFE video: https://www.youtube.com/watch?v=lqtqmP46LaA
⬐ majewskyAnd here's the obligatory rebuttal against such tweening models from an animator: https://www.youtube.com/watch?v=_KRb_qV9P4g
⬐ TaylorAlexanderI just wanted to point out that the latest (RTX+) Nvidia cards support hardware dense optical flow calculation which sounds really fast from their docs. I quickly checked the repo for TFA but didn’t see mention of it. It would be interesting to see their method when designed to utilize hardware calculated optical flow, if that would help at all.Here’s the NVIDIA library: https://developer.nvidia.com/opticalflow-sdk
I’ve played with the NVENC hardware and it’s fast! I’ve not specifically tried the hardware optical flow calculation but docs say it can process flow for 4k video at 150fps!
⬐ da-xPerhaps not impossible - sometime after 2040, a deep learning model reconstructs the entire "Titanic" (the James Cameron film) from a total of 1,000 frames, the script, and voice samples of the actors.⬐ drran⬐ app4softAfter 2080, it will be able to reconstruct your life from photo and comments on HN.Does it mean that I may record desktop screencast in 15FPS (to not overload GPU) & then convert to 60FPS?⬐ scoopertrooper⬐ pastrami_pandaMaybe? It might struggle with highly detailed text-heavy workloads though. You can see from their github that it seems to copy stable sections of the video to areas where movement is taking place and does some sort of transformation on that. This could create some curious artefacts in a screencast. It'd probably be fine for gaming though.⬐ app4soft⬐ varenc> It'd probably be fine for gaming though.Yeah, that would be the good case.
⬐ scoopertrooperJust keep in mind you wouldn't be able to do this in real time.⬐ drranIt can be done in real time, if GPU will provide low detail sketch of the next frame, then will continue work on it, while filling the gap between A frames with optical flow frames.Also, instead of producing of a static image, GPU can calculate and show an animated image, e.g. mp4 fragment.
That’s definitely possible, but you won’t be able to do this in real time without more GPU power. And the frame interpolation techniques might not work as well on a desktop recording.If you can record and process later, you might try recording your desktop in a more raw format. It’ll be very large on disk, but this avoids the need to transcode the recording in real time and strain your GPU/CPU. In ffmpeg just use `-c:v copy` to capture it raw. (Assuming the transcoding is the main limiting factor)
Quality looks very good, I did find one example from a quick glance where the interpolation makes the video less coherent:https://youtu.be/lqtqmP46LaA?t=28
At first I watched the interpolated video and didn't quite understand the movement of the hockey stick closest to the camera - while looking at the original video I found that to be more coherent in some sense. Overall quite amazing though.
⬐ rasz⬐ hzwerTraining data probably didnt contain any hockey matches, or more generic flat surfaces being rotated quickly.I made another demo, https://www.youtube.com/watch?v=kUQ7KK6MhHw We train a more robust model and works very well on video game clips.⬐ codelordHuh. When I look at 15 fps side on the left, it looks normal, doesn't flicker. As if though my brain adapts to the frame rate and just fills in the gaps a bit. When I look at the right side of the video, the left side looks super flickery!⬐ gazabRelated but not entirely the same model: Boosting Stop-Motion to 60 fps using AI: https://www.youtube.com/watch?v=sFN9dzw0qH8⬐ skwbIMHO it is missing the fully sampled (the ground truth 60 fps) solution. It's sorta hard to discern the quality of this work without this benchmark.⬐ rasz⬐ RhysUThe goal is usually passing to human scrutiny, not necessarily faithfully replicating ground truth.Is this sort of upsampling essentially what the eye sees when it "sees" movement?⬐ sorenjan⬐ derekhsuOur eyes and brain doesn't see frames, they see changes in light intensity. There are cameras that try to mimic this behavior, they're called event cameras. I think they're going to see much more use in robotics in the future, but at the moment they're mainly (or only) used in research.⬐ toxik⬐ iandanforthEvent cameras are certainly interesting, but they have their own set of challenges. I do not believe they’re a better choice in any general sense, but can certainly deliver some impressive performance.Yes and no. We do hallucinate both temporal and spatial content from any scene we perceive but there's also a big part of the brain which causes you to ignore missing information. For example during a saccade (rapid eye movement) you're basically blind, the brain just edits out of your perceptual stream the part where the world is an incomprehensible blur.⬐ scootExpanding in this - if you've ever glanced at a ticking clock, and the second hand seemed frozen for a moment, that was the brain backfilling the gap (after the fact!) left by the saccade.I was aware of the phenomenon (from personal experience) but not fo the cause, but there was a post here relatively recently that went into the details. Apologies that I can't now find it.
⬐ InvaderFizzI can't find the HN post, but I believe this was the tweet thread that it referenced: https://threadreaderapp.com/thread/1014267515696922624.html⬐ scootThat's the one - thanks!Looks great. This should deserve more attention.⬐ andy_pppReally cool, particularly the interpolation of two images to 16 frames on their GitHub here: https://github.com/hzwer/arXiv2020-RIFE⬐ scootThe occluded objects (car behind the pole for example) are particularly impressive.⬐ interesticaThose are from two images? That's just bonkers.⬐ RhysUIt would be more compelling contrasted with ones that did not work.⬐ ameliusI suppose a wheel with spokes, when turning at the right speed would not work due to aliasing (the wheel would appear to be not moving both in the original and the interpolated video).⬐ extrThat's probably true but feels like somewhat of a trivial example...I think there would be some very interesting things to test here. When you get down to something like a 16 frame interpolation of 2 stills, the model is essentially guessing based on context what the interpolated frames should look like. Starting to verge into computational photography territory where the model is supplying it's own interpretation of the action based on human-like semantic understanding of the scene. As someone with an interest but not a career in bleeding edge machine learning, I would be curious to get an intuitive sense of how much this is going on.Interesting boundary cases might be visuals of physical processes with inherently "chaotic" small-scale behavior. For example:
* What would happen if you fed the model two stills of a drop of food coloring expanding in water? Would it wholesale invent chaotic action that is obviously only one of many solutions but plausibly interpolates between the two states? Maybe not in it's
* Fireworks can completely visually change in 2-3 frame timescales. The underlying process is immediately recognized and could easily be imagined by most people, does the model understand the context here?
Maybe it wouldn't do well right now, but how much better could performance get on the above if there were more exmaples in the training set?
* In the opposite direction of chaos, it might be interesting to look at something like two photos of a starry night taken 30 minutes apart. Based on the two photos, can the model understand the geometry of the scene and rotate the points correctly?
I would also be know what would happen if instead of using frame next to each other, you took them further and further apart. Of 1 second of a 30 FPS video, could you give it frames 1, 15, and 30, and ask it to find the other 27? How about with 5 seconds of a 30FPS video and giving it frames 1, 30, 60, 90, 120, and 150? Etc etc
It looks like they have a collab notebook, so perhaps I should quit writing and start playing around!
⬐ andy_pppYes, it would be good to see results with two images of fireworks.