AI RESEARCH
Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]
r/MachineLearning
•
The research community has provided (already for some time) seemingly efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models? I imagine not, and I'm trying to think why: - marginal gains? - pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)? - scaling laws are not well understood for input-adaptive patching. therefore.