Thu Apr 17 2025
Vision-language models are integral to computer vision research. Many high-performing models remain closed-source, obscuring their data, design and training recipe. We analyze standard training pipelines without distillation from proprietary models. We explore large-scale synthetic data to identify critical data gaps.
Keywords: video understanding,vision language,video captions,challenging video,videobench
Thu Apr 17 2025
Engineering sketches consist of geometric primitives (e.g. points, lines) connected by constraints that define the relationships between them. We adapt alignment techniques from reasoning LLMs to the task of generating sketch constraints found in computer-aided design (CAD) models.
Keywords: sketch constraints,generate cad,constraint generation,generate constraints,generative cad
Thu Apr 17 2025
Inspired by the human cognitive phenomenon of attentional bias, we reconceptualize neural architectures as associative memory modules. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs.
Keywords: recurrent neural,recurrent models,memory learning,attentional bias,associative memory
Thu Apr 17 2025
Sleep-time compute allows models to"think" offline about contexts before queries are presented. By anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time.
Keywords: query gsm,stateful gsm,gsm symbolic,gsm,anticipating queries
Thu Apr 17 2025
RUKA is a tendon-driven humanoid hand that is compact, affordable, and capable. Made from 3D-printed parts and off-the-shelf components, RUKA has 5 fingers with 15 underactuated degrees of freedom enabling diverse human-likegrasps
Keywords: humanoid hand,robotic hands,leveraging hand,hands teleoperation,grasping
Thu Apr 17 2025
MIB favors methods that precisely and concisely recover relevant causalpathways or specific causal variables in neural language models. For causal variable localization, we find that the supervised DASmethod performs best. SAE features are not better than neurons, i.e.,standard dimensions of hidden vectors
Keywords: autoencoders,sparse autoencoders,autoencoders saes,distributed alignment,mechanistic interpretability
Thu Apr 17 2025
Energy-Based Reward Model (EBRM) is a lightweight post-hoc framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in humanpreferences and mitigating the impact of noisy or misaligned annotations.
Keywords: reward models,models reward,reward model,based reward,language models
Thu Apr 17 2025
VistaDPO is a novel framework for Video Hierarchical Spatial-TemporalDirect Preference Optimization. VistaDPO significantly improves the performance of existing LVMs, effectivelymitigating video-language misalignment.
Keywords: video language,video hierarchical,video understanding,video temporal,benchmarks video
Thu Apr 17 2025
This study explores the relationship between deep learning (DL) modelaccuracy and expert agreement in the classification of crash narratives. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues.
Keywords: crash narratives,critical nlp,language models,crash analysis,classification crash
Thu Apr 17 2025
RoboTwin uses 3D generative foundation models to produce diverse expert datasets. It also introduces a spatial relation-aware codegeneration framework. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual
Keywords: trained robotwin,robotwin generative,robotics dual,magic robot,robotic tasks
Thu Apr 17 2025
Topological insulators (TIs) and topological crystalline insulator (TCIs) are valuable for practical applications. Such materials, particularly those with a full band gap, remain scarce. We apply reinforcement fine-tuning to a pre-trained generative model.
Keywords: topological insulators,topological materials,topological crystalline,crystalline insulators,materials generative
Thu Apr 17 2025
Current BVSR methods often fail to restore sharp details at high resolutions. We propose a novel event-enhanced network, Ev-DeblurVSR. On real data, our method is +2.59 dB more accurate and 7.28$\times$ faster than
Keywords: feature deblurring,deblurvsr effectively,blurry video,deblurring,deblur frame
Thu Apr 17 2025
A PRVR framework encodes diverse contexts within a video into a fixed number of prototypes. We introduce strategies to enhance text association and video understanding within prototypes. To keep the prototypessearchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction
Keywords: video contexts,video retrieval,contexts video,video understanding,context representations
Thu Apr 17 2025
We propose VLCA, which combine language space and vision space. We connect the multiple image domains by using semantic space as the bridge domain. In the end, the languagerepresentation is aligned with the vision representation through the multimodal space of text and image.
Keywords: image domains,domain generalization,multimodal space,capture semantic,image features
Thu Apr 17 2025
NaVAB is a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations. NaVAB implements a national value extraction pipeline to efficiently construct value assessmentdatasets. It can be combined with alignment techniques to effectively reduce value concerns.
Keywords: country values,national values,value extraction,assessment datasets,value assessment
Thu Apr 17 2025
Federated learning (FL) enables collaborative model training usingdecentralized private data from multiple clients. While FL has shown robustness against poisoning attacks with basic defenses, our research reveals new vulnerabilities stemming from non-independent and identically distributed data among clients. These vulnerabilities pose a
Keywords: malicious gradients,backdoor attacks,collaborative backdoor,backdoor defenses,federated learning
Thu Apr 17 2025
EmoVoice is a novel emotion-controllable TTS model. It exploits large language models to enable fine-grained freestyle natural language emotion control. EmoVoice achieves state-of-the-art performance on the English and Chinese test sets.
Keywords: emotional speech,emotion dataset,language emotion,emotion labels,emotion evaluation
Thu Apr 17 2025
We present a novel approach to training specialized instruction-based image-editing diffusion models. We create an online reinforcement learning framework that aligns the diffusionmodel with human preferences. Thisapproach simplifies users' efforts to achieve highly specific edits.
Keywords: intricate edits,visual edits,editing diffusion,image editing,scenes maintaining
Thu Apr 17 2025
Unified Structured Knowledge Reasoning (USKR) aims to answer natural languagequestions (NLQs) by using structured sources such as tables, databases, andknowledge graphs in a unified way. Existing USKR methods either rely onemploying task-specific strategies or custom
Keywords: textual reasoning,unified knowledge,structured knowledge,knowledge representation,knowledge reasoning
Thu Apr 17 2025
SimUSER is an agentframework that serves as believable and cost-effective human proxies. It identifies self-consistent personas from historical data, enriching userprofiles with unique backgrounds and personalities. Users equipped with persona, memory, perception, and brain modules engage in interactions with the
Keywords: recommender simuser,profiles,recommender parameters,recommender systems,recommender
Thu Apr 17 2025
ClIP-Refine is a post-pre-training method for CLIP models at a phase between pre-training andfine-tuning. It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
Keywords: training clip,trained clip,epoch training,language image,text features
Thu Apr 17 2025
We study continual learning on multiple linear classification tasks bysequentially running gradient descent (GD) for a fixed budget of iterations pertask. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear class
Keywords: continual learning,catastrophic forgetting,averaged forgetting,forgetting,transfer forgetting
Thu Apr 17 2025
We propose that learning in deep neural networks proceeds in two phases: arapid curve fitting phase followed by a slower compression or coarse graining phase. This view is supported by the shared temporal structure of three phenomena: grokking, double descent and the information bottleneck.
Keywords: learning deep,deep neural,generalization training,neural,learning
Thu Apr 17 2025
Perceived risk is subjective and difficult to evaluate using existing methods. DSPR achieves the highest prediction accuracy of 87.91% in predicting SRRs. CNN-Bi-LSTM-TPAnetwork presents the highest accuracy among four different LSTM structures.
Keywords: lstm tpa,driver subjective,lstm,bi lstm,driver trust