Information Technology and Control

ADFN: Adaptive Dynamic Fusion Network for Real-time Multispectral Object Detection

Lin Yang, Gangzhu Qiao — Mon, 14 Jul 2025 00:00:00 +0300

Multispectral object detection leverages the complementary strengths of infrared (IR) and visible (VIS) modalities to improve detection accuracy. However, existing approaches often lack adaptability to dynamic lighting conditions, or fail to achieve real-time performance due to complexity. We propose the Adaptive Dynamic Fusion Network (ADFN), a novel architecture that integrates adaptive multi-path computation and attention-guided feature fusion to address these challenges. ADFN incorporates the Collaborative and Alternating Attention (CAA) modules for efficient feature alignment and the Adaptive Dynamic Pathway (ADP) strategy to dynamically adjust computational pathways based on lighting conditions, optimizing the balance between accuracy and efficiency. Experiments on the FLIR2 and LLVIP datasets demonstrate that ADFN achieves superior mAP@50-95 and real-time performance, showcasing its robustness and efficiency across diverse environments. ADFN offers a practical solution for dynamic lighting conditions and resource-constrained multispectral object detection tasks.

A Prediction Method for Highway Traffic Flow Based on the IHPO-VMD-LSTM-Informer Model

Ruinan Wang, Yan Cao, Xingyu Ji, Di Qiao — Mon, 14 Jul 2025 00:00:00 +0300

Accurate and timely predictions of highway traffic flow are crucial for implementing intelligent highway management. This paper introduces a novel prediction approach for highway traffic flow by employing the IHPO-VMD-LSTM-Informer model, aiming at enhancing prediction accuracy. Initially, key indicators measuring highway traffic are identified, and Nonlinear Principal Component Analysis (NPCA) is applied to minimize the dimensionality and interdependence among these indicators. This reduction process replaces the original complex indicators with fewer numbers of principal components, thereby simplifying the feature matrix's structure. Subsequently, Variational Modal Decomposition (VMD) processes historical highway traffic flow data, enhanced by the strategically improved Hunter-Prey Optimization (HPO) algorithm. This optimization facilitates adaptive parameter adjustments for the VMD, enabling effective decomposition of highway traffic flow time series data. The Sample Entropy (SE) of Intrinsic Modal Functions (IMFs) from this decomposition is used with the substantial indicators to form a comprehensive feature matrix. Then, the predictive module combines a Long Short-Term Memory (LSTM) network with the Informer architecture to accurately predict highway traffic flow from the feature matrix. The effectiveness of the proposed model is verified using a public motorway traffic dataset KDD CUP 2017. The results indicate that the proposed model outperforms available ones in terms of prediction accuracy, where MAPE and RMSE have 8.09 and 2,84, thus significantly advancing intelligent highway management.

Improved Agricultural Machinery Navigation Algorithm Based on Machine Learning and Machine Vision Technology

Mon, 14 Jul 2025 00:00:00 +0300

The automatic navigation of agricultural machinery is one of the important directions in intelligent agriculture research. To realize the automatic production of agricultural machinery, the automatic planning of the navigation route for agricultural machinery is the key. Considering the complexity of the agricultural production environment, the agricultural machinery navigation model is constructed based on binocular vision technology, and the optimized BP network is used to calibrate the binocular vision model. Considering the difficulty in crop identification by traditional machine vision technology, RGB space technology is used to complete image segmentation and noise processing. The optimized S-RANSAC algorithm is used to extract image features. The experimental results showed that in the multi-algorithm agricultural rice field image feature matching test, the S-RANSAC algorithm accurately identified the color difference, shape difference, and hydrological environment difference of seedlings. In contrast, other algorithms were unable to identify complex environmental features. At the same time, in the complex agricultural environment positioning test, the maximum error of the S-RANSAC algorithm was 4.16m, which was better than 5.17m of SURF and had the best positioning performance. It can be seen that the proposed technology has excellent application effects in practical scenarios, providing important technical references for the intelligent development of agriculture and the innovation of visual navigation technology.

Multi-strategy Hybrid Improved Intelligent Algorithm for Solving UAV-MTSP

Zixin Wang, Danqing Wang, Jiguang Yu — Mon, 14 Jul 2025 00:00:00 +0300

Unmanned aerial vehicles (UAVs) have been increasingly used in fire monitoring and rescue operations, offering flexibility and efficiency. However, determining the shortest path for all UAVs to visit all regions is a crucial issue, known as the Multiple Traveling Salesman Problem (MTSP), which aims to save time and energy. This paper proposes a novel hybrid heuristic algorithm, MCPWOA, to solve MTSP with a focus on UAV path planning applications. The algorithm integrates the Whale Optimization Algorithm (WOA), Crested Porcupine Optimizer (CPO), Chaotic Mapping Strategy (CMS), Arcsine Control Strategy (ACS) and Reverse Learning Strategy (RLS) to diversify the initial population and achieve rapid exploration. The algorithm's performance is evaluated using the CEC2022 benchmark function set and TSPLIB dataset for function minimization and UAV-MTSP experimental solution finding. Results indicate that MCPWOA outperforms existing WOA, CPO, and other advanced algorithms on most tests, showing higher convergence accuracy. Moreover, MCPWOA's effectiveness is demonstrated in actual UAV fire monitoring and rescue path planning, enhancing fire response efficiency through optimized UAV configuration and task allocation.

Gas Hydrate Pipeline Is Optimized: Levy Flight, Cauchy Mechanism, and Perception Probability

Dawei Qin, Lanlan Chen — Mon, 14 Jul 2025 00:00:00 +0300

Pipelines used for the hydraulic lifting of gas hydrate particles in deep-sea gas hydrates consume a large quantity of energy, so the level of efficient resource exploitation is very low and it is challenging to meet an efficient gas supply. Therefore, the article aims to optimize and analyze a process used for rigid pipe hydraulic lifting, an essential part of a deep-sea gas hydrate extraction system. First, the objective function is constructed considering the relationship between the extraction system’s parameters, and a specific energy consumption is set when the deep-sea gas hydrate extraction is under consideration. Then, the range of each parameter is determined according to the extraction system's actual situation. Secondly, the improved crow search algorithm with a hybrid strategy covering dynamic perception probability, Levy flight, and Cauchy variation mechanism is employed to solve the optimization model. Finally, the improved crow search algorithm is applied to the experimental settings and compared with other optimization algorithms. The experimental results show that the proposed method, which is, the improved crow search algorithm, has a good computational efficiency, can effectively realize the optimization of the parameters of the deep-sea natural gas hydrate system, and is robust to numerical fluctuations of the parameters. Thus, the performance of the pipeline is improved and the energy consumption of the system is effectively reduced. Eventually, a theoretical reference is provided for the development of deep-sea gas hydrate. The proposed algorithm, I-CSA, can effectively deal with larger sample data and maintain high computational efficiency with fewer MAPE results when the sample sizes increase. Eventually, it is helpful for the deep exploitation and utilization of deep-sea gas hydrate.

MEA-IFE: An Improved Multi-modal Fusion Framework Based on DCNN-BERT-BiLSTM and Its Application in Sentiment Analysis

Hongfei Ye, Xiaochen Xiao — Mon, 14 Jul 2025 00:00:00 +0300

In the real world, emotional data often comes from multiple heterogeneous sources, making it difficult for unimodal approaches to capture emotional information fully. Existing sentiment analysis models struggle with accuracy when handling complex emotional expressions. Accordingly, this paper proposes a multi-modal sentiment analysis framework, MEA-IFE, which is characterized by effective feature extraction and high predictive accuracy. To mitigate potential information loss and expression limitations in BERT-BiLSTM during text feature extraction, MEA-IFE introduces a parallel structure of SK-Net and BiLSTM, enhancing the ability to extract multi-dimensional text features. Additionally, it integrates the ECA mechanism to improve the capture of essential information in text. For image-related challenges, MEA-IFE incorporates Vision Transformer better to capture both global and detailed features of images, combining CNN and Transformer architectures. During the feature fusion phase, MEA-IFE employs a multi-head attention mechanism to dynamically integrate text and image features, exploring the interactive potential between different modalities. Experiments performed using the Kaggle text dataset and the FER2013 image dataset demonstrate an impressive accuracy of up to 98.00%, validating its effectiveness. When compared with models like AM-MF, AMSAER, HAN-CA-SA, and TBGAV, MEA-IFE shows outstanding performance across accuracy, precision, recall, and F1 score, with respective improvements of 0.40%, 0.20%, 0.75%, and 0.52%. The model also excels in the AUC metric, further confirming its advantages. The proposed MEA-IFE model possesses high predictive accuracy and strong feature integration capabilities, meeting the precision demands of complex multi-modal sentiment tasks.

Method of Ship Target Oblique Frame Detection in Lightweight SAR Image Based on Recurrent Neural Network

Liang Huang, Xufang Zhu, Bing Luo — Mon, 14 Jul 2025 00:00:00 +0300

When ship targets appear in SAR images at different angles, their shapes and contours may change significantly. At present, target box detection algorithms often match and recognize based on templates with fixed shapes and directions. When the angle of ship targets changes, these templates may no longer be applicable, leading to the decline of detection algorithm performance, and it is difficult to accurately identify and locate targets. Therefore, for the purpose of solving the problem of angle sensitivity, the method of ship target oblique frame detection in lightweight SAR image based on recurrent neural network is studied to improve the effect of ship target oblique frame detection. Using recurrent neural network, the framework of ship target oblique frame detection in lightweight SAR images is established to ensure the detection accuracy, significantly reduce the demand for computing resources, and achieve more efficient detection. In this framework, SAR images are input in the input layer and transmitted to the hidden layer. The lightweight convolutional neural network is used as the hidden layer, and channel attention mechanism is introduced to improve the extraction effect of useful ship target features. The output layer processes the ship target characteristics, predicts the ship target center point heat map, and calculates the oblique frame vertex coordinates of the center point heat map, so as to have better adaptability to the ship targets that tilt or rotate in the SAR image, solve the angle sensitivity problem, and complete the ship target oblique frame detection. The volume Kalman filter algorithm is used to train the recurrent neural network, optimize the network weight, and improve the detection accuracy of ship target oblique frame. Experiments show that this method can effectively extract ship target features. Under different background, this method can accurately detect the slant frame of ship target. Under different occlusion rates, the robustness of the method is better.

Bi-Encoder Polyp Net: A Novel Architecture for Enhanced Polyp Segmentation in Endoscopic Images

Qiqiang Duan, Cong Gu — Mon, 14 Jul 2025 00:00:00 +0300

Automatic polyp segmentation in endoscopic images holds critical clinical value for early colorectal cancer diagnosis. While existing segmentation models have achieved notable progress, two key challenges persist in algorithmic performance improvement. First, dynamic adjustments of colonoscope tip orientation during examinations induce viewpoint variations, which amplify polyp appearance diversity and hinder robust feature learning. Second, the inherent similarity between polyps and surrounding tissues leads to blurred boundaries. Although convolutional neural networks (CNNs) have demonstrated significant advancements, their limitations in modeling global dependencies and reliance on aggressive downsampling operations often cause redundant network structures and local detail loss. To address these bottlenecks, we propose Bi-Encoder Polyp Net – a novel parallel architecture integrating Pyramid Vision Transformer and ResNet. This dual-branch design effectively captures global contextual dependencies while preserving low-level spatial details. A feature alignment module bridges the semantic gap between dual-branch feature maps, and an iterative semantic embedding unit further injects high-level semantic information into aligned low-level features. Extensive experiments across five public polyp segmentation benchmarks validate the network’s effectiveness, demonstrating superior capability in processing real-world colonoscopy images.

Yolov5-based Intelligent Detection Method for Retail Goods

Zixin Jiang — Mon, 14 Jul 2025 00:00:00 +0300

In the current context, intelligent unmanned retail checkout systems offer the prospect of efficient and innovative development. This study proposes an enhanced lightweight YOLOv5 merchandise detection and recognition method. The method introduces SELayer and a multi-headed self-attentive module of Transformer in YOLOv5 to enable the network to focus more on essential factors such as commodities when performing retail merchandise detection, and improve the recognition performance of the model. Also, the Ghost module is introduced to reduce network parameters and computation, increase computation speed and reduce latency. We validated the performance of the approach on a public dataset. Compared with the existing YOLOv5 model, the model achieves a 0.9% improvement in detection accuracy and a 27.7% reduction in GFLOPs. With this study, we optimise the problem of small batch identification of retail goods, providing a basis for automated processing of intelligent retail supply and marketing systems with practical implications.

Neural Networks and Ensemble Model to Automatic Music Coordination: A Performance Comparison

Lu Wang — Mon, 14 Jul 2025 00:00:00 +0300

In order to solve the problems of low classification accuracy, poor quality of generated music, and insufficient consideration of the order and duration of notes in music coordination, this paper adopts a long short-term memory network (LSTM) and ensemble model based on the combination of timing and self-attention mechanism. The experimental model uses the LSTM network to automatically learn the important features of notes, and introduces the timing and self-attention mechanism to enhance the model's ability to pay attention to the note sequence and features, and better capture the long-distance dependencies and emotional changes in music. Compared with the traditional model, the model used in this paper is more detailed in considering the order and duration of notes, and combines emotional labels with audio data to improve the quality of music generation. The experiment is verified by the three music datasets of Lim, Rhyu and Lee. The ensemble model combined with LSTM and self-attention mechanism in this paper performs well in comprehensive evaluation scores and chord classification accuracy, which is significantly improved compared with the traditional LSTM model. The novelty lies in the better integration of the timing relationship and emotional information of the note sequence, which improves the performance of music coordination. The model in this paper achieved 43 points (out of 50 points) and 95.6% in comprehensive evaluation score and chord classification accuracy, respectively. The chord classification accuracy was significantly improved by 3.3% compared with LSTM. It also has unique advantages in model structure design and feature integration, especially in the introduction of timing and self-attention mechanisms, and the combination of emotional labels. It has achieved better results and brought new ideas and methods to the field of music generation.

A Two-stage Cattle Face Recognition Method Based on Target Detection and Recognition Network

Piaoyi Zheng, Minghui Deng, Junjie Gong, Guiping Li, Yanling Yin — Mon, 14 Jul 2025 00:00:00 +0300

Traditional methods of cattle management have problems such as high error rates, easy failure of tags, and the need to consume a lot of time and manpower costs. However, as one of the biological characteristics, the recognition of cattle face is one of the important technical means to achieve intelligent farming, accurate feeding, and health management of cattle. Thus, the article proposed improved algorithms based on YOLOv7 and VoVNet for cattle face detection and recognition using a contactless approach. For the improved YOLOv7 cattle face detection model, the efficient layer aggregation networks (ELAN) structures in the backbone and neck networks were replaced with the ConvNeXt network and CoTNet Transformer module, respectively, aiming to improve the detection speed and robustness while reducing computation. The SimAM (A Simple, Parameter-Free Attention Module) attention mechanism, considering both spatial and channel dimensions, was introduced in the neck network to enhance feature representation without adding extra parameters to the original network. Experimental results on the constructed facial detection dataset of Holstein and Simmental beef cattle showed that the improved CCS-YOLOv7 cattle face detection model achieved a precision of 99.43% and a recall rate of 99.10%, with significantly improved detection speed and reduced model size. As for the improved VoVNet cattle face recognition model, residual connections (RC) were added from the input to the output of the One-Shot Aggregation (OSA) modules of VoVNet to enhance the representation of deep features. The Efficient Channel Attention (ECA) was added to the final feature extraction layer of the OSA modules to improve the feature extraction capability for cattle face image classification. Experimental results on the facial recognition dataset of Holstein dairy cows and Simmental beef cattle, built upon the improved CCS-YOLOv7 cattle face detection model, demonstrated that the VoVNet-ECA-RC model achieved a precision of 99.37% for cattle face recognition with a final model size of 41.4MB. Therefore, the proposed research structures can provide a reference for non-contact individual recognition in the process of intelligent farming.

Integration of Explainable AI with Deep Learning for Breast Cancer Prediction and Interpretability

A. Rhagini, S. Thilagamani — Mon, 14 Jul 2025 00:00:00 +0300

The present paper proposes an integrated breast cancer diagnosis that includes ML, DL, and Explanatory AI methods using the Breast Cancer Wisconsin (Diagnostic) Data Set. We compare standard machine learning approaches, namely Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR), with more intricate techniques based on deep learning. Although ML models help understand the problem, a DL model may be more appropriate when the data’s dimensionality and complexity are huge. Addressing these limitations, we present a new Hybrid Explainable Attention Mechanism (HEAM) for DL models that utilise attention performance. This method is used in CNNS with saliency maps and Grad-CAM methods to provide clinical users with attention on parts of the input that the model is based upon in its predictions, such as characteristics of cell nuclei in images. Using the Breast Cancer Wisconsin dataset, the novel deep learning model with HEAM enhancement is tested against traditional ML models concerning breast cancer classification. The findings of this investigation provide evidence that HEAM not only boosts the prediction accuracy by 99.5% but also enhances the model by allowing for the provision of sound and visual attention that explicates the prediction made, thereby improving the clinical relevance of the model.

Single-Pulse Detection Method of Radar Weak Target Based on a Two-Stage Deep Neural Network

Mingjie Qiu, Jianming Wang, Guangxin Wu — Mon, 14 Jul 2025 00:00:00 +0300

With the increasing prevalence of drones in low-altitude airspace, the radar detection of weak targets with a low signal-to-noise ratio (SNR) still poses a crucial challenge. Traditional constant false alarm rate (CFAR) methods encounter issues of high false alarms and low accuracy when the SNR is below-15dB. This paper puts forward a two-stage deep neural network to improve weak target detection by emulating human visual perception. In the first stage (coarse detection), potential targets are rapidly localized through grid-based regression. In the second stage (fine detection), depth-wise separable convolution (DSC) and residual connections are utilized for accurate classification. Experimental results show that, at an SNR of -20dB, the detection rate of the proposed method is 20% higher than that of CFAR methods, and the inference speed is 3.66 times faster than that of single-stage networks. Ablation studies confirm the efficiency improvements brought by the coarse detection network. This approach offers a robust solution for real-time drone surveillance in complex and cluttered environments.

SAEDF: A Synthetic Anomaly-Enhanced Detection Framework for Detection of Unknown Network Attacks

Kai Liang, Chuanfeng Li, Qiong Duan — Mon, 14 Jul 2025 00:00:00 +0300

Detecting unknown cyber-attacks (i.e., zero-day) is difficult because network environments change frequently and there are few labeled examples of anomalies. Traditional methods for detecting anomalies often struggle to handle unknown attack types and work effectively with complex, high-dimensional data. To overcome these problems, we propose a new approach called the synthetic attack-enhanced detection framework (SAEDF). SAEDF combines synthetic anomaly generation, flexible feature extraction, and unsupervised anomaly detection. The framework employs a model known as the adaptive and dynamic generative variational autoencoder (ADGVAE). This model generates realistic synthetic attacks and adapts its structure to work effectively with datasets of varying complexity. This helps the model work well with a wide range of attack patterns while still being efficient. Tests on benchmark datasets show that SAEDF performs better than other methods. It achieves higher scores for F1, Recall, and has a much lower rate of false positives. These results show that SAEDF is effective in finding unknown attacks, improving detection accuracy, and handling complex and changing network traffic.

Learn from Adversarial Examples: Learning-Based Attack on Time Series Forecasting

Youbang Xiao, Zhongguo Yang, Qi Zou, Peng Zhang — Mon, 14 Jul 2025 00:00:00 +0300

Adversarial attack in Time Series Forecasting(TSF) has been a topic of growing interest in recent years. While some black box attack methods have been proposed for TSF, they require continuous query to the target model. And the computational costs increase as model and data complexity grows. In fact, The perturbations generated by these methods have certain patterns, especially constrained in L0 norm. Those patterns can be captured and learned by a model. In this study, we proposed Learning-Based Attack(LBA), a novel black box adversarial attack method for TSF tasks, focusing on adversarial example, the perturbed data. By utilizing a model to learn adversarial ex- amples and generate a similar one, we can achieve a comparable performance with the original attack methods while significantly reducing the number of queries to the target model, ensuring high efficient and stealthiness. We evaluate our method through several public datasets. In this paper, we learn the adversarial samples attacked by n-Values Time Series Attack(nVITA), a sparse black box attack for TSF. The results show that we can effectively learn the attack information and generate similar adversarial samples with lower computational overhead, thus achieving the stealthiness and efficiency of the attack. Furthermore, we also verify the transferability of our method and found its applicability to attack other models. Our code is available on Github.

An Early Warning Model for Industrial Network Security Issues: A Crafted Strategy for High Accuracy Based on Machine Learning Approach

Xiang Le, Yong Zhao — Mon, 14 Jul 2025 00:00:00 +0300

An industrial network has become an important infrastructure. As industrial networks develop, their cybersecurity problems become more and more prominent. The attacks currently realized to networks turn out to be advancing quicker than ever, and their destructive force also continuously gets bigger. Thus, the available early warning technology for industrial network security issues requires more accuracy and timeliness since a serious amount of delays occurs in real cases. The article proposes a strategy with high accuracy based on a machine-learning algorithm. Nonlinear high-dimensional data with different feature characteristics in cyber-attacks and low training efficiency of conventional early warning models to predict attacks are underlined as a significant part of the problem to deal with. Thus, the manuscript suggests a feature selection method based on the Tuna Swarm Optimization (TSO) algorithm to filter out redundant features and reduce the data’s dimensionality. Then, the Extreme Learning Machine (ELM) and Auto-Encoder (AE) are combined to construct the model called Extreme Learning Machine-Auto Encoder (ELM-AE) to be implemented as the basis of the early warning model for industrial network security. Afterward, the improved Whale Optimization Algorithm (I-WOA) is used to optimize the parameters of the ELM, to construct the obtained optimization model. Finally, the obtained optimization model is applied to detect attacks on industrial cyber security systems as an early warning method. Eventually, the proposed model is tested by constructing an evaluation index system on how effective the early warning system functions. The experimental results show that the proposed warning model for industrial network security issues has high warning accuracy and efficiency concurrently, which provides an advanced early warning model for network attacks. The proposed model with 92.64% precision and 51.84 s average execution time excels over other methods.

Hybrid Attention Approach for Source Code Comment Generation

Yao Meng — Mon, 14 Jul 2025 00:00:00 +0300

Currently, developers are often obligated to enhance code quality. High-quality code is often accompanied with comprehensive summaries, including code documentation and function explanations, which are invaluable for maintenance and further development. Regrettably, few software projects provide sufficient code comments owing to the high costs associated with human labeling. Contemporary researchers in software engineering concentrate on the methods for automated comment generating. Initial algorithms depended on handwritten templates or information retrieval methods. With the advancement of machine learning, researchers construct automated models for machine translation instead. Nonetheless, the produced code comments remain inadequate owing to the significant disparity between code structure and normal language. This study introduces a unique deep learning model, At-ComGen, which utilizes hybrid attention for the automated creation of source code comments. Utilizing two separate LSTM encoders, our approach integrates essential tokens from source code functions with the code structure, represented by a corresponding Abstract Syntax Tree. In contrast to earlier data-driven models, our methodology utilizes code syntax and semantics in the generation of comments. The hybrid attention method, used for comment creation for the first time to our knowledge, enhances the quality of code comments. The tests demonstrate that At-ComGen is efficacious and surpasses other prevalent methodologies. Machine comments from Seq2Seq and CODE-NN disregard code structure underlying DeepCom and At-ComGen. At-ComGen has 59.3%, 36.4%, 43.3%, and 13.1% higher comment BLEU values than baseline models for a 5-line function. Even though model performance reduces with comment length, At-ComGen's comments often outperform others. 5–10-word machine comments work best. For reference length 10, At-ComGen has 38.2%, 23.7%, 9.3%, and 4.4% greater BLEU values than the other baseline models.

Embedding Numerical Features and Meta-Features in Tabular Deep Learning

Xingyu Ma, Bin Yao — Mon, 14 Jul 2025 00:00:00 +0300

Tabular data is ubiquitous in real-world applications, and an increasing number of deep learning approaches have been developed for tabular data prediction. Among these approaches, embedding techniques serve as both a common and essential component. However, the design of tabular embedding paradigms remains relatively limited, and there is a lack of systematic evaluation regarding the performance of many existing methods in specific scenarios. In this paper, we focus on embedding numerical features and meta-features. To enrich the embedding methods for numerical features, we propose an ordering-oriented regularization technique applicable to piecewise linear embeddings, along with an unsupervised feature grouping method to facilitate partial embedding sharing. We demonstrate that these methods contribute to building more efficient and lightweight embedding modules. Importantly, we highlight ordering and sharing as two promising directions in the design of embeddings for numerical features. Additionally, we address several evaluation gaps: we assess the robustness of existing embeddings for numerical features and evaluate a set of general designs separately for data type embeddings and positional embeddings, providing insights into their practical applications and further developments.

ORPTQ: An Improved Large Model Quantization Method Based on Optimal Quantization Range

Shicen Tian, Kejie Huang — Mon, 14 Jul 2025 00:00:00 +0300

Quantization reduces model storage by representing model in low bits. It can help to improve the application capability of transformer-based large models and make them possible to be deployed on resource-limited systems such as PCs and mobile devices. The best weight-only quantization method currently is to use second-order information to fine-tune the weight step by step during the quantization process, compensating for the quantization errors that have occurred. The method can minimize the functional loss of weight due to quantization by adjusting the remaining elements through algebraic transformations in each step. However, the performance of this quantization method will deteriorate rapidly when the adjustment for weight deviates too far from the starting point, especially in low-bit quantization (e.g. 4 bits or fewer). To meet the mathematical prerequisite of this method in the quantization, this paper introduces two parameters $α, β$ to adjust the quantization range based on the second-order method, and presents three approaches to seek their optimal values. The experimental results show that the performance of the proposed method significantly outperforms the original second-order method in low-bit quantization. The code of this paper is available on github.com/t-scen/ORPTQ.

WNASNet: Wavelet-Guided Neural Architecture Search for Efficient Single-Image De-raining

Wenyin Tao, Qiang Chen, Chunjiang Yu — Mon, 14 Jul 2025 00:00:00 +0300

On rainy days, the uncertainty of the shape and distribution of rain streaks can cause the images captured by RGB image-based measurement equipment to be blurred and distorted. The wavelet transform is extensively utilized in conventional image-enhancing techniques because of its capacity to deliver spatial and frequency domain information and its multidirectional and multiscale characteristics. In image de-raining, the distribution of rain streaks is intricately linked to both spatial domain characteristics and frequency domain spatial attributes. Nonetheless, deep learning-based rain removal models predominantly depend on the spatial characteristics of the image, and RGB data is sometimes insufficient to differentiate rain marks from image details, resulting in the loss of essential image information during the rain removal process. To overcome this limitation, we have created a lightweight single-image rain removal model named the wavelet-enhanced neural architecture search network (WNASNet). This technique isolates image features from rain-affected images and can more efficiently eliminate rain artifacts. The proposed WNASNet presents three notable contributions. Initially, it utilizes wavelet transform to extract multi-frequency feature components. It allocates a distinct feature search block (FSB) to each component, facilitating the identification of task-specific feature extraction networks to enhance deraining efficacy. Secondly, we present a straightforward yet efficient wavelet feature fusion technique (SFF) that selectively employs high- and low-frequency features during the inverse wavelet transformation. This method maintains deraining efficacy while substantially decreasing computational complexity relative to conventional frequency blending techniques. Comprehensive studies on four synthetic and two real-world datasets illustrate the better performance of WNASNet across many evaluation measures, including PSNR, SSIM, LPIPS, NIQE, and BRISQUE, thereby verifying its efficacy and robustness for single-image deraining tasks.

Enhancing Open-Set Few-Shot Object Detection with Limited Visual Prompts

Qinghua Yang, Yan Tian, Jing Sun, Fangyuan He — Mon, 14 Jul 2025 00:00:00 +0300

The text-prompt-based open-vocabulary object detection model effectively encapsulates the abstract concepts of common objects, thereby overcoming the limitations of pre-trained models, which are restricted to detecting a fixed, predefined set of categories. However, due to data scarcity and the constraints of textual descriptions, representing rare or complex objects solely through text remains challenging. In this study, we propose an open-set detection model that supports both visual and textual prompt queries (VTP-OD) to enhance few-shot object detection. A small number of visual prompts not only provide rich class-wise visual features, which enhance class textual representations, but also enable flexible extension to new classes for different downstream tasks. Specifically, we incorporate two adaptation modules based on cross-attention to adapt the pre-trained vision-language model, allowing it to support both text and visual queries. These modules facilitate (i) visual fusion between a limited number of visual prompts and query images and (ii) visual-language fusion between class-aware visual features and textual representations of the classes. Subsequently, the model undergoes prompt tuning using the available few-shot downstream data to adapt to target detection tasks. Experimental results demonstrate that our model outperforms the pre-trained model on the LVIS and COCO benchmarks. Furthermore, we validate its effectiveness on the real-world CoalMine dataset.

Towards Real-World Power Grid Scenarios: Video Action Detection with Cross-scale Selective Context Aggregation

Lingwen Meng, Siwu Yu, Shasha Luo, Anjun Li — Mon, 14 Jul 2025 00:00:00 +0300

In this study, we propose a single-stage model for video action detection and a real-world action detection dataset POWER collected from real power operation scenarios. While previous studies have made significant progress in overall classification and localization performance, they often struggle with the actions that have short duration, hindering the application of these approaches. To address this, we introduce the Cross-scale Selective Context Aggregation Network (CSCAN), which focuses on improving the detection of short actions. This network integrates three key components: 1) a cross-scale feature conduction structure combined with a tailored alignment mechanism; 2) a selective context aggregation module based on gating mechanism; and 3) an effective scale-invariant consistency training strategy to enable the model to learn scale-invariant action representation. We evaluated our method on the self-collected dataset POWER and on the most widely used action detection benchmarks THUMOS14 and ActivityNet v1.3. The extensive results show that our model outperforms other approaches, especially in detecting real-world short actions, demonstrating the effectiveness of our approach.