Aktywności – Holme Griffith

Holme Griffith opublikował 1 rok, 3 miesiące temu

In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https//github.com/speedinghzl/CCNet.This paper analyzes regularization terms proposed recently for improving the adversarial robustness of deep neural networks (DNNs), from a theoretical point of view. Specifically, we study possible connections between several effective methods, including input-gradient regularization, Jacobian regularization, curvature regularization, and a cross-Lipschitz functional. We investigate them on DNNs with general rectified linear activations, which constitute one of the most prevalent families of models for image classification and a host of other machine learning applications. We shed light on essential ingredients of these regularizations and re-interpret their functionality. Through the lens of our study, more principled and efficient regularizations can possibly be invented in the near future.Graph matching aims to establish node correspondence between two graphs, which has been a fundamental problem for its NP-complete nature. One practical consideration is the effective modeling of the affinity function in the presence of noise, such that the mathematically optimal matching result is also physically meaningful. This paper resorts to deep neural networks to learn the node and edge feature, as well as the affinity model for graph matching in an end-to-end fashion. The learning is supervised by combinatorial permutation loss over nodes. Specifically, the parameters belong to convolutional neural networks for image feature extraction, graph neural networks for node embedding that convert the structural (beyond second-order) information into node-wise features that leads to a linear assignment problem, as well as the affinity kernel between two graphs. Our approach enjoys flexibility in that the permutation loss is agnostic to the number of nodes, and the embedding model is shared among nodes such that the network can deal with varying numbers of nodes for both training and inference. Moreover, our network is class-agnostic. Experimental results on extensive benchmarks show its state-of-the-art performance. It bears some generalization capability across categories and datasets, and is capable for robust matching against outliers.Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.This paper addresses the problem of photometric stereo, in both calibrated and uncalibrated scenarios, for non-Lambertian surfaces based on deep learning. We first introduce a fully convolutional deep network for calibrated photometric stereo, which we call PS-FCN. Unlike traditional approaches that adopt simplified reflectance models to make the problem tractable, our method directly learns the mapping from reflectance observations to surface normal, and is able to handle surfaces with general and unknown isotropic reflectance. At test time, PS-FCN takes an arbitrary number of images and their associated light directions as input and predicts a surface normal map of the scene in a fast feed-forward pass. To deal with the uncalibrated scenario where light directions are unknown, we introduce a new convolutional network, named LCNet, to estimate light directions from input images. The estimated light directions and the input images are then fed to PS-FCN to determine the surface normals. Our method does not require a pre-defined set of light directions and can handle multiple images in an order-agnostic manner. Thorough evaluation of our approach on both synthetic and real datasets shows that it outperforms state-of-the-art methods in both calibrated and uncalibrated scenarios.In this work, we introduce the average top-k (ATk) loss, which is the average over the k largest individual losses over a training data, as a new aggregate loss for supervised learning. We show that the ATk loss is a natural generalization of the two widely used aggregate losses, namely the average loss and the maximum loss. Yet, the ATk loss can better adapt to different data distributions because of the extra flexibility provided by the different choices of k. Furthermore, it remains a convex function over all individual losses and can be combined with different types of individual loss without significant increase in computation. We then provide interpretations of the ATk loss from the perspective of the modification of individual loss and robustness to training data distributions. We further study the classification calibration of the ATk loss and the error bounds of ATk-SVM model. We demonstrate the applicability of minimum average top-k learning for supervised learning problems including binary/multi-class classification and regression, using experiments on both synthetic and real datasets.In this paper, we propose a novel approach to two-view minimal-case relative pose problems based on homography with known gravity direction. This case is relevant to smart phones, tablets, and other camera-IMU (Inertial measurement unit) systems which have accelerometers to measure the gravity vector. We explore the rank-1 constraint on the difference between the Euclidean homography matrix and the corresponding rotation, and propose an efficient two-step solution for solving both the calibrated and semi-calibrated (unknown focal length) problems. Based on the \em hidden variable technique, we convert the problems to the polynomial eigenvalue problems, and derive new 3.5-point, 3.5-point, 4-point solvers for two cameras such that the two focal lengths are unknown but equal, one of them is unknown, and both are unknown and possibly different, respectively. We present detailed analyses and comparisons with the existing 6- and 7-point solvers, including results with smart phone images.This paper presents a photometric stereo method based on deep learning. One of the major difficulties in photometric stereo is designing an appropriate reflectance model that is both capable of representing real-world reflectances and computationally tractable for deriving surface normal. Unlike previous photometric stereo methods that rely on a simplified parametric image formation model, such as the Lambert’s model, the proposed method aims at establishing a flexible mapping between complex reflectance observations and surface normal using a deep neural network. In addition, the proposed method predicts the reflectance, which allows us to understand surface materials and to render the scene under arbitrary lighting conditions. As a result, we propose a deep photometric stereo network (DPSN) that takes reflectance observations under varying light directions and infers the surface normal and reflectance in a per-pixel manner. To make the DPSN applicable to real-world scenes, a dataset of measured BRDFs (MERL BRDF dataset) has been used for training the network. Evaluation using simulation and real-world scenes shows the effectiveness of the proposed approach in estimating both surface normal and reflectances.VQA is a task to answer natural language questions tied to the content of visual images. Most VQA approaches apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two questionadaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposal combines appearance feature with relations to reconcile the two types of features.We show that existing upsampling operators can be unified using the notion of the index function. This notion is inspired by an observation in the decoding process of deep image matting where indices-guided unpooling can often recover boundary details considerably better than other upsampling operators such as bilinear interpolation. By viewing the indices as a function of the feature map, we introduce the concept of 'learning to index’, and present a novel index-guided encoder-decoder framework where indices are learned adaptively from data and are used to guide downsampling and upsampling stages, without extra training supervision. At the core of this framework is a new learnable module, termed Index Network (IndexNet), which dynamically generates indices conditioned on the feature map. IndexNet can be used as a plug-in applicable to almost all convolutional networks that have coupled downsampling and upsampling stages, enabling the networks to dynamically capture variations of local patterns. In particular, we instantiate, investigate five families of IndexNet, highlight their superiority in delivering spatial information over other upsampling operators with experiments on synthetic data, and demonstrate their effectiveness on four dense prediction tasks, including image matting, image denoising, semantic segmentation, and monocular depth estimation. Code and is available at https//git.io/IndexNet.Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. Furthermore, SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We evaluated SP-NGD on a benchmark task where highly optimized first-order methods are available as references training a ResNet-50 model for image classification on ImageNet. We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.

Compare items