All my publications are listed below, together with their abstracts and links to the paper in the University ePrints service. For a list of my publications ordered by citations, please see my Google Scholar page.
Iterative algorithms solve problems by taking steps until a solution is reached. Models in the form of Deep Thinking (DT) networks have been demonstrated to learn iterative algorithms in a way that can scale to different sized problems at inference time using recurrent computation and convolutions. However, they are often unstable during training, and have no guarantees of convergence/termination at the solution. This paper addresses the problem of instability by analyzing the growth in intermediate representations, allowing us to build models (referred to as Deep Thinking with Lipschitz Constraints (DT-L)) with many fewer parameters and providing more reliable solutions. Additionally our DT-L formulation provides guarantees of convergence of the learned iterative procedure to a unique solution at inference time. We demonstrate DT-L is capable of robustly learning algorithms which extrapolate to harder problems than in the training set. We benchmark on the traveling salesperson problem to evaluate the capabilities of the modified system in an NP-hard problem where DT fails to learn.
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.
Distributed inference is a popular approach for efficient DNN inference at the edge. However, traditional Static and Dynamic DNNs are not distribution-friendly, causing system reliability and adaptability issues. In this paper, we introduce Fluid Dynamic DNNs (Fluid DyDNNs), tailored for distributed inference. Distinct from Static and Dynamic DNNs, Fluid DyDNNs utilize a novel nested incremental training algorithm to enable independent and combined operation of its sub-networks, enhancing system reliability and adaptability. Evaluation on embedded Arm CPUs with a DNN model and the MNIST dataset, shows that in scenarios of single device failure, Fluid DyDNNs ensure continued inference, whereas Static and Dynamic DNNs fail. When devices are fully operational, Fluid DyDNNs can operate in either a High-Accuracy mode and achieve comparable accuracy with Static DNNs, or in a High-Throughput mode and achieve 2.5x and 2x throughput compared with Static and Dynamic DNNs, respectively.
This dataset supports the publication: "Fluid Dynamic DNNs for Reliable and Adaptive Distributed Inference on Edge Devices" by Lei Xun, Mingyu Hu, Hengrui Zhao, Amit Kumar Singh, Jonathon Hare, Geoff V. Merrett CONFERENCE: Design, Automation and Test in Europe Conference 2024 This dataset includes the experimental results for Figure 2 of the paper, showing the throughput and accuracy of the different models (static, dynamic and fluid) considered under different distributed-system cases (master & worker, master, worker). This dataset contains: -'data.csv': Data supporting Fig. 2. The throughput and accuracy of the different models (static, dynamic and fluid) considered under different distributed-system cases (master & worker, master, worker). Related projects: International Centre for Spatial Computational Learning
Dataset supporting publication "Efficient Deployment of Early-Exit DNN Architectures on FPGA Platforms" presented at the conference: Design, Automation & Test in Europe Conference. This dataset contains: 'Fig2a.csv': Data supporting Fig. 2 (a). Execution time in ms of the Dynamic Deep Neural Network on different platforms. (CPU, CPU+GPU, Jetson Xavier and FPGA Xilinx ZCU106) 'Fig2b.csv': Data supporting Fig. 2 (b). Energy consumption and needed power for the execution of the Dynamic Deep Neural network on different platforms. (CPU, CPU+GPU, Jetson Xavier and FPGA Xilinx ZCU106). 'Fig3.csv' : Data supporting Fig. 3 . Number of samples to be firstly correctly predicted after the execution of every layer on ResNet-32. Related projects: Engineering and Physical Sciences Research Council (EPSRC) under EP/S030069/1 Licence: CC BY 4.0
Conventional approaches to TinyML achieve high accuracy by deploying the largest deep learning model with highest input resolutions that fit within the size constraints imposed by the microcontroller's (MCUs) fast internal storage and memory. In this paper, we perform an in-depth analysis of prior works to show that models derived within these constraints suffer from low accuracy and, surprisingly, high latency. We propose an alternative approach that enables the deployment of efficient models with low inference latency, but free from the constraints of internal memory. We take a holistic view of typical MCU architectures, and utilise plentiful but slower external memories to relax internal storage and memory constraints. To avoid the lower speed of external memory impacting inference latency, we build on the TinyOps inference framework, which performs operation partitioning and uses overlays via DMA, to accelerate the latency. Using insights from our study, we deploy efficient models from the TinyOps design space onto a range of embedded MCUs achieving record performance on TinyML ImageNet classification with up to 6.7% higher accuracy and 1.4x faster latency compared to state-of-the-art internal memory approaches.
Deep Neural Network (DNN) inference is increasingly being deployed on edge devices, driven by the advantages of lower latency and enhanced privacy. However, the deployment of these models on such platforms poses considerable challenges due to the intensive computation and memory access requirements. While various static model compression techniques have been proposed, they often struggle when adapting to the dynamic computing environments of modern heterogeneous platforms. The two main challenges we focus on in our research are: (1) Dynamic Hardware and Runtime Conditions: Modern edge devices are equipped with heterogeneous computing resources, including CPUs, GPUs, NPUs, and FPGAs. Their availability and performance can change dynamically during runtime, influenced by factors such as device state, power constraints, and thermal conditions. Moreover, DNN models may need to share resources with other applications or models, introducing an additional layer of complexity to the quest for consistent performance and efficiency. (2) Dynamic Application Requirements: The same DNN model can be used in a variety of applications, each with unique and potentially fluctuating performance requirements.
In this poster, we will explore the world of dynamic neural networks, with a particular focus on their role in efficient model deployment in dynamic computing environments. Our system leverages runtime trade-offs in both algorithms and hardware to optimize DNN performance and energy efficiency. A cornerstone of our system is the Dynamic-OFA, a dynamic version of the 'once-for-all network', designed to efficiently scale the ConvNet architecture to fit the dynamic application requirements and hardware resources. It exhibits strong generalization across different model architectures, such as Transformer. We will also discuss the benefits of integrating algorithmic techniques with hardware opportunities, including Dynamic Voltage and Frequency Scaling (DVFS) and task mapping. Our experimental results, using ImageNet on a Jetson Xavier NX, reveal that the Dynamic-OFA outperforms state of-the-art Dynamic DNNs, offering up to 3.5x (CPU) and 2.4x (GPU) speed improvements for similar ImageNet Top-1 accuracy, or a 3.8% (CPU) and 5.1% (GPU) increase in accuracy at similar latency.
Federated Learning has been an exciting development in machine learning, promising collaborative learning without compromising privacy. However, the resource-intensive nature of Deep Neural Networks (DNN) has made it difficult to implement FL on edge devices. In a bold step towards addressing this challenge, we present FedTM, the first FL framework to utilize Tsetlin Machine, a low-complexity machine learning alternative. We proposed a two-step aggregation scheme for combining local parameters at the server which addressed challenges such as data heterogeneity, varying participating client ratio and bit-based aggregation. Compared to conventional Federated Averaging (FedAvg) with Convolutional Neural Networks (CNN), on average, FedTM provides a substantial reduction in communication costs by 30.5× and 36.6× reduction in storage memory footprint. Our results demonstrate that FedTM outperforms BiFL-BiML (SOTA) in every FL setting while providing 1.37 − 7.6× reduction in communication costs and 2.93 − 7.2× reduction in run-time memory on our evaluated datasets, making it a promising solution for edge devices.
This dataset supports the publication " Exploration of Decision Sub-Network Architectures for FPGA-based Dynamic DNNs " to be published in the Proceedings of the 2023 Design, Automation and Test in Europe Conference and Exhibition. This dataset contains: - 'Fig2a.csv': Data supporting Fig. 2 (a). Execution time in ms of the Dynamic Deep Neural Network on different platforms. (CPU, CPU+GPU, Jetson Xavier and FPGA Xilinx ZCU106). - 'Fig2b.csv': Data supporting Fig. 2 (b). Energy consumption and needed power for the execution of the Dynamic Deep Neural network on different platforms. (CPU, CPU+GPU, Jetson Xavier and FPGA Xilinx ZCU106). Related projects: Engineering and Physical Sciences Research Council (EPSRC) under EP/S030069/1 Licence: CC BY 4.0
Dynamic Deep Neural Networks (DNNs) can achieve faster execution and less computationally intensive inference by spending fewer resources on easy to recognise or less informative parts of an input. They make data-dependent decisions, which strategically deactivate a model’s components, e.g. layers, channels or sub-networks. However, dynamic DNNs have only been explored and applied on conventional computing systems (CPU+GPU) and programmed with libraries designed for static networks, limiting their effects. In this paper, we propose and explore two approaches for efficiently realising the sub-networks that make these decisions on FPGAs. A pipeline approach targets the use of the existing hardware to execute the sub-network, while a parallel approach uses dedicated circuitry for it. We explore the performance of each using the BranchyNet early exit approach on LeNet-5, and evaluate on a Xilinx ZCU106. The pipeline approach is 36% faster than a desktop CPU. It consumes 0.51 mJ per inference, 16x lower than a non-dynamic network on the same platform and 8x lower than an Nvidia Jetson Xavier NX. The parallel approach executes 17% faster than the pipeline approach when on dynamic inference no early exits are taken, but incurs an increase in energy consumption of 28%.
Multilayer Perceptrons struggle to learn certain simple arithmetic tasks. Specialist neural modules for arithmetic can outperform classical architectures with gains in extrapolation, interpretability and convergence speeds, but are highly sensitive to the training range. In this paper, we show that Neural Multiplication Units (NMUs) are unable to reliably learn tasks as simple as multiplying two inputs when given different training ranges. Causes of failure are linked to inductive and input biases which encourage convergence to solutions in undesirable optima. A solution, the stochastic NMU (sNMU), is proposed to apply reversible stochasticity, encouraging avoidance of such optima whilst converging to the true solution. Empirically, we show that stochasticity provides improved robustness with the potential to improve learned representations of upstream networks for numerical and image tasks.
Of the four fundamental arithmetic operations (+, -, $\times$, $\div$), division is considered the most difficult for both humans and computers. In this paper, we show that robustly learning division in a systematic manner remains a challenge even at the simplest level of dividing two numbers. We propose two novel approaches for division which we call the Neural Reciprocal Unit (NRU) and the Neural Multiplicative Reciprocal Unit (NMRU), and present improvements for an existing division module, the Real Neural Power Unit (Real NPU). In total we measure robustness over 475 different training sets for setups with and without input redundancy. We discover robustness is greatly affected by the input sign for the Real NPU and NRU, input magnitude for the NMRU and input distribution for every module. Despite this issue, we show that the modules can learn as part of larger end-to-end networks.
We consider the problem of composing images by combining an arbitrary foreground object to some background. To achieve this we use a factorized latent space. Thus we introduce a model called the “Background and Foreground VAE” (BFVAE) that can combine arbitrary foreground and background from an image dataset to generate unseen images. To enhance the quality of the generated images we also propose a VAE-GAN mixed model called “Latent Space Renderer-GAN” (LSR-GAN). This substantially reduces the blurriness of BFVAE images.
Neural Arithmetic Logic Modules have become a growing area of interest, though remain a niche field. These modules are neural networks which aim to achieve systematic generalisation in learning arithmetic and/or logic operations such as {+,−,×,÷,≤,AND} while also being interpretable. This paper is the first in discussing the current state of progress of this field, explaining key works, starting with the Neural Arithmetic Logic Unit (NALU). Focusing on the shortcomings of the NALU, we provide an in-depth analysis to reason about design choices of recent modules. A cross-comparison between modules is made on experiment setups and findings, where we highlight inconsistencies in a fundamental experiment causing the inability to directly compare across papers. To alleviate the existing inconsistencies, we create a benchmark which compares all existing arithmetic NALMs. We finish by providing a novel discussion of existing applications for NALU and research directions requiring further exploration.
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to low latency and better privacy. However, efficient deployment on these platforms is challenging due to the intensive computation and memory access. We propose a holistic system design for DNN performance and energy optimisation, combining the trade-off opportunities in both algorithms and hardware. The system can be viewed as three abstract layers: the device layer contains heterogeneous computing resources; the application layer has multiple concurrent workloads; and the runtime resource management layer monitors the dynamically changing algorithms' performance targets as well as hardware resources and constraints, and tries to meet them by tuning the algorithm and hardware at the same time. Moreover, We illustrate the runtime approach through a dynamic version of 'once-for-all network' (namely Dynamic-OFA), which can scale the ConvNet architecture to fit heterogeneous computing resources efficiently and has good generalisation for different model architectures such as Transformer. Compared to the state-of-the-art Dynamic DNNs, our experimental results using ImageNet on a Jetson Xavier NX show that the Dynamic-OFA is up to 3.5x (CPU), 2.4x (GPU) faster for similar ImageNet Top-1 accuracy, or 3.8% (CPU), 5.1% (GPU) higher accuracy at similar latency. Furthermore, compared with Linux governor (e.g. performance, schedutil), our runtime approach reduces the energy consumption by 16.5% at similar latency.
Active debris removal missions pose demanding guidance, navigation and control requirements. We present a novel approach which adopts deep learning technologies to the problem of attitude determination of an uncooperative debris satellite of an a-priori unknown geometry. A siamese convolutional neural network is developed, which detects and tracks inherently useful landmarks from sensor data, after training upon synthetic datasets of visual, LiDAR or RGB-D data. The method is capable of real-time performance while improving upon conventional computer vision-based approaches, and generalises well to previously unseen object geometries, enabling this approach to be a feasible solution for safely performing guidance and navigation in active debris removal, satellite servicing and other close proximity operations. The performance of the algorithm, its sensitivity to model parameters and its robustness to illumination and shadowing conditions, are analysed via numerical simulation.
Deep Learning on microcontroller (MCU) based IoT devices is extremely challenging due to memory constraints. Prior approaches focus on using internal memory or external memories exclusively which limit either accuracy or latency. We find that a hybrid method using internal and external MCU memories outperforms both approaches in accuracy and latency. We develop TinyOps, an inference engine which accelerates inference latency of models in slow external memory, using a partitioning and overlaying scheme via the available Direct Memory Access (DMA) peripheral to combine the advantages of external memory
(size) and internal memory (speed). Experimental results show that architectures deployed with TinyOps significantly outperform models designed for internal memory with up to 6% higher accuracy and importantly, 1.3-2.2x faster inference latency to set the state-of-the-art in TinyML ImageNet classification. Our work shows that the TinyOps space is more efficient compared to the internal or external memory design spaces and should be explored further for TinyML applications.
This paper explores a new paradigm for decomposing an image by seeking a compressed representation of the image through an information bottleneck. The compression is achieved iteratively by refining the reconstruction by adding patches that reduce the residual error. This is achieved by a network that is given the current residual errors and proposes bounding boxes that are down-sampled and passed to a variational auto-encoder (VAE). This acts as the bottleneck. The latent code is decoded by the VAE decoder and up-sampled to correct the reconstruction within the bounding box. The objective is to minimise the size of the latent codes of the VAE and the length of code needed to transmit the residual error. The iterations end when the size of the latent code exceeds the reduction in transmitting the residual error. We show that a very simple implementation is capable of finding meaningful bounding boxes and using those bounding boxes for downstream applications. We compare our model with other unsupervised object discovery models.
Convolutional neural networks (CNNs) often extract similar features from successive video frames due to having identical appearances. In contrast, conventional CNNs for video recognition process individual frames with a fixed computational effort. Each video frame is independently processed, resulting in numerous redundant computations and an inefficient use of limited energy resources, particularly for edge computing applications. To alleviate the high energy requirements associated with video frame processing, this paper presented similarity-aware CNNs that recognise similar feature pixels across frames and avoid computations on them. First, with a loss of less than 1% in recognition accuracy, a proposed similarity aware quantization technique increases the average number of unchanged feature pixels across frame pairs by up to 85%. Then, a proposed similarity-aware dataflow improves energy consumption by minimising redundant computations and memory accesses across frame pairs. According to simulation experiments, the proposed dataflow decreases the energy consumed by video frame processing by up to 30%.
Perceptions is a study of how a machine perceives a photograph at different layers within its neural network. We generate sets of pen strokes which are drawn by a robot using pen and ink on Bristol board. The illustrations are produced by maximising the similarity between the machine's internal perception of the illustration and chosen target photographs. The study focusses on the difference between different inductive biases (shape versus texture) in the training of the neural network, as well as how the machine's perception changes as a function of depth within its network. The photos chosen are from travels to far away cities, taken before the COVID-19 pandemic.
We present an investigation into how representational losses can affect the drawings produced by artificial agents playing a communication game. Building upon recent advances, we show that a combination of powerful pretrained encoder networks, with appropriate inductive biases, can lead to agents that draw recognisable sketches, whilst still communicating well. Further, we start to develop an approach to help automatically analyse the semantic content being conveyed by a sketch and demonstrate that current approaches to inducing perceptual biases lead to a notion of objectness being a key feature despite the agent training being self-supervised.
This data is associated with the the "Similarity-aware CNN for Efficient Video Recognition at the Edge" article published in IEEE transaction on Computer-Aided Design of Integrated Circuits and Systems.
Evidence that visual communication preceded written language and provided a basis for it goes back to prehistory, in forms such as cave and rock paintings depicting traces of our distant ancestors. Emergent communication research has sought to explore how agents can learn to communicate in order to collaboratively solve tasks. Existing research has focused on language, with a learned communication channel transmitting sequences of discrete tokens between the agents. In this work, we explore a visual communication channel between agents that are allowed to draw with simple strokes. Our agents are parameterised by deep neural networks, and the drawing procedure is differentiable, allowing for end-to-end training. In the framework of a referential communication game, we demonstrate that agents can not only successfully learn to communicate by drawing, but with appropriate inductive biases, can do so in a fashion that humans can interpret. We hope to encourage future research to consider visual communication as a more flexible and directly interpretable alternative of training collaborative agents.
Deep convolutional neural networks (CNNs) are computationally and memory intensive. In CNNs, intensive multiplication can have resource implications that may challenge the ability for effective deployment of inference on resource-constrained edge devices. This paper proposes GhostShiftAddNet, where the motivation is to implement a hardware-efficient deep network: a multiplication-free CNN with less redundant features. We introduce a new bottleneck block, GhostSA, that converts all multiplications in the block to cheap operations. The bottleneck uses an appropriate number of bit-shift filters to process intrinsic feature maps, then applies a series of transformations that consist of bit-shifts with addition operations to generate more feature maps that fully learn information underlying intrinsic features. We schedule the number of bit-shift and addition operations for different hardware platforms. We conduct extensive experiments and ablation studies with desktop and embedded (Jetson Nano) devices for implementation and measurements. We demonstrate the proposed GhostSA block can replace bottleneck blocks in the backbone of state-of-the-art networks architectures and gives improved performance on image classification benchmarks. Further, our GhostShiftAddNet can achieve higher classification accuracy by using fewer FLOPs and parameters (reduced by up to 3x) than GhostNet. When compared to GhostNet, inference latency on the Jetson Nano is improved by about 1.3x and 2x on GPU and CPU respectively.
This dataset supports the publication: GhostShiftAddNet: More Features from Energy-Efficient Operations.' in 'British Machine Vision Conference 2021'.
To achieve systematic generalisation, it first makes sense to master simple tasks such as arithmetic. Of the four fundamental arithmetic operations (+,-,$\times$,$\div$), division is considered the most difficult for both humans and computers. In this paper we show that robustly learning division in a systematic manner remains a challenge even at the simplest level of dividing two numbers. We propose two novel approaches for division which we call the Neural Reciprocal Unit (NRU) and the Neural Multiplicative Reciprocal Unit (NMRU), and present improvements for an existing division module, the Real Neural Power Unit (Real NPU). Experiments in learning division with input redundancy on 225 different training sets, find that our proposed modifications to the Real NPU obtains an average success of 85.3$\%$ improving over the original by 15.1$\%$. In light of the suggestion above, our NMRU approach can further improve the success to 91.6$\%$.
The Transformer architecture is widely used for machine translation tasks. However, its resource-intensive nature makes it challenging to implement on constrained embedded devices, particularly where available hardware resources can vary at run-time. We propose a dynamic machine translation model that scales the Transformer architecture based on the available resources at any particular time. The proposed approach, 'Dynamic-HAT', uses a HAT SuperTransformer as the backbone to search for SubTransformers with different accuracy-latency trade-offs at design time. The optimal SubTransformers are sampled from the SuperTransformer at run-time, depending on latency constraints. The Dynamic-HAT is tested on the Jetson Nano and the approach uses inherited SubTransformers sampled directly from the SuperTransformer with a switching time of <1s. Using inherited SubTransformers results in a BLEU score loss of <1.5% because the SubTransformer configuration is not retrained from scratch after sampling. However, to recover this loss in performance, the dimensions of the design space can be reduced to tailor it to a family of target hardware. The new reduced design space results in a BLEU score increase of approximately 1% for sub-optimal models from the original design space, with a wide range for performance scaling between 0.356s - 1.526s for the GPU and 2.9s - 7.31s for the CPU.
Active debris removal missions pose demanding guidance, navigation and control requirements. We propose that novel machine learning techniques can help to meet several of the outstanding requirements. Building upon previous work which adopts machine learning technologies for tracking the rotational state of an unknown and uncooperative debris satellite, we improve the approach by further applying machine learning to make use of past measurements. The attitude of the debris target is reconstructed, thereby enabling different debris removal methods. The construction of a simulation framework for generating accurate labelled image data is presented, with the aim of facilitating further research in this area. Finally, we show that a neural network can also learn to track satellites and identify suitable locations for contact-based removal methods, without a-priori knowledge of the object's geometry.
This dataset supports the publication: 'Dynamic Transformer for Efficient Machine Translation on Embedded Devices' in '3rd ACM/IEEE Workshop on Machine Learning for CAD (MLCAD'21)'.
Mobile and embedded platforms are increasingly required to efficiently execute computationally demanding DNNs across heterogeneous processing elements. At runtime, the available hardware resources to DNNs can vary considerably due to other concurrently running applications. The performance requirements of the applications could also change under different scenarios. To achieve the desired performance, dynamic DNNs have been proposed in which the number of channels/layers can be scaled in real time to meet different requirements under varying resource constraints. However, the training process of such dynamic DNNs can be costly, since platform-aware models of different deployment scenarios must be retrained to become dynamic. This paper proposes Dynamic-OFA, a novel dynamic DNN approach for state-of-the-art platform-aware NAS models (i.e. Once-for-all network (OFA)). Dynamic-OFA pre-samples a family of sub-networks from a static OFA backbone model, and contains a runtime manager to choose different sub-networks under different runtime environments. As such, Dynamic-OFA does not need the traditional dynamic DNN training pipeline. Compared to the state-of-the-art, our experimental results using ImageNet on a Jetson Xavier NX show that the approach is up to 3.5x (CPU), 2.4x (GPU) faster for similar Top-1 accuracy, or 3.8% (CPU), 5.1% (GPU) higher accuracy at similar latency.
DNN inference is increasingly being executed locally on embedded platforms, due to the clear advantages in latency, privacy and connectivity. Modern SoCs typically execute a combination of different and dynamic workloads concurrently, it is challenging to consistently meet latency/energy budgets because the local computing resources available to the DNN vary considerably. In this poster, we show how resource management can be applied to optimise the performance of DNN workloads by monitoring and tuning both software and hardware constantly at runtime. This work shows how dynamic DNNs trade-off accuracy with latency/energy/power on heterogeneous embedded CPU-GPU platform.
Active debris removal missions pose demanding guidance, navigation and con-trol requirements. We present a novel approach which adopts deep learning technologies to the problem of attitude determination of an uncooperative debris satellite of a-priori unknown geometry. A siamese convolutional neural network is developed, which detects and tracks inherently useful landmarks from sensor data, after training upon synthetic datasets of visual, LiDAR or RGB-D data. The method is capable of real-time performance while significantly improving upon conventional computer vision-based approaches, and generalises well to previously unseen object geometries, enabling this approach to be a feasible so-lution for guidance in active debris removal missions. The performance of the algorithm and its sensitivity to model parameters are analysed via numerical simulation.
This dataset supports the publication: 'Dynamic-OFA: Runtime DNN Architecture Switching for Performance Scaling on Heterogeneous Embedded Platforms' in 'Efficient Deep Learning for Computer Vision Workshop at CVPR Conference 2021'.
Neural Arithmetic Logic Modules have become a growing area of interest, though remain a niche field. These units are small neural networks which aim to achieve systematic generalisation in learning arithmetic operations such as {+, -, *, \} while also being interpretive in their weights. This paper is the first in discussing the current state of progress of this field, explaining key works, starting with the Neural Arithmetic Logic Unit (NALU). Focusing on the shortcomings of NALU, we provide an in-depth analysis to reason about design choices of recent units. A cross-comparison between units is made on experiment setups and findings, where we highlight inconsistencies in a fundamental experiment causing the inability to directly compare across papers. We finish by providing a novel discussion of existing applications for NALU and research directions requiring further exploration.
Disentanglement has seen much work recently for its interpretable properties and the ease at which it can be induced in the latent representations of variational auto-encoders. As a concept, disentanglement has proven hard to precisely define, with many interpretations leading to different metrics which do not necessarily agree. Higgins et al [2018] offer a precise definition of a linear disentangled representation which is grounded in the symmetries of the data. In this work we focus on cyclic symmetry structure. We examine how VAE posterior distributions are affected by different observations of the same problem and find that cyclic structure is encouraged even when it is not explicitly observed. We then find that better prior distributions, found via normalising flows, result in faster convergence and lower encoding costs than the standard Gaussian. We also find that linear representations can be distinguished from standard ones solely through disentanglement metrics scores, possibly due to their highly structured posteriors. Finally, we find preliminary evidence that linear disentangled representations offer better data efficiency than standard disentangled representations.
Primate visual systems are well known to exhibit varying degrees of bottlenecks in the early visual pathway. Recent works have shown that the presence of a bottleneck between 'retinal' and 'ventral' parts of artificial models of visual systems, simulating the optic nerve, can cause the emergence of cellular properties that have been observed in primates: namely centre-surround organisation and opponency. To date, however, state-of-the-art convolutional network architectures for classification problems have not incorporated such an early bottleneck. In this paper, we ask what happens if such a bottleneck is added to a ResNet-50 model trained to classify the ImageNet data set. Our experiments show that some of the emergent properties observed in simpler models still appear in these considerably deeper and more complex models, however, there are some notable differences particularly with regard to spectral opponency. The introduction of the bottleneck is experimentally shown to introduce a small but consistent shape bias into the network. Tight bottlenecks are also shown to only have a very slight affect on the top-1 accuracy of the models when trained and tested on ImageNet.
Manual design of efficient Deep Neural Networks (DNNs) for mobile and edge devices is an involved process which requires expert human knowledge to improve efficiency in different dimensions. In this paper, we present DEff-ARTS, a differentiable efficient architecture search method for automatically deriving CNN architectures for resource constrained devices. We frame the search as a multi-objective optimisation problem where we minimise the classification loss and the computational complexity of performing inference on the target hardware. Our formulation allows for easy trading-off between the sub-objectives depending on user requirements. Experimental results on CIFAR-10 classification showed that our approach achieved a highly competitive test error rate of 3:24% with 30% fewer parameters and multiply and accumulate (MAC) operations compared to Differentiable ARchiTecture Search (DARTS).
Classification problems using deep learning have been shown to have a high-curvature subspace in the loss landscape equal in dimension to the number of classes. Moreover, this subspace corresponds to the subspace spanned by the logit gradients for each class. An obvious strategy to speed up optimisation would be to use Newton's method in the high-curvature subspace and stochastic gradient descent in the co-space. We show that a naive implementation actually slows down convergence and we speculate why this might be.
Human behaviours consist different types of motion; we show how they can be disambiguated into their components in a richer way than that currently possible. Studies on optical flow have concentrated on motion alone without the higher order components: snap, jerk and acceleration. We are the first to show how the acceleration, jerk, snap and their constituent parts can be obtained from image sequences, and can be deployed for analysis, especially of behaviour. We demonstrate the estimation of acceleration in sport, human motion, traffic and in scenes of violent behaviour to demonstrate the wide potential for application of analysis of acceleration. Determining higher order components is suited to the analysis of scenes which contain them: higher order motion is innate to scenes containing acts of violent behaviour, but it is not just for¬ behaviour which contains quickly changing movement: human gait contains acceleration though approaches have yet to consider radial and tangential acceleration, since they concentrate on motion alone. The analysis of synthetic ¬and real-world images illustrates the ability of higher order motion to discriminate different objects under different motion. Then the new approaches are applied in heel strike detection in the analysis of human gait. These results demonstrate that the new approach is ready for developing new applications in behaviour recognition and provides a new basis for future research and applications of higher-order motion analysis.
Disentangled representation learning has seen a surge in interest over recent times, generally focusing on new models to optimise one of many disparate disentan- glement metrics. It was only with Symmetry Based Disentangled Representation Learning that a robust mathematical framework was introduced to define precisely what is meant by a “linear disentangled representation”. This framework deter- mines that such representations would depend on a particular decomposition of the symmetry group acting on the data, showing that actions would manifest through irreducible group representations acting on independent representational subspaces. Caselles-Dupré et al. [2019] subsequently proposed the first model to induce and demonstrate a linear disentangled representation in a VAE model. In this work we empirically show that linear disentangled representations are not present in standard VAE models and that they instead require altering the loss landscape to induce them. We proceed to show that such representations are a desirable property with regard to classical disentanglement metrics. Finally we propose a method to induce irreducible representations which forgoes the need for labelled action sequences, as was required by prior work. We explore a number of properties of this method, including the ability to learn from action sequences without knowledge of intermediate states and robustness under visual noise. We also demonstrate that it can successfully learn 4 different symmetries directly from pixels.
Recent work suggests that changing Convolutional Neural Network (CNN) architecture by introducing a bottleneck in the second layer can yield changes in learned function.
To understand this relationship fully requires a way of quantitatively comparing trained networks.
The fields of electrophysiology and psychophysics have developed a wealth of methods for characterising visual systems which permit such comparisons.
Inspired by these methods, we propose an approach to obtaining spatial and colour tuning curves for convolutional neurons, which can be used to classify cells in terms of their spatial and colour opponency.
We perform these classifications for a range of CNNs with different depths and bottleneck widths.
Our key finding is that networks with a bottleneck show a strong functional organisation: almost all cells in the bottleneck layer become both spatially and colour opponent, cells in the layer following the bottleneck become non-opponent.
The colour tuning data can further be used to form a rich understanding of how colour is encoded by a network.
As a concrete demonstration, we show that shallower networks without a bottleneck learn a complex non-linear colour system, whereas deeper networks with tight bottlenecks learn a simple channel opponent code in the bottleneck layer.
We further develop a method of obtaining a hue sensitivity curve for a trained CNN which enables high level insights that complement the low level findings from the colour tuning data.
We go on to train a series of networks under different conditions to ascertain the robustness of the discussed results.
Ultimately, our methods and findings coalesce with prior art, strengthening our ability to interpret trained CNNs and furthering our understanding of the connection between architecture and learned representation.
Trained models and code for all experiments are available at https://github.com/ecs-vlc/opponency.
We investigate the problem of generating natural language summaries from knowledge base triples. Our approach is based on a pointer-generator network, which, in addition to generating regular words from a fixed target vocabulary, is able to verbalise triples in several ways. We undertake an automatic and a human evaluation on single and open-domain summaries generation tasks. Both show that our approach significantly outperforms other data-driven baselines.
We investigate the problem of generating natural language summaries from knowledge base triples. Our approach is based on a pointer-generator net- work, which, in addition to generating regular words from a fixed target vocabulary, is able to ver- balise triples in several ways. We undertake an au- tomatic and a human evaluation on single and open- domain summaries generation tasks. Both show that our approach significantly outperforms other data-driven baselines.
Traditional set prediction models can struggle with simple datasets due to an issue we call the responsibility problem. We introduce a pooling method for sets of feature vectors based on sorting features across elements of the set. This can be used to construct a permutation-equivariant auto-encoder that avoids this responsibility problem. On a toy dataset of polygons and a set version of MNIST, we show that such an auto-encoder produces considerably better reconstructions and representations. Replacing the pooling function in existing set encoders with FSPool improves accuracy and convergence speed on a variety of datasets.
Traditional set prediction models can struggle with simple datasets due to an issue we call the responsibility problem. We introduce a pooling method for sets of feature vectors based on sorting features across elements of the set to learn better set representations. This can be used to construct a permutation-equivariant auto-encoder, which avoids the responsibility problem. On a toy dataset of polygons and a set version of MNIST, we show that such an auto-encoder produces considerably better reconstructions. Used in set classification, FSPool significantly improves accuracy and convergence speed on the set versions of MNIST and CLEVR.
We study the problem of predicting a set from a feature vector with a deep neural network. Existing approaches ignore the set structure of the problem and suffer from discontinuity issues as a result. We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. With a single feature vector as input, we show that our model is able to auto-encode point sets, predict bounding boxes of the set of objects in an image, and predict the attributes of these objects in an image.
Colour vision has long fascinated scientists, who have sought to understand both the physiology of the mechanics of colour vision and the psychophysics of colour perception. We consider representations of colour in anatomically constrained convolutional deep neural networks. Following ideas from neuroscience, we classify cells in early layers into groups relating to their spectral and spatial functionality. We show the emergence of single and double opponent cells in our networks and characterise how the distribution of these cells changes under the constraint of a retinal bottleneck. Our experiments not only open up a new understanding of how deep networks process spatial and colour information, but also provide new tools to help understand the black box of deep learning.
Spatial Transformer Networks (STNs) have the potential to dramatically improve performance of convolutional neural networks in a range of tasks. By ‘focusing’ on the salient parts of the input using a differentiable affine transform, a network augmented with an STN should have increased performance, efficiency and interpretability. However, in practice, STNs rarely exhibit these desiderata, instead converging to a seemingly meaningless transformation of the input. We demonstrate and characterise this localisation problem as deriving from the spatial invariance of feature detection layers acting on extracted glimpses. Drawing on the neuroanatomy of the human eye we then motivate a solution: foveated convolutions. These parallel convolutions with a range of strides and dilations introduce specific translational variance into the model. In so doing, the foveated convolution presents an inductive bias, encouraging the subject of interest to be centred in the output of the attention mechanism, giving significantly improved performance.
Current approaches for predicting sets from feature vectors ignore the unordered nature of sets and suffer from discontinuity issues as a result. We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. With a single feature vector as input, we show that our model is able to auto-encode point sets, predict the set of bounding boxes of objects in an image, and predict the set of attributes of these objects.
Representations of sets are challenging to learn because operations on sets should be permutation-invariant. To this end, we propose a Permutation-Optimisation module that learns how to permute a set end-to-end. The permuted set can be further processed to learn a permutation-invariant representation of that set, avoiding a bottleneck in traditional set models. We demonstrate our model's ability to learn permutations and set representations with either explicit or implicit supervision on four datasets, on which we achieve state-of-the-art results: number sorting, image mosaics, classification from image mosaics, and visual question answering.
National Mapping agencies (NMA) are tasked with providing highly accurate geospatial data for a range of customers. This challenge has traditionally been met by combining remote sensing data gathering, field work and manual interpretation and processing of the data. This is a significant logistical undertaking which requires novel approaches to improve potential feature extraction from the available data. Using research undertaken at Great Britain’sNMA, Ordnance Survey (OS)as an example, this paper provides an overview of recent advances in the use of artificial intelligence (AI)to assist in improving feature classification from remotely sensed aerial imagery, describing research using high level neural network architecture to image classification that utilisesconvolutional neural network learning.
Land cover (LC) and land use (LU) have commonly been classified separately from remotely sensed imagery, without considering the intrinsically hierarchical and nested relationships between them. In this paper, for the first time, a highly novel Joint Deep Learning framework is proposed and demonstrated for LC and LU classification. The proposed Joint Deep Learning (JDL) model incorporates a multilayer perceptron (MLP) and convolutional neural network (CNN), and is implemented via a Markov process involving iterative updating. In the JDL, LU classification conducted by the CNN is made conditional upon the LC probabilities predicted by the MLP. In turn, those LU probabilities together with the original imagery are re-used as inputs to the MLP to strengthen the spatial and spectral feature representations. This process of updating the MLP and CNN forms a joint distribution, where both LC and LU are classified simultaneously through iteration. The proposed JDL method provides a general framework within which the pixel-based MLP and the patch-based CNN provide mutually complementary information to each other, such that both are refined in the classification process through iteration. Given the well-known complexities associated with the classification of very fine spatial resolution (VFSR) imagery, the effectiveness of the proposed JDL was tested on aerial photography of two large urban and suburban areas in Great Britain (Southampton and Manchester). The JDL consistently demonstrated greatly increased accuracies with increasing iteration, not only for the LU classification, but for both the LC and LU classifications, achieving by far the greatest accuracies for each at around 10 iterations. The average overall classification accuracies were 90.18% for LC and 87.92% for LU for the two study sites, far higher than the initial accuracies and consistently outperforming benchmark comparators (three each for LC and LU classification). This research, thus, represents the first attempt to unify the remote sensing classification of LC (state; what is there?) and LU (function; what is going on there?), where previously each had been considered separately only. It, thus, has the potential to transform the way that LC and LU classification is undertaken in future. Moreover, it paves the way to address effectively the complex tasks of classifying LC and LU from VFSR remotely sensed imagery via joint reinforcement, and in an automatic manner.
Alignments between natural language and Knowledge Base (KB) triples are an essential prerequisite for training machine learning approaches employed in a variety of Natural Language Processing problems. These include Relation Extraction, KB Population, Question Answering and Natural Language Generation from KB triples. Available datasets that provide those alignments are plagued by significant shortcomings – they are of limited size, they exhibit a restricted predicate coverage, and/or they are of unreported quality. To alleviate these shortcomings, we present T-REx, a dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences). T-REx is two orders of magnitude larger than the largest available alignments dataset and covers 2.5 times more predicates. Additionally, we stress the quality of this language resource thanks to an extensive crowdsourcing evaluation. T-REx is publicly available at: https://w3id.org/t-rex.
In this paper, we propose a novel approach for efficient training of deep neural networks in a bottom-up fashion using a layered structure. Our algorithm, which we refer to as Deep Cascade Learning, is motivated by the Cascade Correlation approach of Fahlman who introduced it in the context of perceptrons. We demonstrate our algorithm on networks of convolutional layers, though its applicability is more general. Such training of deep networks in a cascade, directly circumvents the well-known vanishing gradient problem by ensuring that the output is always adjacent to the layer being trained. We present empirical evaluations comparing our deep cascade training with standard End-End training using back propagation of two convolutional neural network architectures on benchmark image classification tasks (CIFAR-10 and CIFAR-100). We then investigate the features learned by the approach and find that better, domain-specific, representations are learned in early layers when compared to what is learned in End-End training. This is partially attributable to the vanishing gradient problem which inhibits early layer filters to change significantly from their initial settings. While both networks perform similarly overall, recognition accuracy increases progressively with each added layer, with discriminative features learnt in every stage of the network, whereas in End-End training, no such systematic feature representation was observed. We also show that such cascade training has significant computational and memory advantages over End-End training, and can be used as a pre-training algorithm to obtain a better performance.
Most people need textual or visual interfaces in order to make sense of Semantic Web data. In this paper, we investigate the problem of generating natural language summaries for Semantic Web data using neural networks. Our end-to-end trainable architecture encodes the information from a set of triples into a vector of fixed dimensionality and generates a textual summary by conditioning the output on the encoded vector. We explore a set of different approaches that enable our models to verbalise entities from the input set of triples in the generated text. Our systems are trained and evaluated on two corpora of loosely aligned Wikipedia snippets with triples from DBpedia and Wikidata, with promising results.
Urban land use information is essential for a variety of urban-related applications such as urban planning and regional administration. The extraction of urban land use from very fine spatial resolution (VFSR) remotely sensed imagery has, therefore, drawn much attention in the remote sensing community. Nevertheless, classifying urban land use from VFSR images remains a challenging task, due to the extreme difficulties in differentiating complex spatial patterns to derive high-level semantic labels. Deep convolutional neural networks (CNNs) offer great potential to extract high-level spatial features, thanks to its hierarchical nature with multiple levels of abstraction. However, blurred object boundaries and geometric distortion, as well as huge computational redundancy, severely restrict the potential application of CNN for the classification of urban land use. In this paper, a novel object-based convolutional neural network (OCNN) is proposed for urban land use classification using VFSR images. Rather than pixel-wise convolutional processes, the OCNN relies on segmented objects as its functional units, and CNN networks are used to analyse and label objects such as to partition within-object and between-object variation. Two CNN networks with different model structures and window sizes are developed to predict linearly shaped objects (e.g. Highway, Canal) and general (other non-linearly shaped) objects. Then a rule-based decision fusion is performed to integrate the class-specific classification results. The effectiveness of the proposed OCNN method was tested on aerial photography of two large urban scenes in Southampton and Manchester in Great Britain. The OCNN combined with large and small window sizes achieved excellent classification accuracy and computational efficiency, consistently outperforming its sub-modules, as well as other benchmark comparators, including the pixel-wise CNN, contextual-based MRF and object-based OBIA-SVM methods. The proposed method provides the first object-based CNN framework to effectively and efficiently address the complicated problem of urban land use classification from VFSR images.
In some forms of gait analysis it is important to be able to capture when the heel strikes occur. In addition, in terms of video analysis of gait, it is important to be able to localise the heel where it strikes on the floor. In this paper, a new motion descriptor, acceleration flow, is introduced for detecting heel strikes. The key frame of heel strike can be determined by the quantity of acceleration flow within the Region of Interest (ROI), and positions of the strike can be found from the centre of rotation caused by radial acceleration. Our approach has been tested on a number of databases which were recorded indoors and outdoors with multiple views and walking directions for evaluating the detection rate under various environments. Experiments show the ability of our approach for both temporal detection and spatial positioning. The immunity of this new approach to three anticipated types of noises in real CCTV footage is also evaluated in our experiments. Our acceleration flow detector is shown to be less sensitive to Gaussian white noise, whilst being effective with images of low-resolution and without incomplete body position information when compared to other techniques.
Recent advances in computer vision and pattern recognition have demonstrated the superiority of deep neural networks using spatial feature representation, such as convolutional neural networks (CNN), for image classification. However, any classifier, regardless of its model structure (deep or shallow), involves prediction uncertainty when classifying spatially and spectrally complicated very fine spatial resolution (VFSR) imagery. We propose here to characterise the uncertainty distribution of CNN classification and integrate it into a regional decision fusion to increase classification accuracy. Specifically, a variable precision rough set (VPRS) model is proposed to quantify the uncertainty within CNN classifications of VFSR imagery, and partition this uncertainty into positive regions (correct classifications) and non-positive regions (uncertain or incorrect classifications). Those “more correct” areas were trusted by the CNN, whereas the uncertain areas were rectified by a Multi-Layer Perceptron (MLP)-based Markov random field (MLP-MRF) classifier to provide crisp and accurate boundary delineation. The proposed MRF-CNN fusion decision strategy exploited the complementary characteristics of the two classifiers based on VPRS uncertainty description and classification integration. The effectiveness of the MRF-CNN method was tested in both urban and rural areas of southern England as well as Semantic Labelling datasets. The MRF-CNN consistently outperformed the benchmark MLP, SVM, MLP-MRF and CNN and the baseline methods. This research provides a regional decision fusion framework within which to gain the advantages of model-based CNN, while overcoming the problem of losing effective resolution and uncertain prediction at object boundaries, which is especially pertinent for complex VFSR image classification.
Human assessments by either experts or crowdworkers are used extensively for the evaluation of systems employed on a variety of text generative tasks. In this paper, we focus on the human evaluation of textual summaries from knowledge base triple-facts. More specifically, we investigate possible similarities between the evaluation that is performed by experts and crowdworkers. We generate a set of summaries from DBpedia triples using a state-of-the-art neural network architecture. These summaries are evaluated against a set of criteria by both experts and crowdworkers. Our results highlight significant differences between the scores that are provided by the two groups.
While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is therefore of utmost social and cultural importance to focus efforts on languages whose speakers only have access to limited Wikipedia content. We investigate supporting communities by generating summaries for Wikipedia articles in underserved languages, given structured data as an input.
We focus on an important support for such summaries: ArticlePlaceholders, a dynamically generated content pages in underserved Wikipedias. They enable native speakers to access existing information in Wikidata. To extend those ArticlePlaceholders, we provide a system, which processes the triples of the KB as they are provided by the ArticlePlaceholder, and generate a comprehensible textual summary. This data-driven approach is employed with the goal of understanding how well it matches the communities' needs on two underserved languages on the Web: Arabic, a language with a big community with disproportionate access to knowledge online, and Esperanto, an easily-acquainted, artificial language whose Wikipedia content is maintained by a small but devoted community. With the help of the Arabic and Esperanto Wikipedians, we conduct a study which evaluates not only the quality of the generated text, but also the usefulness of our end-system to any underserved Wikipedia version.
The contextual-based convolutional neural network (CNN) with deep architecture and pixel-based multilayer perceptron (MLP) with shallow structure are well-recognized neural network algorithms, representing the state-of-the-art deep learning method and the classical non-parametric machine learning approach, respectively. The two algorithms, which have very different behaviours, were integrated in a concise and effective way using a rule-based decision fusion approach for the classification of very fine spatial resolution (VFSR) remotely sensed imagery. The decision fusion rules, designed primarily based on the classification confidence of the CNN, reflect the generally complementary patterns of the individual classifiers. In consequence, the proposed ensemble classifier MLP-CNN harvests the complementary results acquired from the CNN based on deep spatial feature representation and from the MLP based on spectral discrimination. Meanwhile, limitations of the CNN due to the adoption of convolutional filters such as the uncertainty in object boundary partition and loss of useful fine spatial resolution detail were compensated. The effectiveness of the ensemble MLP-CNN classifier was tested in both urban and rural areas using aerial photography together with an additional satellite sensor dataset. The MLP-CNN classifier achieved promising performance, consistently outperforming the pixel-based MLP, spectral and textural-based MLP, and the contextual-based CNN in terms of classification accuracy. This research paves the way to effectively address the complicated problem of VFSR image classification.
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.
The associated repository contains the code and the corpora that were used in order to build a "learnable" system that generates open-domain textual summaries in Arabic and Esperanto given a set of Wikidata triples as input. The two corpora that have been used for the experiments are included in the repository: (i) Wikidata triples aligned with Wikipedia summaries in Arabic and (ii) Wikidata triples aligned with Wikipedia summaries in Esperanto.
The adverse visual conditions of surveillance environments and the need to identify humans at a distance have stimulated research in soft biometric attributes. These attributes can be used to describe a human's physical traits semantically and can be acquired without their cooperation. Soft biometrics can also be employed to retrieve identity from a database using verbal descriptions of suspects. In this paper, we explore unconstrained human face identification with semantic face attributes derived automatically from images. The process uses a deformable face model with keypoint localisation which is aligned with attributes derived from semantic descriptions. Our new framework exploits the semantic feature space to infer face signatures from images and bridges the semantic gap between humans and machines with respect to face attributes. We use an unconstrained dataset, LFW-MS4, consisting of all the subjects from view-1 of the LFW database that have four or more samples. Our new approach demonstrates that retrieval via estimated comparative facial soft biometrics yields a match in the top 10.23% of returned subjects. Furthermore, modelling of face image features in the semantic space can achieve an equal error rate of 12.71%. These results reveal the latent benefits of modelling visual facial features in a semantic space. Moreover, they highlight the potential of using images and verbal descriptions to generate comparative soft biometrics for subject identification and retrieval.
This chapter presents the work of the 12-month project Seals and Their Impressions in the Ancient Near
East (SIANE), a collaborative effort of the University of Southampton, Oxford University and the
University of Paris (Nanterre). Recognising the need for improved visual documentation of ancient
Near Eastern cylinder seals and the potential presented by new technologies, there have been several
approaches to 3D-imaging cylinder seals in recent years (e.g. Pitzalis et al. 2008; Reh et al. 2016;
Wagensonner forthcoming). SIANE focused on the development of equipment and workflow that can
quickly capture the maximum amount of meaningful data from a seal, including 3D data from
structured light and an automated production of ‘digital unwrappings’. The project addressed some
issues regarding the physical mounting of seals and developed a method of efficient data-capture that
allows the imaging of large numbers of cylinder seals for research and presentation purposes. A
particular research benefit from 3D image capture of entire seal collections is the potential for
exploring computer-aided image recognition, which could contribute to comparative glyptic studies
as well as helping to address the question of whether any original seals can be linked to known
ancient impressions on tablets or sealings possibly separated across modern collections.
The recent growth in CCTV systems and the challenges of automatically identifying humans under the adverse visual conditions of surveillance have increased the interest in soft biometrics, which are physical attributes that can be used to describe people semantically. Soft biometrics enable human identification based on verbal descriptions, and they can be captured in conditions where it is impossible to acquire traditional biometrics such as iris and fingerprint. The research on facial soft biometrics has tended to focus on identification using categorical attributes, whereas comparative attributes have shown a better accuracy. Nevertheless, the research in comparative facial soft biometrics has been limited to small constrained databases, while identification in surveillance systems involves unconstrained large databases. In this chapter, we explore human identification through comparative facial soft biometrics in large unconstrained databases using the Labelled Faces in the Wild (LFW) database. We propose a novel set of attributes and investigate their significance. Also, we analyse the reliability of comparative facial soft biometrics for realistic databases and explore identification and verification using comparative facial soft biometrics. The results of the performance analysis show that by comparing an unknown subject to a line up of ten subjects only; a correct match will be found in the top 2.08% retrieved subjects from a database of 4038 subjects.
While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidata triples. We demonstrate the effectiveness of the proposed approach by evaluating it against a set of baselines on two languages of different natures: Arabic, a morphological rich language with a larger vocabulary than English, and Esperanto, a constructed language known for its easy acquisition.
Within data science, many problems are solved using machine learning. Recently, with the introduction of deep learning, we see this trend spread out across industries of which archaeological object detection on remote sensor data is a case in point. From the known case studies, we have identified the main issues and developed improvements accordingly.
The main issue of archaeological datasets is that there are only a limited number of known sites which makes the networks prone to overfit. Overfitting happens when a network is trained on too few examples and learns patterns that do not generalize well to new data. To an extent, data augmentation can be used to prevent overfitting, however, the training images would still be highly correlated. Therefore, it is argued that the most effect can be gained by limiting storage of irrelevant features in networks. This can be done by optimising network architectures and additionally by using transfer learning in which pre-trained network are used to initialise training. Regardless of pre-training on datasets without archaeological sites, its trained network can still be useful for the low-level features (including lines and edges). A downside of pre-trained networks is that they can only work with data in the same format as they had been trained with.
Our main contribution is the research into including multi-sensor data. We will present approaches to train networks using images with stacks of data, apply fusion networks and by generating pre-trained networks for the available data of different sensors.
Previous research in motion analysis of image sequences has generally not considered the basic nature of higher orders of motion such as acceleration. In this work, we disambiguate different types of motion, and in particular focus on acceleration. First, we show acceleration can be computed in a principled manner by extending Horn and Schunck’s algorithm for global optical flow estimation. We then demonstrate an approximation of the acceleration field using an alternative established optical flow technique, since most real motions violate the global smoothness assumption of Horn and Schunck. Furthermore, we decompose acceleration into radial and tangential components for greater depth of understanding of the motion. As a general motion descriptor, we show how acceleration provides the capability for differentiating different types of motion in video sequences.
As a result of the New Forest Knowledge project, many new sites were discovered. This was partly due to the undertaken LiDAR survey which was followed by an intensive manual process to interpret the results. The research presented in this paper looks at methods to automate this process especially for round barrow detection using deep learning.
Traditionally, automated methods require manual feature engineering to extract the visual appearance of a site on remote sensing data. Whereas this approach is difficult, expensive and bound to detect a single type of site, recent developments have moved towards automated feature learning of which deep learning is the most notable. In our approach, we use known site locations together with LiDAR data and aerial images to train Convolutional Neural Networks (CNNs). This network is typically constructed of many layers with each representing a different filter (e.g. to detect lines or edges). When this network is trained, each new site location that is fed to the network will update the weights of features to better represent the appearance of sites in the remote sensing data. For this learning process, an accurate dataset is required with a lot of examples and therefore the New Forest is a very suitable case study, especially thanks to the extensive research of the New Forest Knowledge project.
In this paper, our latest results will be presented together with a future perspective on how we can scale our approach to a country wide detection method when computing power becomes even more efficient.
The linked repository contains the code along with the required corpora that were used in order to build a system that "learns" how to generate English biographies for Semantic Web triples. Two corpora are included: (i) DBpedia triples aligned with Wikipedia biographies and (ii) Wikidata triples aligned with Wikipedia biographies.
We aim to develop a process by which we can extract generic features from aerial image data that can both be used to infer the presence of objects and characteristics and to discover new ways of representing the landscape. We investigate the fine-tuning of a 50-layer ResNet deep convolutional neural network that was pre-trained with ImageNet data and extracted features at several layers throughout these pre-trained and the fine-tuned networks. These features were applied to several supervised classification problems, obtaining a significant correlation between the classification accuracy and layer number. Visualising the activation of the networks’ nodes found that fine-tuning had not achieved coherent representations at later layers. We conclude that we need to train with considerably more varied data but that, even without fine tuning, features derived from a deep network can produce better classification results than with image data alone.
Traditionally, research initiatives into automated detection of archaeological objects were focussed on feature engineering to detect individual object types. These methods have been criticised for their lack in accuracy which is mostly caused by their inability to capture the variability within an object type and the objects’ appearance across different land cover types.
Recently, rather than further optimizing features, research has shifted towards feature learning which offers more flexibility. This shift was triggered by the overwhelming successes of deep learning (shown for e.g. self-driving cars and medical imagery). A deep convolutional neural network is build-up out of many layers and learns features from images of known objects which are fed to the network. In the early layers of a network only basic abstractions such as lines and edges are learned and as the deeper layers are reached the features get more refined and are able to extract the key characteristics of the object type. This process is very similar to how a human learns although there are some important advantages to the structure of deep networks. For example, they can be designed to incorporate different types of remote sensor data and can hence internally compare this variety of data. In his manner a network will quickly identify obvious false positives and adapt the weights of the layers accordingly. Another important point is that a network can fully appreciate the small variation of pixel values without any image enhancements. For LiDAR data this effect can be demonstrated with a network that identifies a slope in the first layers of the network and later on learns that the slope direction and local relief are important features for a specific object type.
The above listed approaches just scratch the surface of the wide range of possible methods to using deep learning for aerial archaeology. In the end, the shift in research is mainly driven by the far-future concept of a national model which automatically retrains with newly acquired remote sensing data to allow for new discoveries that can further improve the networks.
An essential aspect of archaeology is the protection of sites from looters, extensive agriculture and erosion. Under this constant threat of destruction, it is of utmost importance that sites are located so that they can be monitored and protected. This is mostly done on the ground or by using remote sensing data such as aerial images or LiDAR derived elevation models. This task is time consuming and requires highly specialised and experienced people and would thus immensely benefit from automation. Within this novel research, the potential of deep learning for the detection of archaeological sites is being assessed.
Wikidata is a community-driven knowledge graph, strongly linked to Wikipedia. However, the connection between the two projects has been sporadically explored. We investigated the relationship between the two projects in terms of the in- formation they contain by looking at their external references. Our findings show that while only a small number of sources is directly reused across Wikidata and Wikipedia, references of- ten point to the same domain. Furthermore, Wikidata appears to use less Anglo-American-centred sources. These results deserve further in-depth investigation.
Recent expansion in surveillance systems has motivated research in soft biometrics that enable the unconstrained recognition of human faces. Comparative soft biometrics show superior recognition performance than categorical soft biometrics and have been the focus of several studies which have highlighted their ability for recognition and retrieval in constrained and unconstrained environments. These studies, however, only addressed face recognition for retrieval using human generated attributes, posing a question about the feasibility of automatically generating comparative labels from facial images. In this paper, we propose an approach for the automatic comparative labelling of facial soft biometrics. Furthermore, we investigate unconstrained human face recognition using these comparative soft biometrics in a human labelled gallery (and vice versa). Using a subset from the LFW dataset, our experiments show the efficacy of the automatic generation of comparative facial labels, highlighting the potential extensibility of the approach to other face recognition scenarios and larger ranges of attributes.
Soft biometrics are attracting a lot of interest with the spread of surveillance systems, and the need to identify humans at distance and under adverse visual conditions. Comparative soft biometrics have shown a significantly better impact on identification performance compared to traditional categorical soft biometrics. However, existing work that has studied comparative soft biometrics was based on small datasets with samples taken under constrained visual conditions. In this paper, we investigate human identification using comparative facial soft biometrics on a larger and more realistic scale using 4038 subjects from the View 1 subset of the LFW database. Furthermore, we introduce a new set of comparative facial soft biometrics and investigate the effect of these on identification and verification performance. Our experiments show that by using only 24 features and 10 comparisons, a rank-10 identification rate of 96.98% and a verification accuracy of 93.66% can be achieved.
Finding the natural language equivalent of structured data is both a challenging and promising task. In particular, an efficient alignment of knowledge bases with texts would benefit many applications, including natural language generation, information retrieval and text simplification. In this paper, we present an approach to build a dataset of triples aligned with equivalent sentences written in natural language. Our approach consists of three main steps. First, target sentences are annotated automatically with knowledge base (KB) concepts and instances. The triples linking these elements in the KB are extracted as candidate facts to be aligned with the annotated sentence. Second, we use textual mentions referring to the subject and object of these facts to semantically simplify the target sentence via crowdsourcing. Third, the sentences provided by different contributors are post-processed to keep only the most relevant simplifications for the alignment with KB facts. We present different filtering methods, and share the constructed datasets in the public domain. These datasets contain 1050 sentences aligned with 1885 triples. They can be used to train natural language generators as well as semantic or contextual text simplifiers.
Erica the Rhino is an interactive art exhibit created by the University of Southampton, UK. Erica was created as part of a city wide art trail in 2013 called "Go! Rhinos", curated by Marwell Wildlife, to raise awareness of Rhino conservation. Erica arrived as a white fibreglass shell which was then painted and equipped with 5 Raspberry Pi Single Board Computers (SBC). These computers allowed the audience to interact with Erica through a range of sensors and actuators. In particular, the audience could feed and stroke her to prompt reactions, as well as send her Tweets to change her behaviour. Pi SBCs were chosen because of their ready availability and their educational pedigree. During the deployment, 'coding clubs' were run in the shopping centre where Erica was located, these allowed children to experiment with and program the same components used in Erica. The experience gained through numerous deployments around the country has enabled Erica to be upgraded to increase reliability and ease of maintenance, whilst the release of the Pi 2 has allowed her responsiveness to be improved.
The linked repository contains the resultant datasets of the Semantic Sentence Simplification (S3) methodology. Two high quality data-to-text corpora have been built: (i) DBpedia triples aligned with single Wikipedia sentences and (ii) triples from the Unified Medical Language System (UMLS) aligned with single MedlinePlus sentences.
Combining items from social media streams, such as Flickr photos and Twitter tweets, into meaningful groups can help users contextualise and consume more effectively the torrents of information continuously being made available on the social web. This task is made challenging due to the scale of the streams and the inherently multimodal nature of the information being contextualised.
The problem of grouping social media items into meaningful groups can be seen as an ill-posed and application specific unsupervised clustering problem. A fundamental question in multimodal contexts is determining which features best signify that two items should belong to the same grouping.
This paper presents a methodology which approaches social event detection as a streaming multi-modal clustering task. The methodology takes advantage of the temporal nature of social events and as a side benefit, allows for scaling to real-world datasets. Specific challenges of the social event detection task are addressed: the engineering and selection of the features used to compare items to one another; a feature fusion strategy that incorporates relative importance of features; the construction of a single sparse affinity matrix; and clustering techniques which produce meaningful item groups whilst scaling to cluster very large numbers of items.
The state-of-the-art approach presented here is evaluated using the ReSEED dataset with standardised evaluation measures. With automatically learned feature weights, we achieve an F1 score of 0.94, showing that a good compromise between precision and recall of clusters can be achieved. In a comparison with other state-of-the-art algorithms our approach is shown to give the best results.
Validating user tags helps to refine them, making them more useful for finding images. In the case of interpretation-sensitive tags, however, automatic (i.e., pixel-based) approaches cannot be expected to deliver optimal results. Instead, human input is key. This paper studies how crowdsourcing-based approaches to image tag validation can achieve parsimony in their use of human input from the crowd, in the form of votes collected from workers on a crowdsourcing platform. Experiments in the domain of social fashion images are carried out using the dataset published by the Crowdsourcing Task of the Mediaeval 2013 Multimedia Benchmark. Experimental results reveal that when a larger number of crowd-contributed votes are available, it is difficult to beat a majority vote. However, additional information sources, i.e., crowdworker history and visual image features, allow us to maintain similar validation performance while making use of less crowd-contributed input. Further, investing in “expensive" experts who collaborate to create definitions of interpretation-sensitive concepts does not necessarily pay off. Instead, experts can cause interpretations of concepts to drift away from conventional wisdom. In short, validation of interpretation-sensitive user tags for social images is possible, with “just a little help from the crowd."
The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval.
Knowing the location where a photograph was taken provides us with data that could be useful in a wide spectrum of applications. With the advance of digital cameras, and with many users exchanging their digital cameras for GPS-enabled mobile phones, photographs annotated with geographical locations are becoming ever more present on photo-sharing websites such as Flickr. However there is still a mass of content that is not geotagged, meaning that algorithms for efficient and accurate geographical estimation of an image are needed. This paper presents a general model for effectively using both textual metadata and visual features of photos to automatically place them on a world map with state-of-the-art performance. In addition, we explore how information from user-modelling can be fused with our model, and investigate the effect such modelling has on performance.
A central goal for the EPSRC funded Semantic Media Network project is to support interesting collaboration opportunities between researchers in order to foster relationships and encourage working together (EPSRC priority 'Working Together'). SemanticNews was one of the four projects funded in the first round of Semantic Media Network mini-projects, and was collaboration between the Universities of Southampton and Sheffield, together with the BBC.
The SemanticNews project aimed to promote people's comprehension and assimilation of news by augmenting broadcast news discussion and debate with information from the semantic web in the form of linked open data (LOD). The project has laid the foundations for a toolkit for (semi- ) automatic provision of semantic analysis and contextualization of the discussion of current events, encompassing state of the art semantic web technologies including text mining, consolidation against Linked Open Data, and advanced visualisation.
SemanticNews was bootstrapped using episodes of the BBC Question Time programme that already had transcripts and manually curated metadata, which included a list of the topical questions being debated. This information was used to create a workflow that a) extracts relevant entities using established named entity recognition techniques to identify the types of information to contextualise for a news article; b) provides associations with concepts from LOD resources; and, c) visualises the context using information derived from the LOD cloud.
This document forms the final report of the SemanticNews project, and describes in detail the processes and techniques explored for the enrichment of Question Time episodes. The final section of the report discusses how this work could be expanded in the future, and also makes a few recommendations for additional data that could be could be captured during the production process that would make the automatic generation of the contextualisation easier.
The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general.
This paper describes the approach we take to the analysis of social media, combining opinion mining from text and multimedia (images, videos, etc), and centred on entity and event recognition. We examine a particular use case, which is to help archivists select material for inclusion in an archive of social media for preserving community memories, moving towards structured preservation around semantic categories. The textual approach we take is rule-based and builds on a number of sub-components, taking into account issues inherent in social media such as noisy ungrammatical text, use of swear words, sarcasm etc. The analysis of multimedia content complements this work in order to help resolve ambiguity and to provide further contextual information. We provide two main innovations in this work: first, the novel combination of text and multimedia opinion mining tools; and second, the adaptation of NLP tools for opinion mining specific to the problems of social media.
This paper describes a modular architecture for searching and hyperlinking clips of TV programmes. The architecture aimed to unify the combination of features from different modalities through a common representation based on a set of probability density functions over the timeline of a programme. The core component of the system consisted of analysis of sections of transcripts based on a textual query. Results show that search is made worse by the addition of other components, whereas in hyperlinking precision is increased by the addition of visual features.
Combining items from social media streams, such as Flickr photos and Twitter tweets, into meaningful groups can help users contextualise and effectively consume the torrents of information now made available on the social web. This task is made challenging due to the scale of the streams and the inherently multimodal nature of the information to be contextualised. We present a methodology which approaches social event detection as a multi-modal clustering task. We address the various challenges of this task: the selection of the features used to compare items to one another; the construction of a single sparse affinity matrix; combining the features; relative importance of features; and clustering techniques which produce meaningful item groups whilst scaling to cluster large numbers of items. In our best tested configuration we achieve an F1 score of 0.94, showing that a good compromise between precision and recall of clusters can be achieved using our technique.
There is a wide array of online photographic content that is not geotagged. Algorithms for efficient and accurate geographical estimation of an image are needed to geolocate these photos. This paper presents a general model for using both textual metadata and visual features of photos to automatically place them on a world map.
The 2013 MediaEval Crowdsourcing task looked at the problem of working with noisy crowdsourced annotations of image data. The aim of the task was to investigate possible techniques for estimating the true labels of an image by using the set of noisy crowdsourced labels, and possibly any content and metadata from the image itself. For the runs in this paper, we’ve applied a shotgun approach and tried a number of existing techniques, which include generative probabilistic models and further crowdsourcing.
The 2013 MediaEval Retrieving Diverse Social Images Task looked to tackling the problem of search result diversification of Flickr results sets formed from queries about geographic places and landmarks. In this paper we describe our approach of using a min-max similarity diversifier coupled with pre-filters and a reranker. We also demonstrate a number of novel features for measuring similarity to use in the diversification step.
The data contained on the web and social web is inherently multimedia; consisting of a mix of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. This paper explores some uses for the automatic analysis of multimedia data within the context of the archival and post-archival analysis of community memories on the web and social web.
The data contained within the web is inherently multimedia; consisting of a rich mix of textual, visual and audio modalities. Prospective Web Observatories need to take this into account from the ground up. This paper explores some uses for the automatic analysis of multimedia data within a Web Observatory, and describes a potential platform for an extensible and scalable multimedia Web Observatory.
Search result diversification can increase user satisfaction in answering a particular information need. There are many ways of diversify search results. In some cases the user has a clear idea of how they would like to see their results diversified. This work presents a system that is capable of diversifying search results along specific user-specified axes of diversity.
Millions of images are tweeted every day, yet very little research has looked at the non-textual aspect of social media communication. In this work we have developed a system to analyse streams of image data. In particular we explore trends in similar, related, evolving or even duplicated visual artefacts in the mass of tweeted image data — in short, we explore the visual pulse of Twitter.
The ability to handle very large amounts of image data is important for image analysis, indexing and retrieval applications. Sadly, in the literature, scalability aspects are often ignored or glanced over, especially with respect to the intricacies of actual implementation details.
In this paper we present a case-study showing how a standard bag-of-visual-words image indexing pipeline can be scaled across a distributed cluster of machines. In order to achieve scalability, we investigate the optimal combination of hybridisations of the MapReduce distributed computational framework which allows the components of the analysis and indexing pipeline to be effectively mapped and run on modern server hardware. We then demonstrate the scalability of the approach practically with a set of image analysis and indexing tools built on top of the Apache Hadoop MapReduce framework. The tools used for our experiments are freely available as open-source software, and the paper fully describes the nuances of their implementation.
Photo publishing in Social Networks and other Web2.0 applications has become very popular due to the pervasive availability of cheap digital cameras, powerful batch upload tools and a huge amount of storage space. A portion of uploaded images are of a highly sensitive nature, disclosing many details of the users’ private life. We have developed a web service which can detect private images within a user’s photo stream and provide support in making privacy decisions in the sharing context. In addition, we present a privacy-oriented image search application which automatically identifies potentially sensitive images in the result set and separates them from the remaining pictures
OpenIMAJ and ImageTerrier are recently released open-source libraries and tools for experimentation and development of multimedia applications using Java-compatible programming languages. OpenIMAJ (the Open toolkit for Intelligent Multimedia Analysis in Java) is a collection of libraries for multimedia analysis. The image libraries contain methods for processing images and extracting state- of-the-art features, including SIFT. The video and audio libraries support both cross-platform capture and processing. The clustering and nearest-neighbour libraries contain efficient, multi-threaded implementations of clustering algorithms. The clustering library makes it possible to easily create BoVW representations for images and videos. OpenI-MAJ also incorporates a number of tools to enable extremely- large-scale multimedia analysis using distributed computing with Apache Hadoop. ImageTerrier is a scalable, high-performance search engine platform for content-based image retrieval applications using features extracted with the OpenIMAJ library and tools. The ImageTerrier platform provides a comprehensive test-bed for experimenting with image retrieval techniques. The platform incorporates a state-of-the-art implementation of the single-pass indexing technique for constructing inverted indexes and is capable of producing highly compressed index data structures.
The SIFT keypoint descriptor is a powerful approach to encoding local image description using edge orientation histograms. Through codebook construction via k-means clustering and quantisation of SIFT features we can achieve image retrieval treating images as bags-of-words. Intensity inversion of images results in distinct SIFT features for a single local image patch across the two images. Intensity inversions notwithstanding these two patches are structurally identical. Through careful reordering of the SIFT feature vectors, we can construct the SIFT feature that would have been generated from a non-inverted image patch starting with those extracted from an inverted image patch. Furthermore, through examination of the local feature detection stage, we can estimate whether a given SIFT feature belongs in the space of inverted features, or non-inverted features. Therefore we can consistently separate the space of SIFT features into two distinct subspaces. With this knowledge, we can demonstrate reduced time complexity of codebook construction via clustering by up to a factor of four and also reduce the memory consumption of the clustering algorithms while producing equivalent retrieval results.
In this paper we study the connection between sentiment of images expressed in metadata and their visual content in the social photo sharing environment Flickr. To this end, we consider the bag-of-visual words representation as well as the color distribution of images, and make use of the SentiWordNet thesaurus to extract numerical values for their sentiment from accompanying textual metadata. We then perform a discriminative feature analysis based on information theoretic methods, and apply machine learning techniques to predict the sentiment of images. Our large-scale empirical study on a set of over half a million Flickr images shows a considerable correlation between sentiment and visual features, and promising results towards estimating the polarity of sentiment in images.
We present a brief overview of the way in which image analysis, coupled with associated collateral text, is being used for auto-annotation and sentiment analysis. In particular, we describe our approach to auto-annotation using the graph- theoretic dominant set clustering algorithm and the annotation of images with sentiment scores from SentiWordNet. Preliminary results are given for both, and our planned work aims to explore synergies between the two approaches.
The availability of a large, freely redistributable set of high-quality annotated images is critical to allowing researchers in the area of automatic annotation, generic object recognition and concept detection to compare results. The recent introduction of the MIR Flickr dataset allows researchers such access. A dataset by itself is not enough, and a set of repeatable guidelines for performing evaluations that are comparable is required. In many cases it also is useful to compare the machine-learning components of different automatic annotation techniques using a common set of image features. This paper seeks to provide a solid, repeatable methodology and protocol for performing evaluations of automatic annotation software using the MIR Flickr dataset together with freely available tools for measuring performance in a controlled manner. This protocol is demonstrated through a set of experiments using a “semantic space” auto-annotator previously developed by the authors, in combination with a set of visual term features for the images that has been made publicly available for download. The paper also discusses how much training data is required to train the semantic space annotator with the MIR Flickr dataset. It is the hope of the authors that researchers will adopt this methodology and produce results from their own annotators that can be directly compared to those presented in this work.
This paper proposes a new technique for auto-annotation and semantic retrieval based upon the idea of linearly mapping an image feature space to a keyword space. The new technique is compared to several related techniques, and a number of salient points about each of the techniques are discussed and contrasted. The paper also discusses how these techniques might actually scale to a real-world retrieval problem, and demonstrates this though a case study of a semantic retrieval technique being used on a real-world data-set (with a mix of annotated and unannotated images) from a picture library.
The diffusion of new Internet and web technologies has increased the distribution of different digital content, such as text, sounds, images and videos. In this paper we focus on images and their role in the analysis of diversity. We consider diversity as a concept that takes into account the wide variety of information sources, and their differences in perspective and viewpoint. We describe a number of different dimensions of diversity; in particular, we analyze the dimensions related to image searches and context analysis, emotions conveyed by images and opinion mining, and bias analysis.
This paper describes the diversity enabled retrieval system constructed at Southampton for the ImageCLEFphoto 2009 task. The retrieval system used Terrier as the underlying textual indexing and retrieval system, and combined it with a technique for re-ranking the results by maximising the visual dissimilarity of retrieved images. The results show that our visual re-ranking method does indeed work at increasing the diversity in the top results, however, at the same time it causes a slight drop in precision. The text-based approach designed for handling the 'part 1 topics' of the task is also shown to perform very well.
This paper describes Southampton's submissions to the 2009 ImageCLEF photo annotation task. For the task we used an annotation system based on the idea of constructing semantic spaces, which was developed previously at Southampton. To represent the image content, we used a combination of different SIFT and Colour-SIFT features detected using the difference-of-Gaussian and MSER techniques. These features were converted into a visual term representation by applying vector quantisation using a codebook learnt from a hierarchical k-means clustering. In terms of EER and AUC, the annotator performs reasonably well, however, it struggles when evaluated using the hierarchical measure proposed for the task, due to the way the annotation confidences are thresholded.
LifeGuide is a software package that allows health professionals and researchers with no programming skills to easily and flexibly create, evaluate and modify behavioural interventions. An intervention called the ‘Internet Doctor’ was developed as a way of identifying many of the tools that were required in LifeGuide. The ‘Internet Doctor’ provides people suffering from cold and flu symptoms with tailored advice for the self-care of cold and flu symptoms. Participants were automatically randomised to one of two versions of the website: (i) the full, ‘more interactive’ version, or, (ii) a ‘less interactive’ version which omitted references to the Internet Doctor and links to obtain further information. Participants who viewed the less interactive version were more likely to complete the full consultation cycle for their selected symptom and were also more likely to consult for more symptoms than those in the less interactive version. Few participants clicked on the optional links in the more interactive version. It is concluded that although the more interactive version of the website provided more information, participants did not make full use of the interactive features which displayed this information, and did not consult for as many symptoms, so may not have benefited from the website as much as those viewing the less interactive version.
The IMS Question and Test Interoperability (QTI) standard identifies sixteen different question types which may be used in on-line assessment. While some partial implementations exist, the R2Q2 project has developed a complete solution that renders and responds to all sixteen question types as specified. In addition, care has been taken in the R2Q2 project to ensure that the solution produced will allow for future changes in the specification. The design of R2Q2 is described, the focus being on lessons learnt. We describe the architecture and the rationale of the internal Web services and explain the approach taken in implementing the QTI specification, showing how the design allows for future tags to be added with the minimal of programming effort. The QTI standard has not had a great take-up in part due to the lack of tools. In the 2006 JISC Capital, three Assessment projects were commissioned: item authoring, item banking, and QTI-compliant test delivery. This paper describes the ‘ASDEL’ test delivery engine, focusing upon its architecture, its relation to the item authoring and item banking services, and the integration of the R2Q2 Web service.
Behavioural interventions are used by social scientists to effect change in a person’s behaviour. The LifeGuide project is developing tools to enable the easy creation, deployment and trialling of Internet-based behavioural interventions. The use of on-line behavioural interventions is appealing as it can be more cost effective than face-to-face interventions, can deliver tailored advice at times that suit the participants, and can provide detailed statistical information that can be used to better understand behaviour or demonstrate the efficacy of the interventions themselves. The problem however is that developing on-line interventions is a complex, time-consuming task that often has involved high levels of specialist computing support in construction and delivery. The LifeGuide project is looking to put tools into the hands of domain specialists (psychologists, social scientists, health professionals, etc.) that enable them to easily construct their own behavioural interventions and deploy them on the Internet. This paper looks at the authoring tools currently being developed by the project, assesses their usability through case studies of interventions developed so far, and suggests where the project will look in the future to continue to improve the tools to meet the needs of the wide range of intervention authors.
We are developing a set of software resources named ‘the LifeGuide’ that will enable researchers to collaboratively create, evaluate and modify two central dimensions of behavioural interventions: a) providing tailored advice; b) supporting sustained behaviour.
Behavioural interventions are a technique used by social scientists and health professionals to mediate the behaviour of a subject. Traditionally, interventions take the form of tailored advice given in a face-to-face setting. Internet-based behavioural interventions harness the power of the web to deliver tailored advice to participants at the time that most suits them. The LifeGuide project is a multidisciplinary collaboration with the aim of developing and proving a set of software tools for the development and deployment of internet-based behavioural interventions. The tools developed in LifeGuide cover the complete lifecycle of an intervention, from initial authoring to trialling and refinement to final deployment. Looking ahead, in the longer term we intend to investigate how the LifeGuide toolset can be applied to other domains.
The IMS Question and Test Interoperability (QTI) standard has not had a great take-up in part due to the lack of tools. In the 2006 JISC Capital, three Assessment projects were commissioned: item authoring, item banking, and QTI-compliant test delivery. This paper describes the ‘ASDEL’ test delivery engine, focusing upon its architecture, its relation to the item authoring and item banking services, and the integration of the R2Q2 Web service. The project first developed a java library to implement the system. This will allow other developers and researchers to build their own system or take aspects of QTI they want to implement.
Semantic spaces encode similarity relationships between objects as a function of position in a mathematical space. This paper discusses three different formulations for building semantic spaces which allow the automatic-annotation and semantic retrieval of images. The models discussed in this paper require that the image content be described in the form of a series of visual-terms, rather than as a continuous feature-vector. The paper also discusses how these term-based models compare to the latest state-of-the-art continuous feature models for auto-annotation and retrieval.
The IMS Question and Test Interoperability (QTI) standard has had a restricted take-up, in part due to the lack of tools. This paper describes the ‘ASDEL’ test delivery engine, focusing upon its architecture, its relation to item authoring and item banking services, and the integration of the R2Q2 web service. The tools developed operate with a web client, as a plug-in to Moodle, or as a desktop application. The paper also reports on the load testing of the internal services and concludes that these are best represented as components. The project first developed a Java library to implement the system. This will allow other developers and researchers to build their own system or incorporate aspects of QTI they want to implement
The IMS Question and Test Interoperability (QTI) standard has not had a great take-up in part due to the lack of tools. This paper describes the ‘ASDEL’ test delivery engine, focusing upon its architecture, its relation to the item authoring and item banking services, and the integration of the R2Q2 Web service. The project first developed a java library to implement the system. This will allow other developers and researchers to build their own system or take aspects of QTI they want to implement.
Users of image retrieval systems often find it frustrating that the image they are looking for is not ranked near the top of the results they are presented. This paper presents a computational approach for ranking keyworded images in order of relevance to a given keyword. Our approach uses machine learning to attempt to learn what visual features within an image are most related to the keywords, and then provide ranking based on similarity to a visual aggregate. To evaluate the technique, a Web 2.0 application has been developed to obtain a corpus of user-generated ranking information for a given image collection that can be used to evaluate the performance of the ranking algorithm.
The MapSnapper project aimed to develop a system for robust matching of low-quality images of a paper map taken from a mobile phone against a high quality digital raster representation of the same map. The paper presents a novel methodology for performing content-based image retrieval and object recognition from query images that have been degraded by noise and subjected to transformations through the imaging system. In addition the paper also provides an insight into the evaluation-driven development process that was used to incrementally improve the matching performance until the design specifications were met.
Purpose – To provide a better-informed view of the extent of the semantic gap in image retrieval, and the limited potential for bridging it offered by current semantic image retrieval techniques. Design/methodology/approach – Within an ongoing project, a broad spectrum of operational image retrieval activity has been surveyed, and, from a number of collaborating institutions, a test collection assembled which comprises user requests, the images selected in response to those requests, and their associated metadata. This has provided the evidence base upon which to make informed observations on the efficacy of cutting-edge automatic annotation techniques which seek to integrate the text-based and content-based image retrieval paradigms. Findings – Evidence from the real-world practice of image retrieval highlights the existence of a generic-specific continuum of object identification, and the incidence of temporal, spatial, significance and abstract concept facets, manifest in textual indexing and real-query scenarios but often having no directly visible presence in an image. These factors combine to limit the functionality of current semantic image retrieval techniques, which interpret only visible features at the generic extremity of the generic-specific continuum. Research limitations/implications – The project is concerned with the traditional image retrieval environment in which retrieval transactions are conducted on still images which form part of managed collections. The possibilities offered by ontological support for adding functionality to automatic annotation techniques are considered. Originality/value – The paper offers fresh insights into the challenge of migrating content-based image retrieval from the laboratory to the operational environment, informed by newly-assembled, comprehensive, live data.
This paper introduces a faceted model of image semantics which attempts to express the richness of semantic content interpretable within an image. Using a large image data-set from a museum collection the paper shows how the facet representation can be applied. The second half of the paper describes our semantic retrieval system, and demonstrates its use with the museum image collection. A retrieval evaluation is performed using the system to investigate how the retrieval performance varies with respect to each of the facet categories. A number of factors related to the image data-set that affect the quality of retrieval are also discussed.
The QTI standard identifies sixteen different question types which may be used in on-line assessment. While some partial implementations exist, the R2Q2 project has developed a complete solution that renders and responds to all sixteen question types as specified. In addition, care has been taken in the R2Q2 project to ensure that the solution produced will allow for future changes in the specification. The paper summarises the rationale of Web services and a Service Oriented Architecture, and then demonstrates how the R2Q2 project integrates into JISC’s e-Framework, and the reference model for assessment (FREMA). The design of R2Q2 is described, the focus being on lessons learnt. We describe the architecture and the rationale of the internal Web services and explain the approach taken in implementing the QTI specification, showing how the design allows for future tags to be added with the minimal of programming effort. A major objective of the design was to solve the problem of having to undertake a major redesign and reimplementation as a result of minor modifications to the specification. In the 2006 Capital Programme from JISC, three new projects were commissioned in the area of Assessment: one for authoring of items, one for item banking, and one for a complete test engine as described in the QTI specification. The R2Q2 Web service is at the heart of all three projects and this paper will describe how the R2Q2 Web service will be used.
This poster demonstrates our recent work in the field of intelligent image retrieval in response to real requests from the practitioner domain. The poster shows how we are developing a data-driven 'semantic space' framework for information retrieval which can enable retrieval of unannotated imagery through natural language queries, and also facilitate automatic annotation of imagery.
We live in a world where we are surrounded by ever increasing numbers of images. More often than not, these images have very little metadata by which they can be indexed and searched. In order to avoid information overload, techniques need to be developed to enable these image collections to be searched by their content. Much of the previous work on image retrieval has used global features such as colour and texture to describe the content of the image. However, these global features are insufficient to accurately describe the image content when different parts of the image have different characteristics. This thesis initially discusses how this problem can be circumvented by using salient interest regions to select the areas of the image that are most interesting and generate local descriptors to describe the image characteristics in that region. The thesis discusses a number of different saliency detectors that are suitable for robust retrieval purposes and performs a comparison between a number of these region detectors. The thesis then discusses how salient regions can be used for image retrieval using a number of techniques, but most importantly, two techniques inspired from the field of textual information retrieval. Using these robust retrieval techniques, a new paradigm in image retrieval is discussed, whereby the retrieval takes place on a mobile device using a query image captured by a built-in camera. This paradigm is demonstrated in the context of an art gallery, in which the device can be used to find more information about particular images. The final chapter of the thesis discusses some approaches to bridging the semantic gap in image retrieval. The chapter explores ways in which un-annotated image collections can be searched by keyword. Two techniques are discussed; the first explicitly attempts to automatically annotate the un-annotated images so that the automatically applied annotations can be used for searching. The second approach does not try to explicitly annotate images, but rather, through the use of linear algebra, it attempts to create a semantic space in which images and keywords are positioned such that images are close to the keywords that represent them within the space.
We present Ambient Gestures, a novel gesture-based system designed to support ubiquitous ‘in the environment’ interactions with everyday computing technology. Hand gestures and audio feedback allow users to control computer applications without reliance on a graphical user interface, and without having to switch from the context of a non-computer task to the context of the computer. The Ambient Gestures system is composed of a vision recognition software application, a set of gestures to be processed by a scripting application and a navigation and selection application that is controlled by the gestures. This system allows us to explore gestures as the primary means of interaction within a multimodal, multimedia environment. In this paper we describe the Ambient Gestures system, define the gestures and the interactions that can be achieved in this environment and present a formative study of the system. We conclude with a discussion of our findings and future applications of Ambient Gestures in ubiquitous computing.
Semantic representation of multimedia information is vital for enabling the kind of multimedia search capabilities that professional searchers require. Manual annotation is often not possible because of the shear scale of the multimedia information that needs indexing. This paper explores the ways in which we are using both top-down, ontologically driven approaches and bottom-up, automatic-annotation approaches to provide retrieval facilities to users. We also discuss many of the current techniques that we are investigating to combine these top-down and bottom-up approaches.
Traditionally, statistical models for image auto-annotation have been coupled with image segmentation. Considering the performance of the current segmentation algorithms, it can be meaningful to avoid a segmentation stage. In this paper, we propose a new approach to image auto-annotation using statistical models. In this approach, segmentation is avoided through the use of salient regions. The use of the statistical model results in an annotation performance which improves upon our previously proposed saliency-based word propagation technique. We also show that the use of salient regions achieves better results than the use of general image regions or segments.
This paper attempts to review and characterise the problem of the semantic gap in image retrieval and the attempts being made to bridge it. In particular, we draw from our own experience in user queries, automatic annotation and ontological techniques. The first section of the paper describes a characterisation of the semantic gap as a hierarchy between the raw media and full semantic understanding of the media's content. The second section discusses real users' queries with respect to the semantic gap. The final sections of the paper describe our own experience in attempting to bridge the semantic gap. In particular we discuss our work on auto-annotation and semantic-space models of image retrieval in order to bridge the gap from the bottom up, and the use of ontologies, which capture more semantics than keyword object labels alone, as a technique for bridging the gap from the top down.
This paper presents a novel technique for learning the underlying structure that links visual observations with semantics. The technique, inspired by a text-retrieval technique known as cross-language latent semantic indexing uses linear algebra to learn the semantic structure linking image features and keywords from a training set of annotated images. This structure can then be applied to unannotated images, thus providing the ability to search the unannotated images based on keyword. This factorisation approach is shown to perform well, even when using only simple global image features.
This paper introduces the iGesture platform for investigating multimodal gesture based interactions in multimedia contexts. iGesture is a low-cost, extensible system that uses visual recognition of hand movements to support gesture-based input. Computer vision techniques support gesture based interactions that are lightweight, with minimal interaction constraints. The system enables gestures to be carried out 'in the environment' at a distance from the camera, enabling multimodal interaction in a naturalistic, transparent manner in a ubiquitous computing environment. The iGesture system can also be rapidly scripted to enable gesture-based input with a wide variety of applications. In this paper we present the technology behind the iGesture software, and a performance evaluation of the gesture recognition subsystem. We also present two exemplar multimedia application contexts which we are using to explore ambient gesture-based interactions.
This paper presents an investigation into the use of a mobile device as a novel interface to a content-based image retrieval system. The initial development has been based on the concept of using the mobile device in an art gallery for mining data about the exhibits, although a number of other applications are envisaged. The paper presents a novel methodology for performing content-based image retrieval and object recognition from query images that have been degraded by noise and subjected to transformations through the imaging system. The methodology uses techniques inspired from the information retrieval community in order to aid efficient indexing and retrieval. In particular, a vector-space model is used in the efficient indexing of each image, and a two-stage pruning/ranking procedure is used to determine the correct matching image. The retrieval algorithm is shown to outperform a number of existing algorithms when used with query images from the mobile device.
The vector-space retrieval model and Latent Semantic Indexing approaches to retrieval have been used heavily in the field of text information retrieval over the past years. The use of these approaches in image retrieval, however, has been somewhat limited. In this paper, we present methods for using these techniques in combination with an invariant image representation based on local descriptors of salient regions. The paper also presents an evaluation in which the two techniques are used to find images with similar semantic labels.
In this paper, we propose a model of automatic image annotation based on propagation of keywords. The model works on the premise that visually similar image content is likely to have similar semantic content. Image content is extracted using local descriptors at salient points within the image and quantising the feature-vectors into visual terms. The visual terms for each image are modelled using techniques taken from the information retrieval community. The modelled information from an unlabelled query image is compared to the models of a corpus of labelled images and labels are propagated from the most similar labelled images to the query image.
Much previous work on image retrieval has used global features such as colour and texture to describe the content of the image. However, these global features are insufficient to accurately describe the image content when different parts of the image have different characteristics. This paper discusses how this problem can be circumvented by using salient interest points and compares and contrasts an extension to previous work in which the concept of scale is incorporated into the selection of salient regions to select the areas of the image that are most interesting and generate local descriptors to describe the image characteristics in that region. The paper describes and contrasts two such salient region descriptors and compares them through their repeatability rate under a range of common image transforms. Finally, the paper goes on to investigate the performance of one of the salient region detectors in an image retrieval situation.
In this paper, we introduce a novel technique for image matching and feature-based tracking. The technique is based on the idea of using the Scale-Saliency algorithm to pick a sparse number of ‘interesting’ or ‘salient’ features. Feature vectors for each of the salient regions are generated and used in the matching process. Due to the nature of the sparse representation of feature vectors generated by the technique, sub-image matching is also accomplished. We demonstrate the techniques robustness to geometric transformations in the query image and suggest that the technique would be suitable for view-based object recognition. We also apply the matching technique to the problem of feature tracking across multiple video frames by matching salient regions across frame pairs. We show that our tracking algorithm is able to explicitly extract the 3D motion vector of each salient region during the tracking process, using a single uncalibrated camera. We illustrate the functionality of our tracking algorithm by showing results from tracking a single salient region in near real-time with a live camera input.