Skip to main content
Kent Academic Repository

Understanding Environmental Cues Using Deep Learning-Based Image Analysis

Mohamed, Elhassan (2022) Understanding Environmental Cues Using Deep Learning-Based Image Analysis. Doctor of Philosophy (PhD) thesis, University of Kent,. (doi:10.22024/UniKent/01.02.97922) (KAR id:97922)

Abstract

The recent advances in Artificial Intelligence (AI) motivate its ubiquitous use for computer vision with applications in several fields, such as autonomous and assisted navigation. Deep learning, a branch of artificial intelligence, is shown to be useful, for example, in object detection and semantic segmentation tasks. Such algorithms achieved high performance in terms of accuracy and computational time compared to conventional techniques. However, new application areas, such as providing information for arbitrary environments to address, for example, indoor navigation or simultaneous indoor and outdoor navigation, introduce several challenges that should be overcome. For these challenges, novel deep learning-based methods should be introduced, implemented, and tested in realistic scenarios such as assistive driving of mobile platforms, e.g., powered wheelchairs.

This thesis introduces and explores novel deep learning techniques for object detection and semantic segmentation to enable intelligent systems which aid scene understanding and human-system interaction and could be used in the navigation of any robotic platform. A prominent area in which these types of systems are needed, i.e., aiding with the driving of powered wheelchairs for users with visual disabilities, is chosen as the realistic application to test the performance of the algorithms and methods introduced. Extensive investigations of their characteristics are performed, including using explainable AI (XAI) to justify corresponding system outputs.

A review of relevant literature reveals a number of distinct challenges that need to be addressed to develop a system able to operate in realistic environments:

The first challenge our proposed systems aim to address is being able to perform well with small and large size objects simultaneously. State-of-the-art object detection systems struggle with the localisation of small size objects. These systems are usually trained on large size objects containing abundant information and large numbers of pixels to be utilised by the model during the training and inference processes. Our research investigates the performance of these detectors on a proposed dataset that mainly contains small size objects. Furthermore, the study discusses the means of enhancing the detector's performance on tasks that involve the detection of multi-size objects using multi-head detectors to make predictions on different feature maps. The introduced multi-head detector has achieved mean Average Precision (mAP) of 0 .818 on the proposed dataset. Finally, our investigation findings proposed a roadmap to help the scientific community to choose the best detector for a given application.

The second important challenge to be addressed is the requirement to provide information about the elements of the scene on which the system's decisions were based. System transparency ensures the reliability of deep learning-based computer vision systems. It is not only important to attain an accurate system but also a system that can explain its predictions. A black box system should provide insights into what is happening inside the system to be approved and used in real-life applications. Policymakers and legislators require a certain level of system transparency to approve such technologies. Therefore, in this thesis, we investigate the robustness of systems in terms of their abilities to explain the reason behind a specific decision by introducing novel explanation techniques, thus contributing to the so-called XAI field. Also, novel explanation techniques that can visualise the two main characteristics of robust explanation maps, i.e., fine-grained details and discriminative regions in a single representation (in the form of a "heatmap"), are introduced, implemented, and tested for well-known AI methods. Unlike standard visualisation methods used currently, the introduced ones can identify multiple important image characteristics upon which the system decisions are based.

The third main challenge addressed in this work is providing information about objects of the environment at the image pixels level. Semantic segmentation at the pixel level is explored to better utilise the available images of the dataset used. Pixel classification can better define the boundaries and the geometric shape of the target object than object detection, which provides bounding boxes containing the detected objects. Classifying every pixel in an image, consequently identifying the object boundary, size, and location, facilitate subsequent tasks and human-system interactions at higher accuracy. Also, novel semantic segmentation architectures that can process images from both indoor and outdoor environments are introduced, implemented, and tested in this work. Analysing and understanding data from two different distributions (indoor and outdoor) with a variety of object sizes is challenging due to the difference in the images' contexts and the limited number of datasets currently publicly available, which our systems were shown to be able to handle with significant accuracy and processing speed.

Finally, the proposed systems are tested in a realistic scenario drawn from the field of assistive robotics in powered mobility (Electrical Powered Wheelchairs - EPWs). Visually impaired persons with comorbidities are not prescribed a powered wheelchair due to their sight condition. This is an ideal setting to test our system in real conditions. The proposed semantic segmentation system aims to provide visual cues to aid with the navigation process and increase the user's independence. As these systems are meant to be installed on moving platforms such as mobile robots (EPWs in our case), they are susceptible to mechanical vibrations caused by different terrains. These could negatively impact the performance of our smart deep learning-based computer vision systems. Vibration effects on these systems are examined in detail, where the implication on performance and prospective solutions are highlighted. Our final results indicated that there is a deterioration of 4% in the performance due to these vibrations.

Item Type: Thesis (Doctor of Philosophy (PhD))
Thesis advisor: Sirlantzis, Konstantinos
Thesis advisor: Howells, Gareth
DOI/Identification number: 10.22024/UniKent/01.02.97922
Uncontrolled keywords: Computer vision; Deep learning; Explainable artificial intelligence; Neural networks; Object detection; Semantic segmentation; XAI
Subjects: T Technology > TA Engineering (General). Civil engineering (General)
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Engineering and Digital Arts
Funders: Organisations -1 not found.
SWORD Depositor: System Moodle
Depositing User: System Moodle
Date Deposited: 10 Nov 2022 15:10 UTC
Last Modified: 01 Nov 2023 00:00 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/97922 (The current URI for this page, for reference purposes)

University of Kent Author Information

  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.