Open access
Author
Date
2022Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Deep neural networks (DNNs) have succeeded in many different perception tasks, e.g., computer vision, natural language processing, reinforcement learning, etc. The high-performed DNNs heavily rely on intensive resource consumption. For example, training a DNN requires high dynamic memory, a large-scale dataset, and a large number of computations (a long training time); even inference with a DNN also demands a large amount of static storage, computations (a long inference time), and energy. Therefore, state-of-the-art DNNs are often deployed on a cloud server with a large number of super-computers, a high-bandwidth communication bus, a shared storage infrastructure, and a high power supplement.
Recently, some new emerging intelligent applications, e.g., AR/VR, mobile assistants, Internet of Things, require us to deploy DNNs on resource-constrained edge devices. Compare to a cloud server, edge devices often have a rather small amount of resources. To deploy DNNs on edge devices, we need to reduce the size of DNNs, i.e., we target a better trade-off between the resource consumption and the model accuracy.
In this thesis, we study four edge intelligent scenarios and develop different methodologies to enable deep learning in each scenario. Since current DNNs are often over-parameterized, our goal is to find and reduce the redundancy of the DNNs in each scenario. We summarize the four studied scenarios as follows,
- Inference on Edge Devices: Firstly, we enable efficient inference of DNNs given the fixed resource constraints on edge devices. Compared to cloud inference, inference on edge devices avoids transmitting the data to the cloud server, which can achieve a more stable, fast, and energy-efficient inference. Regarding the main resource constraints from storing a large number of weights and computation during inference, we proposed an Adaptive Loss-aware Quantization (ALQ) for multi-bit networks. ALQ reduces the redundancy on the quantization bitwidth. The direct optimization objective (i.e., the loss) and the learned adaptive bitwidth assignment allow ALQ to acquire extremely low-bit networks with an average bitwidth below 1-bit while yielding a higher accuracy than state-of-the-art binary networks.
- Adaptation on Edge Devices: Secondly, we enable efficient adaptation of DNNs when the resource constraints on the target edge devices dynamically change during runtime, e.g., the allowed execution time and the allocatable RAM. To maximize the model accuracy during on-device inference, we develop a new synthesis approach, Dynamic REal-time Sparse Subnets (DRESS) that can sample and execute sub-networks with different resource demands from a backbone network. DRESS reduces the redundancy among multiple sub-networks by weight sharing and architecture sharing, resulting in storage efficiency and re-configuration efficiency, respectively. The generated sub-networks have different sparsity, and thus can be fetched to infer under varying resource constraints by utilizing sparse tensor computations.
- Learning on Edge Devices: Thirdly, we enable efficient learning of DNNs when facing unseen environments or users on edge devices. On-device learning requires both data- and memory-efficiency. We thus propose a new meta learning method p-Meta to enable memory-efficient learning with only a few samples of unseen tasks. pMeta reduces the updating redundancy by identifying and updating structurewise adaptation-critical weights only, which saves the necessary memory consumption for the updated weights.
- Edge-Server System: Finally, we enable efficient inference and efficient updating on edge-server systems. In an edge-server system, several resource-constrained edge devices are connected to a resource-sufficient server with a constrained communication bus. Due to the limited relevant training data beforehand, pretrained DNNs may be significantly improved after the initial deployment. On such an edge-server system, on-device inference is preferred over cloud inference, since it can achieve a fast and stable inference with less energy consumption. Yet retraining on the cloud server is preferred over on-device retraining (or federated learning) due to the limited memory and computing power on edge devices. We proposed a novel pipeline Deep Partial Updating (DPU) to iteratively update the deployed inference model. Particularly, when newly collected data samples from edge devices or from other sources are available at the server, the server smartly selects only a subset of critical weights to update and send to each edge device. This weightwise partial updating reduces the redundant updating by reusing the pretrained weights, which achieves a similar accuracy as full updating yet with a significantly lower communication cost. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000574442Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Deep learning; On-device AI; Efficient deep learning; Edge AI; Edge computingOrganisational unit
03429 - Thiele, Lothar (emeritus) / Thiele, Lothar (emeritus)
Funding
180545 - NCCR Automation (phase I) (SNF)
More
Show all metadata
ETH Bibliography
yes
Altmetrics