e-space
Manchester Metropolitan University's Research Repository

    A generic performance model for deep learning in a distributed environment

    Kavarakuntla, Tulasi, Han, Liangxiu ORCID logoORCID: https://orcid.org/0000-0003-2491-7473, Lloyd, Huw ORCID logoORCID: https://orcid.org/0000-0001-6537-4036, Latham, Annabel ORCID logoORCID: https://orcid.org/0000-0002-8410-7950, Kleerekoper, Anthony ORCID logoORCID: https://orcid.org/0000-0002-3621-8568 and Akintoye, Samson B ORCID logoORCID: https://orcid.org/0000-0001-5058-433X (2024) A generic performance model for deep learning in a distributed environment. IEEE Access, 12. pp. 8207-8219. ISSN 2169-3536

    [img]
    Preview
    Published Version
    Available under License Creative Commons Attribution Non-commercial No Derivatives.

    Download (1MB) | Preview

    Abstract

    Performance modelling of a deep learning application is essential to improve and quantify the efficiency of the model framework. However, existing performance models are mostly case-specific, with limited capability for the new deep learning frameworks/applications. In this paper, we propose a generic performance model of an application in a distributed environment with a generic expression of the application execution time that considers the influence of both intrinsic factors/operations (e.g. algorithmic parameters/internal operations) and extrinsic scaling factors (e.g. the number of processors, data chunks and batch size). We formulate it as a global optimisation problem and solve it using regularisation on a cost function and differential evolution algorithm to find the best-fit values of the constants in the generic expression to match the experimentally determined computation time. We have evaluated the proposed model on three deep learning frameworks (i.e., TensorFlow, MXnet, and Pytorch). The experimental results show that the proposed model can provide accurate performance predictions and interpretability. In addition, the proposed work can be applied to any distributed deep neural network without instrumenting the code and provides insight into the factors affecting performance and scalability.

    Impact and Reach

    Statistics

    Activity Overview
    6 month trend
    103Downloads
    6 month trend
    94Hits

    Additional statistics for this dataset are available via IRStats2.

    Altmetric

    Repository staff only

    Edit record Edit record