Consider a distributed deep learning experiment with data parallelism and synchronous SGD across multiple servers. Each server has 8 P100 GPUs and batch size per GPU is kept fixed at 64. The training dataset has 131072 images. Scaling efficiency is defined as ratio of per iteration time when training using one 1 server to per iteration time when training using N servers. Following is the result from this experiment, showing how the per iteration time scales as the number of servers.Based on this data provide answer to the following. Note that for answers with decimal part you need to round to two places of decimal. No points will be awarded if the answer is incorrectly rounded. 1. Per epoch time (in secs) with 256 GPUs: 2. Throughput (images/sec) with 256 GPUs: 3. Per epoch time (in secs) with 1024 GPUs: 4. Throughput (images/sec) with 1024 GPUs: 5. Till 256 GPUs the scaling efficiency is greater than 85%. True or False. 6. The scaling efficiency is less than 75% with 1024 GPUs. True or False.

Question

contestada

Respuesta :