Skip to content

Concurrent downloads with Python using asyncio or thread pools


When downloading a large number of files with Python, you are I/O bound. A vanilla implementation with requests like the one below would yield sequential, blocking calls with files downloaded one at a time.

import time
import requests

start = time.perf_counter()
urls = [
    "https://i.imgur.com/AD3MbBi.jpeg",
    "https://i.imgur.com/zYhkOrM.jpeg",
    "https://i.imgur.com/LRoLTlK.jpeg",
    "https://i.imgur.com/gtWsPu9.jpeg",
    "https://i.imgur.com/jDimNTZ.jpeg",
]

for url in urls:
    print(f"Downloading {url}")
    resp = requests.get(url)
    with open(url.split("/")[-1], 'wb') as f:
        f.write(resp.content)
        print(f"Done downloading {url}")

print(f"Total time: {time.perf_counter() - start}")

In the snippet above, when requests.get(url, stream=True) is called, the CPU enters an idle state and waits for a response from the server. Once response is received, it proceeds to save the file & process the next url as evident from the output below.

Downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Total time: 1.2269745559970033

The core idea to speed this up is to fire requests without waiting for a response, essentially pushing data throughput closer to server or channel capacity. In Python, this can be done in a few ways and I’ll cover two popular ones - process/thread pools and asyncio. Also, while the examples here reference downloads, it can be applied to any I/O bound task.


1. Process/Thread pools

One familiar approach here would be to create multiple processes/threads and fire requests in parallel. While you can do this with multiprocessing or threading modules, you should probably use concurrent.futures module instead since it provides a nicer interface. Here’s a simple example using ThreadPoolExecutor:


import time
from concurrent.futures import ThreadPoolExecutor
import requests

start = time.perf_counter()

urls = [
    "https://i.imgur.com/AD3MbBi.jpeg",
    "https://i.imgur.com/zYhkOrM.jpeg",
    "https://i.imgur.com/LRoLTlK.jpeg",
    "https://i.imgur.com/gtWsPu9.jpeg",
    "https://i.imgur.com/jDimNTZ.jpeg",
]

def download_image(url):
    print(f"Downloading {url}")
    resp = requests.get(url)
    with open(url.split("/")[-1], 'wb') as f:
        f.write(resp.content)
    print(f"Done downloading {url}")

with ThreadPoolExecutor(max_workers=5) as executor:
    executor.map(download_image, urls)

print(f"Total time: {time.perf_counter() - start}")

Here, a maximum of 5 threads are created and each thread is assigned a url to download. You can see that the requests are fired in parallel and the total time taken is much less than the sequential version. Process pools work in a similar way, only they are heavier but offer more isolation with no GIL constraints.

Downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Total time: 0.4605532810019213

When to use it:

Gotchas:


2. Asyncio

Another way to achieve this is to use asyncio, a library to write concurrent code that is native to Python (3.4+). This will let you side-step threads if you don’t want to deal with them. Note that you can use asyncio within processes/threads as well which might be useful in some cases. Also, Python’s requests library is not async by default, so you’ll need to use aiohttp instead if you want pure async or httpx if you want a sync/async hybrid.

Here’s a simple example using asyncio:

import time
import asyncio
import aiohttp

start = time.perf_counter()

urls = [
    "https://i.imgur.com/AD3MbBi.jpeg",
    "https://i.imgur.com/zYhkOrM.jpeg",
    "https://i.imgur.com/LRoLTlK.jpeg",
    "https://i.imgur.com/gtWsPu9.jpeg",
    "https://i.imgur.com/jDimNTZ.jpeg",
]

async def download_image(url):
    print(f"Downloading {url}")
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            with open(url.split("/")[-1], 'wb') as f:
                f.write(await resp.read())
    print(f"Done downloading {url}")

async def main():
    await asyncio.gather(*[download_image(url) for url in urls])

asyncio.run(main())

print(f"Total time: {time.perf_counter() - start}")

The above snippet follows a similar behavioral pattern as the thread pool example.

Downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Total time: 0.4095943249994889

Pros:

Gotchas:


references


share this post on:

Previous Post
Humanize your outputs for other humans with Python
Next Post
Set a retry strategy for Python requests
t>