Concurrent downloads with Python using asyncio or thread pools
When downloading a large number of files with Python, you are I/O bound.
A vanilla implementation with requests
like the one below would yield sequential, blocking calls with files downloaded one at a time.
import time
import requests
start = time.perf_counter()
urls = [
"https://i.imgur.com/AD3MbBi.jpeg",
"https://i.imgur.com/zYhkOrM.jpeg",
"https://i.imgur.com/LRoLTlK.jpeg",
"https://i.imgur.com/gtWsPu9.jpeg",
"https://i.imgur.com/jDimNTZ.jpeg",
]
for url in urls:
print(f"Downloading {url}")
resp = requests.get(url)
with open(url.split("/")[-1], 'wb') as f:
f.write(resp.content)
print(f"Done downloading {url}")
print(f"Total time: {time.perf_counter() - start}")
In the snippet above, when requests.get(url, stream=True)
is called, the CPU enters an idle state and waits for a response from the server.
Once response is received, it proceeds to save the file & process the next url as evident from the output below.
Downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Total time: 1.2269745559970033
The core idea to speed this up is to fire requests without waiting for a response, essentially pushing data throughput closer to server or channel capacity. In Python, this can be done in a few ways and I'll cover two popular ones - process/thread pools and asyncio. Also, while the examples here reference downloads, it can be applied to any I/O bound task.
1. Process/Thread pools¶
One familiar approach here would be to create multiple processes/threads and fire requests in parallel.
While you can do this with multiprocessing
or threading
modules, you should probably use concurrent.futures
module instead since it provides a nicer interface.
Here's a simple example using ThreadPoolExecutor
:
import time
from concurrent.futures import ThreadPoolExecutor
import requests
start = time.perf_counter()
urls = [
"https://i.imgur.com/AD3MbBi.jpeg",
"https://i.imgur.com/zYhkOrM.jpeg",
"https://i.imgur.com/LRoLTlK.jpeg",
"https://i.imgur.com/gtWsPu9.jpeg",
"https://i.imgur.com/jDimNTZ.jpeg",
]
def download_image(url):
print(f"Downloading {url}")
resp = requests.get(url)
with open(url.split("/")[-1], 'wb') as f:
f.write(resp.content)
print(f"Done downloading {url}")
with ThreadPoolExecutor(max_workers=5) as executor:
executor.map(download_image, urls)
print(f"Total time: {time.perf_counter() - start}")
Here, a maximum of 5 threads are created and each thread is assigned a url to download. You can see that the requests are fired in parallel and the total time taken is much less than the sequential version. Process pools work in a similar way, only they are heavier but offer more isolation with no GIL constraints.
Downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Total time: 0.4605532810019213
When to use it:
- You don't want to deal with async code and prefer a more familiar interface.
- If you want to process files after downloading them and have multiple core (using a process pool would be faster than only asyncio)
Gotchas:
- You might be limited by the number of processes/threads you can create on your system.
- If you're using some sort of shared state, this would require some manual synchronization.
- If you're using a large number of processes/threads, you might run into connection limits & memory issues.
2. Asyncio¶
Another way to achieve this is to use asyncio
, a library to write concurrent code that is native to Python (3.4+).
This will let you side-step threads if you don't want to deal with them.
Note that you can use asyncio
within processes/threads as well which might be useful in some cases.
Also, Python's requests
library is not async by default, so you'll need to use aiohttp
instead if you want pure async or httpx
if you want a sync/async hybrid.
Here's a simple example using asyncio
:
import time
import asyncio
import aiohttp
start = time.perf_counter()
urls = [
"https://i.imgur.com/AD3MbBi.jpeg",
"https://i.imgur.com/zYhkOrM.jpeg",
"https://i.imgur.com/LRoLTlK.jpeg",
"https://i.imgur.com/gtWsPu9.jpeg",
"https://i.imgur.com/jDimNTZ.jpeg",
]
async def download_image(url):
print(f"Downloading {url}")
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
with open(url.split("/")[-1], 'wb') as f:
f.write(await resp.read())
print(f"Done downloading {url}")
async def main():
await asyncio.gather(*[download_image(url) for url in urls])
asyncio.run(main())
print(f"Total time: {time.perf_counter() - start}")
The above snippet follows a similar behavioral pattern as the thread pool example.
Downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Total time: 0.4095943249994889
Pros:
- No real limit on the number concurrent requests you can make since you're not limited by the number of processes/threads you can create.
- You can always use
asyncio
within processes/threads if you need to.
Gotchas:
- Connection limits & memory issues are still a concern.
- Not very intuitive if you're not familiar with async code.
- You'll need to switch to
aiohttp
orhttpx
if you want to use async requests. - If you're using Jupyter notebooks, you'll need to use
nest_asyncio
to make it work or run it within an event loop.
References¶
- SuperFastPython, an excellent source to get a deep understanding of concurrency in Python.