This will be a rather brief overview and benchmark of 2 different ways you can parallelize HTTP requests in Python. The complete code snippet can be found at the end of this article.
Part 1 - concurrent.futures & requests
Most of the people familiar with Python had used requests
library before in one way or another, it’s one of the simplest and elegant solutions to making HTTP requests in Python. So, naturally, when we think of multithreading HTTP calls - wrapping requests
in some form of parallel execution is the first thing that comes to mind.
Let’s write a base method that makes an HTTP GET call using requests
:
1def http_get_with_requests(url: str, headers: Dict = {}, proxies: Dict = {}, timeout: int = 10) -> (int, Dict[str, Any], bytes):2 response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)34 response_json = None5 try:6 response_json = response.json()7 except:8 pass910 response_content = None11 try:12 response_content = response.content13 except:14 pass1516 return (response.status_code, response_json, response_content)
We then add parallelization on top of it, using ThreadPoolExecutor
from concurrent.futures
:
1def http_get_with_requests_parallel(list_of_urls: List[str], headers: Dict = {}, proxies: Dict = {}, timeout: int = 10) -> (List[Tuple[int, Dict[str, Any], bytes]], float):2 t1 = time.time()3 results = []4 executor = ThreadPoolExecutor(max_workers=100)5 for result in executor.map(http_get_with_requests, list_of_urls, repeat(headers), repeat(proxies), repeat(timeout)):6 results.append(result)7 t2 = time.time()8 t = t2 - t19 return results, t
You can try out setting max_workers
to other values, on my PC I determined that going below 100 drops my request speed, and going above 100 doesn’t really change anything. Your PC’s hardware + internet speed combination may produce other results.
P.S. Bonus points - try out how ProcessPoolExecutor
will work on your system. I didn’t notice any signifficant differences in speed on my PC.
Part 2 - asyncio & aiohttp
An alternative, newer and more robust approach would be to take a dive in Python’s asyncio
and make HTTP call with aiohttp
.
Same as before, we’ll write a base HTTP GET call:
1async def http_get_with_aiohttp(session: ClientSession, url: str, headers: Dict = {}, proxy: str = None, timeout: int = 10) -> (int, Dict[str, Any], bytes):2 response = await session.get(url=url, headers=headers, proxy=proxy, timeout=timeout)34 response_json = None5 try:6 response_json = await response.json(content_type=None)7 except json.decoder.JSONDecodeError as e:8 pass910 response_content = None11 try:12 response_content = await response.read()13 except:14 pass1516 return (response.status, response_json, response_content)
And a multithreaded version:
1async def http_get_with_aiohttp_parallel(session: ClientSession, list_of_urls: List[str], headers: Dict = {}, proxy: str = None, timeout: int = 10) -> (List[Tuple[int, Dict[str, Any], bytes]], float):2 t1 = time.time()3 results = await asyncio.gather(*[http_get_with_aiohttp(session, url, headers, proxy, timeout) for url in list_of_urls])4 t2 = time.time()5 t = t2 - t16 return results, t
Note that we’ll need to pass an additional session
object to these methods, this ClientSession
will be initialized later in the main
function.
Part 3 - Benchmarking
If you’ve noticed - we always measure execution time in our multithreaded methods, that’s because now we’ll need to compare them. To make a fair lenghty comparison let’s take a 1000 URLs, run this batch 10 times on each approach, collect request speeds (number of requests made, divided by the time it took to execute all requests) and compare their averages.
1# URL list2urls = ["https://api.myip.com/" for i in range(0, 1000)]34# Benchmark aiohttp5session = ClientSession()6speeds_aiohttp = []7for i in range(0, 10):8 results, t = await http_get_with_aiohttp_parallel(session, urls)9 v = len(urls) / t10 print('AIOHTTP: Took ' + str(round(t, 2)) + ' s, with speed of ' + str(round(v, 2)) + ' r/s')11 speeds_aiohttp.append(v)12await session.close()1314# Benchmark requests15speeds_requests = []16for i in range(0, 10):17 results, t = http_get_with_requests_parallel(urls)18 v = len(urls) / t19 print('REQUESTS: Took ' + str(round(t, 2)) + ' s, with speed of ' + str(round(v, 2)) + ' r/s')20 speeds_requests.append(v)2122# Calculate averages23avg_speed_aiohttp = sum(speeds_aiohttp) / len(speeds_aiohttp)24avg_speed_requests = sum(speeds_requests) / len(speeds_requests)25print('--------------------')26print('AVG SPEED AIOHTTP: ' + str(round(avg_speed_aiohttp, 2)) + ' r/s')27print('AVG SPEED REQUESTS: ' + str(round(avg_speed_requests, 2)) + ' r/s')
For the aiohttp
part we had to initialize ClientSession
before any ot the requests were made, and we closed it manually after all requests were done. Yes, Python’s with
construction would work here, I just didn’t want to add another indentation level.
Part 4 - Complete code and results
Well, first of all the results of benchmarks. Yes, aiohttp
is faster, somewhere around 2.1 - 2.3 times faster than ThreadPoolExecutor
.
1RUN №12AVG SPEED AIOHTTP: 501.25 r/s3AVG SPEED REQUESTS: 215.94 r/s45RUN №26AVG SPEED AIOHTTP: 500.95 r/s7AVG SPEED REQUESTS: 221.53 r/s89RUN №310AVG SPEED AIOHTTP: 489.21 r/s11AVG SPEED REQUESTS: 226.95 r/s
Of course there are too many variables to cover here, your PC’s multithreading abilities, your connection speeds, the URLs that you call, server load & performance and a lot more factors that can influence these request speeds. But as we determined, in general, in similar conditions - aiohttp
will work better and faster.
If you decide to experiment and compare these approaches please send me your findings. I’d be very interested to know how they behave in different environments.
Thanks for reading! Hope this article was helpful in deciding which approach to multithreaded HTTP requests you should take. And here’s the complete code fragment:
1import asyncio2import json3import time4from typing import Dict, Any, List, Tuple5import requests6from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor7from itertools import repeat8from aiohttp import ClientSession91011def http_get_with_requests(url: str, headers: Dict = {}, proxies: Dict = {}, timeout: int = 10) -> (int, Dict[str, Any], bytes):12 response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)1314 response_json = None15 try:16 response_json = response.json()17 except:18 pass1920 response_content = None21 try:22 response_content = response.content23 except:24 pass2526 return (response.status_code, response_json, response_content)272829def http_get_with_requests_parallel(list_of_urls: List[str], headers: Dict = {}, proxies: Dict = {}, timeout: int = 10) -> (List[Tuple[int, Dict[str, Any], bytes]], float):30 t1 = time.time()31 results = []32 executor = ThreadPoolExecutor(max_workers=100)33 for result in executor.map(http_get_with_requests, list_of_urls, repeat(headers), repeat(proxies), repeat(timeout)):34 results.append(result)35 t2 = time.time()36 t = t2 - t137 return results, t383940async def http_get_with_aiohttp(session: ClientSession, url: str, headers: Dict = {}, proxy: str = None, timeout: int = 10) -> (int, Dict[str, Any], bytes):41 response = await session.get(url=url, headers=headers, proxy=proxy, timeout=timeout)4243 response_json = None44 try:45 response_json = await response.json(content_type=None)46 except json.decoder.JSONDecodeError as e:47 pass4849 response_content = None50 try:51 response_content = await response.read()52 except:53 pass5455 return (response.status, response_json, response_content)565758async def http_get_with_aiohttp_parallel(session: ClientSession, list_of_urls: List[str], headers: Dict = {}, proxy: str = None, timeout: int = 10) -> (List[Tuple[int, Dict[str, Any], bytes]], float):59 t1 = time.time()60 results = await asyncio.gather(*[http_get_with_aiohttp(session, url, headers, proxy, timeout) for url in list_of_urls])61 t2 = time.time()62 t = t2 - t163 return results, t646566async def main():67 print('--------------------')6869 # URL list70 urls = ["https://api.myip.com/" for i in range(0, 1000)]7172 # Benchmark aiohttp73 session = ClientSession()74 speeds_aiohttp = []75 for i in range(0, 10):76 results, t = await http_get_with_aiohttp_parallel(session, urls)77 v = len(urls) / t78 print('AIOHTTP: Took ' + str(round(t, 2)) + ' s, with speed of ' + str(round(v, 2)) + ' r/s')79 speeds_aiohttp.append(v)80 await session.close()8182 print('--------------------')8384 # Benchmark requests85 speeds_requests = []86 for i in range(0, 10):87 results, t = http_get_with_requests_parallel(urls)88 v = len(urls) / t89 print('REQUESTS: Took ' + str(round(t, 2)) + ' s, with speed of ' + str(round(v, 2)) + ' r/s')90 speeds_requests.append(v)9192 # Calculate averages93 avg_speed_aiohttp = sum(speeds_aiohttp) / len(speeds_aiohttp)94 avg_speed_requests = sum(speeds_requests) / len(speeds_requests)95 print('--------------------')96 print('AVG SPEED AIOHTTP: ' + str(round(avg_speed_aiohttp, 2)) + ' r/s')97 print('AVG SPEED REQUESTS: ' + str(round(avg_speed_requests, 2)) + ' r/s')9899100asyncio.run(main())
In case you’d like to check my other work or contact me: