How to Extract Data from APIs for Data Pipelines using Python
- 1. Introduction
- 2. APIs are a way to communicate between systems on the Internet
- 3. API Data extraction = GET-ting data from a server
- 4. Conclusion
- 5. Further reading
1. Introduction
Extracting data is one of the critical skills for data engineering. If you have wondered
How to get started for the first time extracting data from an API
What are some good resources to learn API data extraction?
If there are any recommendations, guides, videos, etc., for dealing with APIs in Python
Which Python library to use to extract data from an API
I don’t know what I don’t know. Am I missing any libraries?
Then this post is for you. Imagine being able to mentally visualize how systems communicate via APIs. By the end of this post, you will have learned how to pull data via an API. You can quickly and efficiently create data pipelines to pull data from most APIs.
The code for this post is available here .
Jargon:
- Server: A machine running code that accepts requests and responds with some data.
2. APIs are a way to communicate between systems on the Internet
Systems need to communicate over the Internet. For example, your browser needs to communicate with a server to access this website. An API (Application Programming Interface) is a set of rules that allows different software applications to communicate with each other.
Depending on your use case, there are multiple architectural patterns. In data engineering, we typically use the REST architecture pattern (which is commonly used for communicating between your browser and a website server).
2.1. HTTP is a protocol commonly used for websites
HTTP is a communication protocol that defines how messages are formatted and transmitted over the Internet. It’s the foundational protocol for data exchange on the World Wide Web. It establishes rules for how clients (like browsers) request data from servers and how servers respond.
While there are multiple components of HTTP protocol, the key parts to know for data extraction are:
2.1.1. Request: Ask the Internet exactly what you want
The request defines what we want from the server. Requests have 3 key components:
- URL refers to the server address. This is typically something like ‘https://jsonplaceholder.typicode.com/'
- Method refers to the action you want the server to do. Most data extraction uses the GET method to get data from the server (reference docs ).
- Headers refer to additional information that adds more information to the request. This is a key-value pair, and the common data sent via headers are to tell what datatype the requester can understand (Accept), what browser/script is making the request (User-agent), authorization, encoding, cookies, etc
- Parameters provide additional context to the request.
Let’s look at a GET data request that we will do later in this post:
GET /posts/1 HTTP/1.1 # Method, path, http version
Host: jsonplaceholder.typicode.com # URL
User-Agent: python-requests/2.32.3 # Who is making the data request
Accept-Encoding: gzip, deflate # Type of acceptable data response & encoding
Accept: */*
Connection: keep-alive
2.1.2. Response is what you get from the server
The response from the server will have the typical components:
- A response status code indicating if the request was successful or not
- Headers, which provide information about the type of data that was sent back to the requester
- Data from the server
A response to the above query can look like
HTTP/1.1 200 OK # http version, status code specifying if your response was successful or not and, if not, what happened
# Headers (key-value pair)
Date: Tue, 22 Apr 2025 11:24:59 GMT
Content-Type: application/json; charset=utf-8 # data type
Transfer-Encoding: chunked
Connection: keep-alive
Server: Cloudflare
...
Via: 1.1 vegur
Cf-Cache-Status: HIT
Age: 3664
Content-Encoding: gzip
CF-RAY: 9344c20649d203d5-EWR
alt-svc: h3=":443"; ma=86400
# response data
{
"userId": 1,
"id": 1,
"title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
"body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}
3. API Data extraction = GET-ting data from a server
The Python library requests
is the most popular way to extract data from API.
3.1. GET data
To get data from a server, we need to
- Define the URL and type of request to make
- Make the request
- Check whether the request was successful, and
- Decode the data.
we will use the free jsonplaceholder server for this.
url = 'https://jsonplaceholder.typicode.com/posts'
response = requests.get(url=f'{url}{entity}')
print(response.status_code) # must be 200
It’s critical to check that the status code is between 200 and 299 , which means the request succeeded. Once we know that the status is successful, we can look at the data sent back.
response. content # data in binary format
# let's see the data in json format
response.json()
3.1.1. GET data for a specific entity
If you notice the URL, you will see the additional /posts
at the end of the URL. This is how most APIs are designed; the /posts
at the end of the URL represents the business entity posts.
You can imagine other entities in this (assuming blog posting website) application, such as /
. The jsonplaceholder server has posts, comments, albums, photos, to-dos, and user entities (also called resources).
If we only want to pull data for a specific post ID, it is typically done by /posts/post_id
as shown below:
post_id=1
response = requests.get(url=f'{url}/{post_id}')
print(response.json()) # data for post 1
3.1.2. GET data for related entities
In most cases, entities are related to each other. For example, in our blog example, a post can have multiple comments. Usually, these URLs are defined as /posts/post_id/comments,
which will give all the comments in a specific post_id. Let’s see how this works.
entity = 'comments'
# add post id url
response = requests.get(url=f'{url}/{post_id}/{entity}')
print(response.json()) # all the comments on post 1
The design of the API is usually dependent on the server, but this is the typical design most API systems use.
3.1.3. Use query params to specify the data you need
While we can use the /post/post_id/comment/comment_id
pattern to pull related entities, what if we want to look for more specific data?
Such as all the posts made between certain date ranges, all posts with at least 10 comments, or the top 10 posts by popularity?
APIs make such querying available via the parameters
input (check your API docs).
For example, we can look at results from google search:
# search for electronics on google
url = 'https://www.google.com/search'
query_params = {'q': 'electronics'}
response = requests.get(url=f'{url}', params=query_params)
response.content[:100] # This is a HTML file
3.2. Pull large data as small chunks, aka Paginate
The HTTP protocol was designed to transfer small amounts of data over the Internet. However, data pipelines typically tend to get a large amount of data.
Most API servers implement a technique called pagination to avoid having to send large amounts of data as a single response.
Pagination refers to a system where the server provides chunks of data, and the requester can make new requests for the next chunk of data as needed. You would have seen tables on websites with the next button or tables populated with new data as you scroll; these are done with pagination.
There are multiple ways a server can design how to implement pagination. Still, as consumers of this, we only care about requesting chunks of data.
Paginations usually follow one of the 2 patterns.
3.2.1. Use limit and offset to specify how many and which data points you want
In this pattern, when making a request, you will send 2 query parameters:
Limit
representing the number of data points to getOffset
represents the position from which to start the limit count
For example, if the offset is 100 and the limit is 10, the response will have data from items 101 to 111 (offset + limit). The order is generally decided by the server or accepted as part of the request.
Let’s see how we can use this pattern with the free PokeAPI
params = {'offset': 5, 'limit': 3}
url = 'https://pokeapi.co/api/v2/'
entity = 'ability'
response = requests.get(url=f'{url}{entity}', params=params)
response.json()
Let’s look at a simple example with offset 3 skipping the first 3 rows and limit 5 selecting the next 5 rows of data.
3.2.2. Use the next page link to let the server decide what data to send
Another pattern APIs use is sending a next_page_link
as part of the response, which you would use to make the next request to get the next chunk of data. The PokeAPI also provides us with a link to the next page.
url = 'https://pokeapi.co/api/v2/'
entity = 'ability'
response = requests.get(url=f'{url}{entity}')
json_response = response.json()
print(json_response)
response = requests.get(json_response.get('next'))
response.json()
In the above example, we use the next
data from the response to get the next chunk of data. While this next link is the same link with predefined limit
and offset,
some APIs can provide you with links in non-modifiable formats.
3.3. Do not overload the server
3.3.1. Retry reasonably
There are times when the server may be busy or an error occurs, and you’re unable to make a request. If the server from which you are pulling data can have such issues, it’d be a good idea to retry.
# start a local fake server
import subprocess
process = subprocess.Popen(['uv', 'run', 'dummy_server.py'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time
# Configure the retry strategy
retries = Retry(
total=3,
status_forcelist=[500, 502, 503, 504], # Retry on these server errors
allowed_methods={"GET"}, # Explicitly allow only GET requests
)
# Create a session and mount the retry adapter
session = requests.Session()
adapter = HTTPAdapter(max_retries=retries)
session.mount('http://', adapter)
# URL for your local API
api_url = "http://localhost:8000/api" # random 2/3 error
response = session.get(api_url)
process.terminate() # stop server
We define a Retry
object where we specify
- Number of retries
- Methods(GET) to retry for
- Responses(500, 502, 503, 504) to retry for
The 500, 502, 503, and 504 response codes represent server errors .
If you know the server is prone to issues and when issues occur, they take a long time to resolve, you can use backoffs
, a mechanism for waiting between retry attempts. The wait time between attempts increases exponentially (see code example
).
3.3.2. Rate limit to prevent servers from being overloaded
Servers don’t want a single requester to hog all their resources (server, DB, etc) by accidentally (or intentionally) making so many requests that the server cannot respond to other requesters.
To prevent such a case, most API servers implement a rate limiter. A rate limiter only allows a specific requester to make a specified number of requests per time unit. This will ensure that one requester is not hogging the server.
3.4. Most servers require authentication
In the example above, we’ve used open AIPs that don’t require authentication or API keys. In most cases, you will need to do one of the following:
- API Keys:
- Simple string tokens included in request headers, query parameters, or in the request body
- Easy to implement but offers basic security
- Best for public APIs with limited access requirements
- OAuth 2.0:
- Industry-standard protocol for authorization
- Provides secure delegated access using access tokens
- Common flow types: Authorization Code, Implicit, Client Credentials, and Resource Owner Password
- Bearer Tokens:
- Access tokens sent in the Authorization header
- Often used with OAuth 2.0
- Format: Authorization: Bearer
- Basic Authentication:
- Uses base64 encoded username
- Sent in Authorization header
- Format: Authorization: Basic <base64(username:password)>
- Should only be used with HTTPS
4. Conclusion
To recap, we saw
- How API works
- How to get data from an API
- How to send additional context with your request
- How to pull chunks of data at a time
- How to retry and what rate limiting is
The next time you have to pull data from an API, use this article to guide you.
Please share this with your friends and colleagues if you think it might be helpful.
5. Further reading
- MDN HTTP docs
- Python data pipelines
- Bitcoin monitor
- Python essentials for data engineers
- What is rate limiting
If you found this article helpful, share it with a friend or colleague using one of the socials below!