How to Extract Data from APIs for Data Pipelines using Python

Apr 14, 2025 · 11 min read

1. Introduction
2. APIs are a way to communicate between systems on the Internet
- 2.1. HTTP is a protocol commonly used for websites
  - 2.1.1. Request: Ask the Internet exactly what you want
  - 2.1.2. Response is what you get from the server
3. API Data extraction = GET-ting data from a server
4. Conclusion
5. Further reading

1. Introduction

Extracting data is one of the critical skills for data engineering. If you have wondered

How to get started for the first time extracting data from an API

What are some good resources to learn API data extraction?

If there are any recommendations, guides, videos, etc., for dealing with APIs in Python

Which Python library to use to extract data from an API

I don’t know what I don’t know. Am I missing any libraries?

Then this post is for you. Imagine being able to mentally visualize how systems communicate via APIs. By the end of this post, you will have learned how to pull data via an API. You can quickly and efficiently create data pipelines to pull data from most APIs.

The code for this post is available here .

Jargon:

Server: A machine running code that accepts requests and responds with some data.

2. APIs are a way to communicate between systems on the Internet

Systems need to communicate over the Internet. For example, your browser needs to communicate with a server to access this website. An API (Application Programming Interface) is a set of rules that allows different software applications to communicate with each other.

Depending on your use case, there are multiple architectural patterns. In data engineering, we typically use the REST architecture pattern (which is commonly used for communicating between your browser and a website server).

2.1. HTTP is a protocol commonly used for websites

HTTP is a communication protocol that defines how messages are formatted and transmitted over the Internet. It’s the foundational protocol for data exchange on the World Wide Web. It establishes rules for how clients (like browsers) request data from servers and how servers respond.

While there are multiple components of HTTP protocol, the key parts to know for data extraction are:

2.1.1. Request: Ask the Internet exactly what you want

The request defines what we want from the server. Requests have 3 key components:

URL refers to the server address. This is typically something like ‘https://jsonplaceholder.typicode.com/'
Method refers to the action you want the server to do. Most data extraction uses the GET method to get data from the server (reference docs ).
Headers refer to additional information that adds more information to the request. This is a key-value pair, and the common data sent via headers are to tell what datatype the requester can understand (Accept), what browser/script is making the request (User-agent), authorization, encoding, cookies, etc
Parameters provide additional context to the request.

Let’s look at a GET data request that we will do later in this post:

GET /posts/1 HTTP/1.1  # Method, path, http version
Host: jsonplaceholder.typicode.com # URL
User-Agent: python-requests/2.32.3 # Who is making the data request
Accept-Encoding: gzip, deflate # Type of acceptable data response & encoding
Accept: */*
Connection: keep-alive

2.1.2. Response is what you get from the server

The response from the server will have the typical components:

A response status code indicating if the request was successful or not
Headers, which provide information about the type of data that was sent back to the requester
Data from the server

A response to the above query can look like

HTTP/1.1 200 OK # http version, status code specifying if your response was successful or not and, if not, what happened
# Headers (key-value pair)
Date: Tue, 22 Apr 2025 11:24:59 GMT
Content-Type: application/json; charset=utf-8 # data type
Transfer-Encoding: chunked
Connection: keep-alive
Server: Cloudflare 
...
Via: 1.1 vegur
Cf-Cache-Status: HIT
Age: 3664
Content-Encoding: gzip
CF-RAY: 9344c20649d203d5-EWR
alt-svc: h3=":443"; ma=86400
# response data
{
  "userId": 1,
  "id": 1,
  "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
  "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}

API desing best practices

3. API Data extraction = GET-ting data from a server

The Python library requests is the most popular way to extract data from API.

3.1. GET data

To get data from a server, we need to

Define the URL and type of request to make
Make the request
Check whether the request was successful, and
Decode the data.

we will use the free jsonplaceholder server for this.

url = 'https://jsonplaceholder.typicode.com/posts'
response = requests.get(url=f'{url}{entity}')
print(response.status_code) # must be 200

It’s critical to check that the status code is between 200 and 299 , which means the request succeeded. Once we know that the status is successful, we can look at the data sent back.

response. content # data in binary format
# let's see the data in json format
response.json()

3.1.1. GET data for a specific entity

If you notice the URL, you will see the additional /posts at the end of the URL. This is how most APIs are designed; the /posts at the end of the URL represents the business entity posts.

You can imagine other entities in this (assuming blog posting website) application, such as /. The jsonplaceholder server has posts, comments, albums, photos, to-dos, and user entities (also called resources).

If we only want to pull data for a specific post ID, it is typically done by /posts/post_id as shown below:

post_id=1
response = requests.get(url=f'{url}/{post_id}')
print(response.json())  # data for post 1

In most cases, entities are related to each other. For example, in our blog example, a post can have multiple comments. Usually, these URLs are defined as /posts/post_id/comments, which will give all the comments in a specific post_id. Let’s see how this works.

entity = 'comments'
# add post id url
response = requests.get(url=f'{url}/{post_id}/{entity}')
print(response.json()) # all the comments on post 1

The design of the API is usually dependent on the server, but this is the typical design most API systems use.

3.1.3. Use query params to specify the data you need

While we can use the /post/post_id/comment/comment_id pattern to pull related entities, what if we want to look for more specific data?

Such as all the posts made between certain date ranges, all posts with at least 10 comments, or the top 10 posts by popularity?

APIs make such querying available via the parameters input (check your API docs).

For example, we can look at results from google search:

# search for electronics on google
url = 'https://www.google.com/search'
query_params = {'q': 'electronics'}
response = requests.get(url=f'{url}', params=query_params)
response.content[:100] # This is a HTML file

3.2. Pull large data as small chunks, aka Paginate

The HTTP protocol was designed to transfer small amounts of data over the Internet. However, data pipelines typically tend to get a large amount of data.

Most API servers implement a technique called pagination to avoid having to send large amounts of data as a single response.

Pagination refers to a system where the server provides chunks of data, and the requester can make new requests for the next chunk of data as needed. You would have seen tables on websites with the next button or tables populated with new data as you scroll; these are done with pagination.

There are multiple ways a server can design how to implement pagination. Still, as consumers of this, we only care about requesting chunks of data.

Paginations usually follow one of the 2 patterns.

3.2.1. Use limit and offset to specify how many and which data points you want

In this pattern, when making a request, you will send 2 query parameters:

Limit representing the number of data points to get
Offset represents the position from which to start the limit count

For example, if the offset is 100 and the limit is 10, the response will have data from items 101 to 111 (offset + limit). The order is generally decided by the server or accepted as part of the request.

Let’s see how we can use this pattern with the free PokeAPI

params = {'offset': 5, 'limit': 3}
url = 'https://pokeapi.co/api/v2/'
entity = 'ability'
response = requests.get(url=f'{url}{entity}', params=params)
response.json()

Let’s look at a simple example with offset 3 skipping the first 3 rows and limit 5 selecting the next 5 rows of data.

Limit and Offset

3.2.2. Use the next page link to let the server decide what data to send

Another pattern APIs use is sending a next_page_link as part of the response, which you would use to make the next request to get the next chunk of data. The PokeAPI also provides us with a link to the next page.

url = 'https://pokeapi.co/api/v2/'
entity = 'ability'
response = requests.get(url=f'{url}{entity}')
json_response = response.json()
print(json_response)
response = requests.get(json_response.get('next'))
response.json()

In the above example, we use the next data from the response to get the next chunk of data. While this next link is the same link with predefined limit and offset, some APIs can provide you with links in non-modifiable formats.

3.3. Do not overload the server

3.3.1. Retry reasonably

There are times when the server may be busy or an error occurs, and you’re unable to make a request. If the server from which you are pulling data can have such issues, it’d be a good idea to retry.

# start a local fake server
import subprocess
process = subprocess.Popen(['uv', 'run', 'dummy_server.py'], 
                          stdout=subprocess.PIPE,
                          stderr=subprocess.PIPE)

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time

# Configure the retry strategy 
retries = Retry(
    total=3,
    status_forcelist=[500, 502, 503, 504],  # Retry on these server errors
    allowed_methods={"GET"},  # Explicitly allow only GET requests
)

# Create a session and mount the retry adapter
session = requests.Session()
adapter = HTTPAdapter(max_retries=retries)
session.mount('http://', adapter)

# URL for your local API
api_url = "http://localhost:8000/api" # random 2/3 error 
response = session.get(api_url)

process.terminate() # stop server

We define a Retry object where we specify

Number of retries
Methods(GET) to retry for
Responses(500, 502, 503, 504) to retry for

The 500, 502, 503, and 504 response codes represent server errors .

If you know the server is prone to issues and when issues occur, they take a long time to resolve, you can use backoffs, a mechanism for waiting between retry attempts. The wait time between attempts increases exponentially (see code example ).

3.3.2. Rate limit to prevent servers from being overloaded

Servers don’t want a single requester to hog all their resources (server, DB, etc) by accidentally (or intentionally) making so many requests that the server cannot respond to other requesters.

To prevent such a case, most API servers implement a rate limiter. A rate limiter only allows a specific requester to make a specified number of requests per time unit. This will ensure that one requester is not hogging the server.

3.4. Most servers require authentication

In the example above, we’ve used open AIPs that don’t require authentication or API keys. In most cases, you will need to do one of the following:

API Keys:
- Simple string tokens included in request headers, query parameters, or in the request body
- Easy to implement but offers basic security
- Best for public APIs with limited access requirements
OAuth 2.0:
- Industry-standard protocol for authorization
- Provides secure delegated access using access tokens
- Common flow types: Authorization Code, Implicit, Client Credentials, and Resource Owner Password
Bearer Tokens:
- Access tokens sent in the Authorization header
- Often used with OAuth 2.0
- Format: Authorization: Bearer
Basic Authentication:
- Uses base64 encoded username
- Sent in Authorization header
- Format: Authorization: Basic <base64(username:password)>
- Should only be used with HTTPS

4. Conclusion

To recap, we saw

The next time you have to pull data from an API, use this article to guide you.

Please share this with your friends and colleagues if you think it might be helpful.

5. Further reading

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Join now and get started on your data engineering journey!

Testimonials:

I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!

I have learned a lot from the course which is much more practical.

This course helped me build a project and actually land a data engineering job! Thank you.

When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

Built with Kit

Comment anonymously

M ↓ Markdown

Commento

How to Extract Data from APIs for Data Pipelines using Python

1. Introduction

2. APIs are a way to communicate between systems on the Internet

2.1. HTTP is a protocol commonly used for websites

2.1.1. Request: Ask the Internet exactly what you want

2.1.2. Response is what you get from the server

3. API Data extraction = GET-ting data from a server

3.1. GET data

3.1.1. GET data for a specific entity

3.1.2. GET data for related entities

3.1.3. Use query params to specify the data you need

3.2. Pull large data as small chunks, aka Paginate

3.2.1. Use limit and offset to specify how many and which data points you want

3.2.2. Use the next page link to let the server decide what data to send

3.3. Do not overload the server

3.3.1. Retry reasonably

3.3.2. Rate limit to prevent servers from being overloaded

3.4. Most servers require authentication

4. Conclusion

5. Further reading

Land your dream Data Engineering job!

Testimonials:

Land your dream Data Engineering job!