How to Scrape Instagram in 2024: Python, GraphQL APIs & Proxy Rotation

Seo Za
March 24, 2026
17 minutes

Instagram's massive ecosystem, boasting over 2 billion monthly active users, represents an unparalleled source of public data for market research, competitor analysis, and academic studies. However, tapping into this resource responsibly demands more than technical skill—it requires a thorough understanding of legal constraints, ethical boundaries, and robust engineering practices. This guide is designed for intermediate-to-advanced developers, data engineers, and technical growth hackers who need to collect public Instagram data at scale while avoiding pitfalls like account bans, legal exposure, and data instability. We cover the full spectrum: from GDPR and CCPA compliance to tool selection, reverse-engineered API techniques, and sustainable data pipelines. By the end, you'll have a clear roadmap for building scraping systems that are both powerful and principled.

The core thesis is simple: ethical Instagram scraping is not a compromise—it's the only sustainable approach. By respecting rate limits, using mobile proxies, and leveraging official or stable internal endpoints, you ensure long-term access and reliability. This guide transforms Instagram data from a volatile exploit into a strategic asset.

The Strategic Value of Instagram Data

With over 2 billion monthly active users, Instagram public data is a goldmine for business intelligence. It reveals real-time market trends, audience demographics, and competitor strategies that simply aren’t available elsewhere. This data powers everything from product development to precise ad targeting.

The Real Cost of Cutting Corners

Unethical instagram scraping triggers severe, concrete penalties:

Instant IP/account bans via Instagram’s aggressive anti-bot systems.
Legal action under GDPR compliance and CCPA regulations, with fines reaching 4% of global revenue.
Irreparable reputational damage and loss of user trust.

The critical insight is this: Instagram data is only valuable if you can access it consistently. Violating Instagram Terms of Service isn’t a shortcut—it’s a fast track to being cut off entirely. Ethical scraping, which respects rate limits and platform APIs, is the only sustainable method for long-term business intelligence. It transforms Instagram data from a volatile exploit into a reliable strategic asset. The choice isn’t between scraping and ethics; it’s between temporary gains and permanent access.

Understanding the Legal Landscape

The legal boundaries of Instagram scraping are narrow but critical. Compliance isn't optional—it's the prerequisite for any sustainable data operation. Here is the definitive breakdown.

Permitted Data (Generally)	Prohibited Data (Strictly)
Publicly available profile information (bio, username, public posts)	Any data behind a login wall (follower lists, DMs, private posts)
Follower counts, engagement metrics (likes/comments on public posts)	Any data accessed via non-public APIs or session hijacking
Public hashtag feeds	Aggregated or derivative databases that violate Meta's terms

Even permitted public data is heavily constrained by three overlapping legal frameworks:

A quick compliance checklist:

The core legal principle is: Instagram grants a limited, revocable license to view public content. Scraping is a technical act of mass copying that violates that license. The line between browsing and scraping is defined by automation and scale, not just the data's visibility.

Instagram Terms of Use & Meta Platform Policy: Explicitly prohibit "automated means" to access or collect data without express permission. This includes crawlers and scrapers. The policy bans scraping for creating competing services or profiling users.
GDPR (EU) & CCPA (California): These data protection laws classify Instagram user data as personal information. Scraping it requires a lawful basis for processing and provides users with rights to delete their data. Non-compliance risks fines up to 4% of global annual revenue under GDPR.
Rate Limiting & Technical Barriers: Instagram's anti-bot systems (rate limits, CAPTCHAs, IP flags) are a technical manifestation of their ToS. Aggressive scraping violates the "no automation" clause, triggering immediate account bans.

Verify data is truly public (no login required in a fresh browser).
Honor robots.txt directives and implement strict rate limiting (1-2 requests per second per IP).
Never scrape follower lists or any user-identifiable data at scale.
Document your lawful basis for processing under GDPR/CCPA.
Consult the official Instagram Terms of Use and Meta Platform Policy.

Setting Up a Reliable Scraping Environment

A stable, reproducible environment is the foundation of any reliable scraping operation. Inconsistent setups cause 80% of "it works on my machine" debugging sessions and are a leading cause of unexpected bans due to version drift. Here is the standard, battle-tested setup.

1. Python & Virtual Environment
Start here—never skip this step. A Python virtual environment enforces isolation from your system Python and project dependencies.

# Create and activate the environment (Linux/macOS)python3.8 -m venv .venvsource .venv/bin/activate# Install core scraping libraries via pippip install requests lxml

2. IDE & Debugger Selection
Your IDE is your primary debugging interface. Choose based on workflow:

IDE	Best For	Key Strength	Cost
VS Code	General purpose, lightweight, excellent extensions	Integrated terminal, vast extension marketplace	Free
PyCharm Professional	Large codebases, advanced refactoring	Superior debugger, database tools	Paid
Jupyter Notebook	Exploratory analysis, data prototyping	Cell-based execution, inline plots	Free

3. Dependency Management
Pin exact versions in a requirements.txt file. This guarantees that every team member works in an identical environment, preventing subtle bugs from version mismatches in libraries like lxml or requests.

Tool Selection: A Comparative Analysis

Choosing the right scraping tools is a tactical decision with major downstream effects on development speed, reliability, and cost. The selection hinges on three factors: the target content's nature (static HTML vs. JavaScript-rendered), required scale, and your team's existing skill set. Here is a direct comparison.

Tool Comparison Matrix

Tool	Best For	Key Pro	Key Con	Learning Curve
Requests + BeautifulSoup	Simple, static pages	Extremely lightweight, full control	No JavaScript support; manual pipeline	Low
Scrapy	Large-scale static scraping	Fast, built-in pipelines, middleware	Steep framework specifics; mobile proxies require setup	High
Selenium	Legacy dynamic sites	Full browser automation	Very slow, high resource use	Medium
Playwright / Puppeteer	Modern JS-heavy apps	Fast headless browser, auto-waits	Heavier than pure HTTP; needs management	Medium
Custom Python + Mobile Proxies	Scaling securely (e.g., instagram scraping)	Bypasses anti-bot AI completely via CGNAT	Requires writing your own extraction logic	Medium

Decision Flowchart:

Start: Is the content static HTML (no JS rendering)?
- Yes: Is scale low? Use Requests + BeautifulSoup.
- No (Dynamic content): Can you access the internal JSON endpoints?
  - Yes: Use Requests combined with a highly trusted Mobile Proxy pool to hit the API directly.
  - No: Use Playwright (preferred) to render the full DOM.

Scraping Instagram Profiles via Reverse-Engineered APIs

Instagram scraping via API reverse-engineering is more reliable than HTML parsing. Instagram's frontend uses stable, internal JSON endpoints that return structured profile data without the fragility of CSS selectors. This walkthrough extracts a user's username, bio, follower count, and public contact email using the web_profile_info endpoint.

1. Discover the Endpoint in Chrome DevTools
Open any public Instagram profile. Press F12 → Network tab. Filter by XHR/Fetch. Scroll the page to trigger a network request named graphql/query/. Right-click it → Copy → Copy as cURL. Paste into a text editor. The URL contains the endpoint path and a query hash. The request payload holds the user ID variable.

2. Replicate Browser Request Headers
Instagram blocks generic scripts. Your request must mimic a real browser session. Minimal required headers:

3. Python Implementation
This script fetches the profile data as JSON. It handles the two-step process: (1) get the user's numeric ID from the profile page's HTML if not known, (2) query the web_profile_info endpoint.

import requestsimport reimport jsondef get_user_id_from_html(profile_url):    """Fallback: scrape the numeric user ID from the page's initial HTML."""    headers = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15'}    resp = requests.get(profile_url, headers=headers)    match = re.search(r'profilePage_(\d+)', resp.text)    return match.group(1) if match else Nonedef fetch_profile_data(username, user_id=None):    base_url = "https://www.instagram.com/api/graphql"    query_hash = "d4d88dc1500312af6f937f5b8a9f58d2"  # Verify in DevTools    variables = {"id": user_id or username, "render_surface": "PROFILE"}        headers = {        'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15',        'X-IG-App-ID': '936619743392459',        'X-Requested-With': 'XMLHttpRequest',        'Referer': f'https://www.instagram.com/{username}/',    }        params = {        'query_hash': query_hash,        'variables': json.dumps(variables)    }        try:        resp = requests.get(base_url, headers=headers, params=params, timeout=10)        resp.raise_for_status()        data = resp.json()        user = data['data']['user']        return {            'username': user['username'],            'bio': user['biography'],            'follower_count': user['edge_followed_by']['count'],            'email': user['business_email']  # Only public if user has a business account        }    except (KeyError, requests.RequestException) as e:        print(f"Failed to fetch {username}: {e}")        return None# Example usageprofile = fetch_profile_data('instagram')if profile:    print(json.dumps(profile, indent=2))

Performance Note

This API call typically returns in 200-500ms from an allowed IP. Parsing the same profile data from HTML with BeautifulSoup consistently takes 1.5-3× longer due to DOM traversal. The JSON endpoint is the efficient path for instagram scraping.

Critical Caveat: This endpoint is undocumented and may change without notice. You must:

User-Agent: A modern mobile browser string.
X-IG-App-ID: Typically 936619743392459 for web.
X-Requested-With: XMLHttpRequest.
Referer: The profile URL you're scraping.

Implement robust error handling and fallbacks.
Use mobile rotating proxies and strict rate limiting (1 request/second) to avoid triggering account bans. Datacenter IPs will be blocked almost instantly.
Regularly verify the query_hash in DevTools, as Instagram rotates them.

API Discovery Using Chrome DevTools

Begin API discovery by opening Chrome DevTools (F12) and navigating to the Network tab. Filter requests to show only XHR requests. Load any public profile—you'll see a graphql/query/ request, often named web_profile_info. This endpoint returns structured profile data as JSON from Instagram's internal API.

Click the request to inspect its headers and payload. Critical headers include:

⚠️ Critical: The x-ig-app-id value changes periodically. Always capture the current value from DevTools.

User-Agent: Must mimic a real browser string.
X-IG-App-ID: Instagram's internal app identifier (currently 936619743392459).
Cookie: A valid session cookie (optional for public data but reduces blocking).
Referer: The target profile URL.

Production-Ready Profile Fetching Script

This production-ready Python script demonstrates instagram scraping via the internal profile API. It correctly configures headers, manages cookies, integrates Mobile Proxies, implements rate limit handling for HTTP 429 responses, and parses the JSON response to extract key fields.

import requestsimport timedef fetch_profile_data(username):    url = "https://www.instagram.com/api/graphql/"    headers = {        'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)',        'X-IG-App-ID': '936619743392459',        'Referer': f'https://www.instagram.com/{username}/',    }        # Routing traffic through Mobile Proxies is mandatory for production    proxies = {        "http": "http://user:pass@your-mobile-proxy:port",        "https": "http://user:pass@your-mobile-proxy:port"    }        session = requests.Session()    session.headers.update(headers)    session.proxies.update(proxies)        params = {        'query_hash': 'd4d88dc1500312af6f937f5b8a9f58d2',        'variables': f'{{"id":"{username}","render_surface":"PROFILE"}}'    }        try:        resp = session.get(url, params=params, timeout=10)        if resp.status_code == 429:            time.sleep(5)  # Simple backoff; implement exponential in production            return fetch_profile_data(username)  # Recursive retry        resp.raise_for_status()        data = resp.json()        user = data['data']['user']        return {            'username': user['username'],            'bio': user['biography'],            'follower_count': user['edge_followed_by']['count'],            'is_verified': user['is_verified']        }    except requests.exceptions.RequestException as e:        print(f"Request failed: {e}")        return None    except (KeyError, ValueError) as e:        print(f"Parsing error: {e}")        return None

Sample JSON response with extracted fields indicated:

{  "data": {    "user": {      "username": "target_user",      "biography": "Bio text here",      "edge_followed_by": { "count": 12500 },      "is_verified": true    }  }}

Ethical Extraction of Public Emails

In instagram scraping, ethical extraction of a public email is strictly limited to addresses users voluntarily include in their public biography field. Emails behind "Email" buttons or login walls are protected by GDPR and Instagram's Terms of Service—scraping them is illegal and risks severe penalties.

Use this regex to reliably extract emails from the bio text:

import repattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'emails = re.findall(pattern, bio_text)

⚠️ Legal Warning: Never scrape emails hidden behind "Email" buttons or login-protected endpoints. Such data is not public; extracting it violates GDPR and Instagram ToS, inviting fines and permanent bans. Stick solely to the biography field.

For cleaner results, validate matches with a library like email-validator. This ethical approach keeps your instagram scraping operation compliant and sustainable.

Scraping Instagram Posts with GraphQL

Instagram's GraphQL endpoint delivers structured post data, eliminating the brittleness of HTML parsing. This method retrieves media URLs, captions, and engagement metrics via a single stable API call.

1. Discover the Query ID
In DevTools, open the Network tab, filter for XHR, and navigate to a target profile's posts page. Trigger a request to www.instagram.com/api/graphql/. Copy the doc_id from the URL—this is your permanent query identifier for that specific data shape.

2. Format the POST Request
Instagram requires a POST with JSON payload. Essential headers:

Payload structure:

{  "doc_id": "YOUR_DOC_ID",  "variables": {    "id": "USER_ID",    "first": 12,    "after": "CURSOR"  // Omit for first page  }}

3. Paginate with end_cursor
The response contains page_info with has_next_page and end_cursor. For subsequent pages, pass the cursor in the variables.after field.

4. Python Implementation

import requestsimport jsondef fetch_posts(user_id, doc_id, cursor=None):    url = "https://www.instagram.com/api/graphql/"    headers = {        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',        'X-IG-App-ID': '936619743392459',        'Content-Type': 'application/json',        'Referer': f'https://www.instagram.com/{user_id}/'    }    variables = {"id": user_id, "first": 12}    if cursor:        variables["after"] = cursor        payload = {        "doc_id": doc_id,        "variables": json.dumps(variables)    }        resp = requests.post(url, headers=headers, json=payload, timeout=10)    resp.raise_for_status()    return resp.json()# Extract posts from responsedata = fetch_posts("user_id", "doc_id")edges = data['data']['user']['edge_owner_to_timeline_media']['edges']for edge in edges:    node = edge['node']    print(f"Media URL: {node['display_url']}")    print(f"Likes: {node['edge_liked_by']['count']}")    print("---")

Common Fields in Post Response

Field Path	Description
`node.display_url`	Direct media URL (image/video)
`node.edge_media_to_caption.edges[0].node.text`	Caption text
`node.edge_liked_by.count`	Like count
`node.edge_media_to_comment.count`	Comment count
`node.taken_at_timestamp`	Post timestamp (Unix)
`node.is_video`	Boolean: video post?

User-Agent: Modern browser string.
X-IG-App-ID: 936619743392459 (verify in DevTools).
Content-Type: application/json.
Referer: Profile posts URL.

Locating the GraphQL doc_id

To access Instagram's GraphQL endpoint for post data, you must first locate the correct doc_id in Chrome DevTools.

🔄 Critical Note: The doc_id changes with Instagram's frontend updates. Always extract it fresh from your own DevTools session; hardcoded values from old tutorials will fail.

Open any Instagram profile's Posts tab.
Open DevTools (F12), go to the Network tab, and filter for graphql/query.
Trigger a request by scrolling. Identify the POST request with a JSON payload.
Click the request, go to the Payload or Request tab, and find the doc_id field. This is your stable query identifier.

Pagination and Rate Limiting

Instagram's frontend uses infinite scroll, but its GraphQL API relies on a stable cursor-based pagination system. Unlike brittle HTML scraping, this method uses the end_cursor value from the page_info object to fetch subsequent pages reliably.

Pagination Loop Logic:

Critical: Insert time.sleep(2-5) between each request. For large-scale instagram scraping, pair this with Mobile Proxies. Without strict rate limiting and proper IP masking, Instagram serves CAPTCHAs and bans IPs after 10-20 rapid requests. The price of an IP ban is permanent access loss for that address.

Proxy Recommendation: Rely on 4G/5G Mobile Proxies. Thanks to Carrier-Grade NAT (CGNAT), you share an IP address with thousands of real smartphone users, preventing Instagram from issuing a ban without taking down legitimate human traffic.

Send request with doc_id and variables (include after: cursor for subsequent pages).
Parse edges for post data.
Check page_info.has_next_page. If true, set cursor = page_info.end_cursor and repeat.

Special Cases: Followers and Emails

Two of the most common goals in instagram scraping are extracting a user's complete follower list and finding emails beyond what's publicly posted. Both are legal minefields. Here’s the definitive boundary.

Data Point Feasibility & Legal Status

Data Point	Status	Why & Legal Path
Full follower list (usernames/IDs)	❌ Impossible via scraping	Requires login and pagination through private endpoints. This is a clear ToS violation and violates privacy laws. Instagram's anti-bot systems detect and block this immediately.
Follower count (public number)	✅ Possible	Publicly visible on any profile. Accessible via the profile JSON API without login.
Email from business contact button	❌ Impossible via scraping	The "Email" button loads a form behind a login/verification wall. Accessing this is unauthorized and breaches GDPR/CCPA.
Public email in bio text	✅ Possible (ethically)	Only if the user manually typed it into their public biography. This is public data. Use regex extraction as shown previously.

Legitimate Alternative for Followers: For approved business use cases, Instagram’s official API (the Instagram Graph API) provides follower counts and basic demographics for Business or Creator accounts you own or manage. This requires business verification and user consent via OAuth. It is the only legal method for accessing follower-related data at scale.

Bottom line: If your project requires a full follower list or non-bio contact details, instagram scraping is the wrong tool. You must use the official API or obtain explicit user consent. Any other approach is a ToS violation with high risk of bans and legal action.

Follower Count vs. Follower List

The distinction between a public follower count and the actual follower list is absolute in Instagram data access. One is a permitted public metric; the other is a protected private dataset.

Public Count vs. Private List

Follower Count (Public)	Follower List (Private)
✅ Accessible via profile JSON API without login.	❌ Requires authentication; any automated access violates ToS and privacy laws.
Displays aggregate number only.	Contains usernames/IDs of all followers.
Rate-limited (aggressive queries trigger blocks).	Instagram’s anti-bot systems detect and ban such requests instantly.

Even the public follower count must be fetched respectfully—strict rate limits apply to avoid IP bans. The Graph API is the only scalable, sustainable path for business analytics. Attempting to access the list guarantees failure and account termination.

Email Extraction Boundaries

Email extraction in instagram scraping has one absolute rule: you may only parse emails that users have manually typed into their public bio field. Every other method is prohibited and illegal.

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'public_emails = re.findall(pattern, bio_text)

🚫 Critical Legal Boundary: Never attempt to scrape emails from "Contact" buttons, hidden form fields, or any interface that requires a click or login. This is a direct violation of Instagram's ToS and privacy laws like GDPR and CCPA. Such email extraction is unequivocally illegal and exposes you to significant fines and account termination.

Effective data storage for scraped data begins with a core principle: **data minimization**. Store only fields essential for your analysis. Unnecessary data increases storage costs, processing overhead, and compliance risk.

Storage Solution Comparison

Format	Scalability	Best Use Case
CSV/JSON	Low (single files)	One-time analysis, small datasets (<10k records), prototyping.
PostgreSQL	High (ACID compliant)	Structured relational data, complex queries, production pipelines requiring integrity.
MongoDB	Very High (horizontal scale)	Semi-structured or nested JSON-like documents (e.g., Instagram post data with variable fields).

For a quick prototype, save directly to CSV with pandas:

import pandas as pddf = pd.DataFrame(posts_list)df.to_csv('instagram_posts.csv', index=False)

GDPR Compliance Steps (Checklist):

Anonymize immediately: Hash or delete user IDs if personal identification isn't required.
Secure storage: Encrypt databases at rest and in transit.
Define retention: Automatically delete scraped data after its analytical purpose is served.
Document processing: Maintain a record of what data you store and why (lawful basis under GDPR).

Frequently Asked Questions

Q: How does Instagram detect scraping?Instagram monitors request patterns: high RPM, missing/inconsistent headers (User-Agent, X-IG-App-ID), and IP reputation. Use strict rate limiting (1-2 req/sec), realistic headers, and Mobile Proxies to mitigate detection.Q: What's the best tool for large-scale scraping?For >100k daily requests, use custom Python scripts paired with a robust pool of Mobile Proxies. Cloud APIs exist, but they are expensive and give you less control over your infrastructure.Q: How should I handle CAPTCHAs?CAPTCHAs signal a ban is imminent. Stop all requests from that IP immediately. Rotate to a fresh proxy and reduce request rate. If you are using true Mobile Proxies, CAPTCHAs will rarely appear due to the high trust score of mobile IPs.Q: How long can I legally store scraped data?Store only as long as your explicit, lawful purpose requires. Under GDPR/CCPA, this is often 30-90 days for analytics. Document the retention period in your privacy policy.Q: Is scraping public data still a ToS violation?Yes. Instagram's Terms prohibit "automated means" to access their service, even for public data. Compliance requires honoring rate limits and avoiding any private endpoints. The risk is account/IP bans, not just legal.

Avoiding Detection: Best Practices

Complete detection avoidance in instagram scraping is impossible—Instagram's systems are designed to detect and block automation. However, you can significantly reduce your risk profile with strict operational discipline.

Mobile Proxies: Datacenter IPs die instantly on Instagram. You must use 4G/5G mobile proxies to blend in with millions of real mobile users.
Request throttling: Enforce randomized delays (2–5 seconds) between all requests.
User-agent rotation: Cycle through legitimate, up-to-date mobile browser strings.
Respect robots.txt: Honor crawl-delay directives and disallowed paths.

Scaling for Large-Scale Operations

For large-scale instagram scraping (100k+ requests/day), the optimal tools are distributed Python workers communicating through Redis, all funneling traffic through a private pool of Mobile Proxies. Avoid Selenium/Playwright—their high resource use makes them impractical at scale. Always implement incremental storage to prevent memory overload.

GDPR-Compliant Data Retention

Under GDPR, you may only store Instagram data as long as necessary for your explicit purpose. Apply data minimization: collect only required fields and delete the rest. Anonymize promptly if personal IDs aren't needed. Document your lawful basis (e.g., legitimate interest) and retention period.

Typical Retention Periods by Use Case

Use Case	Max Retention
Trend analysis	30 days
Competitor benchmarking	90 days
Academic research	1 year (with ethics approval)

When uncertain, consult legal counsel. GDPR violations carry fines up to 4% of global revenue.

Final Checklist for Ethical Scraping

Mastering Instagram scraping via the GraphQL API delivers structured data efficiently, but long-term viability hinges on responsible scraping practices. Instagram’s anti-bot systems and evolving Terms of Service make ethical and operational discipline non-negotiable. Neglecting rate limits, proxy rotation, or data minimization invites IP bans, CAPTCHAs, and legal exposure under GDPR/CCPA.

This checklist ensures sustainable access and compliance. The data ecosystem depends on scrapers who act responsibly—stay vigilant and keep learning.

✅ Prefer the GraphQL API over HTML parsing for stable, structured data and reduced detection risk.
✅ Enforce strict rate limits: 1–2 requests per second per IP to avoid triggering anti-bot systems.
✅ Use Mobile Proxies to distribute request volume organically and prevent IP bans.
✅ Apply data minimization: store only essential fields; anonymize PII immediately.
✅ Review Instagram’s ToS quarterly and follow official documentation for API changes.

By adhering to the practices outlined—leveraging stable endpoints, respecting rate limits, routing through trusted mobile IPs, minimizing data, and complying with GDPR/CCPA—you can build Instagram scraping pipelines that are both effective and sustainable. Remember, the goal isn't just to extract data, but to do so in a way that preserves access for the long term and respects user privacy. Stay vigilant, monitor Instagram's updates, and always prioritize ethical considerations in your data collection endeavors.