Scraping with Python

30+ video tutorials to help you master scraping web pages
with everything you need to crawl websites and manage scraped data.

Early preview coming soon. Interested?

Table of contents

Getting started with Scrapy

Get started with scraping your first website

Writing your first spider: Scraping Hacker News homepage
Using scrapy shell
Exporting data to CSV

Selecting data

Learn the various ways to selecting data on a web page

Selecting data with xpath
Selecting data with CSS class and id attributes

Navigating pages

Learn to identify links and navigate from one page to another to continue scraping. Also learn how to selectively scrape patterns of urls to avoid scraping unnecessary pages.

Navigating to more pages to scrape
Identifying urls with Scrapy Selectors
Selective scraping with allow and deny rules

Exporting data

Export data to formats that can be parsed easily by other programs. Learn about the JSON Lines format

The JSON Lines format
Exporting data to JSON Lines format

Managing scraped data

Organize scraped data with objects and learn to post-process them with Item Pipelines.

Standardising data with Scrapy Items
Post-processing data with Item Pipelines

Configuring a spider

Learn to do more with a spider with some handy tricks.

Scraping a specific url
Overriding settings for individual spiders

Advanced topics

Get up to speed with scrapy concepts that will come in handy.

Intercept with middlewares
Using Spider Contracts to test spiders
Dupe filters
Logging

Ethical scraping

This section is a reminder that there are real people running websites and the things you can do to be polite when scraping.

Throttling per site
Throttling outgoing requests
Caching

Workaround limitations

Learn how to get past limitations like forms, login or scrape websites with dynamic content

Scraping content that requires login
Scraping websites that load dynamic content
Scraping content that requires filling forms

Circumvent blocking

And then there are times that you need to fly below the radar

Masking UserAgent to look like a real user
Using a proxy to circumvent blocking
Keeping your cookies

Scraping other content

Learn to scrape different kinds of content using scrapy

Downloading images
Scraping JSON
Scraping microformats

Deploying scrapy projects

Sure you can run your scraping projects on your own computer. But for large scraping projects, running them on servers allow you to scale beyond your bedroom

Deploying to ScrapingHub
Deploying to servers using SpiderMon

Bonus content

Some tricks to speed up your implementation when scraping websites

Sitemaps & Robots.txt
What to start with to scrape a website

From me to you

"Hello there. Everything in this course is from my learnings in the last few years building and maintaining large scraping projects, and helping other folks do the same.
This course is all my notes and learnings, so that you move onto your projects without struggle."

— Akash Manohar (@HashNuke)

Early preview of this course is coming soon. Let me know if you need access to it.

© 2020 ScrapingWithPython.com