Category Archives for "Tutorials"

Which Web Scraping Tool Should I Use?

Which Web Scraper Should I Use?

A lot of different web scraping tools out there will get the job done.

So, how do you know which one to try?

I have put a lot of time and effort into testing and comparing web scraping tools, which led me to create the Simple Scraper. (I didn't find something low-cost that works in most common scraping situations and is very very easy to use!) But I believe that each tool (for the most part) has a use and a value, and a team behind it that put a lot of effort into making something cool. 

Here's a quick breakdown of how these different tools have worked for me, comparing them on a few different points: Price, Difficulty in Learning, and Feature Set. I also considered who might best benefit from each tool.

You can use the image below to get a quick glance at these different tools, or scroll down for a lengthier description of each inside the comparison chart.

Have questions? Feel free to comment below. 

Product Price Cost Payment Interval Difficulty Feature Set General Features Who’s it for?
Outscrape Simple Scraper Low $67 One-time Low Low-Medium Visual, multi-tier, point-and-click, includes javascript; No ajax or input General web scraping tasks, good for freelancers or anyone doing many different scraping tasks a few times
FMiner Medium $168-$250 One-time Medium Medium Visual, Most important features (form filling, multi-tiered, ajax, recursive next button, fuzzy scraping) General web scraping tasks, good for freelancers or anyone doing many different scraping tasks a few times
import.io Very High $300-$5000 Monthly Low Medium Visual, Point and Click interface, easy to use, extremely fast scraping on servers Enterprise level companies with large, one-time scraping projects or budgets to scrape massive data regularly
Octoparse Very High $75-$158 Monthly High High Visual / Scripting, similar to Mozenda but more complex + difficult to learn Medium to Enterprise level businesses with regular scraping projects defined and run monthly
Scrapebox Medium $97 One-time Low-Medium Low-Medium Scrapebox is more like a machine gun than a scraping tool. Use it to grab huge amounts of data without defining where it comes from Anyone collecting huge amounts of data indescriminately – SEOs, someone collecting huge amounts of content, etc
Scrapy Free $0 Free High Medium Scrapy is a python framework for collecting data – many, many options, but requires programming knowledge Very useful for programmers, not worth learning unless you have a background or plan to use extensively
Mozenda Very High $.02/page Per Page Low High Visual, comparable to import.io but per page cost Enterprise level companies with large, one-time scraping projects or budgets to scrape massive data regularly
WebHarvy Medium $99 One-time Medium Low Good basic point-and-click visual scraper, can’t load javascript pages so fairly limited General web scraping tasks, good for freelancers or anyone doing many different scraping tasks a few times
VisualWebRipper High $349 One-time Medium Medium Good general feature set, includes ajax / infinite scroll scraping General web scraping tasks, good for freelancers or anyone doing many different scraping tasks a few times
Web Content Extractor Low $49 One-time High Medium Decent feature set, can’t use javascript Requires too great a knowledge of scraping and some scraping concepts for most new/medium users, but if you are an expert not bad for the cost

Comment

How to Do Web Scraping

How to Do Web Scraping

In March I wrote an introduction to web scraping for a forum, and as I was writing it, I realized that there was a LOT of information. Way more than I could explain in a simple post. That's how I ended up writing Web Scraping Secrets Exposed. If you're interested in learning a basic introduction to web scraping - how people use it and why they do it - this is the reproduced post.

After years of web scraping and working with people who do data collection, data harvesting, data indexing, data aggregation, web crawling, screen scraping, or whatever you want to call it, I wanted to put together a very basic list of ideas on how anyone can profit from the info that’s already out there.

First: What’s scraping? My definition is basic: scraping is intelligently, automatically taking content from somewhere, generally structured content, with the intention of reproducing it or examining it for trends or valuable information.

Second: Why scraping?
Because data is valuable. Knowledge is power. Yadayada. You know all this.

What you might not know: scraping is often free. So, we’re talking about free value. So here’s what this guide is going to answer, very basically and quickly to get your mind working:

  • Where to get the data
  • What to do with it

One popular cloud-based scraping product suggests these basic scraping categories:

Method (site example)

Machine learning (Google images)
Price monitoring (Ebay)
Lead generation (Yelp) [scraping contact info for local biz]
Market research (Brewdog) [scraping types of beer and their ratings, for example)
App Development (Realtor.com) [I can only assume scraping realty data and copying it]
Academic Research (Techcrunch)

Nice, but I’m going to break it down for you in terms of how to actually make money with this stuff. Here’s the basic categories I could come up with [Note: The ebook has 20+ total different methods):=

  1. Duplicating sites
  2. Offering scraped data as a service
  3. Lead gen
  4. Offering “scraping” itself as a service
  5. Scraping to get around APIs

1. Duplicating Sites

This is an obvious one. No matter what website you want to create, there’s probably already one out there that’s similar. Here’s some site ideas that could benefit from reproducing scraped data:

  • Forums
  • Job boards
  • Blogs
  • Q&A Site
  • Coupon Sites
  • Knowledgebase/Wiki Sites
  • Social network
  • Review sites (think Yelp, Amazon, etc)
  • Any site with data that you could reproduce and create a better interface/app/etc for

A ton of sites you might have could use one of these, to look active, to get more traffic, for SEO, as part of a PBN, as a place to actually get the data to begin with (for a coupon site), etc.

 

2. Scraped Data As A Service

People want the info below. If you aggregate it regularly or quickly you’ve got yourself some value. Build a targeted search engine, for example, that pulls data from the top 10 or 20 providers of any kind of niche product and you’ve got something that probably doesn’t exist anywhere else. Consider:

  • Stocks (Often sites require a cost to scrape anything past a certain date – but you could scrape this once and then provide it for free)
  • Niche News Aggregation (pick a niche, like celebrity news sites, scrape the top 10 sites, etc)
  • Daily News (pay for a subscription to get past major site paywalls, then make the data free or discounted)
  • Anything with a paywall – if you’re a student, you can grab this for free – but be careful, because that’s what got Aaron Swartz in trouble
  • Any kind of niche content to auto-send your mailing list, post via social media, etc (think a newsletter just for the top trends in blackhat IM, or a bot that auto-tweets new when a house gets sold in a specific zipcode)
  • Offline, intranet, or hard-to-access data – any legacy database or collection of info can be scraped and converted into a new format and put online, and I’ve seen companies pay big bucks to have this done rather than pay to have entire legacy software systems rebuilt.

 

3. Lead Generation

This is a goldmine, and one that could be considered less than legal, but you wouldn’t believe the number of big companies who use this data for all sorts of things (import.io SUGGESTS you use Yelp for lead gen, despite scraping Yelp being against the TOS).

Ever get targeted by a mailer because you bought a house, had a kid, moved, went to jail, started a business, etc? A lot of this is public info. You wouldn’t believe the number of lawyers and realtors I’ve talked to who use public databases to get clients. Those two groups, for example, usually have access to a poorly designed database that doesn’t export easily and requires scraping to go through the vast datasets.

If you have access to a unique dataset, or you’re willing to pay for it, or you can grab something that’s public and re-form it, you’re in a great position. You could collect the data and sell it, or you could use it yourself by targeting the contacts directly with offers.

Note – Learn regex. Many places are going to have contact info like email addresses throughout that isn’t easily scrapable. With regex and the right software, you can grab any email address from any dataset and copy ONLY that.

Places to scrape:

  • Social networks like Linkedin, Facebook, Twitter
  • Public datasets/records like insurance data, criminal records and other law databases, voting records, tax records, gov’t spending databases.
  • Realty (home foreclosures, new homes)
  • Car / vehicle sales websites
  • Review sites like Yelp

 

4. Scraping as a Service

This sounds like offering scraped data as a service but it’s slightly different, essentially because it’s time-based. A lot of SAAS companies out there are just scrapers or content aggregators. You can be too. For instance, you could:

  • Monitor websites for updates or changes
  • Proxies
  • Sales data (Amazon, Ebay, etc) or any kind of item and product listings for competitive price monitoring and market research, a price comparison portal, price arbitrage (what/when can you buy from Amazon and sell on Ebay for a profit?) or inventory tracking
  • Locate the highest-ranking keywords of your competitors on all major search engines
  • Automate ad buying research
Like This Blog Post? Get 40 Pages from the Ebook Free and Learn...
  • How I used web scraping to make a killing on a social network
  • The basics of scraping, and tools and resources I recommend
  • How to come up with your first web scraping idea
  • Lessons from Seth Godin, Facebook, Google, and more
  • BONUS: Join 230 People in a Private Web Scraping Mastermind

Enter your email and I'll send you the first 40 pages, free!  

Your email address will be safe. 

​5. API Alternatives

A lot of sites that have APIs have them because people are willing to pay for the data – if that’s true, then just ask yourself, why?

API’s are awesome, but they often cost money. If you had all the money in the world and plenty of time to code I’m sure you would use them. The great thing is that sites that have APIs usually have structured content on their site as well. If you need to get data fast and easy, and for basically free, skip the API and go straight for scraping the data directly.

In fact, one way to get scraping site ideas is to look up sites that have APIs. Example:

http://www.computersciencezone.org/50-most-useful-apis-for-developers/
http://www.programmableweb.com/news/most-popular-apis-least-one-will-surprise-you/2014/01/23

 

It's All Overwhelming. Where do I start with web scraping?

1. Start with what you know.

If you’re into old cars, build a search engine / listing site for old cars for sale. See if you can automate it and monetize it. If you’re into gov’t spending or something related to legislation, here’s a few fun ideas:

https://www.fcc.gov/licensing-databases/general/search-fcc-databases
https://www.data.gov/
https://www.foia.gov/search.html

2. Play around.

One reason I love scraping is that it’s fun. The programming part of it is annoying, but getting the data is fun.

3. Don’t freak out.

Yes, there’s a lot of data. Yes there are almost always sites that exist already that do something like what you plan to do, but usually they’re making money, and you could make some of that money, too. And if your idea is niche enough, you might actually be the first to aggregate the data or offer the service.

If you want to jump in headfirst, grab my ebook, which is a great introduction to web scraping for the non-coder/entrepreneur/marketer:

Get the Full eBook - Web Scraping Secrets Exposed

ONLY $34

  • 7 Chapters - Over 150 Pages Of Content!
  • 20+ Detailed Methods
  • 30+ Niches
  • BONUS: Private Mastermind Access
  • BONUS: 687 Freelance Scraping Jobs
  • BONUS: 150 Sample URLS
  • BONUS: Sample Scraper Code
  • BONUS: Scraper Demo Videos
  • BONUS: 4 Samples of Scraped Data
ripped image of web scraping secrets exposed

Secure checkout provided by Paypal and Stripe. 

4 How To Use Web Scraping On These 7 Niches

How to Use Web Scraping In These 7 Niches


You can use web scraping to get data from almost any niche, because web scraping gets you data and content. Inside Web Scraping Secrets Exposed you'll learn exactly how you can use web scraping to get data from, and for, over 30 niches. Here's a sample of just ten of those.

Niche 1: Web Scraping On Online Marketplaces (like Amazon)

After years of scraping data and talking to clients who scrape data, I can tell you that Online Marketplaces and E-Commerce websites (like Amazon) are probably the most common source of scraped data. In the back of this sec"on you’ll see a list of other example sites taken from interviews with scrapers. Basically, I define an online marketplace as somewhere that products are sold by more than one company. 

I’ve spoken to more than one provider of niche products that isn’t actually the provider, just the scraper who has a better search functionality than the original providers. You don’t even have to charge a customer. Just give access to your proprietary data in exchange for an email! Once you have that, you know the niche they are in, and you can profit.

Niche 2: Using Web Scraping on Business Directories (like YellowPages)

Business directories like Crunchbase or Yellowpages.com are incredible resources. But don't forget about the smaller niches inside this: restaurants, hospitals, clinics, construction companies, event agencies, tourism, transport, communication, marketing, fashion/design,music/film/production and more all have directories if you look hard enough. 

Scraping YP.com

You could build a single report for a local business, or you could build an entire database or portal or all-in-one site showing everything there is to know about the competition in the area or across the nation. You could compile all the info you can about all the dentists across the nation, for example, just from business directories. Or maybe you want to sell fresh email lists of contacts that might want to hire marketers? Put together an email list every month of all the companies on Crunchbase that have announced Series A or Seed funding. There's so many ways to use this info that you'll end up using Business Directories more than almost any other niche, I think.

Niche 3: Web Scraping Crowdfunding Sites (like Kickstarter)

If you're running your own kickstarter or crowdfunding project, then there are OBVIOUS ways to use scraping. Research into crowdfunding to understand the success of similar projects, who donated to them, and why, can help you succeed. Scraping is a great way to do that. But you can also research info for other projects too. I'll show you how inside the book! 

Wondering if there are enough different projects and kickstarter sites to do this research? You'd be surprised! Crowdfunding is a nearly $50 billion industry. And there are a LOT of different sites. 

You can use these sites to collect backers’ information from the funding pages of related sites, Reach out to these backers, explaining the crossover between the project they funded and yours, and get an instantaneous funding boost.

  • Kickstarter.com
  • Gofundme.com
  • Teespring.com
  • Indiegogo.com
  • Patreon.com
  • Crowdrise.com
  • Crowdtilt.com
  • More inside the book...

Niche 4: Scraping Real Estate Sites (like Zillow.com)

Real estate is one of my favorite categories for web scraping. In my experience scraping the web and talking to scrapers, this is one of the big ones. Realtors, home buyers or sellers and even renters have to work fast, to get new sales data and to grab options or clients before someone else does. This makes scraped data in the real estate market extremely useful and valuable.

​The market is big: The Association of Real Estate License Law Officials (ARELLO) estimates that there are about 2 million active real estate licensees in the United States alone. The market is valuable: According to the 2007 Economic Census, there are 109,472 real estate brokerage firms operating in the United States. And in 2010, primary residences accounted for 29.5% of total family assets, according to the Federal Reserves Survey of Consumer Finances. And one last thing - the market and sales data is, mostly, public. So the market is great for scraping. 

An example spreadsheet of scraped data pulled from a Real Estate site

An example spreadsheet of scraped data pulled from a Real Estate site

Huge troves of public data, lots of money at stake, swings in the economy - all of this means there's plenty of data available. One scraper told me he used real estate data to understand pricing trends and merge information from similar listings found on various real estate listing websites, to create trend data from all the mess of online scraped data in different places. You could do this too - and create all sorts of other information. There are obviously entire college departments dedicated to doing trend research on the economy, but usually on a global or national scale. Imagine what you could do with a bit of very specific local data from your area's sales data or MLS data.

Niche 5: Scraping Sports Sites

​You can find all sorts of sites for every sport to pull data from. One scraper told me he scraped NBA score data extraction and shared the code for the scraper he made. Another scraper has a side-project to scrape play-by-play data from NBA games and put them together on a site. In fact, the more time you spend looking sports stats, the clearer it is that a lot of sites are just aggregated data. Look at sites like Vorped and NBAWowy for examples. And don’t just focus on the big leagues: smaller divisions have rabid fans as well.

Like This Blog Post? Get 40 Pages from the Ebook Free and Learn...

  • How I used web scraping to make a killing on a social network
  • The basics of scraping, and tools and resources I recommend
  • How to come up with your first web scraping idea
  • Lessons from Seth Godin, Facebook, Google, and more
  • BONUS: Join 230 People in a Private Web Scraping Mastermind

Enter your email and I'll send you the first 40 pages, free!  

Your email address will be safe. 

​Niche 6: Scraping Social Media Sites (like Facebook or Instagram)

One great way to use social media sites: to validate ideas! It takes some effort, but you can usually find more info about what people want or don’t want in a product on social media than you can on Google. Use something like Twitter to search for keywords, competitor info, or other things related to your plan or product, and and then scrape the data! I’ve also used social media to find out more about what exists, and what features should exist in products. This can help you determine whether the idea you have is a good one, and what features are most likely to get you off to a good start. Offering this sort of analysis as a service is a great way to make money from people who are interested in starting a business but not sure if it is viable.

Niche 7: ​Scraping Forums (like Reddit)

Forums are great places for people to voice their opinions. They were the Twitter before there was a Twitter. so it makes sense that you can find lots of information about how people feel about certain things by looking on forums.

There are niche forums that focus on just tools, just cars, just photography, just SEO, pretty much anything you can think of. I highly recommend forums as a place to start if you’re considering leadgen, but there are many other options for using the info that's there! It's not that difficult to scrape forums and comb through the data for similar content about related topics to build your own unique content by tying it together, or just getting your own ideas. 

Get the Full eBook - Web Scraping Secrets Exposed

ONLY $34

  • 7 Chapters - Over 150 Pages Of Content!
  • 20+ Detailed Methods
  • 30+ Niches
  • BONUS: Private Mastermind Access
  • BONUS: 687 Freelance Scraping Jobs
  • BONUS: 150 Sample URLS
  • BONUS: Sample Scraper Code
  • BONUS: Scraper Demo Videos
  • BONUS: 4 Samples of Scraped Data
ripped image of web scraping secrets exposed

Secure checkout provided by Paypal and Stripe.