PHP

PHP Web Scraping: What to know before you start with Symfony Panther, Goutte, and more

First and foremost, scraping the web (in PHP) is bad, mkay? PHP web scraping isn’t worse than doing it with other languages, it’s just that web scraping in general is most likely to be looked by with disdain by content-producers. And it’ll make your code more brittle than it should be, and is general going to make an application more complex to build.

That said, sometimes “web scraping” is your only choice. If you love PHP (I do), and need to do some web scraping, you’re in the right place. In this article I’ll give a quick summary of the state-of-the-art for PHP web scraping, and some details about why to do it and what tools to use. Let’s start this PHP web scraping tutorial!

What is Web Scraping?

Web scraping with PHP is no different than any other kind of “web scraping.” And while different people mean different things when they say “web scraping,” what I mean is that you’re extracting information from within the HTML of a web page the owner of that information hasn’t made that information available in a REST, SOAP, or GraphQL API. (Nor any other kind of identifiable programming-friendly interface.)

Web scraping is your last choice when someone doesn’t provide a formal web API for data-access

I was recently looking to do some web scraping to get information about our web affiliate businesses for WPShout. (BTW, if you’re looking for the best WordPress web hosting, read our great article on the topic.) The owner of the affiliate program data didn’t offer any notable API. But I wanted us to have an “earning dashboard.” So I built a PHP scraper script to fetch it for me. And it’s working great. (Unfortunately I can’t show it to you, as I don’t want to share the data it scrapes.)

The Reasons to do Web Scraping

So we web scrape because…

  • We want data inside of our PHP script
  • The owner of the data doesn’t expose an API by which we can more efficiently get that data
  • We really want their data inside of our PHP script

I said “we want that data” twice because a characteristic of web scrapper scripts is that they’re fragile. Because they’re getting at underlying data presented in a web page’s internal HTML, they can break for random reasons. Like a designer changing the HTML that surrounds the data you’re seeking. This is why Intuit (makers of Mint, QuickBooks, etc) are spending millions on its bank web-scrappers every year. (Conservatively. But without citation 🙃)

The Reasons Not to Do Web Scraping

Here are few of the reasons that your web scraper (once it’s working, which is a whole other topic…), will break:

  • The data format is now presented different in terms of it’s text-strings
  • That the data location on the page is moved because of design considerations
  • The host’s data format (surrounding HTML) is moved because of design considerations
  • The data you’re trying to scrape gets intentionally obfuscated by its host (think of Facebook’s anti-ad-block markup)

In short, web scraping should always be the last resort. You’re doing a thing that the content-producer is at best a little disappointed with. If they wanted to make that information available to you, and they could, they would have. They may truly not have the technical capacity, or interest. Which is when web scraping is a great fit. Because a slow-moving website is one of the best targets for scraping data from websites using PHP.

Why Use PHP for Web Scraping?

There are a number of PHP web scraping libraries. And while I’ve not done an exhaustive search, I do suspect there are better languages than PHP to use for scraping. I doubt the absolutely best web scraping framework is writting in PHP. No PHP web scraping framework I know is mind-blowingly good.

The primary reason for doing PHP web scraping is that you know and love PHP. Use PHP for your web scraping if the rest of your application (that’s going to use the result of this web scraping) is written in PHP. Scraping with PHP is not so easy that I’d plan to use it in the middle of Python web project, for example. The PHP scraping libraries are quite good, but they’re not amazing.

Reasons to Avoid PHP Web Scraping

Web scraping with PHP is easy enough. And good enough that I’d do it without a second’s hesitation in a PHP project. So the primary reason I wouldn’t do PHP scraping? That I knew a different language better, or was already using it. Web scraping with PHP is not better enough that I’d use it in preference to some language like Java that I was already writing my project in.

The other big reason not to do PHP web scraping is simply that you’re not wanting to do web scraping at all. There are tons of good reasons for that, including the increasing commonness of CAPTCHAs and other bot-stopping maneuvers. It’s still a useful technique to know for sure, but it’s getting less-useful than it was a decade ago.

Getting Started with PHP Web Scraping

There a number of PHP web scraping framework options. While I could make this tutorial a thorough tour of using each one of those, I think that the Symfony projects of Goutte and Panther make a potent combination here, and I wouldn’t really make an effort to use a different system. You can if you need to, but I won’t give you a full list.

Which PHP Web Scraping Libraries Should I Use?

Choose your web scraping team carefully. It’s like surgery. With rockets 😛

So, I think the obvious answer here is “whatever you like.” No PHP scraping framework I’ve ever tried is so good that I’d use it in preference to another.

I started doing some light PHP web scraping in the context of a project that was using the Symfony PHP web framework. And, in general, I enjoy the Symfony tools enough to not look for others. So what we’ll cover in the rest of the PHP web scraping tutorial is FriendsOfSymfony/Goutte and Symfony/Panther. But there are a lot of good options. In general the major difference I’d highlight is between a PHP web scraping library like Panther or Goutte, and PHP web request library like cURL, Guzzle, Requests, etc.

In my mind, a PHP web request library is distinguished from a web scraping library because:

  • It can make requests using all the major HTTP methods
  • It can get you the basic HTML of a page, which you can parse how you’d like
  • It doesn’t help you parse the web page your HTTP request returns
  • To doesn’t help you to make a series of requests in sequence while moving through a series of web pages you’re trying to scrape

So I’d count Goutte, Panther, and Laravel Dusk (which we’ll just briefly cover at the end) to be properly PHP web scraping libraries. I’d count just about every other PHP tool I’ve ever heard of as a “request library.”

Getting Started with Symfony Goutte

So, let’s get to the step-by-step of our PHP Web Scraping tutorial. Goutte was the first PHP web scraper I used, and it still works pretty well for all the basic needs you’ll have: getting pages, filling in their PHP web forms, and extracting content from them.

To use Goutte, we must first get it:

composer require fabpot/goutte

Here’s a script that will scrape a page with Goutte:

<?php

include ('vendor/autoload.php');

$client = new \Goutte\Client();
$crawler = $client->request('GET', 'http://example.com/');
$fullPageHtml = $crawler->html();
$pageH1 = $crawler->filter('h1')->text();

This is using the PHP package manager, Composer. So will all other examples here. I don’t yet have an article to get you started with Composer, let me know if you need one.

What this crawler does it pretty simple: it goes to example.com and loads the page. Then it filters the HTML and pulls the page’s <h1> element, getting us its content. There is nothing very cool here, but should give you a sense of how Goutte works for PHP web scraping.

When You’ll Need Symfony Panther

The primary obstacle that every basic PHP site scraper will have is that a lot of the modern web requires JavaScript to work. Long ago are the days that every website developer made sure that their site worked great without executing any JavaScript. And it’s precisely this issue that will make it necessary for your web scraper to use Panther instead of Goutte for PHP web scraping.

What’s great about Symfony Panther is that you’re actually spawning and controlling an instance of the Google Chrome web browser form your PHP scraper script, as opposed to doing it with raw HTML requests. This is a nice thing because Google Chrome is great for executing JavaScript. It’s bad because you’ll have another dependency on your deployed box which is a little abnormal: that it has Google Chrome. It also will add a few speed bumps to your development, one of which I note after the code sample.

Unsure if you’ll be able to get by with Goutte or will need to use Panther? Disable JavaScript in your web browser. (You’ll be able to solve this with a quick web search of “Disable JavaScript in [browsername].”) Then try to do what you’ll want your scraper to do. If you can, go back to Goutte. If you can’t, it’ll be Panther time.

Here’s our PHP scraper script that will browse and move-through a page with Panther:

First, get the package:

composer require symfony/panther

Then use this in your PHP script:

<?php

include ('vendor/autoload.php');

$client = \Symfony\Component\Panther\Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com/');
$fullPageHtml = $crawler->html();
$pageH1 = $crawler->filter('h1')->text();

This is, you’ll notice, identical to the above Goutte code. We’ll highlight some of the cooler features of Panther in the next snippet. At this point we’re just replacing Goutte with Panther, and getting the same page content. If you really execute both of these, you’ll notice that Pather is way slower than Goutte. Spinning up Chrome is way more expensive than just getting HTML with PHP itself, which is all Goutte is doing under the hood. Not the end of the world, but certainly something to know.

Random useful tip for Symfony Panther play: I’ve had a lot of times when my development scripts would error and leave a Chrome running on port 9515. The will later raise an exception reporting the issue. The BASH command kill $(lsof -t -i:9515) is the best way I found to kill that and get my script back to working.

Doing More Complex Operations with Goutte and Panther

The parts of web scraping I struggle with the most is the operations other than loading a web page and getting some data that was in the underlying HTML. Things like filling forms, finding and clicking links, and more are possible but not as obvious. I actually find them a little awkward with Goutte and Panther. But to get you started, here’s an example of filling in the search value on Wikipedia and clicking the search button:

include ('../vendor/autoload.php');

$client = new \Goutte\Client();
// For Panther
//$client = \Symfony\Component\Panther\Client::createChromeClient();
$crawler = $client->request('GET', 'https://www.wikipedia.org/');
$form = $crawler->filter('#search-form')
    ->form(['search' => 'web scraping']);
$crawler = $client->submit($form);
// For Panther
//$client->takeScreenshot('screenshot.png');
//$client->waitFor('.firstHeading');

echo $crawler->filter('.mw-parser-output p')->first()->text();

What’s awesome about this example of a web scraping script that you’re able to adapt it to just about any form you can think of. What’s not great is while playing with the above snippet I wasted about 90 minutes because I forget the second assignment to $crawler from the form submission and was getting obscure errors when running with Symfony Panther.

What does the above do? It searches Wikipedia for “web scraping” by finding the search form on the homepage and submitting it. (Because of the relatively smart structure of Wikipedia URLs, you might not need this part. Because lots of other things you’ll want to scrape require you to fill out a form, I intentionally did this search via filling out the form rather than “URL hacking” where I just took the best-guess of the structure of the final URL.)

It’s also really fun about Panther and Goutte that when you don’t do stupid things, they’re APIs are largely compatible. Because Goutte is basically an HTML-only browser, it can’t do cool things like take-screenshot, nor wait for DOM elements to load. But other than that, these two will work the same. But the screenshot feature? It is certainly pretty cool. 🤓

Laravel Fan? Laravel Dusk Looks a Lot Like Panther, with Slightly Nicer Interactions

A photo of dusk. Because we could all use more beauty. Especially while studying web scraping ;p

As I mentioned, I’ve not done a comprehensive review of PHP web scraping frameworks. But it did occur to me as I wrote my above minor complaint about Symfony Panther’s heuristics for navigation that there was something called Laravel Dusk that I’d not really studied.

Just like Symfony Panther, Laravel Dusk is meant primarily as a tool for you test your own web application, and not scrape web sites you don’t own. But just the same, it also supports the idea of doing whatever you want with a PHP scraping tools.

There’s a Lot More To Do with PHP Web Scrapers

Web scraping with PHP is really limited by your imagination. And complex human tests like reCAPTCHAs. But other thank that, I like how this web scraping tutorial finished. PHP is a powerful language, and understanding how you can use it to harvest data from the web at large is well worth the effort.

Go forth and act responsibly, gathering data that the owner is hopefully OK with your web-scraping.

Standard

11 thoughts on “PHP Web Scraping: What to know before you start with Symfony Panther, Goutte, and more

  1. Mike S. says:

    Hi, WPShout reader here. I’m confused what web scraping actually is, you don’t really spend much time describing it for us newbies.

    Wikipedia says this: “Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.”

    I’m confused why it would be frowned upon? Don’t all browsers basically do this when you visit a website?

    Thanks!

    • David says:

      Hi Mike,

      A good question. It’s exactly what a web browser does. But usually it’s assumed that a human is behind a web browser. And lots of “bots” do bad things–think comment spam, data theft, etc–so people don’t generally like bot traffic, because most of it is for neutral or negative aims.

      Does that make sense?

  2. SAM says:

    One of the rare quality articles on web scraping. What I liked is that you explained that modern websites use JavaScript and that is a problem for PHP when scraping.

    I’m now subscribed to your blog!

  3. henry yen says:

    i have been trying to get panther working. I started with ubuntu 18.04 a clean install of all necessary package.

    I try to implement the code above and are getting the following errors, which I can’t seem to find much info? any help? thanks

    PHP Fatal error: Uncaught Facebook\WebDriver\Exception\UnknownServerException: unknown error: Chrome failed to start: exited abnormally
    (unknown error: DevToolsActivePort file doesn’t exist)
    (The process started from chrome location /usr/bin/chromium-browser is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
    (Driver info: chromedriver=76.0.3809.68 (420c9498db8ce8fcd190a954d51297672c1515d5-refs/branch-heads/3809@{#864}),platform=Linux 4.15.0-62-generic x86_64) in /var/www/html/vendor/facebook/webdriver/lib/Exception/WebDriverException.php:114
    Stack trace:
    #0 /var/www/html/vendor/facebook/webdriver/lib/Remote/HttpCommandExecutor.php(331): Facebook\WebDriver\Exception\WebDriverException::throwException(13, ‘unknown error: …’, Array)
    #1 /var/www/html/vendor/facebook/webdriver/lib/Remote/RemoteWebDriver.php(144): Facebook\WebDriver\Remote\HttpCommandExecutor->execute(Object(Facebook\WebDriver\Remote\WebDriverCommand))
    #2 /var/www/html/vendor/symfony/panther/src/ProcessManager/ChromeManager.php(62): in /var/www/html/vendor/facebook/webdriver/lib/Exception/WebDriverException.php on line 114

    • David says:

      Hi Henry,

      Great question! Generally it seems that Chrome isn’t quite working, so Panther can’t find and use it. As to the specifics, I can’t really say from the error message. I’m maybe try to make sure that Chrome is up-to-date and try again. I realize that’s not a perfect answer, but your error message seems to suggest that it *is* installed (the first problem I thought of), so it’s just that it’s not working with Panther/the Facebook Wedbriver that Panther is built on.

  4. I’m creating an app to get Order Tracking Update Data from a website that does not offer API…

    I have already made my app work locally… but as I was making my app live on my Centos OS server – it simply did not work…

    searched the Panther documentation and found that for Panther to work, it would need a locally installed chrome browser.

    how can I make my Symfony panther powered app to work on my CPanel?

  5. JR says:

    Web scraping is at its core what the web is for. Not everything has an API or distributes JSON openly. Data deserves to be freed. If you want it private, make it so, but don’t put the ethical burden on those who seek raw data to analyze. Mega corporations will sue, if they can find you. So, yes, be careful out there.

  6. George says:

    Thanks for the great tutorial. I have always set up random web scrapers over the years, but never felt too great about it. I think the best way to prevent errors is to set up some good unit tests and run weekly reports so you can see when page formats change.

  7. OK, excellent tutorial, in fact, one of the best.

    Of course, I can call the PHP function “get_class_methods”, (below) however,

    Where can I find a concise list (with spec) of all the methods from the returned objects?

    i.e.
    * $client
    * $crawler
    * $form
    etc.

    Patrick

    ***************
    *** CLIENT ***
    ***************

    Array
    (
    [0] => createChromeClient
    [1] => createFirefoxClient
    [2] => createSeleniumClient
    [3] => __construct
    [4] => getBrowserManager
    [5] => __destruct
    [6] => start
    [7] => getRequest
    [8] => getResponse
    [9] => followRedirects
    [10] => isFollowingRedirects
    [11] => setMaxRedirects
    [12] => getMaxRedirects
    [13] => insulate
    [14] => setServerParameters
    [15] => setServerParameter
    [16] => getServerParameter
    [17] => click
    [18] => submit
    [19] => refreshCrawler
    [20] => request
    [21] => back
    [22] => forward
    [23] => reload
    [24] => followRedirect
    [25] => restart
    [26] => getCookieJar
    [27] => waitFor
    [28] => getWebDriver
    [29] => get
    [30] => close
    [31] => getCurrentURL
    [32] => getPageSource
    [33] => getTitle
    [34] => getWindowHandle
    [35] => getWindowHandles
    [36] => quit
    [37] => takeScreenshot
    [38] => wait
    [39] => manage
    [40] => navigate
    [41] => switchTo
    [42] => execute
    [43] => findElement
    [44] => findElements
    [45] => executeScript
    [46] => executeAsyncScript
    [47] => getKeyboard
    [48] => getMouse
    [49] => followMetaRefresh
    [50] => xmlHttpRequest
    [51] => getHistory
    [52] => getCrawler
    [53] => getInternalResponse
    [54] => getInternalRequest
    [55] => clickLink
    [56] => submitForm
    )

    ******************
    *** CRAWLER ***
    ******************

    Array
    (
    [0] => __construct
    [1] => clear
    [2] => add
    [3] => addContent
    [4] => addHtmlContent
    [5] => addXmlContent
    [6] => addDocument
    [7] => addNodeList
    [8] => addNodes
    [9] => addNode
    [10] => eq
    [11] => each
    [12] => slice
    [13] => reduce
    [14] => first
    [15] => last
    [16] => siblings
    [17] => nextAll
    [18] => previousAll
    [19] => parents
    [20] => children
    [21] => attr
    [22] => nodeName
    [23] => text
    [24] => html
    [25] => evaluate
    [26] => extract
    [27] => filterXPath
    [28] => filter
    [29] => selectLink
    [30] => selectImage
    [31] => selectButton
    [32] => link
    [33] => links
    [34] => image
    [35] => images
    [36] => form
    [37] => setDefaultNamespacePrefix
    [38] => registerNamespace
    [39] => getNode
    [40] => getElement
    [41] => count
    [42] => getIterator
    [43] => click
    [44] => getAttribute
    [45] => getCSSValue
    [46] => getLocation
    [47] => getLocationOnScreenOnceScrolledIntoView
    [48] => getSize
    [49] => getTagName
    [50] => getText
    [51] => isDisplayed
    [52] => isEnabled
    [53] => isSelected
    [54] => sendKeys
    [55] => submit
    [56] => getID
    [57] => findElement
    [58] => findElements
    [59] => getUri
    [60] => getBaseHref
    [61] => matches
    [62] => closest
    [63] => outerHtml
    [64] => xpathLiteral
    )

    *************
    *** FORM ***
    *************

    Array
    (
    [0] => __construct
    [1] => getButton
    [2] => getElement
    [3] => getFormNode
    [4] => setValues
    [5] => getValues
    [6] => getFiles
    [7] => getMethod
    [8] => has
    [9] => remove
    [10] => set
    [11] => get
    [12] => all
    [13] => offsetExists
    [14] => offsetGet
    [15] => offsetSet
    [16] => offsetUnset
    [17] => getPhpValues
    [18] => getPhpFiles
    [19] => getUri
    [20] => getName
    [21] => disableValidation
    [22] => getNode
    )

  8. Peter says:

    Hey David,

    Interesting article! Thank you for sharing your knowledge here!

    I would love to hear what you think about my side-project phpscraper.de – it’s taking a bit a different approach to scraping. A stronger focus on reducing code and simply getting information out.

    Panther is one of the next points on the todo list. Unfortunately it isn’t a 100% a drop-in replacement…

    Peter

Leave a Reply

Your email address will not be published. Required fields are marked *