php+webdriver 爬虫

所需工具

  1. webdirver chrome 下载地址(需科学上网)
  2. composer
  3. php7

准备

  • 启动 webdirver
chromedriver.exe --port=9515
  • 下载所需依赖包
composer require php-webdriver/webdriver  # webdriver驱动包
composer require guzzlehttp/guzzle # 发送http请求的包

思路

  1. 抓取目标网页内容
  2. 匹配数据 dom 节点
  3. 获取数据并保存

初始化 chromedirver

require(__DIR__.'/vendor/autoload.php');
use Facebook\WebDriver\Remote\{RemoteWebDriver, DesiredCapabilities};
use Facebook\WebDriver\WebDriverBy;
use GuzzleHttp\Client;

// 1. 初始化chromedriver
$options = (new ChromeOptions)->addArguments([
    // '--disable-gpu',
    '--headless'
]);

$host = 'http://localhost:9515';
$driver = RemoteWebDriver::create($host, DesiredCapabilities::chrome()->setCapability(
    ChromeOptions::CAPABILITY,
    $options
));

爬取网站

https://www.mzitu.com/

抓取目标网页内容

$driver->get('https://www.mzitu.com/');

匹配数据 dom 节点获取数据 这里用 css 选择器

$i = 1;
$fetchData = [];
while (true) {
    $el = $driver->findElements(WebDriverBy::cssSelector("#pins > li:nth-child($i) > a"));
    if (!count($el)) {
        break;
    }
    $title = $driver->findElement(WebDriverBy::cssSelector("#pins > li:nth-child($i) > a > img"))->getAttribute('alt');
    $detailLink = $driver->findElement(WebDriverBy::cssSelector("#pins > li:nth-child($i) > a"))->getAttribute('href');

    $fetchData[] = [$title, $detailLink];
    $i++;
}

获取详情数据,保存

foreach ($fetchData as $fetch) {
    list($title, $detailLink) = $fetch;
    print_r([$title, $detailLink]);
    try {
        // 根据title生成对应文件夹
        // 图片根目录
        $rootDir = __DIR__ . '/imgs';
        if (!file_exists($rootDir)) {
            mkdir($rootDir, true);
        }
        // 图片目录
        $imgDir = $rootDir . '/' . $title;
        if (!file_exists($imgDir)) {
            mkdir($imgDir, true);
        }

        // 获取详情数据
        $driver->get($detailLink);
    } catch (Throwable $e) {
        echo $e->getMessage() . PHP_EOL;
        echo '等待6秒';
        sleep(6);
    }

    // 获取最大页数
    try {
        $pageMax = $driver->findElement(WebDriverBy::xpath('//div[@class="pagenavi"]/a[last()-1]'))->getText();
    } catch (Throwable  $e) {
        echo $e->getMessage() . PHP_EOL;
        echo $detailLink;
    }
    $http = new Client(['verify' => false]);
    $fileName = md5($detailLink);
    for ($page = 1; $page <= $pageMax; $page++) {
        $imgPath = $imgDir . '/' . $fileName . '_' . $page . '.jpg';
        if (file_exists($imgPath)) {
            echo $imgPath . ' is exists' . PHP_EOL;
            continue;
        }
        try {
            $imgLink = $detailLink . '/' . $page;
            $driver->get($imgLink);
            // 获取目标图片链接
            $src = $driver->findElement(WebDriverBy::xpath('//div[@class="main-image"]//img'))->getAttribute('src');
            // 获取并保存图片
            $http->get($src, ['save_to' => $imgPath, 'headers' => ['referer' => $imgLink, 'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36']]);
        } catch (Throwable $e) {
            echo $e->getMessage();
            echo PHP_EOL . '等待5秒';
            sleep(5);
            $page = $page - 1;
        }
    }
}

关闭驱动

$driver->close();
点赞

发表评论

电子邮件地址不会被公开。必填项已用 * 标注