用nodeJs实现一个简单的网站爬取工具

站长

2023年05月28日 23:45 · 阅读数 17

用nodeJs实现一个简单的网站爬取工具

本文介绍如何使用 Node.js 中提供的 http 模块和第三方的 cheerio 模块，实现一个可以爬取网站中所有图片、视频、音频文件的程序。在程序中，我们会遍历整个网站，并将其中的图片、视频、音频文件下载到指定的文件夹中。

依赖模块安装

在开始编写代码前，需要先安装依赖模块。在 Node.js 中使用 npm 包管理器可以安装所需模块，具体命令如下：

npm install axios cheerio

其中，axios 模块用于发送 HTTP 请求，cheerio 模块用于解析 HTML 页面。

程序实现

下面是对应的 Node.js 代码：

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');
const path = require('path');

// 网站地址
const website = 'https://example.com';

// 用于存储图片、视频、音频的数组
const images = [];
const videos = [];
const audios = [];

// 获取所有链接，并对每个链接进行处理
async function getLinks(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // 获取页面中的所有链接
    const links = $('a');
    for (let i = 0; i < links.length; i++) {
      const href = $(links[i]).attr('href');

      // 如果是图片、视频、音频文件，将链接加入对应的数组中
      if (href.endsWith('.jpg') || href.endsWith('.png') || href.endsWith('.gif')) {
        images.push(href);
      } else if (href.endsWith('.mp4') || href.endsWith('.avi') || href.endsWith('.wmv')) {
        videos.push(href);
      } else if (href.endsWith('.mp3') || href.endsWith('.wav') || href.endsWith('.ogg')) {
        audios.push(href);
      }
      
      // 如果链接是一个网页，递归调用 getLinks 函数
      else if (href.startsWith('http') && !href.includes(website)) {
        await getLinks(href);
      }
    }
  } catch (error) {
    console.error(error);
  }
}

// 下载图片、视频、音频文件
async function downloadFile(url, directory) {
  try {
    const response = await axios.get(url, { responseType: 'stream' });

    // 获取文件名并拼接路径
    const filename = path.basename(url);
    const filePath = path.join(directory, filename);

    // 将文件写入磁盘
    const writer = fs.createWriteStream(filePath);
    response.data.pipe(writer);
    console.log(`Downloaded ${filename}`);
  } catch (error) {
    console.error(error);
  }
}

// 创建保存图片、视频、音频的目录
const imagesDir = 'images';
const videosDir = 'videos';
const audiosDir = 'audios';
if (!fs.existsSync(imagesDir)) {
  fs.mkdirSync(imagesDir);
}
if (!fs.existsSync(videosDir)) {
  fs.mkdirSync(videosDir);
}
if (!fs.existsSync(audiosDir)) {
  fs.mkdirSync(audiosDir);
}

// 运行爬虫程序
getLinks(website).then(() => {
  // 下载所有图片
  Promise.all(images.map((image) => downloadFile(image, imagesDir)))
         .then(() => console.log('All images downloaded.'));
  
  // 下载所有视频
  Promise.all(videos.map((video) => downloadFile(video, videosDir)))
         .then(() => console.log('All videos downloaded.'));
  
  // 下载所有音频
  Promise.all(audios.map((audio) => downloadFile(audio, audiosDir)))
         .then(() => console.log('All audios downloaded.'));
});

在上述代码中，我们首先定义了要爬取的网站地址和用于存储图片、视频、音频链接的数组。接着，定义一个 getLinks 函数，该函数使用 axios 模块发送 HTTP 请求，获取指定页面的 HTML 内容，并使用 cheerio 模块解析 HTML 页面。在解析过程中，我们通过 jQuery 风格的选择器获取页面中的所有链接，并判断每个链接是否是图片、视频、音频文件；如果是，则将链接加入对应的数组中。如果链接指向一个网页，则递归调用 getLinks 函数，以遍历整个网站。

定义完 getLinks 函数后，我们又定义了一个 downloadFile 函数。该函数使用 axios 模块下载指定链接的文件，并将文件保存到指定的目录中。在下载文件时，需要指定文件的保存路径和文件名。

最后，我们创建保存图片、视频、音频的目录，并在爬虫程序结束后，调用 downloadFile 函数，下载所有图片、视频、音频文件。

总结

在本文中，我们介绍了如何使用 Node.js 中提供的 http 模块和第三方的 cheerio 模块，实现一个可爬取网站中所有图片、视频、音频的程序。在实际使用中，需要根据不同的网站和爬取需求进行适当的修改，并注意隐私和版权问题。