🐶
Next.js

Run Puppeteer in Docker with Chromium

By Filip on 10/05/2024

Learn how to create a Dockerfile for Puppeteer with Chromium to run headless browser automation tests in a containerized environment.

Run Puppeteer in Docker with Chromium

Table of Contents

Introduction

This guide provides a step-by-step process for running Puppeteer, a Node library for controlling headless Chrome, within a Docker container. It begins by outlining the project setup, including creating a directory, initializing a Node.js project, and installing Puppeteer. The guide then details the creation of a Dockerfile, which defines the environment for the container. This involves selecting a base image, setting the working directory, copying project files, installing dependencies, exposing ports, and specifying the command to run the application. Instructions are provided for building the Docker image using the docker build command and running the container using the docker run command with appropriate flags. The guide emphasizes the importance of writing Puppeteer scripts that account for the headless nature of Chromium and potential sandbox restrictions within Docker. A basic example script is included. Additional considerations such as security, resource management, and debugging are also discussed. By following these steps, users can effectively utilize Puppeteer within Docker for various web automation tasks.

Step-by-Step Guide

Here's a comprehensive guide to successfully run Puppeteer within a Docker container, addressing common challenges and incorporating best practices:

1. Setting Up Your Project:

  • Project Structure: Create a directory for your project and navigate to it in your terminal.
  • Package Initialization: Run npm init -y to initialize a new Node.js project and create a package.json file.
  • Install Dependencies: Install Puppeteer and any other required libraries using:
npm install puppeteer

2. Creating the Dockerfile:

  • Create a file named Dockerfile in your project's root directory.
  • Start with a base image that includes Node.js and the necessary dependencies. For example:
FROM node:16-alpine
  • Set the working directory within the container:
WORKDIR /app
  • Copy your project files into the container:
COPY package*.json ./
RUN npm install
COPY . .
  • Install additional dependencies required by Puppeteer, such as fonts and shared libraries:
RUN apk add --no-cache chromium \
    nss \
    freetype \
    harfbuzz \
    ca-certificates \
    ttf-freefont
  • Expose the port you want your application to listen on (if applicable):
EXPOSE 3000
  • Define the command to run your application:
CMD ["node", "your-script.js"]

3. Building the Docker Image:

  • Open a terminal in your project's root directory.
  • Build the Docker image using the following command, replacing <image-name> with your desired name:
docker build -t <image-name> .

4. Running the Docker Container:

  • Start a Docker container from the image you built:
docker run -it --rm --shm-size=2gb <image-name>
  • Explanation of flags:
    • -it: Runs the container in interactive mode with a TTY.
    • --rm: Automatically removes the container when it exits.
    • --shm-size=2gb: Allocates shared memory for Chrome, which can be crucial for stability.

5. Writing Your Puppeteer Script:

  • Create a JavaScript file (e.g., your-script.js) in your project directory.
  • Write your Puppeteer code, ensuring you handle Chromium's headless nature and potential sandbox restrictions within the Docker environment. Here's a basic example:
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  const page = await browser.newPage();
  await page.goto('https://example.com');
  // ... your Puppeteer actions here ...
  await browser.close();
})();

Additional Considerations:

  • Security: Be cautious when running Puppeteer in Docker, especially in production environments. Use a non-root user and consider security best practices for containerized applications.
  • Resource Management: Allocate sufficient resources (CPU, memory) to the container to avoid performance issues.
  • Debugging: Use console.log statements or debugging tools to troubleshoot any problems within the container.

By following these steps and addressing potential challenges, you can effectively run Puppeteer within a Docker container for web scraping, automation, and other tasks.

Code Example

This JavaScript code uses Puppeteer within a Docker container to automate browser tasks. It opens a headless Chrome instance, navigates to Google, takes a screenshot, performs a search, and retrieves the page title. The code includes explanations of each step and tips for error handling and best practices.

Following the guide you provided, here's a basic JavaScript example for your-script.js that demonstrates how to use Puppeteer within a Docker container:

const puppeteer = require('puppeteer');

(async () => {
  // Launch Puppeteer with headless mode and sandbox disabled
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'] 
  });
  const page = await browser.newPage();

  // Navigate to a website
  await page.goto('https://www.google.com');

  // Take a screenshot and save it as 'google.png'
  await page.screenshot({ path: 'google.png' });

  // Type "Puppeteer in Docker" into the search bar
  await page.type('input[name="q"]', 'Puppeteer in Docker');

  // Submit the search form
  await page.keyboard.press('Enter');

  // Wait for the search results page to load
  await page.waitForNavigation();

  // Get the title of the page
  const title = await page.title();
  console.log(`Title of the page: ${title}`);

  // Close the browser
  await browser.close();
})();

Explanation:

  1. Launch Puppeteer: We launch Puppeteer with headless: true to run it without a visible browser window. The args are crucial for running Puppeteer within the Docker environment, disabling the sandbox for compatibility.
  2. Navigation and Screenshot: We navigate to Google's homepage and take a screenshot to demonstrate basic interaction.
  3. Search Interaction: We simulate typing a search query and submitting the form, showcasing how to interact with page elements.
  4. Page Title: We retrieve and log the page title to demonstrate data extraction.
  5. Closing Browser: Finally, we close the browser to end the session.

Remember:

  • This is a basic example. You can expand it to include more complex Puppeteer actions like clicking buttons, filling forms, and extracting data.
  • Ensure your Dockerfile and build process are set up as described in the guide before running this script.
  • Adjust the script based on your specific use case and website interactions.

Additional Tips:

  • Use try...catch blocks to handle potential errors gracefully.
  • Implement proper waiting mechanisms (e.g., waitForSelector, waitForNavigation) to ensure elements are loaded before interacting with them.
  • Consider using libraries like puppeteer-extra and its plugins for additional features and stealth options.

Additional Notes

While the provided guide offers a solid foundation, here are some extra notes and troubleshooting tips to enhance your experience with Puppeteer in Docker:

Docker Considerations:

  • Base Image Choices: While node:16-alpine is a popular choice for its small size, consider using node:16-slim if you encounter compatibility issues. For more comprehensive tooling, node:16 might be suitable.
  • Resource Limits: If you're running resource-intensive tasks, adjust the --shm-size flag or allocate more CPU and memory to the container using Docker's resource constraints options.
  • Docker Compose: For managing multi-container applications, consider using Docker Compose to define and run your Puppeteer service alongside other services.

Puppeteer Tips:

  • Headless Chrome Options: Explore additional launch options like defaultViewport to set the screen size, userDataDir for persistent storage, and executablePath to specify a custom Chrome installation.
  • Network Settings: If you're working with websites that require specific network configurations, use Puppeteer's network options like proxyServer or extraHTTPHeaders.
  • Error Handling: Implement robust error handling using try...catch blocks and Puppeteer's error events to gracefully handle unexpected situations.

Troubleshooting Common Issues:

  • Chromium Launch Failures:
    • Ensure you've installed the necessary dependencies as mentioned in the guide.
    • Check if the --no-sandbox and --disable-setuid-sandbox flags are included in your launch options.
    • Verify that the shared memory size (--shm-size) is sufficient.
  • Timeout Errors:
    • Increase timeout values for navigation and element interactions using page.setDefaultTimeout().
    • Implement retry mechanisms for flaky actions.
  • Detection and Blocking:
    • Use tools like puppeteer-extra and its plugins to evade detection and blocking by websites.
    • Rotate user agents and IP addresses if necessary.

Advanced Usage:

  • Puppeteer Cluster: For parallel execution and improved performance, consider using puppeteer-cluster to manage multiple browser instances.
  • Stealth Plugins: Explore plugins like puppeteer-extra-plugin-stealth to make detection more difficult.
  • Custom Configurations: Tailor your Puppeteer setup with custom configurations and extensions to meet your specific needs.

By incorporating these additional notes and troubleshooting tips, you can further optimize your Puppeteer experience within Docker and tackle potential challenges effectively.

Summary

Step Action Command/Notes
1 Project Setup
Initialize Node.js project npm init -y
Install Puppeteer npm install puppeteer
2 Create Dockerfile
Choose base image (e.g., Node.js) FROM node:16-alpine
Set working directory WORKDIR /app
Copy project files & install dependencies COPY, RUN npm install
Install Puppeteer dependencies (fonts, libraries) RUN apk add ... (chromium, nss, freetype, harfbuzz, etc.)
Expose port (if needed) EXPOSE 3000
Define command to run application CMD ["node", "your-script.js"]
3 Build Docker Image
Build image with chosen name docker build -t <image-name> .
4 Run Docker Container
Start container with options docker run -it --rm --shm-size=2gb <image-name>
Options explanation -it: interactive, --rm: remove on exit, --shm-size: shared memory
5 Write Puppeteer Script
Create JavaScript file for Puppeteer code your-script.js
Handle headless mode & sandbox restrictions headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox']
Include Puppeteer actions (navigation, scraping, etc.)
Extra Additional Considerations
Security best practices Non-root user, container security
Resource management Allocate sufficient CPU/memory
Debugging console.log, debugging tools

Conclusion

In conclusion, running Puppeteer in Docker offers a robust and efficient solution for web scraping, automation, and various browser-based tasks. By following the outlined steps, you can create a containerized environment that effectively executes your Puppeteer scripts while addressing potential challenges. Remember to consider security best practices, resource management, and debugging techniques to ensure smooth operation. With careful planning and implementation, Puppeteer in Docker empowers you to harness the capabilities of headless Chrome for a wide range of applications.

References

Were You Able to Follow the Instructions?

😍Love it!
😊Yes
😐Meh-gical
😞No
🤮Clickbait