Web Crawler
Automate web data extraction within your workflows to collect structured insights from websites, APIs, and dynamic pages. Effortlessly scrape content, analyze search engine results pages, and integrate raw or processed data into automation pipelines for research, monitoring, and content generation at scale.
Video Tutorial
Watch this comprehensive tutorial to see the Web Crawler Node in action and learn advanced scraping techniques.
Core Features & Extraction Methods
AI-Powered Extraction
Leverage artificial intelligence to intelligently parse and extract data using natural language prompts without any coding required. Perfect for complex or changing website structures.
CSS/XPath Extraction
Use precise CSS selectors or XPath expressions to rule-based extract data from HTML structures for consistent, fast, and reliable results on well-structured websites.
API Extraction
Directly query backend APIs for structured JSON data, bypassing HTML parsing entirely for maximum efficiency and reliability when websites expose data through APIs.
Advanced Processing
Handle pagination automatically, render JavaScript-heavy sites, route through premium or stealth proxies, implement smart rate limiting, and process outputs through AI.
Extraction Methods Explained
Choose from three distinct extraction methods, each optimized for different scraping scenarios. Understanding when to use each method will dramatically improve your scraping success rate and efficiency.
1. AI-Powered Extraction
Let artificial intelligence understand your data needs through natural language prompts. This method excels when website structures change frequently, when you need to extract complex nested data, or when you want to quickly prototype scrapers without studying HTML structure.
Input your target URLs, adding multiple pages if needed. Include pagination placeholders like {page} for automatic crawling.
Select extraction mode: Single Extraction for direct scraping, or Wizard Mode for two-step discovery and extraction.
Configure your AI prompt with natural language instructions or define structured fields using JSON schema.
Optionally limit extraction to a specific CSS selector to focus the AI's attention and improve accuracy.
Enable advanced features: JavaScript rendering for dynamic content, Premium/Stealth proxies for protected sites, Auto-pagination for entire result sets.
2. CSS/XPath Extraction
Use precise selectors to extract data from HTML with surgical accuracy and blazing speed. This rule-based approach works exceptionally well for stable, well-structured websites where you can identify consistent patterns.
Input your target URLs with optional pagination placeholders for complex URL patterns.
Load pre-built templates for common scenarios like e-commerce, news articles, social media posts, or job listings.
Add extraction rules for each field: specify name, CSS selector or XPath, attribute to extract, data processing, and multi-value handling.
Enable advanced options: JavaScript rendering for dynamic content, Premium/Stealth proxies for anti-scraping measures, Auto-pagination with configurable limits.
3. API Extraction
Query backend APIs directly to access structured JSON data at the source, eliminating HTML parsing fragility. This approach is significantly faster, more reliable, and less likely to break when website designs change.
Input the API endpoint URL with placeholders like {page}. Discover endpoints using browser Developer Tools Network tab.
Select HTTP method: typically GET for data retrieval or POST for complex queries.
Configure headers (JSON format) for authentication tokens or API keys. Set query parameters or request body as needed.
Set the results path to specify where in JSON response to find your data (e.g., 'data.products').
Choose pagination strategy: Number-based for simple increments, or Response-based for cursor tokens from API responses.
Step-by-Step Configuration Guide
Follow this systematic approach to configure your Web Crawler Node correctly from the start. Each step builds on the previous one.
Adding the Node
From the right-side Tools menu in the Playground interface, select the Web Crawler Node and drag it onto your workflow canvas. Assign a descriptive title that clearly indicates what this scraper does, such as 'E-commerce Product Scraper' or 'News Article Extractor'.
Method Selection
Choose your extraction method based on your target website's characteristics. Use AI-Powered for complex or changing sites. Choose CSS/XPath for stable, well-structured sites. Select API extraction when you've identified backend APIs.
Method-Specific Configuration
Configure parameters specific to your chosen method. For AI-Powered: craft clear prompts or define schemas. For CSS/XPath: identify selectors and configure processing. For API: map request structure and authentication.
Workflow Connections
The Web Crawler operates independently without input connections. Configure output connections to Assistant Node for analysis, Vector Store for embeddings, File Base for storage, or direct exports. Results are also accessible in the Bucket menu.
Test, Save, and Deploy
Test your configuration with one or two pages first to validate selectors, prompts, or API calls. Monitor results carefully. Once confident, save settings and click Start Extraction for full scraping. Review completed extractions in Bucket menu.
Common Use Cases
Product Research & Competitive Pricing
Monitor competitor prices, product availability, and new launches across multiple e-commerce platforms. Extract product details and feed to AI for trend analysis and price optimization.
Content Aggregation & News Monitoring
Use AI Wizard Mode to discover and extract articles from news sites and blogs. Generate summaries, perform sentiment analysis, and create automated news digests for market research.
API-Driven Lead Generation
Query classified ads APIs or job boards to pull listings with sophisticated filtering. Extract contact information and company details for CRM integration and sales pipeline automation.
Market Monitoring & Sentiment Analysis
Extract reviews, comments, and user-generated content from multiple platforms. Process through AI to classify sentiment, identify themes, and generate alerts for reputation management.
Best Practices for Successful Web Scraping
- Start with small test runs of just one or two pages to validate your configuration before launching full scrapes. This optimizes credit usage and helps identify problems early.
- Combine extraction methods strategically based on each site's characteristics. Use AI-Powered for complex sites, CSS/XPath for structured sites, and API extraction whenever possible.
- Enable proxies appropriately based on target site protection levels. Premium proxies for most commercial websites, Stealth proxies for heavily guarded sites.
- Use Wizard Mode for multi-level scraping where you first discover item URLs, then extract detailed information from each item page.
- Cache results to avoid re-scraping unnecessarily. Limit extractions to essential fields and use CSS selectors to narrow scope, reducing processing time and costs.
- Leverage pre-built templates and browser Developer Tools for quick selector discovery and API endpoint identification. Process outputs through Assistant Node for cleaning and enrichment.
- Always respect website terms of service and robots.txt directives. Use appropriate rate limiting to avoid overloading servers. Verify you have permission before scraping.
Integration & Output Options
The Web Crawler Node operates independently as an output-only node that forwards structured results in JSON or CSV format to other workflow components.
- Assistant Nodes: Process and analyze content using AI
- Vector Store: Create searchable embeddings for semantic search
- File Base: Organized storage and retrieval
- Direct Download: Export as JSON or CSV from Bucket menu
- Google Sheets: Reporting and data tracking
- Telegram: Notifications and alerts
Important: Always respect website terms of service and robots.txt directives when scraping. Use appropriate rate limiting to avoid overloading servers. Some websites prohibit automated access, so verify you have permission. Web scraping laws vary by jurisdiction, so consult legal counsel if unsure.