AI Parfums Scraper
Production-grade web intelligence system with anti-bot evasion, proxy rotation, and relational data storage.
Objective
Build an automated indexing service that continuously extracts product specifications (brand, olfactory notes, concentrations), prices, and historical review scores from perfume catalogs.
Anti-bot evasion
Target platforms use Cloudflare JS challenges and IP profiling. I built a Playwright Stealth system with randomized user agents, mouse motion paths, dynamic rate limiting, and premium proxy node rotation.
Data integrity
Scraped records pass through a validation layer that rejects malformed entries. PostgreSQL constraints prevent duplicates.
CREATE TABLE perfumes ( id SERIAL PRIMARY KEY, brand VARCHAR(100) NOT NULL, name VARCHAR(150) NOT NULL, concentration VARCHAR(50), top_notes TEXT[], heart_notes TEXT[], base_notes TEXT[], price NUMERIC(10, 2), scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, CONSTRAINT unique_profile UNIQUE (brand, name, concentration) );
Result
1,000+ verified products indexed. The structured dataset is ready for vector embeddings to enable semantic product recommendations based on ingredient notes.