Case study

Cloned-site takedown scraping workflow

Automated page discovery and content mapping for takedown support

TL;DR

Problem: A cloned WooCommerce site was appearing in Google search results, but its randomized numeric URLs made manual page mapping impractical across a large catalog.
My role: Built a Node.js and Puppeteer scraping workflow to discover cloned URLs, collect page-level data, log failures, support reruns, and produce structured exports for takedown support.
Approach: Scraped listing and detail pages into CSVs, compared cloned-site data against exports from the original site, and helped turn the results into a usable URL-mapping spreadsheet.
Outcome: The team completed the takedown support work quickly, reduced manual review across thousands of pages, and produced a clear page-by-page record that helped support removal of the cloned site.
Stack: Node.js, Puppeteer, CSV exports, logging, Excel

A premium nursery retailer discovered that a cloned version of its WooCommerce site was appearing in Google search results. That immediately created both a brand-protection issue and an SEO problem, and the team needed a way to respond quickly.

The challenge was that the cloned site did not use readable URLs that could be matched back to the original site. Instead, it relied on random 18-digit numeric URLs with no obvious pattern, which made page-by-page mapping impractical. With more than 2,000 products involved, manually reviewing the site would have taken far too long.

Rather than trying to brute-force the work by hand, I built a custom scraping workflow that automated URL discovery and page-level data collection. The output gave the team structured data they could compare against an export from the original site, making it possible to map original URLs to cloned URLs and prepare a clean spreadsheet for the takedown process.

The challenge

This was more involved than simply crawling a site.

Because the cloned site used randomized numeric URLs, we could not rely on URL structure to match products and pages back to the original WooCommerce store. We needed another way to identify pages at scale using fields that were more stable and more human-readable, such as H1s and page titles.

At the same time, the work was urgent. The cloned site was already showing up in search results, and the catalog size meant a fully manual review would have been slow, tedious, and expensive in labor hours. The goal was not just to gather data, but to do it fast enough that the team could keep moving toward a takedown.

What I built

I built a Node.js and Puppeteer-based scraping pipeline that started from listing pages, extracted detail-page URLs, and then scraped blogs, products, and general pages into CSV exports. The workflow also logged failures, generated rerun lists, and captured troubleshooting artifacts so incomplete runs could be resumed instead of restarted from scratch. That pipeline is reflected directly in the documented workflow: start with listing pages, extract detail URLs, scrape detail pages into timestamped CSVs, then review failures and rerun as needed.

The scraper workflow included separate extraction and scraping scripts for blogs, products, and general pages, along with structured outputs such as CSV files, failure logs, rotating logs, and error screenshots. That made it much more useful than a one-off script, because it produced organized output the team could actually work from.

To make the results actionable, I exported a page list from the original site with fields we could compare against the scraper output, including page titles and H1s. From there, we used Excel to match the original pages to the cloned-site pages and assemble a final spreadsheet mapping original URLs to cloned URLs for takedown support.

Why it was interesting

What made this project especially satisfying was the mix of urgency, scale, and practical problem-solving.

This was not a polished product feature or a long-planned internal platform. It was a fast-moving response to a real business problem. The team needed a repeatable way to gather and compare data across a very large WooCommerce catalog without sinking days of labor into manual review.

There was also a reliability challenge. As the scraper ran, the target server began throwing errors, which suggested the automation may have been getting flagged. To keep the job running, I adjusted the behavior by randomizing the headless browser profile and varying request timing. That made the process slower, but it made it dependable enough to run in the background overnight and keep producing useful output.

That tradeoff was the right one. Raw speed was less important than getting a stable workflow that reduced labor hours and helped the team move toward a usable deliverable.

My role

I handled the technical implementation and helped shape how the workflow operated in practice, including:

Designing the scraping pipeline for listing-page extraction and detail-page collection
Structuring outputs so the data could be used in a comparison workflow
Extracting page-level identifiers such as titles, H1s, and IDs
Adding failure tracking, rerun support, and troubleshooting outputs
Adjusting scraper behavior when the target server began resisting requests
Supporting the handoff process by helping turn the exports into a usable URL-mapping spreadsheet

Outcome

The project was completed in about three days end to end, but it did not require three straight days of manual effort. Once the workflow was stabilized, much of the scraping ran in the background overnight, which kept the labor hours relatively low compared with the scale of the task.

The final output was a spreadsheet that mapped original URLs to cloned URLs so the team could support the takedown process with a clear page-by-page record. The cloned site was ultimately taken down.

Just as important, the project turned what could have been a painful manual review across thousands of pages into a structured, repeatable workflow. It solved an urgent problem quickly, reduced repetitive labor, and produced a deliverable the non-developer side of the process could actually use.