Scanning takes forever

Started by ljs, June 10, 2015, 11:48:46 PM

ljs

Same as many others here. I let the scan running overnight and stopped it this morning after 14 hours to check results so far. It looks like every possible url was scanned.

The number of 'Jobs waiting in crawler engine' keeps growing and A1 wants to keep going and going. I'm not sure what to do.

Webhelpforums

The best thing to do in such a situation is to stop the scan and inspect the URLs collected.

Remember hat you can resume scans later:
http://www.microsystools.com/products/sitemap-generator/help/sitemap-generator-resume-scan/

Anyhow, most likely your problem is that your website generates inifinite / near-infinite number of URLs.

From http://www.microsystools.com/products/sitemap-generator/help/creating-sitemaps-large-websites/ help page:

QuoteList of things to check:

A) Check if your website is generating an infinite amount of unique URLs. If it does, it will cause the crawler to never stop as new unique page URLs are found all the time. A good method to discover and solve these kinds of problems is by:


  • Start a website scan.
  • Stop the website scan after e.g. half an hour.
  • Inspect if everything appears correct, i.e. if most of the URLs found seem correct.

Example #1
A website returns 200 instead of 404 for broken page URLs. Example of infinite pattern: Original 1/broken.html links to 1/1/broken.html links to 1/1/1/broken.html etc.

Example #2
The website platform CMS generates a huge number of 100% duplicate URLs for each actual existing URL. To read more about duplicate URLs, see this help page. Remember that you can analyze and investigate internal website linking incase something looks wrong.


B) Check if your project configuration and website content will cause the crawler to download files hundred of megabytes large.

Example #1: Your website contains many huge files (like hundreds of megabytes) the crawler must download. (While the memory is freed after the download has completed, it can still cause problems on computers with low memory.)

Remember you can see internal linking after you stop the scan - this will usually show how an URL was found:
http://www.microsystools.com/products/sitemap-generator/help/sitemaps-generator-analyze-links/




To help more, I will need to see the website. Feel free to post it here or email it: http://www.microsystools.com/home/contact.php
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

ljs

Actually, it's a car parts web shop. So there are hundreds of car brands and models, that all have hundreds of matching parts. Every model/part combination generates a URL. There are now about 65000 internal URL's listed. Should I go on or is there an alternative way to make a sitemap for this type of website?

Thanks!

Webhelpforums

If there are URL patterns you know are duplicates of some sort, you can exlude them in analysis filters and output filters before starting the scan. You can probably shorten the scan time quite drasticly if applied carefully.

The first step will be to stop the scan and inspect the results for urls you do not want in the final XML sitemap.

If the website/server can handle it, you can also increase amount of threads and simultaneous connections (default settings in A1SG is around 10% of max)

Be sure to check this help page for more info and links to other relevan help pages:
http://www.microsystools.com/products/sitemap-generator/help/creating-sitemaps-large-websites/
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper