Help reducing scan time

Started by Laptop Plus, May 12, 2014, 10:21:29 PM

Laptop Plus

I've gone through the Help index, but the suggestions haven't sped up the scan time all that much.

We've recently implemented some changes to the website which increased the amount of interlinking significantly. There's approximately 75K-100K unique pages, but A1 Sitemap Generator is picking up 50Million+ pages for "Init" found link (check if unique). So I assume this is why the scan is now taking a significant amount of time to complete, as there are literally millions of new links which end up pointing to the same URLs.

Is there any way to reduce the time taken for unique link checking?

Webhelpforums

In later/recent versions A1 Sitemap Generator will test new links before they get queued and all links in existing queue at intervals. If it then finds a given link is the same as an URL already identified, it is handled immediately (all necessary information regarding links-to / linked-by etc. is of course updated)

However - that only minimizes memory usage in keeping the queed "waiting" number down. For speed, I think the only thing you can do is to add the number of workers threads to 100. But you will need a really powerful computer for itt to make a difference with *lots* of memory as well.

Maybe you could split the website crawl into small sections? (e.g. directories)
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

Laptop Plus

#2
Thanks for the reply. A combination of what you've suggested should eventually get the job done. However, I'm having a bit of trouble scanning individual directories due to the .NET nature of the site.

For example, for the Asus subdirectory, there'll be something of a doorway page such as:

www.mywebsite.com/Asus.aspx?manid=1&catid=1

The above page will then link to many pages within the Asus subdirectory, such as:

www.mywebsite.com/Asus/Series.aspx?seriesid=1&manid=1&catid=1

I guess the issue is that the Asus subdirectory and all of the pages contained in it don't actually exist without being linked to from the Asus doorway page. The pages are created dynamically and don't exist within a static directory structure on the server.

Also, is there a way of limiting the scan so that it will only scan descending subdirectory levels, to avoid it reaching the root directory again and trying to crawl the whole site?


Laptop Plus

There's a relatively small number of subdirectories (19) off the root directory, so I may be able to use analysis filters to scan only one subdirectory at a time.

Webhelpforums

1)
Set root to your domain.

2)
Add start search paths to some pages that link into the directories you want

3)
Configure "analysis filters" to only allow those URLs + URLs found in the wanted directories

4)
Configure "output filters" to only allow URLs found in the wanted directories
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

Laptop Plus

Cheers, I've got it sorted now. Awesome piece of software.

heiberlin

I have the same problem. A little website (7 pages) - the scan needs more than 7 hours!
sitemap Generator 5.1.0 - bought, downloaded and unlocked. No options changed.

What about my sites with more than 100 pages?
The first attempts I cancelled after more tha 20 hours.

The hints here in the thread causes no improving.

Webhelpforums

#7
Hi heiberlin,

Your problem sounds to be entirely different. Some possible causes could be:

Please email the problemeatic website URL or post it here.
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper