Large sites eat all available memory

Started by seofreelance, August 19, 2013, 11:14:58 AM

seofreelance

Hi,
I am running A1SG (always latest pro version) on a 2generation i7 with 8gb RAM and find that historically A1SG eats all available memory with larger projects, meaning large as > 100K URLs


(this snapshot comes from this problem happening while scanning france-voyage.com)

I know I have settings for making querying and memory usage lighter, but since I use this (great) tool for SEO purposes I need the closest to real info, specially from missconfigured or tricky servers (the ones that may give problems if I enable the suggested in thread http://webhelpforums.net/index.php?topic=4283.0)

This memory issue has a workaround: stop the crawler, let it get all pending requests and save the Project. Then mark the option for continue where left and keep going for next batch, and so on (mucho so on if large site!)
Reverse lecture on this workaround: can't leave it working while picking kids from school (an example), if I forget or rely too much on the time allowed to run, I may find myself with a stuck laptop and have to interrupt the A1SG process from Task Admin, then all process time and collected data go to waste.

Is it an A1SG bug? Is is anything you can fix in next versions?

I'd be happy enough with serious alternates such as:
- Autosaving settings
- Memory limit triggers (say stop&save, save and prompt, data batch like those ".part" compressed packs being assempled at final stage)
- Whatever you can think of :)

Thanks for the atention
Ricard Menor, Spanish SEO consultant based in Barcelona

Webhelpforums

One of the reasons it helps stopping the scan is that there sometimes are huge queues of links found that needs to be checked. This can on *some* sites actually cause large memory usage far beyond what is normal.  I think I have a suggestion that will solve your problem

Try uncheck Scan website > Crawler engine > Default to GET for page requests

This solution often works well because it clears the queue much faster. For details see:
http://www.microsystools.com/products/sitemap-generator/help/website-crawl-progress-status/


Incidently, if you do start/resume scans, using HEAD requests will also be better since it can mark more pages URLs as "scanned completely" faster. For details on that, see:

http://www.microsystools.com/products/sitemap-generator/help/sitemap-generator-resume-scan/


The reason HEAD is not default is that some servers (and website CMS plugins) error/denie such requests + using GET has proven to be slighly faster on most websites
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

seofreelance

Thanks, will try this settings out in short.
Ricard Menor, Spanish SEO consultant based in Barcelona

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper