A1 Website download exclude pages

Started by ericgourmet, July 09, 2017, 06:47:12 AM

ericgourmet

Hi,
I am trying to exclude web pages based on their names beginning with the characters MS_
The pages are downloaded in different directories, each page name beginning with MS_ and html extension. How do I exclude these pages from both analysis and output?

Exemple :
dir1/MS_987.html should be excluded
dir2/MS_48732.html should be excluded
dir1/boxes.html should be included

Thanks!

Webhelpforums

Please see:
https://www.microsystools.com/products/website-download/help/website-download-convert-links/

Sounds like those files (MS_ + .HTML)  are those created when two different URLs will map to the same file name on disk (this is because file names on disk do now allow the same characters as URLs do)

I have contemplated a different way - but a problem I encountered with huge websites was that if I just removed illegal characters or similar "simple" method - I would receive examples of where URLs would collide when saved to disk.




General way of excluding pages from crawl:
https://www.microsystools.com/products/website-download/help/website-crawler-scanner-filters/

General way of excluding pages from final output:
https://www.microsystools.com/products/website-download/help/website-crawler-output-filters/
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

ericgourmet

Ok, that solved my problem. I had to go in the Crawler Options and check the Cutout "?" (GET parameters) in internal links.
Thanks!

Webhelpforums

That will not be a favourtable solution with websites where ? parameters are an important part of URLs

e.g. fetchpage?page=about and fetchpage?page=contact

The way A1WD handles it by default ensures both pages are downloaded (but renamed because e.g. "?" can not be part of file name on Windows) + internal linking works

It is possible one could keep more of the old names and simply append the "MS_xx" part to them - would that be better? If so, I will add it to wishlist / create an option for it.
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

RichardBaker

 Were you able to figure out how to exclude these pages from both analysis and output?

Webhelpforums

Quote from: Webhelpforums on July 29, 2017, 04:50:31 AM
That will not be a favourtable solution with websites where ? parameters are an important part of URLs

e.g. fetchpage?page=about and fetchpage?page=contact

The way A1WD handles it by default ensures both pages are downloaded (but renamed because e.g. "?" can not be part of file name on Windows) + internal linking works

It is possible one could keep more of the old names and simply append the "MS_xx" part to them - would that be better? If so, I will add it to wishlist / create an option for it.

Update: Option exists now and is found in "Scan website | Download options"
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper