Limits and filters

Started by spiweb, February 21, 2012, 03:55:05 AM

spiweb

Hi,
I am trying to get a grip on limits and filters in A1WebsiteDownload.
I have read the Help page, but I am not entirely  clear about something.

1. Can I make "Exclude" filters override "Include" limits?
For instance, I want to download the icecream/ path, let's say, but I don't need the icecream/strawberry/ subpath.

2. How do I express an absolute path?
In a project for website www.example.org
would the RegEx
::^$sweets/
refer to www.example.org/sweets/
?

BTW, is there a RegEx tester for A1WD?


3. A limit to
:scones/
would include the relative path
/scones/cranberry/
so a
:cranberry/
rule would not be necessary for /scones/cranberry/
Did I get that right?

4. Are limits sometimes automatically used as starting paths?


Thanks!

Webhelpforums

#1
1)
I tend to think of it as "exclude" and "limit to" sections being applied at the same time. (Same result no matter the order.) So yes, you can combine them as they are both applied to the URLs.


2)
Common to "string", ":path/" and "::(regex)?" filters is they all get applied to the relative part of the URL (http://example.com/relative/part/of/the/url)

So "^sweets/$" would refer to "http://www.example.org/sweets/" and "^$" would match "http://www.example.org/"


3)
":scones/" (path filter) includes both "http://example.com/scones/" and "http://example.com/scones/cranberry/" while ":cranberry/" (path filter) would include "http://example.com/cranberry/"


4)
In "analysis filters", the paths (not strings and not regex) in "limit to" are automatically inserted as extra search starting paths:
http://www.microsystools.com/products/website-download/help/root-aliases-start-paths/
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

spiweb

#2
Ok, thanks!

1) So, correct me if I am wrong, I understand that the explicit Exclude prevails over the Limit to.
For instance,
Limit to: :icecream/
Exclude: :strawberry/
should not download neither icecream/strawberry/ nor strawberry/icecream/

But in fact it seems to me it does not! I mean, the "Exclude" path does not override the "Limit to" path.


2) As an alternative to the RegEx "^sweets/$", can I exclude that absolute path just adding
:http://www.example.org/sweets/
to the Exclude filters?

I am asking this because I noticed that external absolute addresses are added this way
:http://connect.facebook.net/it_IT/all.js
(if I add them to the Exclude filters with the red button/Ctrl-Alt-F in the External sitemap window)

3) I also read that an asterisk after the relative path will apply the filter to the sub-paths but not to the relative path itself

From the Help:
":blogs/* matches relative paths excluding itself that start with "blogs/" such as http://www.microsystools.com/blogs/sitemap-generator/. "

But then, in Sitemap, there is a command "Add selected + subpaths..." that adds to filters with sub path match asterisk...

4) Ok. At first I didn't notice it. Now I remove those starting paths too if I remove some Limit to filters while I create a project.

Webhelpforums

Concerning *1*:

"
For instance,
Limit to: :icecream/
Exclude: :strawberry/
should not download neither icecream/strawberry/ nor strawberry/icecream/
"

":icecream/" limits analysis to http://example.com/icecream/
":strawberry/" excludes http://example.com/strawberry/ from analysis

Are you sure you don't mean this:
Limit to: ":icecream/"
Exclude: ":icecream/strawberry/"

Then your question would be more "correct" since that would limit analysis to the "icecream/" folder, and in *addition* exclude the subdirectory ":icecream/strawberry/" from analysis.

TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

Webhelpforums

About *2*

You have used the exclude button on external URLs and therefore adds the complete URLs and not relative ones.
It would seem those "quick" exclude buttons should be disabled for external URLs to avoid the confusion (!)


About *3*

"
there is a command "Add selected + subpaths..." that adds to filters with sub path match asterisk...
"

The button currently add, as you describe, ":example/*" (i.e. subpaths only)

This in contradiction to the text on the button. This causes unwanted confusion, and will most likely be corrected very fast :)


About *4*

Part of the original problem was that people used limit-filters in such way the root would not be analyzed thus stopping the website scan from ever getting started.
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

spiweb

Quote from: Webhelpforums on February 21, 2012, 07:49:15 PM
Concerning *1*:

"
For instance,
Limit to: :icecream/
Exclude: :strawberry/
should not download neither icecream/strawberry/ nor strawberry/icecream/
"

":icecream/" limits analysis to http://example.com/icecream/
":strawberry/" excludes http://example.com/strawberry/ from analysis

Are you sure you don't mean this:
Limit to: ":icecream/"
Exclude: ":icecream/strawberry/"

Then your question would be more "correct" since that would limit analysis to the "icecream/" folder, and in *addition* exclude the subdirectory ":icecream/strawberry/" from analysis.

I see. Now let's say it's icecream/something/else/unknown/strawberry/ I need to exclude.
I can use the RegEx, right?
Limit to: ":icecream/"
Exclude: "::icecream/.*/strawberry/"

spiweb

Quote from: Webhelpforums on February 21, 2012, 08:06:02 PM
About *2*

You have used the exclude button on external URLs and therefore adds the complete URLs and not relative ones.
It would seem those "quick" exclude buttons should be disabled for external URLs to avoid the confusion (!)


About *3*

"
there is a command "Add selected + subpaths..." that adds to filters with sub path match asterisk...
"

The button currently add, as you describe, ":example/*" (i.e. subpaths only)

This in contradiction to the text on the button. This causes unwanted confusion, and will most likely be corrected very fast :)


About *4*

Part of the original problem was that people used limit-filters in such way the root would not be analyzed thus stopping the website scan from ever getting started.

Thanks for all the explanations!

Regarding *2*, so complete URLs don't have any effect in filters? Only relative paths do work?
How can I instruct A1WebsiteDownload not to download anything, not even as "source" (js, images, or else), from a given external site I don't want? (typically ads and banners).

Webhelpforums

Change
"
Exclude: "::icecream/.*/strawberry/"
"
to
"
Exclude: "::icecream/.*?/strawberry/"

(.*? = non-greedy match and .* = greedy match)

...

Relative filters is the way to go. (At present at least!)

...

Concerning not downloading external stuff that is *used* (like Javascript, images shown on page, CSS etc.) see option:

"Download used images and similar residing on external domains" (checkbox)

found in

"Scan website | Download options"
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper