Limits and filters

spiweb · February 21, 2012, 03:55:05 AM

Hi,
I am trying to get a grip on limits and filters in A1WebsiteDownload.
I have read the Help page, but I am not entirely clear about something.

1. Can I make "Exclude" filters override "Include" limits?
For instance, I want to download the icecream/ path, let's say, but I don't need the icecream/strawberry/ subpath.

2. How do I express an absolute path?
In a project for website www.example.org
would the RegEx
::^$sweets/
refer to www.example.org/sweets/
?

BTW, is there a RegEx tester for A1WD?

3. A limit to
:scones/
would include the relative path
/scones/cranberry/
so a
:cranberry/
rule would not be necessary for /scones/cranberry/
Did I get that right?

4. Are limits sometimes automatically used as starting paths?

Thanks!

Webhelpforums · February 21, 2012, 07:24:30 AM

1)
I tend to think of it as "exclude" and "limit to" sections being applied at the same time. (Same result no matter the order.) So yes, you can combine them as they are both applied to the URLs.

2)
Common to "string", ":path/" and "::(regex)?" filters is they all get applied to the relative part of the URL (http://example.com/relative/part/of/the/url)

So "^sweets/$" would refer to "http://www.example.org/sweets/" and "^$" would match "http://www.example.org/"

3)
":scones/" (path filter) includes both "http://example.com/scones/" and "http://example.com/scones/cranberry/" while ":cranberry/" (path filter) would include "http://example.com/cranberry/"

4)
In "analysis filters", the paths (not strings and not regex) in "limit to" are automatically inserted as extra search starting paths:
http://www.microsystools.com/products/website-download/help/root-aliases-start-paths/

spiweb · February 21, 2012, 07:34:47 PM

Ok, thanks!

1) So, correct me if I am wrong, I understand that the explicit Exclude prevails over the Limit to.
For instance,
Limit to: :icecream/
Exclude: :strawberry/
should not download neither icecream/strawberry/ nor strawberry/icecream/

But in fact it seems to me it does not! I mean, the "Exclude" path does not override the "Limit to" path.

2) As an alternative to the RegEx "^sweets/$", can I exclude that absolute path just adding
:http://www.example.org/sweets/
to the Exclude filters?

I am asking this because I noticed that external absolute addresses are added this way
:http://connect.facebook.net/it_IT/all.js
(if I add them to the Exclude filters with the red button/Ctrl-Alt-F in the External sitemap window)

3) I also read that an asterisk after the relative path will apply the filter to the sub-paths but not to the relative path itself

From the Help:
":blogs/* matches relative paths excluding itself that start with "blogs/" such as http://www.microsystools.com/blogs/sitemap-generator/. "

But then, in Sitemap, there is a command "Add selected + subpaths..." that adds to filters with sub path match asterisk...

4) Ok. At first I didn't notice it. Now I remove those starting paths too if I remove some Limit to filters while I create a project.

Webhelpforums · February 21, 2012, 07:49:15 PM

Concerning *1*:

"
For instance,
Limit to: :icecream/
Exclude: :strawberry/
should not download neither icecream/strawberry/ nor strawberry/icecream/
"

":icecream/" limits analysis to http://example.com/icecream/
":strawberry/" excludes http://example.com/strawberry/ from analysis

Are you sure you don't mean this:
Limit to: ":icecream/"
Exclude: ":icecream/strawberry/"

Then your question would be more "correct" since that would limit analysis to the "icecream/" folder, and in *addition* exclude the subdirectory ":icecream/strawberry/" from analysis.

Webhelpforums · February 21, 2012, 08:06:02 PM

About *2*

You have used the exclude button on external URLs and therefore adds the complete URLs and not relative ones.
It would seem those "quick" exclude buttons should be disabled for external URLs to avoid the confusion (!)

About *3*

"
there is a command "Add selected + subpaths..." that adds to filters with sub path match asterisk...
"

The button currently add, as you describe, ":example/*" (i.e. subpaths only)

This in contradiction to the text on the button. This causes unwanted confusion, and will most likely be corrected very fast

About *4*

Part of the original problem was that people used limit-filters in such way the root would not be analyzed thus stopping the website scan from ever getting started.

spiweb · February 22, 2012, 04:32:48 AM

Quote from: Webhelpforums on February 21, 2012, 07:49:15 PM
Concerning *1*:

"
For instance,
Limit to: :icecream/
Exclude: :strawberry/
should not download neither icecream/strawberry/ nor strawberry/icecream/
"

":icecream/" limits analysis to http://example.com/icecream/
":strawberry/" excludes http://example.com/strawberry/ from analysis

Are you sure you don't mean this:
Limit to: ":icecream/"
Exclude: ":icecream/strawberry/"

Then your question would be more "correct" since that would limit analysis to the "icecream/" folder, and in *addition* exclude the subdirectory ":icecream/strawberry/" from analysis.

I see. Now let's say it's icecream/something/else/unknown/strawberry/ I need to exclude.
I can use the RegEx, right?
Limit to: ":icecream/"
Exclude: "::icecream/.*/strawberry/"

spiweb · February 22, 2012, 04:40:23 AM

Quote from: Webhelpforums on February 21, 2012, 08:06:02 PM
About *2*

You have used the exclude button on external URLs and therefore adds the complete URLs and not relative ones.
It would seem those "quick" exclude buttons should be disabled for external URLs to avoid the confusion (!)

About *3*

"
there is a command "Add selected + subpaths..." that adds to filters with sub path match asterisk...
"

The button currently add, as you describe, ":example/*" (i.e. subpaths only)

This in contradiction to the text on the button. This causes unwanted confusion, and will most likely be corrected very fast

About *4*

Part of the original problem was that people used limit-filters in such way the root would not be analyzed thus stopping the website scan from ever getting started.

Thanks for all the explanations!

Regarding *2*, so complete URLs don't have any effect in filters? Only relative paths do work?
How can I instruct A1WebsiteDownload not to download anything, not even as "source" (js, images, or else), from a given external site I don't want? (typically ads and banners).

Webhelpforums · March 03, 2012, 09:17:23 AM

Change
"
Exclude: "::icecream/.*/strawberry/"
"
to
"
Exclude: "::icecream/.*?/strawberry/"

(.*? = non-greedy match and .* = greedy match)

...

Relative filters is the way to go. (At present at least!)

...

Concerning not downloading external stuff that is *used* (like Javascript, images shown on page, CSS etc.) see option:

"Download used images and similar residing on external domains" (checkbox)

found in

"Scan website | Download options"

See Our Webmaster Tools for Windows and Mac

Limits and filters