Adhere to robots.txt
It seems that many of the pages your tool crawled were blocked by our robots.txt - these should not be indexed as they contain no content or are behind a login, etc... It would thus be great if rather than just getting straight to crawling you provided a few options:
Use robots.txt should be one of them.
Another option should be to add folder to NOT be indexed. E.g. our international pages are all in /ja/... for Japan.
Anonymous shared this idea