Cơ sở kiến thức
Semrush Toolkits
SEO
Site Audit
Configuring Site Audit

Configuring Site Audit

To set up a Site Audit, you first need to create a Project for the domain. Once you have your new project, select the “Set up” button in the Site Audit block of your Project interface.

If you are having problems getting your Site Audit to run, please reference Troubleshooting Site Audit for help.

Domain and Limit of Pages

You’ll be taken to the first part of the setup wizard, Domain and Limit Pages. From here, you can either choose to “Start Site Audit,” which will immediately run an audit of your site with our default settings, or proceed to customize the settings of your audit to your liking. But don't worry, you can always change your settings and re-run your audit to crawl a more specific area of your site after your initial setup.

Crawl Scope

To crawl a specific domain, subdomain, or subfolder, you can enter it into the “Crawl scope” field. If you enter a domain in this field, you’ll be given the option to crawl all subdomains of your domain with a checkbox.

By default, the tool checks the root domain, which includes all available subdomains and subfolders of your site. In Site Audit settings, you can specify your subdomain or subfolder like a crawl scope and uncheck ‘Crawl all subdomains’ if you don’t want other subdomains to be crawled.

For example, you want to audit only the blog of your website. You can specify the crawl scope as blog.semrush.com or semrush.com/blog/ depending on whether it’s implemented as a subdomain or a subfolder.

A window displaying basic Site Audit settings: Crawl scope, Limit of checked pages, and Crawl source. A red arrow is pointing at the View examples line under Crawl scope, which opens the examples of what root domains, subdomain, and subfolders look like.

A GIF displaying basic Site Audit settings. The number of pages and a crawl source are being selected, demonstrating available options. After selecting the desired number of pages to crawl and a crawl source, a cursor heads down to the green Start Site Audit button.

Limit of Checked Pages

Next, select how many pages you want to crawl per audit. You can enter a custom amount using the “Custom” option. You will want to choose this number wisely, depending on the level of your subscription and how often you plan on re-auditing your website.

  • Pro users can crawl up to 100,000 pages per month and 20,000 pages per audit
  • Guru users can crawl 300,000 pages per month and 20,000 pages per audit
  • Business users can crawl up to 1 million pages per month and 100,000 pages per audit

Crawl Source

Setting the crawl source determines how the Semrush Site Audit bot crawls your website and finds pages to audit. In addition to setting the crawl source, you can set masks and parameters to include/exclude from the audit in steps 3 and 4 of the setup wizard.

There are 4 options to set as your Audit’s crawl source: Website, Sitemap on site, Sitemap by URL, and a file of URLs.

1. Crawling from Website means we will crawl your site like the GoogleBot, using a breadth-first search algorithm and navigating through the links we see on your page’s code—starting from the homepage.

If you just want to crawl the most important pages of a site, choosing to crawl from Sitemap instead of Website will let the audit crawl the most important pages rather than just the ones most accessible from the homepage.

2. Crawling from Sitemaps on site means we will only crawl the URLs that are found in the sitemap from the robots.txt file.

3. Crawling from Sitemap by URL is the same as crawling from “Sitemaps on site,” but this option lets you specifically enter your sitemap URL.

Since search engines use sitemaps to understand which pages they should crawl, you should always try to keep your sitemap as up-to-date as possible and use it as a crawl source with our tool to get an accurate audit.

Note: Site Audit can only use one sitemap URL as a crawl source at a time, so if your website has several sitemaps, the next option (Crawling from a file of URLs) may work as a workaround.

4. Crawling from a file of URLs lets you audit a super-specific set of pages on a website. Make sure that your file is properly formatted as a .csv or .txt with one URL per line, and upload it directly to Semrush from your computer.

This is a useful method if you want to check on specific pages and conserve your crawl budget. If you made any changes to only a small set of pages on your site that you want to check on, you can use this method to run a specific audit and not waste any crawl budget.

After uploading your file, the wizard will tell you how many URLs were detected so that you can double-check that it worked properly before running the audit.

Site Audit settings window. For Crawl source, the URLs from file option is selected and highlighted. To demonstrate how this function works, an example file was uploaded. The tool has provided a number of URLs detected in the file (97 URLs), and the number is highlighted in this menu as well.

Crawling Javascript

If you use JavaScript on your site, you can enable JS rendering in the settings of your Site Audit campaign.

Site Audit settings window with the second tab, Crawler settings, on display. The JS rendering part, which is the last one on this tab, is highlighted to make this option easier to find.

Please kindly note that this feature is only available on a Guru and Business subscription.

Advanced Setup and Configuration

Note: The following four steps of the configuration are advanced and optional.

Crawler Settings

This is where you can choose the user agent that you want to crawl your site. First, set your audit’s user agent by choosing between the mobile or desktop version of either the SemrushBot or the GoogleBot. You can also choose the OpenAI-Search user agent, which will check whether your website is crawlable with the new search bot.

By default we check your site with our mobile crawling bot that helps to audit your website in the same way Google’s mobile crawler would go through your website. You can change the bot to Semrush desktop crawler anytime. 

Site Audit settings menu. On the second tab, titled "Crawler settings", the User agent section is highlighted. The User agent selection has the default option enabled: SiteAuditBot-Mobile.

As you change the user agent, you’ll see the code in the dialog box below change as well. This is the user agent’s code and can be used in a curl if you want to test the user agent on your own.

Crawl-Delay Options

Next, you have 3 options for setting a crawl delay: Minimum delay, Respect robots.txt, and 1 URL per 2 seconds.

If you leave this minimum delay between pages checked, the bot will crawl your website at its normal rate. By default, SemrushBot will wait around one second before starting to crawl another page.

If you have a robots.txt file on your site and specified a crawl delay, then you can select the “respect robots.txt crawl-delay” option to have our Site Audit crawler follow that instructed delay.

Below is how a crawl delay would look like within a robots.txt file:

Crawl-delay: 20

If our crawler slows down your website and you do not have a crawl delay directive in your robots.txt file, you can tell Semrush to crawl 1 URL per 2 seconds. This may force your audit to take longer to complete, but it will cause fewer potential speed issues for actual users on your website during the audit.

Allow/Disallow URLs

To crawl specific subfolders or block certain subfolders of a website, refer to the Allow/Disallow URLs Site Audit setup step. This step also allows auditing multiple specific subfolders at once.

Include everything within the URL after the TLD in the text box below. For example, if you wanted to crawl the subfolder http://www.example.com/shoes/mens/, you would want to enter /shoes/mens/ into the allow box on the left.

Allow/Disallow URLs step in the Site Audit settings. On the screenshot, there are two fields for both allow/disallow masks. In the "allow" field, the following example is added: /shoes/mens/. A red arrow is pointing at it, indicating that this is what format masks need to have to apply correctly.

To avoid crawling specific subfolders, you would have to enter that subfolder’s path in the disallow box. For example, to crawl the men’s shoes category but avoid the hiking boots sub-category under men’s shoes (https://example.com/shoes/mens/hiking-boots/), you would enter /shoes/mens/hiking-boots/ in the disallow box.

Example on how to further specify the allow/disallow mask settings. In the "allow" field, a red arrow is pointing at the same /shoes/mens/ example. In "disallow", a second arrow is pointing at /shoes/mens/hiking-boots to indicate that certain subfolders can be excluded despite the "allow" section directives.
If you forget to enter the / at the end of the URL in the disallow box (ex: /shoes), then Semrush will skip all pages in the /shoes/ subfolder as well as all URLs that begin with /shoes (such as www.example.com/shoes-men). 

Remove URL Parameters

URL parameters (also known as query strings) are elements of a URL that do not fit into the hierarchical path structure. Instead, they are added on to the end of a URL and give logic instructions to the web browser.

URL parameters always consist of a ? followed by the parameter name (page, utm_medium, etc) and =.

So “?page=3” is a simple URL parameter that could indicate the 3rd page of scrolling on a single URL.

The 4th step of the Site Audit configuration allows you to specify any URL parameters that your website uses in order to remove them from the URLs while crawling. This helps Semrush avoid crawling the same page twice in your audit. If a bot sees two URLs; one with a parameter, and one without, it may crawl both pages and waste your crawl budget as a result.

Remove URL parameters step in Site Audit settings. In the parameters fields, several example options are displayed, with a red arrow pointing at them to bring attention to what the URL parameter format should look like to work.

For example, if you were to add “page” into this box, this would remove all URLs that included “page” in the URL extension. This would be URLs with values such as ?page=1, ?page=2, etc. This would then avoid crawling the same page twice (for example, both “/shoes” and “/shoes/?page=1” as one URL) in the crawling process.

Common uses of URL parameters include pages, languages, and subcategories. These types of parameters are useful for websites with large catalogs of products or information. Another common URL parameter type is UTMs, which are used for tracking clicks and traffic from marketing campaigns.

If you already have a project set up and would like to change your settings, you can do so using the Settings gear:

Site Audit Overview report with the settings menu opened via the gear icon on the top right. The settings menu is a dropdown and can be scrolled down for additional settings options. The list of such options is highlighted with red, as well as the gear icon, to make them easier to find.

You will use the same directions listed above by selecting the “Masks” and “Removed Parameters” options.

Bypass Website Restrictions

To audit a website in pre-production or hidden by basic access authentication, step 5 offers two options:

  1. Bypassing the disallow in robots.txt and robots meta tag
  2. Crawling with your credentials to bypass password-protected areas

If you want to bypass disallow commands in the robots.txt or meta tag (usually, this would be found in your website’s <head> tag), you will have to upload the .txt file provided by Semrush to the main folder of your website.

You can upload this file the same way you would upload a file for GSC verification, for example, directly into your website’s main folder. This process verifies your ownership of the website and allows us to crawl the site.

The Bypass website restrictions step of Site Audit settings. Each widget of this step is highlighted with red. A red arrow under number 1 points at the "Bypass disallow in robots.txt and by robots meta tag" option checked in. The second arrow under number 2 points at the "Crawling with your credentials option". Crawling with your credentials option is also enabled, presenting the username and password fields.

Once the file is uploaded, you can start the Site Audit and gather results.

To crawl with your credentials, simply enter the username and password that you use to access the hidden part of your website. Our bot will then use your login info to access the hidden areas and provide you with the audit results.

Schedule

Lastly, select how often you would like us to automatically audit your website. Your options are:

  • Weekly (choose any day of the week)
  • Daily
  • Once

You can always re-run the audit at your convenience within the Project.

An example of schedule options on the last tab of Site Audit setup menu. A dropdown menu is opened, demonstrating weekly option for each day of the week, daily, and one-time crawl option.

After completing all of your desired settings, select “Start Site Audit.”

Troubleshooting

In the case of an “auditing domain has failed” dialog, you will want to check that our Site Audit crawler is not blocked by your server. To ensure proper crawl, please follow our Site Audit Troubleshooting steps to whitelist our bot. 

Alternatively, you can download the log file that’s generated when the failed crawl occurs and provide the log file to your webmaster so they can analyze the situation and try to find a reason why we are blocked from crawling.

Connecting Google Analytics and Site Audit

After completing the setup wizard, you will be able to connect your Google Analytics account to include issues related to your top-viewed pages. 

If any issue persists with running your Site Audit, try Troubleshooting Site Audit.

Các câu hỏi thường gặp Hiển thị thêm
Hướng dẫn sử dụng Hiển thị thêm
Quy trình làm việc Hiển thị thêm