Whitelisting independent search crawlers

Seznam is a Czech search engine that uses its own web crawlers to build an independent search index. While it is a Czech-first search engine, it has robust English-language search results, including a good selection of New Leaf Journal results for certain queries. I was interested enough in the project to navigate the mostly-Czech website to secure webmaster tools. (I will note that Seznam has an admirably clean UI and lack of clutter.) I came across an interview with a top Seznam executive which included something of a public service announcement asking webmasters to not block Seznam’s crawler. I agree fully with the request and want to share it here while also explaining why it is important.

Table of Contents

The importance of search engines with their own indexes
Seznam’s PSA
Concurring with Mr. Pergler

The importance of search engines with their own indexes

Search engines with their own indexes are important for adding diversity to the search engine market. Google is the dominant search engine in the English-speaking world (and in most of the non-English speaking world, with very limited exceptions – namely China and Russia), followed by Microsoft Bing as a distant second. Many of the alternatives draw their results from Google’s or Bing’s index (usually Bing), see, for example, DuckDuckGo (Bing), Ecosia (Bing), Qwant (Bing), and Startpage (Google) (see my review of DuckDuckGo Lite and Qwant Lite).

(Additional reviews: Peekier (Bing-based), Norton Safe Search (Ask-based, which in turn is based on Google), FrogFind (DuckDuckGo front-end, which in turn is based on Bing), Oscobo (Bing).)

While these front-ends serve valuable purposes, their status as alternatives is limited by their upstream dependencies. See, for example, my articles on how Bing’s censoring of search results for the Chinese Communist Party or blacklisting domains affects every search tool that uses Bing’s index (The New Leaf Journal recently fell victim to the latter phenomenon). True alternative search engines give people the option to see results from entirely independent indexes that are not encumbered by the policies and biases of Google and Bing.

Seznam’s PSA

On April 17, 2019, Search Engine Journal published an interview with Mr. Tomáš Pergler, the director of Seznam’s Search Division. The interview is worth reading in full, but we will focus on the section dealing with Seznam’s struggles with its bot, SeznamBot, being blocked from crawling websites outside the Czech Republic.

The interviewer, Mr. Dan Taylor, asked Mr. Pergler whether SeznamBot has trouble crawling non-Czech websites. Mr. Pergler answered in the affirmative:

Unfortunately, we experience serious access problems when crawling international web. Increasing number of websites tends to block all traffic except for GoogleBot.
Tomáš Pergler

Websites can set what it is called a robots.txt file. The Robots file contains directives for web crawlers, or bots. Some websites opt to severely limit which bots can crawl their sites. As Mr. Pergler notes, some sites only allow GoogleBot. I will venture many exclusively allow only GoogleBot and BingBot. Many crawlers of the nefarious variety will ignore robots directives. However, legitimate crawlers such as Seznam’s do obey the directives. Thus, directives that do not allow Seznam will cause SeznamBot to not crawl a website. Mr. Pergler went on to note that, in 2019 at least, some lists of unwanted crawler IPs utilized by webmasters included Seznam’s SeznamBot.

After discussing the problems, Mr. Pergler expressed his hope that the interview would improve the sad state of affairs for SeznamBot:

It would be great if this article would help encourage webmasters to allow SeznamBot to their sites, so it may bring some visits from users in the Czech Republic.
Tomáš Pergler

In short, he hoped that webmasters would consider expressly allowing SeznamBot if they were otherwise blocking it. In the alternative, Mr. Pergler suggested including contact information in the Robots file for interested persons and organizations running non-nefarious crawlers to get in touch and ask for permission to crawl.

Concurring with Mr. Pergler

This interview is from 2019, so I do not know if the situation has improved for Seznam since then. However, I have noted that it has done a solid job of indexing The New Leaf Journal, its English language results seem to be decent from a cursory, surface-level look. So it seems that Seznam is doing fairly well with its international indexing efforts. Seznam is a very neat project. Unless a website (provided it is public) has distinct bandwidth limitations, I recommend following Mr. Pergler’s advice and ensuring that it does not block the friendly SeznamBot. One way to promote valuable attempts to build independent search engines is by allowing them to index interesting new content. Moreover, as Mr. Pergler noted, there is no harm in being indexed by one of the major search players in the Czech market.

I offer the following resources for those who are interested in learning more and making sure that their sites are available to be indexed by Seznam:

In the spirit of the article, I will take the time to promote several other independent search engine bots that should be allowed, absent special cases (the list is not exhaustive).

Mojeek
InfoTiger
Marginalia Search (primarily for lightweight and non-commercial sites)
Common Crawl

These are just a few examples of crawlers from independent search engines that are worth allowing. Interested webmasters can research search engines with their own indexes to find others to look into (see a good resource).

If a webmaster is having an issue with the behavior of one of these bots or another bot from a legitimate, independent search engine, he or she can try modifying the robots.txt file or contacting the person or organization behind the crawler for guidance before blocking. I have not personally had to raise an issue regarding a crawler, but I can say from experience that Mojeek (listed above) is very quick to respond to questions and issues raised through its contact form and email. The same may be true of others.

For general guidance on the robots.txt format and standards, see robotstxt.org and Google Search Console’s guide to the standard.

Helping small search engines that are trying to build and improve their own indexes is one way, in the long run, to help make the internet a better place. If you run a blog, writing website, or site with some valuable resources, information, or media, you can do your part by making it possible for independent search engines to add your web pages to their indexes.

On whitelisting independent search crawlers

The importance of search engines with their own indexes

Seznam’s PSA

Concurring with Mr. Pergler

Curated Related Posts