User:Colin M/Determining commonname

This essay is in development.

It contains the advice or opinions of one or more Wikipedia contributors.

Essays may represent widespread norms or minority viewpoints. Consider these views with discretion, especially since this page is still under construction.

A keystone of Wikipedia's article naming policy is WP:COMMONNAME:

Wikipedia does not necessarily use the subject's "official" name as an article title; it generally prefers the name that is most commonly used (as determined by its prevalence in a significant majority of independent, reliable English-language sources)

In some cases, it may not be immediately clear which name (if any) satisfies this condition, and editors may disagree. This page will suggest some concrete methods for collecting and interpreting data which is relevant to WP:COMMONNAME.

Weighing different sources[edit]

The WP:COMMONNAME is the name most commonly used in independent, reliable English-language sources. Therefore any accounting of usage should avoid including:

Unreliable sources: For example, blogs or other user-generated content not written by established subject matter experts (though usage in everyday communications may have some evidentiary use for evaluating the WP:CRITERIA of recognizability and naturalness)
Non-independent sources: For example, if the topic under consideration is a company, this would include that company's website, press releases issued by the company, or other publications of the company.
Non-English sources: See WP:UE for how to handle the rare case where there is insufficient English-language coverage to determine an English commonname.

Among sources that pass these filters, it may be prudent to give additional weight to:

More recent sources, particularly when usage of different names has shifted over time (see WP:NAMECHANGES)
High quality, in-depth sources. For example, in determining the name to use for a scientific topic, the name used by a textbook which devotes a chapter to the topic is likely to carry more weight than the name used in a newspaper article which has just one passing mention of the topic.

Manual methods[edit]

Perhaps the most robust and reliable method for determining the commonname of a topic is to manually inspect a sufficiently large chunk of the most in-depth, high-quality sources dealing with the topic. This approach carries a few drawbacks and obstacles:

Unless you're a subject-matter expert, it may be difficult to determine the most significant sources for the topic at hand.
- If the article on the topic is well-developed, the "References" section may be a reasonable proxy (paying special attention to any sources which are frequently cited)
- If this is a scholarly topic, sources with a high citation count (as listed, for example, in Google Scholar) are likely to be important in the field.
If the most significant sources about the topic are print books or paywalled articles, gaining access to them may prove difficult or impossible.
Compared to the automated methods described below, this approach is slow going.

As a quick heuristic approximation, consider examining how the sources cited in the article refer to the article subject in their titles (if applicable).

Automatic methods[edit]

When deciding between titles A and B, a natural approach is to do a search for each title and see which term returns more results. This can be an effective path, but there are a number of important considerations for getting useful data.

Choosing a search engine[edit]

General-purpose search engines such as Google web search, Bing, or DuckDuckGo, should never be used for ascertaining the commonname for a subject. They indiscriminately index all publicly accessible webpages, including lots of user-generated content (such as web forums and blogs), procedurally generated, spammy content, and otherwise unreliable sources. Because WP:COMMONNAME is concerned only with usage in reliable, independent, English-language sources, you should choose and configure a search engine which is, as far as possible, limited to documents matching these criteria.

The most appropriate search engines will depend on the nature of the subject's coverage (is most of the coverage recent? Is it covered in the news? In books? In journal articles?). Some of the most frequently useful places to search include:

Google Books Ngram Viewer and Google Books search
Google News search (and other news databases and aggregators, such as Proquest and Newspapers.com)
Google Scholar (and other sites indexing academic papers, such as Jstor)
The archives of specific news publications, either via a search feature on the publication's own website, or by a Google search using the insite: keyword

Tips and caveats specific to particular search engines can be found in later sections.

Constructing queries[edit]

Search terms should be enclosed in quotes. e.g. "French toast", not French toast. This is especially important for multi-word titles, though it's not a bad practice to do it for single words too. (For example, some search engines will do stemming on unquoted search terms, causing a term like Staples to also match staple, stapled, etc.)

Be aware of the possibility of false positive matches if the term you're searching for can also refer to other topics. For example, many results for the query "Iron maiden" will be referring to the band rather than the torture device. If a basic search would give a high false positive rate, consider narrowing your query by...:

Adding keywords relating to the topic area. For example, "hawk" "foreign policy" will greatly reduce bird-related false positives compared to just "hawk".
Restricting the context in which it appears, by searching for a particular word sequence which includes the name. For example, rather than comparing counts for "chair" vs. "chairman", try "appointed chair" vs. "appointed chairman"

These restrictions on your query may also exclude many true positives, but that's not a problem. We only care about the relative frequency of the candidate titles, not their absolute frequency, so precision is much more important than recall.

Interpreting the results[edit]

Comparing the count of results returned for each potential title should give some indication of which is likely to be the commonname. However, it's a good practice to manually inspect at least a small sample of the results returned for each query as a sanity check (especially if the margin between the counts is near the borderline). Keep an eye out for the following issues which might complicate the interpretation of the raw counts:

False positive matches, where the term is being used to refer to something other than the title of the article (see above for mitigation strategies)
Unreliable, non-independent, or non-English sources. Even specialized search engines such as Google News and Google Scholar may still index some unreliable sources (such as articles published in predatory journals, or news sources which have been assessed by the community as generally unreliable or deprecated).
Duplicate sources. For example, do many of your results come from the same press wire article which was republished by multiple news outlets? This can be especially relevant if the absolute counts you're dealing with are low.
A single source might use multiple names for the subject throughout the article. It may be that the raw counts for names A and B are close, but on closer inspection, most sources prefer to use name B through most of the running text, and generally mention A only once.

Notes on particular search engines[edit]

Google News[edit]

It's very important to note that the count of results at the top of the page ("About n results (x seconds)") is fuzzed, sometimes to an extreme degree. The real count can usually be found by clicking through to the last page of results (by following the numbered links at the bottom of the results page).

For example, at time of writing, a Google News search for "stuck behind a horse" reported "About 227 results". However, scrolling down to the bottom, we can see there are only four pages of results. Once we click on the fourth page, the figure at the top gets revised: "Page 4 of 32 results".

But for queries which match a large number of pages, this technique is not effective. For example, when we follow this procedure for the very common phrase "the weather in", Google News stops at "Page 32 of about 316 results". This is certainly not accurate, since doing the same search but limiting to results from the past week gives 266 results. If this is a problem, consider narrowing your search queries in an arbitrary way. For example, instead of "Kiev" vs. "Kyiv", try "the mayor of Kiev has" vs. "the mayor of Kyiv has".

Google Books Ngram Viewer[edit]

The Ngram Viewer is one of the most powerful tools available for collecting commonname data. Given one or more phrases of up to 5 words each, it will display their relative frequency in English books over time. For example, here's an ngram comparison of "flogging a dead horse" vs. "beating a dead horse".

It indexes scans of a majority of all books ever published in English as of 2019, so it represents an enormous sample size. Some important considerations:

Search strings are case sensitive (unless you toggle the "Case-Insensitive" button).
A phrase must appear in at least 40 different books to be searchable in the Ngram Viewer.
Searching for the target phrase in Google Books search will give you some limited ability to inspect the usages indexed by the Ngram Viewer in books. But bear in mind:
- Unlike the Ngram Viewer, Google Books search is not case sensitive, and is mostly indifferent to punctuation
- For copyright reasons, Google Books may give you only a limited snippet of the phrase's appearance in the book (or none at all).

See the official documentation for further details.