Jump to content

User:Sphilbrick/Sandbox re noindexing of new articles by novices

From Wikipedia, the free encyclopedia

New pages, created by novices, should include noindex by default[edit]

Proposal summary[edit]

Pages created by users who are not autoconfirmed should, by default, include a noindex template. The template should only be removable by an autoconfirmed user

What problem is being addressed?[edit]

Brand new users, many unfamiliar with the guidelines of Wikipedia, create pages that violate many rules. Many of these pages are indexed by search engines such as Google. Links to Wikipedia pages often score very high in the Google page rank system, thus these pages are very often one of the first presented to a person using a search engine. In the case of some rule violations (poor grammar) it is only mildly troubling that such a page would appear in a search engine. However, other violations, such as copyright violations, and libelous statements against living persons, are far more serious. While established editors can and do run afoul of the rules, pages created by the newest Wikipedians are more likely to be troublesome.

I did a very unscientific review of a few recently created pages. Results here:User:Sphilbrick/Sandbox for support of proposal The sample size is too small to draw broad conclusions, but does illustrate some of the issues. In short, I reviewed a number of new pages created by users with fewer than ten edits, and found eight with concerns. Six of the eight are already found in Google, and can still be found, even though, in some cases, the underlying page has been deleted.

What are the benefits?[edit]

Under this proposal, none of the eight pages identified would be found using a search engine (outside of Wikipedia). One of these pages is flagged for a copyvio, and all but one of the others reflects poorly on Wikipedia.

While this benefit is admittedly modest, I estimate that dozens of pages with problems are created each day by nonautoconfirmed users, and these pages are finding their way into search engines. If a better estimate of the potential count is needed, I can do a more rigorous analysis, but I'll note that my review covered fewer pages than are generated in a single day, so it is highly unlikely the number is less than thousands in a year. (However, I don't know how long it takes for a deleted page to drop out of Google, so I don't have an estimate of the number of problems pages in existence at any time.)

What are the costs?[edit]

I see four types of costs:

  1. Implementation time
  2. Education time
  3. Template removal time
  4. False positive cost

My list is somewhat long, but each is minor. IANAP so I don't know precisely what is necessary to implement this, but I assume that when a user creates a new page, it fires off an algorithm that would need modification. The algorithm needs to determine whether the user is autoconfirmed, but my understanding is that this happens every time a user does an action, so is no cost. The algorithm would have to create the new page in a slightly different way for the two types of users, but I think adding a template should not be a major cost. (Does the system already add a non-patrolled template, or is that handled a different way?)

If the proposal is implemented, user manuals would need a rewrite. Not zero, and I may be under-estimating the places affected, but not major.

If the proposal is implemented, someone will have to remove the template at some time. However, most of these pages will have templates relating to other issues, so removing this template when the other problem messages are removed should be a minor addition of time.

Finally, there is a possibility that a new user will create an excellent article relating to a breaking news item, and users of search engines will run across the article later than they would under the present approach. While it needs to be mentioned, I find it difficult to believe it could be a meaningful problem. New pages relating to breaking news are highly likely to be patrolled, and it seems very unlikely that such a page would escape review by an autoconfirmed user with the ability to remove the template for more than literally minutes.

Background[edit]

This proposal was inspired by a discussion in Technical here. This proposal stands on its own, but interested readers might want to read the link to see, for example, why removal based upon a time limit was criticized, or why a proposal that it would take an administrator to change the status was a problem.

Alternative[edit]

I feel that the noindex template rule should be stronger than simply users who are not yet autoconfirmed. I would prefer something like a few hundred edits and a few months. However, creation of a new class of users is likely to be a big deal, so I chose to make the distinction using autoconfirmed. I do note that Tor users have rules applied at a different point - 90 days and 100 edits. If that status is easily available, I would find it a better choice.

Third pass review[edit]

Selection criteria:

  • Review Special:NewPages for first eight hours 23 May 2009
  • Identify new page created by user with <10 edits, and without user page
  • Review page to subjectively determine whether the world would be a better place if the entry could be found in a search engine.
  • Add to list, identify whether the answer is "yes", "no" or "maybe".
  • Exclude redirects
  • Sort list to place more problematic pages first

Observations:

  • These articles are newer, having been created today
  • Over one third (nine) of the identified articles should not (in my opinion) be in a search engine
  • More relevantly, 7 of the 25 are already identified by other editors as possibly deserving deletion.
  • Seven of the nine articles identified by me are already in Google
  • Six of the seven articles identified by other Wikipedians as problematic are already in Google
Count Article In Google? Should it be in Google Comments
1 Jonathan Payne Yes No proposed that this article be deleted
2 Abdelaziz bin Hamad bin Abdullah Yes No  
3 Datti No No so this may be an attack article
4 Rawls Byrd Elementary School No No  
5 Magaluf Card Game Yes No being considered for deletion
6 Emc X Yes No proposed that this article be deleted
7 Mrittika Sen Yes No being considered for deletion
8 Pab social club Yes No being considered for deletion
9 Beat Persuasion Yes No being considered for deletion
10 Holy Life No Perhaps  
11 Paranoia (Eiko Shimamiya) No Perhaps  
12 3·14 riot No Perhaps  
13 Cameron Cerasani No Perhaps  
14 Hot 100 Brazil No Perhaps  
15 Eggdancer Productions No Perhaps  
16 DosWin32 No Yes  
17 Ismail Yaacob No Yes  
18 Luke Fowler No Yes  
19 Beta Kappa Fraternity No Yes  
20 Princes Park Stadium No Yes  
21 Danny Sullivan (football) No Yes  
22 Rod Culbertson No Yes  
23 Dania Nassief No Yes  
24 Eric Bress No Yes  
25 J. Mackye Gruber No Yes  

First pass review[edit]

Selection criteria:

  • Review Special:NewPages
  • Identify new page created by user with <10 edits
  • Review page to subjectively determine whether the world would be a better place if the entry could be found in a search engine.
  • Add to list if the answer is "no" or "maybe".
  • Stop after eight entries (selected from fewer than 1000 new articles)
Article Created Edits by creator In Google? Should it be in Google
Fstg 21-May-09 12:10:00 4 Yes No
Ijare 21-May-09 10:37:00 1 Yes No
David Collier (producer) 21-May-09 7:14:00 2 No No
Courtney Dowdall already deleted <10 Yes No
University of uva wellassa 21-May-09 3:27:00 9 Yes No
Women's Action Alliance 21-May-09 2:31:00 7 No Maybe
Arthur Do 22-May-09 1:17:00 5 Yes No
Liam Bunston already deleted <10 Yes No

Second pass review[edit]

Selection criteria:

  • Review Special:NewPages for 25 April 2009
  • Identify new page created by user with <10 edits, and without user page
  • Review page to subjectively determine whether the world would be a better place if the entry could be found in a search engine.
  • Add to list, identify whether the answer is "yes", "no" or "maybe".
  • Exclude redirects
  • Sort list to place more problematic pages first

Observations:

  • Higher ratio of acceptable pages than first review - but this is partly because pages already found to be seriously problematic have been removed from WP and the list of new pages, even though some will still be found in search engines
  • While many deserve to be found in search engines, they will be found if any editor removes the noindex tag, so false positives are not a significant problem


Count Article In Google? Should it be in Google
1 Charity registry Yes No
2 Laura Stevenson Yes No
3 Ashlee Young Yes No
4 D.imman Yes No
5 Dom (internet slang) Yes No
6 Basantapur High School Yes No
7 Paroom Yes No
8 Fort Weaver Road Yes No
9 Wes Sechrist Yes No
10 Dani Donadi Yes No
11 Chef (tool) Yes No
12 Shabab El Waladeya Yes No
13 Jewett Academy Yes No
14 Commonwealth High School Did not check No
15 Anders lundkvist Yes Perhaps
16 Meddy Ford Yes Perhaps
17 Sekolah Menengah Kebangsaan Petra Jaya Yes Perhaps
18 Shen Qing Yes Perhaps
19 Tribal Seeds Yes Perhaps
20 Ka Ho Mok Yes Perhaps
21 Triazolopyridine Yes Perhaps
22 Fort Barrette Road Yes Perhaps
23 Ganglion mother cell Yes Perhaps
24 Octa-Vibraphone Did not check Perhaps
25 Burns v. Reed Did not check Perhaps
26 Customer to customer Yes Yes
27 Duck duck go Yes Yes
28 PstI Yes Yes
29 HR 4587 Yes Yes
30 Broome Tramway Yes Yes
31 Providencia stuartii Yes Yes
32 Kadiadara Yes Yes
33 Samuel Rüling Yes Yes
34 American xplorer Yes Yes
35 Shiraito Falls Yes Yes
36 Tormented (2009 film) Yes Yes
37 Terry Gygar Yes Yes
38 John Kay (Poet) Yes Yes
39 DFS 332 Yes Yes
40 TheFREEhoudini Yes Yes
41 Hero of War Yes Yes
42 People's United Community Yes Yes
43 Wythenshawe Bus Garage (Manchester) Yes Yes
44 The Singing Scott Brothers Yes Yes
45 Paulina Maj Yes Yes
46 Baragudi Yes Yes
47 Gary M Pomerantz Yes Yes
48 Kawasaki Bajaj Caliber Yes Yes
49 Arma-Goddamn-Motherfuckin-Geddon Yes Yes
50 Mary McGowan Yes Yes
51 KMB Route 1A Yes Yes
52 Mary McCaslin Did not check Yes
53 Schuman Collection Did not check Yes
54 Vaivaka Did not check Yes
55 Sports Team in the Central Pennsylvania Area Did not check Yes
56 Sylvan Lake, New York Did not check Yes
57 Josh Carlson Did not check Yes
58 Yuri Rozhdestvensky Did not check Yes
59 Montcalm High School Did not check Yes
60 Little Neebish Resort Did not check Yes
61 Kathryn Tanner Did not check Yes
62 Sad Clown Bad Dub II Did not check Yes
63 The Grove-Jefferson, Texas Did not check Yes