Wikipedia:Administrators' noticeboard/Incidents/CCI

From Wikipedia, the free encyclopedia
This CCI case
CCI pages
CCI case main page
'bot task explanation
how to help
'bot discussion
cleanup discussion
changes to the 10,000 articles
list of tagged articles
Policy
Copyright policy
On this page


Darius Dhlomo CCI incident discussion[edit]

I have boldly moved a LOT of older discussion from this page to /Archive 1, beginning with DD's unblock request and analysis of the scale of copying. The current status (basically, the outcome of the discussion) is summarized at the CCI case main page and the bot task explanation. For now I am leaving the section structure and some still-relevant discussion on this page, but am boxing it so new comments should be made in a new section. In some cases I have left summaries of the archived material. Note that to keep the size of this page manageable, a lot of the stuff has been chopped even inside the box. Look in the archive if you want to see it all. 67.119.14.196 (talk) 23:25, 17 September 2010 (UTC) [reply]
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section.

Now in /Archive 1.

67.119.14.196 (talk) 22:49, 17 September 2010 (UTC)[reply]


Review of unblock request and discussion of possible community ban[edit]

Scale of the problem[edit]

What's the scale of the copyright problem here? I've identified these so far: Sammy Korir (2006) Joetta Clark (2007) Canyon Ceman (2008, notice removed by this same editor and still a copyright violation right now), and Phil McMullen (athlete) (2010). I am unable to determine why the 'bot thought that Kamil Damašek was a copyright violation. Is this all that there is? Uncle G (talk) 20:29, 4 September 2010 (UTC)[reply]

Archived further threaded discussion at /Archive 1. 67.119.14.196 (talk) 23:25, 17 September 2010 (UTC)[reply]

Ban[edit]

Archived ban proposal and discussion at /Archive 1. Current status is DD indefblocked but not officially banned. Standard offer was extended early on, not sure if offer is still on the table. 67.119.14.196 (talk) 23:25, 17 September 2010 (UTC)[reply]

Mass deletion: Give up and start over[edit]

Now at /Archive 1. This was proposal to delete all 10,000 of DD's articles with a bot. It got considerable support at first, then some opposition, leading to current consensus to blank the articles for manual history review instead of deleting them outright. I am leaving a few comments here that discuss the nature of the copying, as still having some relevance. These are also in the archive. Another proposal (in the archive) was to move the blanked articles to incubation rather than leaving them in article space. 67.119.14.196 (talk) 23:25, 17 September 2010 (UTC)[reply]
  • Rather than a straightforward mass deletion, may I suggest an element of triage? Some of the articles that this editor created will have been edited by others, and some will be more or less notable than others. If we identify and delete those that are tagged as orphans, unreferenced or tagged for notability would that leave us something more manageable? ϢereSpielChequers 16:04, 5 September 2010 (UTC)[reply]
    • Almost certainly not. I've reviewed a few hundred of the biographies now. Yes, it's less than 5% of the problem, but I was selecting at random, from the list before it was sorted, so I have little suspicion that my sample is biased. Notability is almost never an issue on which these subjects have been challenged or tagged. These are not exactly minor sporting figures and events. Likewise, orphan status would be problematic. Many of these articles are on navigation templates for sports teams, regular sporting competitions, and the like, and are unlikely to be orphans. (Quite a few cross-reference one another, too.) Nor, indeed, is lack of any citations a recurrent issue. Darius Dhlomo has linked almost all of xyr creations to on-line sports databases and the like. As criteria for filtering out the problematic articles, from what I've seen I suspect these won't be useful at all.

      I suggested that we find some filtering criteria, above. I haven't yet come up with any, and Moonriddengirl quite rightly notes, above, that it might not be safe from a copyright perspective to even do that. Even the 1-paragraph stubs might be a mass copying exercise, from some source that we are unaware of. All of us who have reviewed the article set so far seem to have come to the same conclusion, that Darius Dhlomo simply doesn't write original prose, at all, anywhere, even if it's only a couple of sentences to make up a small paragraph. Pick a couple of hundred for yourself, check them for copyright violations, and see what conclusions you draw.

      If you find from doing so some triage criteria that actually work in practice, that would be good news, of course. ☺ Uncle G (talk) 16:29, 5 September 2010 (UTC)[reply]

Mass blanking of ten thousand articles by a 'bot[edit]

Technical discussion of blanking proposal, in /Archive 1, outcome reflected in current blanking plan (wp:Contributor_copyright_investigations/Darius_Dhlomo/Task_explanation). 67.119.14.196 (talk) 23:25, 17 September 2010 (UTC)[reply]
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Prod-like proposal[edit]

Archived. This basically suggested tagging all articles with a PROD-like template and deleting them in a week if copyvios weren't fixed. 67.119.14.196 (talk) 23:41, 17 September 2010 (UTC)[reply]

When voting is not evil[edit]

Archived proposal to vote on mass-deletion proposal, with votes allocated according to how many articles an editor reviews. Status: no mass deletion. 67.119.14.196 (talk) 05:58, 18 September 2010 (UTC) [reply]

Triage criteria[edit]

Let's talk a little bit about triage criteria. All these points are open to discussion, but initially for this purpose I'll assume:

  • "Triage" is a process designed to be carried out by a bot or script that examines (in some way) all 20000+ articles that Darius has touched, and labels those that meet given criteria that we're trying to specify here. The script should run with no human intervention and not much human review of the final output. Of course we'd first run it on smaller sample sets and examine the results carefully to tune the criteria. Triage should divide all those articles into several possible categories, such as:
    • Articles needing no special attention (Darius's only edits didn't add any significant content)
    • Articles presumed copyvio and which should be deleted or blanked without additional attention (all significant content in the article came from Darius)
    • Articles presumed containing copyvio but which should get careful attention anyway (e.g. article contains significant content from both Darius and others)
    • Articles that the bot isn't sure how to classify, but that a human can probably tell with a quick look.

By "manual edits" I mean edits to an article made manually by human editors. "Human" is specified because edits by bots shouldn't count for this purpose. (A spot check of the Darius-created articles indicates that the majority of the edits in them are probably bot edits). Script-assisted human edits (routine maintenance scripts) mostly shouldn't count either. "Text" means any sequence of more than 5 consecutive words in the article body (not in category or interwiki tags). Here is a simple proposal for criteria and labels:

  • Articles that contain no text added by Darius (just tables with names and numbers) => no attention needed
  • Articles that contain text added by Darius and no text added by others => delete
  • Articles containing text added by both Darius and by others => if text is more than 80% Darius, then delete, else flag for attention
  • Articles with 2 or more manual edits from editors other than Darius => these may be of interest, flag these in a sample set and study for further ideas.

Any thoughts?

67.122.211.178 (talk) 19:11, 6 September 2010 (UTC)[reply]

I like it, but is it technically feasible to separate these articles? If so, go for it. fetch·comms 19:54, 6 September 2010 (UTC)[reply]
Yeah, most of the above should be doable. I may try coding something later today. 67.122.211.178 (talk) 20:17, 6 September 2010 (UTC)[reply]
This sounds really interesting; I'll look forward to seeing what you can come up with. I would, though, want to find some way to exclude those articles that have been cleared already through the CCI. We've got some really good volunteer work going on there. :) --Moonriddengirl (talk) 23:16, 6 September 2010 (UTC)[reply]
For the moment, I'm using User:CorenSearchBot/manual as a double-check; it's a bit buggy and not very reliable (I caught two vios it missed), but might help some users. It definitely needs to be reviewed by a human, though, but for articles that already are two sentence stubs, there's really no way it can be a cv if the bot passes the second sentence (as the first is usually changed to be MOS-compliant). If anyone can find a better cv-checker process, however, please tell everyone. The Earwig's tool only does one page at a time and he told me that he doesn't have time to let it process multiple ones at once. fetch·comms 00:20, 7 September 2010 (UTC)[reply]
I thought we'd already concluded that a non-finding from the searchbot doesn't really tell us much. We also don't know about possible copying from materials like printed almanacs and magazines, that have never been online so won't show up in any search engines. We're left treating all Darien text additions as vios whether we locate a source or not. Are we still going by that approach? Are there enough different views on this that we should open a discussion section about it? 67.122.211.178 (talk) 01:13, 7 September 2010 (UTC)[reply]
There are a lot of stubs he made that can't be vios. I mean, "X (born [date] in [place]) was a [occupation]" is a standard first sentence; if that was a copyvio, then it'd be pure coincidence. Many of the vios I saw are that he creates a page (most back in 2006), then comes back in 2008 to add the cv in. Now, this isn't the case for every article, as he also creates cvs at the beginning, but deleting everything he created doesn't help. I also have seen several instances of what he has copied being changed over the years so that it is no longer a cv at this point. The only way to do this right is not to take the easy way out and delete all the pages, but rather to separate the more-likely copyvios (excluding category/template/table-only changes and going through the rest). For the issue of print sources, it may be better to stubbify articles to which he has added more than a couple sentences but do not appear to have been taken from Internet sources. fetch·comms 02:07, 7 September 2010 (UTC)[reply]
How do we even know that the people in those articles even exist? See trap street. If Darius got a sports almanac and entered info about some fictitious player who the almanac writers invented, that's a copyvio even if no words were copied. I just can't work up much motivation to try to retain articles that nobody other than Darius contributed to. Most are quite unencyclopedic, sort of a phone book about athletic events. 67.122.211.178 (talk) 04:18, 7 September 2010 (UTC)[reply]
I'm not much of a deletionist/inclusionist kind of guy, but I don't think that it's fair to just delete all these potentially useful articles of notable people (to some degree). The issue of fictitious persons in his sources can't really be helped, I guess; we could basically delete every article without a source under that premise. I know that going through the list manually is not desirable, but I'd rather check everything and salvage what we can. fetch·comms 04:24, 7 September 2010 (UTC)[reply]
We use an WP:AGF approach to normal contributions, but in Darius's case there's such a rampant pattern of vios that we may be better off treating every one of his contributions as tainted. I asked on his talkpage if he copied from any print sources, but he hasn't responded yet. The articles that don't contain (presumably copied) text inserted by Darius are mostly uninformative stubs, not all that useful compared to just using a search engine. We do in fact now have a policy (being implemented in stages that are still under way) of deleting all unsourced BLP's. 67.122.211.178 (talk) 06:03, 7 September 2010 (UTC)[reply]

Nuclear option[edit]

Above I suggested blanking articles and instigating a mass checking effort. But at this point I reckon triage here should involve excluding edits which cannot be copyvios, particularly by virtue of being too small, or merely changing categories etc. Everything else should be presumed copyvio of one form or another, possibly from print sources (and therefore hard to impossible to identify). All the evidence (and Darius' inability to so far point to things which are not copyvios is damning) suggests to me that Darius simply does not write substantive prose. My feeling is he's one of those people who (possibly English not first language?) isn't confident writing, and so virtually always copies with some minor modification. This would explain his ignoring the warnings - he felt simply unable to contribute without doing it in this copyright-violating way. In logical consequence, all prose he's ever written which remains in articles should be deleted as being a copyvio of something or other. This seems to be a situation where it is unreasonable to say "let's see which of these are copyvios, and remove it if proven". "nuke the entire site from orbit. It's the only way to be sure." I'm not even basing my view on the amount of work involved in checking: I'm basing it on the unacceptable likelihood of large numbers of copyvios not being identified, and so retained - especially if the checking is done by people not familiar with copyright checking. So i) bot-blank and tag all affected articles, excluding whatever can be; ii) the tag requires all Darius prose to be removed for the article to be restored. This has the tremendous advantage of simplicity iii) allow a long time to handle blanked articles, say a year. Then delete any that are left. Rd232 talk 12:00, 7 September 2010 (UTC)[reply]

I'm beginning to find the sprawling proposals very hard to follow here. :D I agree with you that what cannot be copyvio should be excluded. We have traditionally excluded edits below 100b as likely to be de minimis at worst. It's not perfect, but it's workable. (Workable matters. This is just one CCI. We have several dozen still open, some of which are over a year old.) In this case, I agree that we need to consider that all creative content added by this user is a copyright problem that needs to be removed. I like the tag modification as made by Uncle G (see section above): User:Moonriddengirl/CCIdf. The combination of that tag, Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation and Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help would invite all interested contributors to help out. The only thing we might wish to reconsider is how they can recognize problems. Given that there is a risk of offline sourcing (so far, none found, but some of the sources that have been found didn't show up in a Google search engine), we should presumptively remove or rewrite all of his creative content. Perhaps those who want to help out should be invited just to identify what creative text he added and remove or rewrite it. --Moonriddengirl (talk) 12:23, 7 September 2010 (UTC)[reply]
My bad. I guess what I'm really doing here is arguing that the tag directions to editors finding the blanked page should not tell people to check whether there is a copyvio; but to simply to remove (or rewrite) any Darius prose, because any such prose almost certainly is. And it's an error-prone waste of time trying to prove a negative. Beyond that, this is basically the "Mass blanking of ten thousand articles" proposal. Rd232 talk 12:57, 7 September 2010 (UTC)[reply]
I agree. I think these two proposals merge well together. --Moonriddengirl (talk) 13:08, 7 September 2010 (UTC)[reply]
A lot of what Darius wrote were two-line stubs that probably aren't vios (and if they were, it'd only be the second sentence because the first would have had to be changed to match the MOS) as he uses fairly plain wording everywhere: "X is an [sport] player. Xe won the [medal] with Y country in the Z Olympics. Xyr personal record was [time] at [place].", etc. So, I agree that everything he wrote has the potential to be a vio, but for many of the stubs, just doing a 30-second reword of the only possibly offending sentence or two should be enough to remove any lingering doubt. Until we can run through all those, just keeping it blanked ought to do fine. As long as a query can eliminate the minor diffs (categories, templates, tables, etc. changed only), then we have a lot less to worry about. fetch·comms 00:11, 8 September 2010 (UTC)[reply]

Simple proposal[edit]

This has three parts:

  1. Skip the articles where he changes less than 200b. He's either adding a row to tables, adding categories, or adding templates.
  2. Everyone stop worrying about this AN/I thread and go check some articles. If there are 23,000 articles on the list, 100 users can go through 230 articles each and that will be that. 230 is not a lot, considering I went through 20 in about three minutes last night, which were ones where about 100kb was changed. If we just skip those, and start at the end, we can eliminate a good portion of the articles as cv-free, and deal with the likely cv ones.
  3. Blank all articles he created, and make a separate list of those to which he added more than 1,000b, which need individual examination.

Otherwise, the triage idea above seems good. fetch·comms 19:54, 6 September 2010 (UTC)[reply]

200kb is the size of the entire ANI page. Do you mean 200 bytes? I think that is too many (1 word = approx 6 bytes). 67.122.211.178 (talk) 20:16, 6 September 2010 (UTC)[reply]
My bad. I meant bytes. Fixed accordingly. 200b is around adding a few categories, a medal chart template, or couple of rows to a table. fetch·comms 20:47, 6 September 2010 (UTC)[reply]
1 word = 6 bytes so 200b = 30+ words, which can be a pasted sentence. I would not want to accept anything that had more than 4 or 5 consecutive words added, where "consecutive" means e.g. not in separate table cells. 67.122.211.178 (talk) 22:09, 6 September 2010 (UTC)[reply]
Articles to which he has added below 100b are excluded from the listing as minor. If we add them back in, the number of articles we must check jumps from 23197 to 41,108. --Moonriddengirl (talk) 22:51, 6 September 2010 (UTC)[reply]
What software are you using to find that? I've started fooling around with some code, but might be duplicating existing effort. 67.122.211.178 (talk) 04:10, 7 September 2010 (UTC)[reply]
It's our CCI tool; you can read about it and access it here. I'm afraid I know zilch about how it works...just that it does. :) --Moonriddengirl (talk) 10:40, 7 September 2010 (UTC)[reply]

Should we move this discussion?[edit]

Subpaged.

I begin to think we should move this to a newly created project page or set of pages (some such pages have already been created), or an RFC. Otherwise it is going to swamp ANI pretty soon. Part of the discussion should be about technical aspects of proposed bots and scripts, that would be too far in the weeds to clutter ANI with. 67.122.211.178 (talk) 22:48, 6 September 2010 (UTC)[reply]

yes It's already a 1/16th of the page.--intelati(Call) 22:50, 6 September 2010 (UTC)[reply]
  • (edit conflict) Usually a subpage of ANI. From the instructions: When moving long threads to a subpage, add a link to the subpage and sign without a timestamp: "Moonriddengirl (talk)"; this prevents premature archiving. Move to Wikipedia:Administrators' noticeboard/Incidents/[concise title]." The title here is long and not that descriptive, and I wouldn't want to use the contributor's name as it could be a real name. How about Wikipedia:Administrators' noticeboard/Incidents/CCI? We don't have anything at that title yet. We leave a summary here with ~~~ to prevent early archiving and add {{unresolved}}). --Moonriddengirl (talk) 23:03, 6 September 2010 (UTC)[reply]

Breathe deeply[edit]

I'm somewhat aghast that the nuclear option, blowing up 10,000 articles by bot, is being discussed so cavalierly. Certainly, this is the worst, most extreme option — to be avoided unless absolutely unnecessary. Having reviewed Darius' talk page — both the current and past versions — I am struck by the fact that he does admit having made serious mistakes but has contended that the number of flagrant copyright violations is relatively minimal in number and that he has offered to help find them and liquidate them.

I wonder why the CorenWhatchamacallit Copyright Bot isn't run over each and every article to which Darius has contributed to flag copyright vios? That's what alerted us of the problem to begin with, did it not? Let the bot check for violations — subject everything to review.

I think the punishment meted out to this editor should be severe, but I don't see why the most draconian corrective measure is being discussed before all corrective options have been exhausted. Bot-check the works and blow up everything that comes back positive for copyright violations... Carrite (talk) 06:12, 7 September 2010 (UTC)[reply]

There seems to have been earlier discussion concluding that bot-checking doesn't help that much and we have to assume it's all tainted. If you are suggesting we rethink that notion, it's probably best to open a new section of this page for thoughts and comments. I personally don't feel very attached to articles with no substantial content contributions from anyone other than Darius. If those articles were so important, other people would have edited them too. I agree about not blowing up the ones with contributions from others. 67.122.211.178 (talk) 07:19, 7 September 2010 (UTC)[reply]
This is not in any way, shape or form, punishment: this is dealing with copyright violation on an unusually vast and barely manageable scale. There is a proposal to blank affected articles with an explanatory note (with variations on the theme), which deals with the problem immediately, allowing anyone interested in the article to deal with the problem. If anything, I have a concern that spreading the copyvio checking so widely risks too many trickier cases being missed by people not normally involved in checking such things. Good instructions will mitigate that, but it's still a worry. Rd232 talk 09:05, 7 September 2010 (UTC)[reply]
Carrite, the thing is, 10'000 articles built upon a copyvio aren't ours to take and publish. They aren't even ours to modify: even if the text has been edited out later on to the point there is nothing left of the original, we have created an unauthorized derivative work. Those are not our articles, they're effectively someone else's, and we have no claim on them. There's nothing cavalier about that. MLauba (Talk) 09:06, 7 September 2010 (UTC)[reply]
I don't think the "10,000" number is anywhere close to accurate, unless Darius is lying through his teeth. Do we know that the problem is actually this vast? Nor do I have any problem whatsoever blowing up any article found with copyright violations. The question is this: how big is this problem, really? I would suggest that the punishment help mitigate the crime, that for the next six months Darius be limited to editing, with a new account, articles which he created and only articles which he created... With a view to eliminating copyright violations. His work can silently be checked "over his shoulder." At the end of that period, extreme scrutiny should be applied to remaining articles to see if the problem has been fixed or not, and the community can proceed from there based upon findings made at that time. Darius' previous account name should be locked out and a new account name initiated, with edits starting again from zero and no autoreviewed status for multiple years, in my estimation... Current thinking seems to be obsessed with making the problem instantly go away by mass deletion of the good, the bad, and the ugly via automation in one fell swoop. My suggestion is that the culprit be instructed to get to work for half a year fixing his own mess. Carrite (talk) 11:28, 7 September 2010 (UTC)[reply]
With all due respect, copyright violation is not like spelling mistakes, it's not something to be fixed when we get round to it. And you seem not to have heard me when I said this was no about punishment. Finally, there appears to be a consensus that the problem affects an enormous proportion of Darius' substantive prose edits. This is in no way, shape or form a minor issue. Rd232 talk 11:41, 7 September 2010 (UTC)[reply]
I'm not saying it's a minor issue and I'm not saying it shouldn't be immediately addressed. And I did hear you when you said this was not about punishment — and I argue that it should be about punishment, with the punishment being the immediate fixing of his mess by the culprit, bearing in mind that Rome wasn't built in a day and that it will take time to ferret out everything... Further, I challenge the assertion that any consensus can be drawn about the scope of the problem until it is systematically studied. See the random sampling below. Expand that process, let's look at this problem scientifically before we go nuclear on it. Carrite (talk) 12:04, 7 September 2010 (UTC)[reply]
Have you looked at the actual copyright investigation subpages? Beginning with Wikipedia:Contributor copyright investigations/Darius Dhlomo and moving through (there's a sidebar above that links to them all), articles that have been checked and cleared are marked Red XN, while articles wherein copyright problems have been found are marked Green tickY. The listing of five articles below is significantly smaller sample than has already been evaluated. --Moonriddengirl (talk) 12:08, 7 September 2010 (UTC)[reply]
I quickly count 32 green checks and 159 red Xs = 16.75% violation rate. Carrite (talk) 12:19, 7 September 2010 (UTC)[reply]
That would be a much higher outcome than the hundreds I've predicted, but it's possible that contributors are zeroing in on more problematic areas. :/ --Moonriddengirl (talk) 12:28, 7 September 2010 (UTC)[reply]
On page 2 I figured out how to let the browser find function do the counting and come up with 7 violations and 100 clean pages = 6.5% violation rate, total now 39/259 = 13.08% violation rate. \Carrite (talk) 12:33, 7 September 2010 (UTC)[reply]
On page 3 it's 5 bad, 98 good = 4.85% violation rate. I need to go back and check my first count mechanically and redo the arithmetic... It looks like a violation rate of under 10%... Carrite (talk) 12:38, 7 September 2010 (UTC)[reply]
% violation rate is meaningless as a proportion of all edits - the problem only applies to substantial prose edits. Most edits (including my sampling on page 3) are not substantial prose; they're adding infoboxes, categories, basic data and the like. Rd232 talk 12:53, 7 September 2010 (UTC)[reply]


Darius has no credibility on this issue; he assured us he had done this in no more than 15 articles. We had more than doubled that count before Uncle G stopped counting. Since his block at his talk page, Darius told us that Fabián Roncero was fine; it isn't. He told us that Núria Camón is fine; it isn't. How are we supposed to trust him to identify his copyright violations, much less acknowledge them and address them? Even though I believe that there will probably be hundreds rather than thousands of articles that are a copyright problem by the time the investigation is done, there are still tens of thousands of articles that need review. Having somebody silently check over his shoulder only works when we know that he (a) can and (b) will accurately assist. --Moonriddengirl (talk) 11:36, 7 September 2010 (UTC)[reply]


(unindenting) I did a recount, manually counting numbers over 100 since the find feature only counts that high on Safari. I found:

  • Page 1 — 37 violations, 168 clean pages
  • Page 2 — 7 violations, 212 clean pages
  • Page 3 — 5 violations, 98 clean pages
  • Page 4 — 0 violations, 12 clean pages = 49/539 = 9.09% violation rate

Most of the violations were on the first page. I'm not sure if these things are chronological or not or whether those articles were being critiqued more harshly or what was going on.... The violation rate seems an anomaly for the first page, compared to the next two. What is clear is that we are probably not talking about "10,000" copyright violation articles here, but some substantially lower number in the general range of 5-8% of Darius' total contributed pages. Carrite (talk) 13:05, 7 September 2010 (UTC)[reply]

So, basically, you're saying that you believe my estimated hundreds is low? You may be right. --Moonriddengirl (talk) 13:11, 7 September 2010 (UTC)[reply]
There are 13,542 articles in the queue... If you'll accept my premise that the 9.09% rate is somewhat inflated for some unknown reason (learning curve of the editor or more harsh judgment of inspectors) and that the actual rate falls in the 5-8% range, we are talking about between 677 and 1,083 articles with substantial issues, give or take. "Hundreds" is accurate. Carrite (talk) 13:18, 7 September 2010 (UTC)[reply]
I believe that User:VernoWhitney has indicated that the numbering in the queue is inaccurate. We have had some difficulties with putting together the listing because of its scope and the fact that initially we tried to isolate only articles he had created. There are over 23,000 articles listed by our CCI program excluding reverts and minor edits, which makes 5% something in the order of 1,150. --Moonriddengirl (talk) 13:23, 7 September 2010 (UTC)[reply]
(Oh, I just noted that above you mentioned not being sure if these things are chronological: they are not. They're listed by size of total contributions, beginning with greater. I would expect more problems in the front end. That's usually the way it goes. --Moonriddengirl (talk) 13:25, 7 September 2010 (UTC))[reply]
There are 23,197 total articles, pages 1-10 are articles they created and then the numbering (and order by size) restarts on page 11 for articles they didn't create and just edited. VernoWhitney (talk) 13:36, 7 September 2010 (UTC)[reply]
Thanks. :) More unusual than I knew! --Moonriddengirl (talk) 13:39, 7 September 2010 (UTC)[reply]

(unindenting) Okay, so we're seeing a much higher copyright violation rate with long contributions vs. stubs — is that a fair summary? Numbers 1-1000 in size, maybe something like 1 out of six 5 of those are defective, whereas the copyright violation incidence rate falls to what might be considered "normal" levels with shorter contributions (has the question of copyright violation across random WP articles ever been studied? Four or five percent of articles having "problems" would be a pretty reasonable guess, I'd think...).

Anyway, what seems to need to be done is a high-priority manual checking of the top 1000 or so original articles as well as the top 1000 or so content contributions to already-established articles, with maybe some sort of cursory bot-checking of the remaining short and stub articles. Is that a reasonable perspectiive??? Carrite (talk) 16:18, 7 September 2010 (UTC)[reply]

Yes, and quite common for CCIs. I'm not sure what you mean by "normal" levels, though. I don't know if anybody's ever done a random copyvio study of Wikipedia articles. I would kind of hope not, as I'd rather see anybody with that kind of time on their hands trying to help clean them up. :) Again, this is one of dozens of CCIs. We've got more articles than I want to count waiting for review. I don't know that it's reasonable to limit review to 2,000 out of the 23,000+ articles that he's done non-minor edits to. By "top" I assume you mean contribution length: a lot depends on the pattern of the CCI subject. He's got a lot of table and list contribs. Those are high in volume, but low in risk. A single paragraph of creative text from him would worry me a whole lot more than his most prolific contrib, [1982 in athletics (track and field) here] (already cleared). I see that copied content has already been found in article #9514 of the articles he's created. Limiting our checks to the top 1,000 of his articles would stop well short of that. If a bot didn't detect it, the copied content would remain. --Moonriddengirl (talk) 17:25, 7 September 2010 (UTC)[reply]

In-depth study of small random sample[edit]

Sample size[edit]

My back-of-the-envelope estimate, based upon the number of copyright violations found versus the number of articles that I looked at, was that around 10% of articles, just over a thousand, will turn out to be copyright violations. A sample size of five isn't nearly enough. Have ten articles to look at. (Even that's not enough.) Uncle G (talk) 15:57, 7 September 2010 (UTC)[reply]

Were those selected uniformly at random from the whole set of Darius-created articles? If yes, it's surprising that they're all biographies. Anyway the point of selecting the 5 articles wasn't to find vios per se, but to examine them carefully to see if anything could be learned about Darius's methods. So I'd rather keep examining the original 5 for a while before expanding the set. 75.57.241.73 (talk) 18:23, 7 September 2010 (UTC)[reply]
From your 10-article sample below, it looks like five have had copying detected so far, giving 50% rather than 10% copying rate. In my 5-article set above there's two with copying detected and at least one I'd consider suspicious, so again looking at around 50%. I think the large set getting 10% is not being examined very closely. 75.62.3.153 (talk) 02:30, 11 September 2010 (UTC)[reply]
Steve Spence
  • Green tickY later additions were cleaned two days ago but the very same text introduced at article creation time appears on official bios. Blanked and listed at WP:CP for now. MLauba (Talk) 16:20, 7 September 2010 (UTC)[reply]
Leonard Nitz
  • Green tickY cv of [12]. All the copied content was added word-for-word by Darius as his second edit; the original stub seems to have gotten all the info from the cv source. I have stubbified the article for now, as the original stuff doesn't seem to be a vio. Darius was clearly getting his info from a source, writing a quick stub, and pasting in the rest a few days later, without listing the offending link as a source. fetch·comms 00:59, 8 September 2010 (UTC)[reply]
Steffen Radochla
  • Red XN "turned professional in 2001" is a bit boilerplate-sounding, but it's also a common wording. One-line stub with a short list; no cv found on Google. fetch·comms 12:51, 8 September 2010 (UTC)[reply]
Mark Gorski
Gerrit de Vries (cyclist)
  • Red XN Seems OK to me, although the wording is basically like the other stubs. fetch·comms 02:22, 9 September 2010 (UTC)[reply]
Lauren Hewitt
Japhet Kosgei
Lee Naylor (athlete)
  • Green tickY More froM ABC: [15]. Has been modified by others, but still fundamentally the same wording/structure. fetch·comms 02:26, 9 September 2010 (UTC)[reply]
George Mofokeng (athlete)
  • ? Interesting wording. Could not locate source. fetch·comms 02:47, 9 September 2010 (UTC)[reply]
    • I devoted about 90 minutes to searching for a source for the prose here. No luck. However, given that this article's history matches Dhlomo's pattern of starting a 1 sentence stub, and then a short time later copying in text from a source, (and also because of the specific nature of the prose) I strongly suspect CV. The source may be a web page that has since been taken down, or something in print. Revcasy (talk) 12:38, 13 September 2010 (UTC)[reply]
Gert Thys
  • ? This has a close copy of the wording in the article, but Darius created it in 2007, while that link was published in August 2009. Some close wording in the last sentence to [16], but that was part of the text of a ruling. I removed most of the text as unsourced information in a BLP anyway. fetch·comms 02:38, 9 September 2010 (UTC)[reply]
    • Now this brings to mind another case. Dhlomo has in my eyes been a very problematic user with regards of writing tables instead of prose, more specifically replacing prose with tables. It might have nothing to do with this case, except for the unwillingness to listen to talk page comments in the past. Geschichte (talk) 23:44, 10 September 2010 (UTC)[reply]

Technical question[edit]

Couple of things; one: has any of the triage stuff been listed (i.e., eliminating edits where he only touched categories, etc.)? Two: can some make a quick list of all articles he made (just page 5 of the CCI for now) where he made at least one edit after initial creation that added more than 500b to an article (ignoring category-only edits, etc. preferably)? This may help to see if he often created a stub first, then pasted in a couple paragraphs later. If this is not technically feasible, I'll just keep looking manually. fetch·comms 01:18, 8 September 2010 (UTC)[reply]

1) slightly complicated but doable, I've just been juggling other things. 2) easier, I'll see if I can bang it out. 75.57.241.73 (talk) 02:19, 8 September 2010 (UTC)[reply]
It looks like the CCI report already has this info (the additions are broken out as separate edits). What is it that you're asking for that's not already there? Note the threshold is probably more like 100b than 500b. Even 3-word snippets like "retired female swimmer" is enough to pick up some vios. 75.57.241.73 (talk) 02:24, 8 September 2010 (UTC)[reply]
I want a list of the pages where at least one of the later additions was over 500b (removing the lesser ones). I know it's not perfect; just searching for a possible pattern, seeing how much he may have lifted at once, and then narrow it down from there. If possible, just run through the existing CCI page and get a list of the articles with a diff that says more than 500. fetch·comms 02:47, 8 September 2010 (UTC)[reply]
Lifting a three word phrase like "retired female swimmer" isn't technically a copyright violation, although it might be a pointer to a section of lifted text. Carrite (talk) 02:29, 8 September 2010 (UTC)[reply]
Right, finding the same phrase in dozens of articles can point to a common source. 75.57.241.73 (talk) 02:30, 8 September 2010 (UTC)[reply]
Fetchcomms, it looks like articles with that pattern (contribution > 500b on other than the first edit) on page 5 are easy to spot by eyeballing. Do you want something more than that? 75.57.241.73 (talk) 03:06, 8 September 2010 (UTC)[reply]
Meh, alright. I'm probably just getting lazy :P. Working on them now... until I sleep. fetch·comms 03:18, 8 September 2010 (UTC)[reply]
Source to watch out for: I found two copyvios so far (he slightly changed it by merging some sentences and reordering them) from http://www.hockey.org.au/index.php[17] and [18]. Both were in the original creation, it seems, so for field hockey articles, that seems like a source he may have used repeatedly. Is there a way to list all the articles he created that are part of Category:Australian field hockey players, so we can check against this site? fetch·comms 03:42, 8 September 2010 (UTC)[reply]
Yeah, give me a few minutes. 75.57.241.73 (talk) 04:10, 8 September 2010 (UTC)[reply]
They are below, feel free to uncollapse the list or move it. I can add links to the 1st rev of each article if that is useful. 75.57.241.73 (talk) 05:04, 8 September 2010 (UTC)[reply]

Fetchcomms, please read Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help. This has already been written down. Uncle G (talk) 11:09, 8 September 2010 (UTC)[reply]

  • I assume you mean his strategy (I didn't find any links there). That page is very helpful, though hopefully we can add more info as we progress. fetch·comms 12:50, 8 September 2010 (UTC)[reply]
    • Yes, the strategy. I've already invited everyone to be bold in adding to and improving that page. If you want a list of the created articles, simply go back in the edit history of the first two CCI list pages to before the list was sorted and revamped.

      I've reviewed a few hundred of the biographies. The common creation strategy was for the whole text to be in the first edit. But sometimes there are a few later edits fixing copy and paste errors. As I wrote above, a productive triage approach is to first go back to the latest revision by Darius Dhlomo before someone else (aside from a 'bot) touched the article, and read that. Check the foundation content first, in other words. Uncle G (talk) 13:18, 8 September 2010 (UTC)[reply]

More strategems that you can incorporate into Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help (be bold!):

  • Articles that cite the Beach Volleyball Database have uniformly proven to be taken from the biographies there. The BVB pages aren't datestamped, unfortunately. But Moonriddengirl and I did some checking with the Wayback Machine to check the relative dates.
  • If an article cites a "profile" somewhere, it's quite productive to check that profile first. Unfortunately, some profiles pointed to have been removed from the WWW in the intervening years. The Wayback Machine is of some help, here.
  • As discussed above, if an article cites an IAAF profile, there's a likelihood that the prose came from another article somewhere else on the IAAF site. (I originally skipped a lot of articles that cited IAAF profiles, because I wasn't aware of the other pages.)

Uncle G (talk) 13:32, 8 September 2010 (UTC)[reply]

Field hockey players[edit]

Articles from Category:Australian field hockey players created by Darius, per Fetchcomm's request.

collapsed list of 92 players

75.57.241.73 (talk) 04:40, 8 September 2010 (UTC)[reply]

Temporary list of known used sources for Australian field hockey players

I'm just using this for myself right now, but a central list of sources he copies from frequently could be useful in identifying some vios, as Google does not show all of these on the first page or two. fetch·comms 15:44, 8 September 2010 (UTC)[reply]

  • This also indicates usage of the hockey.org.au site before they reorganized and changed links. fetch·comms 15:49, 8 September 2010 (UTC)[reply]

Have we alerted the appropriate Wikiprojects and enlisted their help?[edit]

As it stands, a casual survey of the articles in question seem to limit the fields to mostly athletics and specific sports within them. Have the appropriate Wikiprojects been contacted and enlisted to try to help? I could see a bot option that is being suggested below to be very effective if there are dedicated members of the affected projects getting involved to help clean up stuff, using a coordination page to drop admin help requests when needed. --MASEM (t) 16:03, 8 September 2010 (UTC)[reply]

Notice of the CCI has been given to WikiProject Athletics and WikiProject Olympics. Some projects are very responsive to these. This CCI's been a little unusual; I don't know if the people who've expressed interest in helping have been pointed to this discussion. It seems like it would be good to link to this discussion from the CCI page; I'll do that now. --Moonriddengirl (talk) 16:09, 8 September 2010 (UTC)[reply]

Implementing bot?[edit]

Executive summary for people coming here from the watchlist notice:
'bot task
Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation
instructions
Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help
blanking notice for articles
Wikipedia:Contributor copyright investigations/Darius Dhlomo/Notice
'bot technical discussion
Wikipedia:Bots/Requests for approval/Uncle G's major work 'bot
discussion of the task
Here on this very page
articles for this (first) pass
1–5000 5001–9664
example 'bot edits
here here
watch what happens to the first pass articles
Special:RecentChangesLinked/User:The-Pope/DDCCI_list

At this point, I propose that we go ahead with the following, based on discussions above:

The advantages of this: we pull the content from publication immediately, and we invite the wider community to help with cleanup. This could be the most efficient means of addressing a CCI ever, and it may not linger for more than a year as some of our others have done. There is a substantial risk that some of these tags will simply be removed by users who don't care about copyright. I see this routinely at WP:CP. We try to address this at WP:CCI by requiring that only those who have themselves had no copyright issues assist, but this isn't foolproof.

Still to be determined: what then? At what point do we go through the ones still tagged?

Thoughts? --Moonriddengirl (talk) 12:55, 8 September 2010 (UTC)[reply]

Go for it. After that, just keep checking manually, I guess. fetch·comms 13:25, 8 September 2010 (UTC)[reply]
  • I realize I'm in the minority by now, so I have to say this--not to get into further debate about it but just to indicate that there are still some of us who feel this way--but I still favor the mass deletion approach over any of these schemes for sucking up massive amounts of community effort cleaning up Darius's mess (plus exposing everyone who touches any of those articles to potential legal liability). The articles aren't for the most part really articles at all. They're more like database dumps that Darius vacuumed from various places into WP article space. None of them are written from secondary sources as our articles are supposed to be. Yes I know some of them are about legitimately notable people. I just don't feel any sense of tragedy that there might exist a notable person someplace in the world who is temporarily not the subject of a WP article, at least til somebody else gets around to writing a real one with real sourcing.
  • That said, I wonder if we could deploy some additional automation to help with cleanup. Is there some kind of script around that integrates all the diffs from when an article was created, in order to highlight all the text in the last revision, that was originally put in by a particular editor (i.e. Darius)? I can probably write one, but it would surprise me if it hasn't been done already. On the other hand it wouldn't be perfectly accurate. (Update: someone at refdesk mentions User:Cacycle/wikEdDiff which is not what I had in mind, but looks interesting anyway).
  • Do you want any additional filtering or processing of the 23000 articles? Like maybe for articles creates by people other than Darius, instead of blanking the whole article, the bot could revert to just before Darius's first edit to it. We'd write a different template for articles that got reverted but not blanked. Also I can still attempt some of the triage stuff discussed above, like noticing category-only edits with a script (I just have RL things to do as well). 75.57.241.73 (talk) 13:37, 8 September 2010 (UTC)[reply]
  • I completely understand favoring the mass deletion approach. I was leaning that way myself when we started. But so many people have been putting their time into this already, and I do not wish to devalue their efforts. Too, this approach has some exciting prospects for future cases like this. Getting assistance at CCI is a challenge; most of them involve thousands of articles, though few are this scale. If we find that this approach actually works, then it may be useful for other similar CCIs down the road...a way to encourage involvement from those members of the community who actually do view these articles. If this leads to finding a new, viable system for these, we might not have dozens of open CCIs with probably hundreds of thousands of articles cumulatively waiting for view. (oi)
  • I have no idea what automation can do. I'm technologically in the school of "challenged by using my remote control." I don't know of any script that integrates the diffs or how we might process it to automatically revert back to the pre-Darius version, but if those things are possible, they might be good approaches. I already have a notice for talk pages about rolling back CCI articles: User:Moonriddengirl/CCIr. I only use it when there is evidence of copying, but it could be easily modified to this situation.
  • Do you write scripts? There are several ideas I have for copyright cleanup tools that I would love to see in the works. If you do and you're up for it, come by my talk page. :D (Note, though, that I am technologically clueless. I never know if my ideas are in the realm of "easily accomplished" or "needs a magic wand.") --Moonriddengirl (talk) 14:17, 8 September 2010 (UTC)[reply]
  • The bit about scripts is still very useful, as we should probably focus first on the articles he created, which have a greater likelihood of diffs containing vios rather than just extra categories. fetch·comms 14:55, 8 September 2010 (UTC)[reply]

75.57.241.73, blanking the articles now does not preclude deleting them later, if we find that in six months we still have ten thousand blanked articles. But going straight to deletion immediately precludes any other approach. (You would have to get someone else to volunteer to do that, in any event. None of my 'bot accounts have sysop privileges, and I'm not going to do 'bot edits with this account.)

I reiterate my request for everyone to please boldly fix anything in Wikipedia talk:Contributor copyright investigations/Darius Dhlomo/How to help, Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation, and User:Moonriddengirl/CCIdf — which latter I suggest reside somewhere like Wikipedia:Contributor copyright investigations/Darius Dhlomo/Article notice or in the Template: namespace (if we aren't unhappy about Wikipedia mirrors showing the same notice). If we're going to do this, I want those all thoroughly reviewed beforehand. Uncle G (talk) 15:25, 8 September 2010 (UTC)[reply]

What should do about translations of these copyvios in other languages? Is there a way to get a list of all interwikis created after Darius did here? If the en versions are determined to be vios, the other language ones need to go as well. fetch·comms 15:39, 8 September 2010 (UTC)[reply]
In the useless comment category, I have no idea. :/ I have once or twice communicated with others Wikis when I know that an article has been copied, but we have never had a practice of doing this. --Moonriddengirl (talk) 13:13, 9 September 2010 (UTC)[reply]

FWIW, from the few articles I've looked at, I've seen the following patterns:

  • Articles of the type "[insert name of country] at the [insert year] [insert competition]" are quite frequently little more than a bit of boilerplate at the top, a formatted list in the middle, & the expected stuff at the end. Probably best examined by hand in case he added any further text -- which is likely a copyvio. (And the subject name does vary a bit.)
  • Same for individual sports at the Olympics, Pan-American games, etc. Same treatment.
  • Biographical articles that are not minimal stubs seem to routinely have copyvio material in them. If we can determine the cut-off size for these -- where the visible text is more than "X [insert birth & death information] is a [insert country] [insert athletic specialty]. He/she was active [insert length of career]" -- those could either be safely deleted or stubbified.
  • My experience confirms Uncle G's estimate that around 10% of his articles are copyright violations; the rest are simply stubs. Deleting the 90% of acceptable -- & likely useful -- articles just to purge this poisonous share is overkill -- unless one believes all stubs are potential maintenance problems & should be deleted. (Not arguing for or against that opinion, but I suspect it is a motivation of some of those who favor mass deletion.) -- llywrch (talk) 16:02, 8 September 2010 (UTC)[reply]
  • Llywrch, besides maintenance issues, it occurs to me that a more visceral reason I want to delete these articles is WP:DENY. We customarily revert any edits made by banned editors without trying to figure out whether they're good edits or not (occasional exceptions are permitted if a user wants to proxy a particular edit as their own, but that's not supposed to be done as a matter of course). Copyvios introduced as flagrantly as this should be treated like banned edits and undone. If the edit is one that creates a new article, undoing means deleting the article. There are some Wikipedians who somehow equate deleting an article with drowning a kitten, but really, in a case like this (where nobody else has edited the article), we should just think of the deletions as reversions and deal with it. We shouldn't let someone make 10,000 banned edits and have them stay in the encyclopedia.

    Also, wanting to maintain high standards for BLP sourcing is in part for WP's neutrality and not just maintainability. OK, these particular articles are about athletes, who tend to not be self-promoters and don't bother me all that much, but if they were mostly about garage bands or motivational speakers or fringe political weirdos, I'd see this incident as someone spewing 10,000 self-serving search magnets into WP article space to bias its content and laughing his head off if they were allowed to stay, and I'd be outraged.

    You additionally wrote "[p]eople here appear a lot more eager to tell us what the solution is & expect someone else to do it, than to actually help fix the problem" and that's part of it too. It seems to me that proposing 100's of other editors spend 1000's of hours examining the articles and cleaning the vios is exactly expecting others to fix the problem. It's tragic that someone like Fetchcomms says he's taking time away from his own writing to help preserve this Darius spew. By contrast, nuking all the affected articles with a bot can be done by one person in a few hours. I'm ok with volunteering to implement such a bot myself if that's the decided outcome. The code would have to go through some approval hoops and run from an admin account, so deployment would still require some other people's involvement, but I could write the code and hand it over to a bot op for activation. So I'm willing to personally do (most of) the work of the mass deletion proposal, and implement something that takes care of all 10000 articles in one go. I don't see any preservation proponents offering to personally do most of the work of reviewing all 10000 articles manually.

    Finally, you claim that 90% (or whatever) of the articles are vio-free because they don't contain copied text. IANAL but if Darius has copied the dates, names, events, times, etc from 1000's of entries in some copyrighted sports statistics book into WP articles, I'm sure as heck not willing to bet my skateboard on the assertion that those articles are not copyvios even if zero words of actual prose are copied. So I think these Darius-created articles should be treated as 100% copyvio even if they contain no text (just names and numbers copied en masse from wherever into tables and templates). So I don't accept the 90% non-vio figure and I'm not willing to declare a single one of those articles vio-free if I have to accept liability for it if I'm wrong. If others want to do so, that's up to them.

    I'm not especially trying at this point to swing the discussion back to bot deletion at this point (I'm used to being in the minority on stuff like this), but just trying to let you know where I'm coming from. I hope that helps. Regards, 75.57.241.73 (talk) 03:16, 9 September 2010 (UTC)[reply]

  • And all of that is relevant to my volunteered observations exactly how? Or did you intend to respond to someone else's post? -- llywrch (talk) 04:16, 9 September 2010 (UTC)[reply]
You wrote "Deleting the 90% of acceptable -- & likely useful -- articles just to purge this poisonous share is overkill -- unless one believes all stubs are potential maintenance problems & should be deleted." I responded to explain that there are many other reasons besides "all stubs are potential maintenance problems" to believe deletion is not overkill and is the right thing. Anyway, per WP:IINFO, "useful" by itself is not a sufficient reason to keep something. We're looking for documentation from secondary sources, and these articles mostly (entirely?) don't have that. 75.57.241.73 (talk) 04:58, 9 September 2010 (UTC)[reply]
Okay, I see the connection. Despite your appeal to WP:DENY, deleting all of those articles is still overkill. Sanctioning this editor, cleaning up his mess, & moving on will result in a minimal impact which will deny him any attention. Deleting all of those articles will give him far more attention: people will come to the articles, find them deleted, & inquire about what happened to them -- & DD's story will be repeated once again, which will keep attention on him. As for my comment about "useful information", I was referring to the quality of the stubs: many of them clearly have more detail than, for example, "llywrch (1957 -) is an American editor of Wikipedia, who has editted for almost 8 years." There are a lot of stubs in Wikipedia with that much text, which have avoided deletion only because their subject is notable. (Not to imply I am a notable subject, mind you.) -- llywrch (talk) 22:11, 9 September 2010 (UTC)[reply]
The DENY article is about trolls. Darius has caused an immense mess for us but I don't think for one second that (a) he is a troll, or (b) he's enjoying this notoriety in any way, shape or form. SFB/talk 23:41, 10 September 2010 (UTC)[reply]
  • Do we need a bot to do this? According to WP:Administrator "The English Wikipedia has 1,755 administrators as of September 8, 2010." I know some are retired/inactive, but if each administrator looked over 10+/- articles they would all be reviewed within a day. If the article is a copyright violation they can delete it. This would save all of the copyright violation free articles. Of course this would take quite a bit of organization, but it is just an idea. --Alpha Quadrant (talk) 17:27, 8 September 2010 (UTC)[reply]
  • If they would, we wouldn't. But this has been publicized at both admin noticeboards as well as plenty of points around Wikipedia, and so far we've got nowhere near full admin participation. I suspect we're not going to get even a tenth of them involved. --Moonriddengirl (talk) 17:34, 8 September 2010 (UTC)[reply]
  • If every admin wrote four FAs a year, we'd be well on our way to a better encyclopedia, but there's no way a thousand people will be coaxed into even reviewing one article. Whoever can help, please help. Otherwise, we can't ask any more of users who are already busy in RL. I've personally postponed some writing goals to work with this CCI because I value copyright very seriously, and I realize this is an understaffed area. But I can't speak for others' priorities. fetch·comms 18:11, 8 September 2010 (UTC)[reply]
  • I have made a BOLD edit to Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation per my suggestion above.[19] Note that implementing the suggested change will require some additional code in the bot, which would of course be up to Uncle G. 75.57.241.73 (talk) 21:59, 8 September 2010 (UTC)[reply]
    • Seems sensible; as long as Uncle G can make the bot revert to the last "major" (non-category, etc.) change, or it will take ages to get through this list if he added one category in 2006 and then the article was legitimately expanded by someone else. fetch·comms 22:28, 8 September 2010 (UTC)[reply]
      • I cannot do any sort of reversion at all without a fair amount more work. As I said, these are very simplistic tools when it comes to content editing. I can append/prepend wikitext to a page, or replace a page entirely with some given wikitext. (These tools' prior content editing tasks have included nothing more hefty than raking sandboxes and creating boilerplate xFD pages.) I could write a tool to do this, I expect. I'm not even going to try to give an estimate on that right now, or even definitely confirm that it's possible. Right now I'm revisiting code that I wrote half a decade ago and updating it to work with the current MediaWiki interfaces. ☺ Then I have to do some testing.

        The first pass of anything that the 'bot does will be the ten thousand creations, simply blanked. (We all agree on that, yes?) More complex work on the remainder after that, the touched but not created articles, we can come back to. I'll probably need a new list off VernoWhitney at that point, anyway. Uncle G (talk) 22:50, 8 September 2010 (UTC)[reply]

        • The API can give the categories directly from the last version of any article[20] and the bot could use that list to categorize the reverted articles, instead of trying to parse the categories out of the article text. So I think the categories aren't a big problem in their own right, though some other templates may be worth scanning for. I agree that it's reasonable to separate the task into blanking the 10000 Darius-created articles first, while deferring til later possibly doing more complicated stuff with the other 13000 articles. That would mean we're back to doing just the 10000 for now, which is OK. (Added: note "reversion" means replacing the entire wikitext with another text, in this case text from an earlier revision of the article). 75.57.241.73 (talk) 23:46, 8 September 2010 (UTC)[reply]

I'm ready to use the list. I've tested the 'bot on Ted Morgan (boxer) (which was a definite copyright violation of this biography). You can see the edit here. That's what's going to happen, and that's what it's going to look like. I might tweak the edit summary a bit. Uncle G (talk) 04:38, 9 September 2010 (UTC)[reply]

  • I would change the category in the template to something like CCI-DD (for Darius Dhlomo, I don't know that it's appropriate to spew a username directly into articles even in a category) since CCI refers to lots of different incidents/investigations. The edit summary should have the same change. I'd just link CCI-DD in the edit summary to the info page rather than having "what is this bot doing" there. Otherwise, looks reasonable. Can you run it on, say, 5 articles as the next test? 75.57.241.73 (talk) 04:50, 9 September 2010 (UTC)[reply]
    • I leave it up to Moonriddengirl and the other CCI regulars as to what the category is. I'd rather not even put "DD" in the edit summary or the article notice, myself. I think it unfair for Darius Dhlomo's name, even abbreviated, to come up all over the WWW as a result of this. Uncle G (talk) 13:32, 9 September 2010 (UTC)[reply]
      • How about CCI-2010A. So if there is another one this year it would be 2010B etc. 67.119.12.29 (talk) 03:23, 12 September 2010 (UTC)[reply]
  • Also, if you're not going to have the bot make any logs, maybe you could make a new account ("Uncle G's Major Work Bot -- DD CCI task" or some such) just for the purpose of this task. That would make it easier to locate all affected articles by pulling down the special account's contribs.

    As a separate matter, I had wondered if we might have a bit more discussion about moving the articles to incubation, rather than leaving them in article space (or deleting them) 75.57.241.73 (talk) 04:53, 9 September 2010 (UTC)[reply]

    • Seems logical. I already saw one of Darius' articles graduate from the Incubator, all possible traces of copyvio gone. But we need to delete all of the found vios first, of course. fetch·comms 12:50, 9 September 2010 (UTC)[reply]
    • The 'bot hasn't really done anything for just over a year, and this is the only task that it's doing. It will be very easy to spot the list of articles from its contributions history. The article list that it's going to be working from is on-wiki in the first place, anyway. It's here (1–5000) and here (5001–9664). Uncle G (talk) 13:32, 9 September 2010 (UTC)[reply]
  • Just to note: I'm here and ready to start restoring articles that have already been cleared. --Moonriddengirl (talk) 13:13, 9 September 2010 (UTC)[reply]
  • Okay, I've made a restoration of Ted Morgan (boxer) as a stub. Is this what we should expect? If not, revert/delete my version & replace it with the proper version. (Or revert me if I acted too quickly.) -- llywrch (talk) 22:19, 9 September 2010 (UTC)[reply]
    • Well, now it's an unreferenced BLP that may be subject to deletion under some forthcoming BLP cleanup unrelated to this copyright stuff. You might want to transfer some references from old revisions. Anyone currently cleaning up the articles should keep in mind that the bot might wipe them, in which case they can revert the bot. 75.57.241.73 (talk) 23:28, 9 September 2010 (UTC)[reply]
      • List of refs restored. Unsourced BLPs are just about as bad as copyvios, so we need to avoid those as well. fetch·comms 02:15, 10 September 2010 (UTC)[reply]
  • OK, I just seen this on AN a moment ago, so forgive me if this has already been answered, but what is the list of articles are we looking at blanking here? - NeutralhomerTalk • 04:41, 10 September 2010 (UTC)[reply]
    • Revision links are a couple of paragraphs above. ↑ Uncle G (talk) 04:59, 10 September 2010 (UTC)[reply]
  • This reminds me of the Samurai Archives cleanup we did a few years ago. Not nearly as many articles, but still a royal pain to deal with. ···日本穣? · 投稿 · Talk to Nihonjoe · Join WikiProject Japan! 07:21, 11 September 2010 (UTC)[reply]

Categories[edit]

Not sure if this is the right spot to ask, but it's all a bit confusing for those just arriving to the party! If I understand it, 10000 articles are about to be blanked. Unless I've missed it, the list is two old revs of a list? I've cut and pasted that list to User:The-Pope/DDCCI_list (some reason only 9664 pages) so that AWB can suck it in and I can do some list comparisons on it. Did we end up agreeing on adding a special category to the blanked page? If so, then the WP:CatScan tool could also be used for people to help in the rescue/rewriting - but that would require the categories to remain. The test edit removed the cats - surely they aren't a copyvio and will make rescue near impossible.

If I go proactive and completely rewrite an article (or have already completely rewritten one), will it still be blanked, and I would have to revert it myself? Is that correct? If it is, it's a drastic measure, but I understand, compared to the last great BLP issue (unreferenced BLPs), this one is at about an actual problem, not just a "it might be a problem." I might start publishing some "rescue lists" of higher profile athletes- such as olympic medalists etc. There are 213 Australians Sportspeople on those two pages - which is my area of interest, and I'm just checking how many medalists.The-Pope (talk) 01:55, 11 September 2010 (UTC)[reply]

  • The transcluded notice will (unless someone changes it) add the articles to Category:Articles tagged for CCI copyright problems, so they could be tracked that way. As far as you're second question goes, someone would have to revert/unblank it. You of course could, and Moonriddengirl has volunteered for that mind-numbing task for every article which has already been cleaned, so if you don't she or somebody else will in relatively short order. VernoWhitney (talk) 02:08, 11 September 2010 (UTC)[reply]
    • But won't that category potentially have other copyvio issues unrelated to this task, or has it been setup only for these mass blankings? I can't see why removing categories is a good thing. Any reason - they are the one part of the article that are absolutely not a copyvio? BTW, there are 1,392 Olympic medalists in the list to be blanked, including 436 Gold medalists.The-Pope (talk) 02:30, 11 September 2010 (UTC)[reply]
      • That category has been specifically set up for this, so it should only have these pages aside from any leftover testing pages, such as the only member currently present in the category. As far as leaving existing categories - I know it's easier to just replace everything with the blanking, but you can ask at the BRFA for the task or at the bot owner's talk page, and maybe they'll be able to work that in. VernoWhitney (talk) 03:13, 11 September 2010 (UTC)[reply]
  • If you rewrite an article, just watchlist it and revert the bot after the bot blanks it. Generating the list of Darius's articles took about 20 minutes with soxred93's pages created tool if anyone cares. No new ones have been created since this started, since Darius has been blocked. 75.62.3.153 (talk) 04:20, 11 September 2010 (UTC)[reply]
    • Forgot to add: Pope, yes, 9963 sounds like the actual number, we've just been rounding it off to 10000 in conversation. Sorry for the confision. 75.62.3.153 (talk) 05:19, 11 September 2010 (UTC)[reply]
  • Uncle G, maybe you could make the bot preserve the categories in the articles, while reverting the content? The categories are quite easy to retrieve from api.php. The current API is tons better than the stuff that existed 5 years ago. 75.62.3.153 (talk) 04:23, 11 September 2010 (UTC)[reply]
    • See the two new example edits. ↑ Uncle G (talk) 14:37, 11 September 2010 (UTC)[reply]
  • It's mostly all or nothing with this 'bot"s current tools. It doesn't have the ability to parse wikitext. "major work" has always been mass tagging or mass renaming or mass boilerplate creation. Mass detailed content editing I have left to other 'bots and other 'bot owners. I volunteered the 'bot for this task because it is, essentially, a mass tagging task. I'll see what I can cobble together with grep and WIKIGET to pull out categories, or whether it's a simple task to write a new tool to pull out the categories, but I make no promises of full (or any) function in this regard and I'm not going to write a wikitext parser for this task.

    Note that, as stated above, all of the articles will be categorized by the notice that is being applied. (q.v.) That's intended as a depopuplate-this-category worklist for CCI volunteers. Uncle G (talk) 12:17, 11 September 2010 (UTC)[reply]

  • When it comes to interwiki links, I refer you to these edits by SieBot (talk · contribs), which made me chuckle: SieBot edit SieBot edit SieBot edit. Uncle G (talk) 12:43, 11 September 2010 (UTC)[reply]
  • Uncle G, the bot should not have to parse the wikitext to get the old categories out. It can get the old categories with an API query and the interwikis the same way. Then just inject the retrieved category and interwiki info into the new wikitext along with putting in the template. That should be easier than parsing. 67.119.12.106 (talk) 16:38, 11 September 2010 (UTC)[reply]
  • I mentioned this in the BRFA but it may bear repeating here, it occurs to me that clobbering the interwiki info may cause a bunch of disruption because it will break up the interwiki connection graph across all the wikipedias. So the various interwiki bots watching those links will both start trying to restore the interwikis in the now-clobbered English article, and also start messing up some of the non-English ones because the English one had been used for some inference about some other language article and is now removed. 67.119.12.106 (talk) 16:44, 11 September 2010 (UTC)[reply]

How long will it take?[edit]

As a complete outsider looking at his first mass deletion, how long will it take the bot to blank the 10,000 articles? Thanks, and intriguing conversation. --intelati(Call) 05:45, 10 September 2010 (UTC)[reply]

  • It's a mass blanking, not a mass deletion. How fast the 'bot will run, and other purely technical issues (aside from the "Should we do this? Is there another way?" questions, which we're rightly discussing here), is still up for discussion at Wikipedia:Bots/Requests for approval/Uncle G's major work 'bot. I'm taking a fairly conservative approach. My major work 'bot has, historically, run at only a few edits per minute. There are ways for 'bots to run much faster, with approval. I'd need to change my 'bot in order to use them. I might do that later on. (The tools for the 'bot are fairly simple. The necessary change complicates them a fair bit.) But even at (say) 4 edits per minute, 10,000 articles is only 41.666666666667 hours. (If you look at the 'bot's contributions history, you'll see one task that took it more than a month.) We've spent longer than that on this discussion, for comparison. I'll probably do the work in batches, too, so there might be gaps as I do various unnecessary things like eating, sleeping, and so forth before submitting a fresh batch. Uncle G (talk) 06:27, 10 September 2010 (UTC)[reply]
    • At least you have good humor about it.--intelati(Call) 15:16, 10 September 2010 (UTC)[reply]
    • Wow, I just saw the hatnote!! You guys must have your hands full!! Too bad, now Wikipedia will only have 3,397,565 articles as of now. :-) Don't work too hard! --Funandtrvl (talk) 00:39, 11 September 2010 (UTC)[reply]
      • Now that's being an optimist. ;) -- œ 18:03, 17 September 2010 (UTC)[reply]

Tweaking the edit summary[edit]

I'm going to tweak the edit summary a bit more. It already contains a clear naming of the task, a description of the edit, and a hyperlink to a detailed explanation. I'm not happy with the hyperlink being phrased as a question. A declarative "What this 'bot is doing" might be better. I might expand "mass blanking", too. I'm going to look at whether I can do anything with categories and tweak the edit summary, then I'm going to shove a batch of ten or so articles through, to double-check the edit rates and the main script. Uncle G (talk) 12:38, 11 September 2010 (UTC)[reply]

  • See the new edit summary in the 'bot"s two new test edits. Uncle G (talk) 14:37, 11 September 2010 (UTC)[reply]
    • Nice job with the categories (further discussion on the BRFA). I think I'd change "What this 'bot is doing" to something less cutesy, like "Explanation", but that's just me. 67.119.12.29 (talk) 01:37, 12 September 2010 (UTC)[reply]

Short Pages[edit]

For the sake of us short pages patrollers, please make one modification to the bot's blanking pass. Please add a comment such that the resulting blanked page is somewhere over 200 characters long. Without this, the blanked pages will flood out the various short pages lists, making them worthless for any sort of patrol for an extended period of time. - TexasAndroid (talk) 02:07, 14 September 2010 (UTC)[reply]

That deals with my concerns. Thank you very much. - TexasAndroid (talk) 15:02, 14 September 2010 (UTC)[reply]

To avoid redundancy[edit]

I am generating a list of articles that have been checked and cleared so that the Bot will not blank these. The question is how to avoid redundancy of labors after. We will then have a category that contains a list of all blanked articles and the usual CCI board listing of articles that have not been cleared. Should we switch our efforts to the category, which does run a risk that copyrighted content may be overlooked if the blanking is reversed by a contributor who (a) does not recognize the problem or (b) does not care? Or should we ask those who remove the bot blanking to note their action at the list? (Not so hard if they follow the "what links here" to find the specific subpage.) --Moonriddengirl (talk) 14:05, 15 September 2010 (UTC)[reply]

I think it would be beneficial to ask those who remove the blanking to note their action at the CCI subpages. That way there will be an easily accessible permanent record of who cleared what. This is also the way that every other CCI is handled and so is what will be expected unless we overhaul the entire system from this point on. I know it will mean more work for those who clear articles and so probably less help, but I think the transparency is important in the copyright area where we can either be restoring copyvios or removing large masses of text with no other explanation. VernoWhitney (talk) 14:11, 15 September 2010 (UTC)[reply]
Why not just have a bot or script pick up when someone removes the CCI template? I've asked Coren if CorenSearchBot can do that, since it would be good for the bot to sanity check the page after someone has pronounced it clean. 67.119.14.196 (talk) 09:21, 17 September 2010 (UTC)[reply]

Questions[edit]

Ok, I've just spent a fair amount of time reading all of this page (and others).

I won't claim to understand the CV stuff, except that I thought that the current practice with CV is delete, and then sort it out.

It doesn't matter if the article is now featured and on some perfect article we "need".

delete and start over.

Not to mention that the rule of thumb with banned users is to delete/revert their contributions.

Wikipedia:BAN#Edits_by_and_on_behalf_of_banned_editors.

I'm just concerned that the blanking will lead to the common editor (who doesn't care about cv, and just is enthusiastic for their favoured info to be on Wikipedia) will shrug their shoulders and revert the blanking.

Not to sideline ANY help, but shouldn't the pages all be protected, at least semi?

Also... I'd suggest oversighting all the edits, but from what I read above, we need them to compare what are his edits (and cv) and what aren't, in the hopes of salvaging?

And not to go the WP:BEANS route, but what's to stop him from block evasion, creating an even further mess of having to now track 20k+ articles for IPs, and other fun stuff, forever....

Further clarification on this would be welcome. - jc37 03:42, 10 September 2010 (UTC)[reply]

  • Three points:
    • It's not that simple. Some articles are 1-paragraph stubs that contain barely more than raw numbers, names, and dates. Some articles have been heavily edited since, by other editors. Some articles had most of their content written by other editors anyway. Some articles were just touched for non-prose-content-related tasks such as (re)categorization. When we are talking about a corpus of twenty-three thousand articles, any "some" is a lot.
    • Darius Dhlomo wasn't necessarily acting in bad faith. Xe wouldn't be the first person to just not get that taking other people's prose wholesale, tweaking some pronouns and shuffling the sentences around a bit, is not original writing. It's not that xe has, or had, bad intentions. Clearly xe had good ones, trying to give us articles on olympic althletes and so forth. It is that xe enacted those good intentions entirely wrongly; and has not been particularly forthcoming about either the scale of the problem or specifics of particular articles when questioned.
    • There's a fairly clear warning on the template that anyone bulk reverting for the sake of it and reintroducing copyright violations by doing so will be treated as a problem editor. And anyone doing so only has to look at Darius Dhlomo's block log to see where that road leads.
  • This isn't just a once-off copyright violation, or even a new account that's submitted a couple of copy-and-paste articles. This is Wikipedia:Contributor copyright investigations territory. Things are different, here, not least because processes like Wikipedia:Copyright problems simply don't scale to this number of articles. Even the usual CCI process is having difficulty scaling to this, which is why we're trying this new approach. Uncle G (talk) 04:13, 10 September 2010 (UTC)[reply]

response (Also edit conflicted - I'll try to respond to the other posts (This responds to UncleG's original response) following this.)

To clarify: I wasn't commenting on whether he should be banned - though, from what I read above, indef blocking (at least) seems appropriate. Though, to be honest, regardless of how many possible positive edits he's made, he really seems to be being less than honest about all of this, so trust seems misplaced, at least for now.

And it doesn't matter if the cv was in good faith or not. it's cv - delete/revert. And if we can't trust any of his edits, then send a bot through his edit history and that's that. And I'm presuming that this isn't a case where we can allow for "oops, we missed those cv edits in his edit history".

To touch on your points:

"Some articles are 1-paragraph stubs that contain barely more than raw numbers, names, and dates." - the scale and scope of the cv's precludes caring about the other, possibly "clean" edits. delete and revert, and only restore when assessed as allowable. You've shown repeatedly above that the cv is vast, throughout his edits.

"* This isn't just a once-off copyright violation, or even a new account that's submitted a couple of copy-and-paste articles." - I don't think it matters if he's made one edit or one million edits. If his edits are cv, then they MUST be removed from this encyclopedia. Maybe I'm missing something here, but my understanding was/is that no extenuating circumstances allow for CV material to stay on Wikipedia.

And I get that this is beyond the typical scope. I agree that this should be handled by bot. I'm just saying that too much leeway is being given here for cv material to NOT be removed. WP:AGF has worn out its welcome here. He's lied to you all repeatedly, boldly and point blank. He has cv throughout a massive editing history. There really isn't much left but to remove from the encyclopedia, and then AFTER that, have the CCI experts (and whomever else would help) to try to restore anything salvageable from this mess.

Also, just to be clear, in re-reading my comments above, they sound terse in "voice", I hope you understand that you all have my full empathy in what I am certain is a royal pain and mess that you're trying to sift through. So please don't take anything I'm saying as directed negatively towards any of you. - jc37 04:53, 10 September 2010 (UTC)[reply]

  • No worries.

    Four more points:

    • Any administrator deleting 10,000 articles in the normal manner is in for a severe case of repetitive strain injury. Any administrator deleting 10,000 articles with a 'bot is in for massive drama and an arbitration case. (Even just blanking I know is going to cause complaints. Trying to prevent this by letting people know that this is coming ahead of time, is why I've been posting notices all over the place.) And there's no way a Developer is going to touch this without a poll.
    • My calculations, and those of others above, are that only 10% of the articles are copyright violations. The problem is that we don't have a mechanical way to determine which 10%. So we're sharing the pain, as it were, of finding out. The trade-off that you're looking at is the outright deletion of ~9,000 good articles for the sake of ~1,000 bad ones. That's something that we all, as people who want to build an encyclopaedia, understandably feel uncomfortable with.
    • This step doesn't rule out taking further steps. We're dealing with ten thousand articles out of twenty-three thousand, after all. Even on its own, this isn't the complete solution to the problem. (It shrinks it a little bit, though.) We could, six months from now, if we find that only a small percentage of articles have been reviewed, decide that the process hasn't worked, and choose to take a different route.
    • This is all about not concentrating either the decision making or the enactment of the outcome in the hands of one, or a few, administrators. Anyone, even someone without an account, can fix an article if we take this route. If we go down the delete-everything-and-let-the-challenges-come route, we have a small number of administrators having to review the deleted content of ten thousand articles. It's not the case that only administrators are capable of this sort of thing. Aside from the fact that non-administrators scan for, and tag, copyright violations every day, in good faith, let's not forget that most of Wikipedia's content was written by editors who don't even have accounts. The non-administrator editorship is a force to be reckoned with. It does, in the main, operate in good faith. And it may well prove to be capable of demolishing this problem with ease.
  • Uncle G (talk) 05:30, 10 September 2010 (UTC)[reply]
    • Some of what I would respond to above is already covered in my "responses to others" below, so I'll leave that there.
    • And I'm empathetic about concerns about various kinds of "community blowback" (Past dramas involving responses to and from betacommand come immediately to mind). I just think we're past that stage (I think you've done a pretty decent job of communicating that there is an immense problem.) And this should move forward.
    • "Even on its own, this isn't the complete solution to the problem." - And that's acceptable? To even let one "slip through" when we could prevent it by just mass reverting his edits.... In this case, the baby needs to go out with the bathwater. We'll retrieve the baby afterwards. Nothing else is acceptable, from what I understand. And honestly, I thought that the current system is to revert to some edit just prior to the editor's first edit on an article, and then allow the wiki world to salvage anything after, if there is anything salvageable. So why is that not the case here? And if the editor created the article, blank (in mild cases) and delete in cases of rampant cv. Again, why not here?
    • I'm not saying that admins should be the only ones doing this. I'm saying that, due to the scope, and due to how the resolution appears to be implemented, that admins might as well do it, since they'll have to meticulously go through and follow up on the bot, and everyone else's edits anyway... I'd like to wish that wasn't true, but, it seems that way to me at least. And when one considers that most admins are less than active (I only just recently came back from a rather extended wikibreak myself), that idea is also daunting. - jc37 05:46, 10 September 2010 (UTC)[reply]

I see that I'm being accused of being an optimist. Again. ☺ Here are some more points to allay your concerns:

  • There's nothing stopping an ordinary editor from reverting a bad undo. Again, more than administrators can police this. Indeed, we already rely upon ordinary editors to police copyright violations using the ordinary editing tool. If someone reverts to a non-infringing version, and then someone reverts back, that's policable, and policed, with the editing tool.
  • Place yourself in the shoes of the hypothetical not-caring-about-stealing-the-writing-of-others editor who mass undoes the notices. How long do you think that would take with 10,000 articles? Xe wouldn't be using a 'bot, after all. Do you think that someone might notice, say, 500 such undos in a contributions history and raise a query?
  • If we had a lot of such editors, we'd already be in big trouble. I contend that the fact of our continued existence is proof that the well-meaning ordinary editors outnumber the ill-intentioned.
  • Who said that we were stopping at just this? 75.57.241.73, above, has some ideas for steps to take after this one. (q.v.)
  • It is important to bear in mind that this first action is addressing the list of ten thousand articles created by Darius Dhlomo. If there's a copyright violation, it's very likely going to be a foundational one. There's nothing to revert back to in such a case. On the other hand, as demonstrated, there may be independently written, non-derived-work, content that an editor can salvage from later additions by other editors. Ted Morgan (boxer), the 1 sample 'bot edit so far, is perhaps a fairly good example of this.
  • If you, or anyone else, want to watch for notice removals, I'm sure that Kingpin13 could get SDPatrolBot, or PleaseStand could get PSBot, to watch for notice removals and list them on some page somewhere. Talk to them about it. Note that, again, not just administrators could watch and police such a list, just as not solely administrators already look at the lists of de-prods.

Uncle G (talk) 07:35, 10 September 2010 (UTC)[reply]

"If we had a lot of such editors, we'd already be in big trouble." - In a typical situation, sure. But this isn't typical. So unlike when CVs would normally be staggered over time. this will be thousands blanked at once. Which means that we will be providing the rare opportunity to see a million monkeys in action : )

"Who said that we were stopping at just this?" - Multiple steps, when the encyclopedia is constantly being edited (from a programmer's pov, the changes are going in "live") means that the longer we wait and the more steps involved, then the more likely that this won't be all inclusive. Looking at this from an outsider's view (like a judge) - "Did you take all action necessary to prevent CV? If not, why not?"

Hence why I'm not understanding why all his contributions are not (?!) being reverted. And then we start restoring from there.

Anyway, that aside, yes, having a category or subpage somewhere that lists all of these that a bot can check for simple reversion of the blanking, would probably be a good idea. (give vandal patrollers something else to watch for - "many monkeys, meet many monkeys" : )

What might be really nice is if a filter could check to see if (let's say) less than 200b was changed between the version restored by some user, and the version prior to blanking. Though I'm not sure atm if that's within the possible use of the filter system. (I'll have to take a look). - jc37 14:22, 10 September 2010 (UTC)[reply]


Other responses[edit]

(The threading is slightly confusing (for me at least), trying to make some sense of it.) - jc37 05:57, 10 September 2010 (UTC)[reply]

(edit conflict) Blanking seems more viable for a massive list. Better tracked, and only manybe 10% are actually vios. See section just above for a bit more on this.

Is he a sock of a banned user? We only "revert any edits made in defiance of a ban", so if we ban him now, we won't rollback his edits before that.
We can have a list of pages blanked, and just check relatedchanges every so often. I really doubt anyone will bother to remove warning tags for admittedly obscure articles, but they'll get warned and/or blocked if that continues, anyway.
Oh, that would probably kill the servers a couple times over, protection for 10,000+ pages :P. To be serious, it would skew up some tracking lists of semi'd pages and probably isn't needed at this point.

Oversight --> revdel works fine, but for vios, I'd only personally bother with that for major, major vios that have been around since who knows when.

If he creates socks, they will be noticed pretty quickly, I'd imagine. New users and/or IPs removing copyright tags and all is pretty suspicious. If he does that, then just block, etc. IDK what else is needed. He knows that evading blocks is wrong, so if he does it, there goes any chance of an unblock within a few years.
I hope I made sense. It's pretty late where I am right now yawn... fetch·comms 04:23, 10 September 2010 (UTC)[reply]

(edit conflict)

jc37, I think the situation is:

  • WP practice is to remove cv's from displayed articles immediately, but except in very egregious cases, the cv is allowed to stay in the revision history unless the copyright owner requests removal. See WP:CP#Instructions. So if someone pastes too much of a newspaper story into the Elvis Presley article, we just revert it, not delete the article. I guess it's possible now to revdel the revision with the cv, but that may in some cases cause attribution problems for later revisions.
  • Darius was not banned at the time he created at these articles (whether he's banned now or merely blocked is unclear). So they're not banned edits under a literal interpretation of the BAN policy AFAIK. I've made the case for treating them as banned edits anyway, based on the scale and the fact he had received previous warnings and then (in my opinion) relatively low value of these articles as evidenced by absence of secondary sources and non-involvement of other editors in most of the articles. But it doesn't look like they're being handled that way.
  • Oversight is usually reserved for bad privacy vios and other extreme circumstances. These copyvios are fairly routine on the scale of things, except that there's so many of them done by one person.

75.57.241.73 (talk) 04:25, 10 September 2010 (UTC)[reply]

responses to the others

WP:BAN says something different now than it did when I last read it. Once upon a time, I seem to recall banned editors (especially when the ban involved dubious or POV content) having their edit histories arbitrarily reverted. So let's drop that facet of my comments above for now.

Part of what I am commenting on is what I kept reading above where it seemed that there was the idea that we shouldn't revert/blank delete his edits because he was a prolific editor, so we should have some other standard for cv from his edits. I hope that's not what was intended to be said, but quite a few were coming across that way.

Another part is that, due to the vast number of contributions, and the vast number of cvs interspersed, I guess I'm just not seeing how we're going to avoid some of the CVs getting lost in the cracks.

And finally, while I AGF, I'm enough of a realist/pragmatist to know that people are going to revert the blanking and not care about the cv concerns. And so unless someone is looking over the shoulder of all the blanked pages, it's going to happen. and if it is, then we might as well delete, and have admins go through the 20k+ pages themselves anyway. This being something that needs doing right. And not according ot the typical "wiki way", which allows for things to be done half way, with the idea that someone else will come along and finish up after you.

So I'm concerned. - does that make more sense? - jc37 05:08, 10 September 2010 (UTC)[reply]

Your concerns are reasonable. My take is that the nuisances that you predict will happen, but will just be the ordinary sorts of nuisances that wikipedia deals with all the time. There's always tons of cv's "in the cracks" already, and they get handled when they are discovered. The drasticness of this particular situation comes from having 1000's of cv's in one place. So the bot operation as currently planned may not eliminate the problem, but (we hope) will knock it back down to the level that we can handle by usual methods. At least I think that's the rationale. 75.57.241.73 (talk) 08:02, 10 September 2010 (UTC)[reply]
Well, we're past the "discovery phase", so I think we may have passed the time of "lost in the cracks: is allowed. In this case, "the cracks" are merely in the edit history of a single user. Revert them all, and there are none lost. Yes that means for right now a lot of "other" articles will be temporarily lost in the cracks, but I have faith in our million monkeys to set that right. To use a virus analogy (which applies well here, I think) - We need to destroy the virus before it blossoms beyond our ability to find it.

And since this is a wiki, anything salvageable can be salvaged over time, according to the wiki-way, as normal. It's all there in the edit history. - jc37 14:22, 10 September 2010 (UTC)[reply]

Ok, can someone clarify for me, is this 23,000 completely copyright infringing articles, or is it only parts of the 23,000 articles that are copyright violations? If it's not the complete articles that are infringements, then it seems a bit overzealous to remove the whole articles...but if it's the complete articles, then this would be the best solution...but on this scale I can't think what the balance in between would be. Ks0stm (TCG) 20:07, 10 September 2010 (UTC)[reply]

  • 10000 articles created by Darius, most of which are pretty stubby and have only slight if any editing by other people. That's the target of this initial bot operation. Clear copying found in maybe 10% of a subset that has been examined. Opinions vary about how much total copying there is.
  • 23000 total articles in which Darius has made non-minor edits (includes the 10000 above). There's some preliminary discussion on this page about a future bot operation to revert the other 13000 to before Darius's first non-minor edit. Amount of copying in these has not been studied AFAIK.
  • 40000 or so total articles edited by Darius in any way including minor. I think nobody is asking to do anything about the ones with only minor edits from Darius.
66.127.55.183 (talk) 20:11, 10 September 2010 (UTC)[reply]

Still more questions[edit]

Forgive me if these have been asked, there is alot of go through above and I didn't seen them, so here goes....OK, four questions:

  • We are talking about alot of sports articles and sports people, right?
  • These are articles started by one person, right?
  • Couldn't they be mass deleted?
  • My main question, why was this user's edits called into question? - NeutralhomerTalk • 16:21, 10 September 2010 (UTC)[reply]
  • I can answer two out of the four:

    In the first pass, we're looking at the just under 10,000 articles created by Darius Dhlomo, yes. We are discussing ways to come back and look at the remaining thirteen thousand odd articles out of the 23,197 total articles touched by Darius Dhlomo. (Many of the touches are benign. We're still looking for ways to whittle that list down.)

    Yes, they could be mass deleted. But see what I wrote above in response to jc37 about the two likely fates of any administrator who takes it upon xyrself to mass delete 10,000 articles.

    Others will have to answer the CCI history question. I wasn't involved in the late August discussion, or what preceded it. Uncle G (talk) 16:27, 10 September 2010 (UTC)[reply]

    • OK, those for those two answers. :) Makes a tad bit more sense now, now to wait for the other two. :) - NeutralhomerTalk • 16:31, 10 September 2010 (UTC)[reply]

(ec) Re questions: 1) Yes; 2) Yes; 3) Proposed early on, but apparently considered overkill compared to blanking and letting people try to salvage the content from the histories on an article by article basis; 4) User had been notified about copyvios several times over the years, but scope of the problem wasn't recognized until this current ANI started a few days ago. 75.57.241.73 (talk) 16:36, 10 September 2010 (UTC)[reply]

Thanks Anon, much appreciated. - NeutralhomerTalk • 16:42, 10 September 2010 (UTC)[reply]
More specifically, User:Sillyfolkboy noticed a copyright problem, checked and found other issues and filed a CCI. That got things rolling towards the initial ANI filing. --Moonriddengirl (talk) 16:47, 10 September 2010 (UTC)[reply]
  • If mass deletion were still on the table (I think it's not) then the next step would be to put the proposal through AfD. That would be interesting, an AfD for 10 thousand articles. If the AfD closed in favor of deletion, I'd hope (ha ha) that the admins who implemented it would be absolved from any blame as long as they did what the closure specified. 75.57.242.217 (talk) (please excuse these address resets, I don't want them either) 21:22, 10 September 2010 (UTC)[reply]
    • I think the point is being largely missed that this is most significantly a biography problem. Darius's created sports results type articles are typically very easy to sort through and very easy to clean. I'm making quite a bit of progress already. There is a lot of talk going on here, but not as many editors as I expected have been checking the articles. When so many cases are so obvious, I don't see why we can't get through the created articles manually in the near future. SFB/talk 23:51, 10 September 2010 (UTC)[reply]
      • It seems like we might. I have never seen so many people pitch in on a CCI. I'm thrilled. (I am myself still struggling to get another one in shape to request review from experienced image editors, but once that is arranged will be freed up to work again in my usual text neighborhood.) --Moonriddengirl (talk) 12:13, 11 September 2010 (UTC)[reply]
  • OK, I think I understand now, thanks guys. Just wanted to make sure this didn't affect any areas I know how to edit. Unfortunately, I can't help with the recreation, don't know anything about sports really, my thing is radio and television stations. I do recommend an admin (if possible) going through and RevDel'ing the edits this user created once the bot has passed so they are completely out of the system for the viewing public. - NeutralhomerTalk • 12:33, 11 September 2010 (UTC)[reply]

Just a thought, but many of these articles are taken from the same publications. Has anyone asked the owners of these publications if they are concerned that their material has been used? Perhaps they are pleased to have their sport publicised? Shipsview (talk) 21:55, 17 September 2010 (UTC)[reply]

  • That actually mighe be worth a try. But these articles come from a lot of different places, so getting permission for more than a fraction of them is unlikely. Also this is a situation where it's better to ask permission than forgiveness, i.e. do the cleanup first, rather than saying "we've copied your stuff all over the place and it's in 100's/1000's of articles now, please let us know if that's ok." 67.119.14.196 (talk) 22:46, 17 September 2010 (UTC)[reply]

WP:Filter[edit]

I've posted a note there requesting whether watching (and auto re-reverting if the cv is restored) the blanked pages is possible.

If it is, this might be a nice way to handle all copy vio in the future: blank/template the page, and let editors fix. - jc37 16:47, 10 September 2010 (UTC)[reply]

Cyclist tables[edit]

Spanish from Wiktionary
primero
segundo
tercero
cuarto
quinto

Not sure if anyone has mentioned this as such: one type of copy vio seems to be the "cyclist results biography" (e.g. Franklin Chacón). Contrary to the rest of Darius's table type work, I believe these are unique in that they do actually constitute a violation. Stripped straight from here, the format and content is virtually entirely unchanged. I believe that this is a form of copyright infringement (i.e. sweat of the brow grounds) which is not an issue on athlete biographies such as Konrad Dobler.

I would recommend getting rid of the tables on this cyclist's article (and others like it) and leaving just the basic one or two sentences left. SFB/talk 22:37, 10 September 2010 (UTC)[reply]

  • IANAL but I thought the United States (the relevant jurisdiction for Wikipedia because that's where the servers and the WMF are) rejected the sweat of the brow doctrine in Feist v. Rural. My guess is that's why some editors in this page seem to think that articles containing only dates/events/tables are automatically not cv's. I'm not saying they're wrong, but I don't feel completely assured, because legal questions are often more complicated than they look. 75.62.3.153 (talk) 00:30, 11 September 2010 (UTC)[reply]
    • The US and thus Wikipedia does reject sweat of the brow, so the only thing we're concerned with is creativity. First, if the data itself is creative that's a problem and the info would have to be removed, but if it's just a list of all of their results in professional races that should be fine (someone who knows more about sports results than I do should know whether it's standard information or not). Second, the way the data is presented could be creative (e.g., non-standard sorting or other formatting), but that could be resolved by reformatting the data into a standard order (again, someone who knows more about sports should know at least a standard way to present a table). VernoWhitney (talk) 00:57, 11 September 2010 (UTC)[reply]
      • In all of the articles I've seen, Darius's format and presentation of tables/data is entirely different from the original source (thus it is not a copyright issue at all anywhere) but these cyclist articles pose a different problem. These are categorically different from other tables in that they copy the information verbatim. Does anyone with greater knowledge of copyright know if these pose a problem in that the presentation of the information is regarded as copyright-able? SFB/talk 11:02, 11 September 2010 (UTC)[reply]
        • I don't believe they're a problem as nothing about the information appears to be creative, but I'll ask our resident copyright guru to opine here. VernoWhitney (talk) 11:18, 11 September 2010 (UTC)[reply]
          • I'm afraid I can't even read the table at Franklin Chacón. "5º in Maillot..." What? What does 5º mean in this context? What's with 3rd place, bronze medalist(s)? The folks at Wikipedia:WikiProject Cycling may be able to help determine the creativity of the list, since they can say, "Oh, that's bog-standard. Nothing fancy there!" or "Wow, what a brilliant way to categorize these results!" or somewhere in between. I'll ask. (ETA: And have.) --Moonriddengirl (talk) 12:04, 11 September 2010 (UTC)[reply]
            • See table. Uncle G (talk) 12:26, 11 September 2010 (UTC)[reply]
              • Well, that makes sense. :) I wasn't sure if there was some esoteric "degree of slope of road" or something going on. That has me leaning towards "non-creative" with Verno. --Moonriddengirl (talk) 12:38, 11 September 2010 (UTC)[reply]
                • Er, the whole presence of that 5º stuff suggests the table was copied from somewhere. Otherwise it would say 5th or the like. 67.119.12.106 (talk) 23:13, 11 September 2010 (UTC)[reply]
                  • It doesn't matter if it's copied if it's not copyrightable. We permit the import of public domain content. --Moonriddengirl (talk) 23:21, 11 September 2010 (UTC)[reply]

(edit conflict)

  • Reply from WP:CYCLING: taking the Franklin Chacón article as an example: "5º" means fifth, the results that show a jersey (maillot) are national championships, the bronze medal with a three shows a third place in an international competitition that awarded medals.
The article obviously has some problems (Manual of Style issues, too many results, not enough prose), but I understand you are interested in the copyright issues. The list of results was obviously copied from "cyclingarchives", which is also given as source. Darius added some things in the format:
  1. He bolded the years (standard practice in our project).
  2. He linked to the races, if they have wikipedia articles.
  3. He added jerseys to national championship races.
  4. He added medals to the races where medals were awarded.
There are a few things more you can change with the format of the results, but you'll aways need the rank, the race and optionally the place where the race was run, and to order them chronologically is pretty standard. So, with respect to format, I guess I would say that Darius changed the format enough to say it is different from the original source. The content is a different story.--EdgeNavidad (Talk · Contribs) 12:39, 11 September 2010 (UTC)[reply]
If you and the Cycling Project don't think its unusual or constitutes a violation and are happy with retaining these types of tables then I'm inclined to agree with you. SFB/talk 13:45, 11 September 2010 (UTC)[reply]
A second view from WP:CYCLING. Results lists almost always include the year, race and placing, and sometimes include location and date (rather than just year) and at an extreme annual competition and level of race. There have been several different ways to display these from simple lists through to sortable tables but no way in particular is considered ideal. Lists usually cull any placings below third place. The lists from Darius have their problems as EdgeNavidad mentions but the style of having a yearly sub-heading and then the list of (notable) results from that year isn't particularly unusual. SeveroTC 15:03, 14 September 2010 (UTC)[reply]
My view is that result/achievement lists are not copyrightable. Normally, if one uses a copyrighted (and copyrightable) source, one is precluded from verbatim copying, because that would be a violation. However, paraphrasing a source is not a violation; paraphrasing reliable sources is the essence of all encyclopedic work. But it is impossible to "paraphrase" a result list, because it is a simple collection of facts. Therefore, if such lists were copyrightable, there would be no way for us to list someone's achievements without violating copyright. Also, if such lists were copyrightable, who would own the copyright? Since the content of such list is not original, anyone can write it down and thus claim "authorship", then force all who follow to pay royalties. Of course, that would not make sense. IANAL, but US law is applicable to Wikipedia, so you might want to take a look at e.g. Feist v. Rural.
That's why I'm absolutely against deletion, because 99% by DD's work by volume were not copyright violations, and deletion seems to be excessively heavy-handed. Blanking is the way to go. GregorB (talk) 16:40, 11 September 2010 (UTC)[reply]
You may realize the difference yourself, but I need to clarify for others reading that paraphrasing a source certainly can be a copyright violation if the content is insufficiently altered. Copyright protection in the U.S. is not restricted to verbatim reproduction. (Osterberg, Eric C. (2003). Substantial Similarity in Copyright Law. Practising Law Institute. p. §1:1, 1-2. ISBN 1402403410. With respect to the copying of individual elements, a defendant need not copy the entirety of the plaintiff's copyrighted work to infringe, and he need not copy verbatim.) If you read the text above, you'll see Feist has already been discussed; at question here is not whether non-creative compilations of facts are protected by copyright in the U.S. (they are not, though "sweat of the brow" is protected in some other countries), but whether this particular content was creative. EdgeNavidad would seem to confirm it is not, and I take it you agree. --Moonriddengirl (talk) 18:31, 11 September 2010 (UTC)[reply]
I'm all for avoid copyright paranoia but while well-intended, I don't think the amateur lawyering from GregorB is very helpful. Feist was about a telephone directory, not about bicycle racing info. How similar or different the two are is the type of thing lawyers get paid a lot to analyze, because of their training. AFAIK none of us here are qualified to be doing that. We should just follow WP precedent. 67.119.12.106 (talk) 21:35, 11 September 2010 (UTC)[reply]
WP precedent is to remove lists when they are creative and not when they are not. At least, that's been the outcome of every list discussion I've seen, including those that have received input from User:MikeGodwin, in the two + years that I've been addressing copyright problems on Wikipedia. --Moonriddengirl (talk) 23:20, 11 September 2010 (UTC)[reply]
I agree with you. Also, GregorB is possibly discussing things from an athletics perspective. On this topic, the tables are not similar to any source in either composition or presentation. Similarly, bare results articles such as this contain information which is in the public domain and not subject to copyright. Of all the data I've seen in Darius's work, only the cyclist biographies have directly copied table material.
I'm very confused as to why people aren't more opposed to the mass blanking or deletion ideas. So many of his articles are easily identifiable as non-violations. I've already marked hundreds myself. I think we should leave more difficult cases (such as mixed up Darius violations and original edits on biographies) until later and simply spend our inital time whittling down the numbers by marking the great mass of results articles which largely do not contain violations. SFB/talk 11:19, 12 September 2010 (UTC)[reply]
People are opposed to the mass deletion; mass blanking is not really a problem, in my opinion, although there will now be some redunancy of labor in restoring those which you and other volunteers at the CCI have already checked. I've raised my hands to do this. As long as the checked articles are logged at the CCI subpage, I can get through that relatively easily. The idea of blanking is to both remove from publication swiftly any content that is problematic and alert regular contributors to those areas to the problem, whereupon they can help out if they choose. If they are easily identifiable as non-violations, they need only revert the blanking, and the article will be removed from the category. It's a potentially new way to get the community involved in reviewing these articles. --Moonriddengirl (talk) 12:47, 12 September 2010 (UTC)[reply]
If the blanking plan works like that then its reasonable I suppose. I was just particularly worried that we would be left with blanked pages of previously fine material. The voluntary nature of Wikipedia means things are often left half-done! SFB/talk 13:07, 12 September 2010 (UTC)[reply]
That's part of what the blanking is hoping to prevent. This is one of dozens of CCIs, at least one of which has been open over a year. We simply don't have the volunteer force to take care of it. :/ In those CCIs, what we have is copyvios that are still being published and still being edited--wasting the time of contributors whose work will later be lost as "derivative." In this case, as articles are cleared, the category created from the blanking template will depopulate, and it will become much easier to see which ones *haven't* been checked. It sounds like a really elegant solution to me, aside from the pain of start-up. :) --Moonriddengirl (talk) 13:11, 12 September 2010 (UTC)[reply]

Can we somehow or other distinguish those articles that have substantial non-tabular content? It seems improbable that the purely tabular ones constitute the copying of creative content. Then we can focus our attention on the remainder. Of those, rather than blanking, I'd prefer to see a slightly less draconian bot action that would preserve any listed references, something which would provide grounds for a clean rewrite without introducing any possibility of copyvio. LeadSongDog come howl! 21:23, 15 September 2010 (UTC)[reply]

Enlisting projects and other editors[edit]

In the unreferenced BLP project we used Bots and project tagging to notify creators of unreferenced BLPs and relevant projects of lists of articles. I suggest we do something similar here. User:Tim1357 has the technology to inform relevant projects and anyone who has edited an article that has just been bot blanked, but will need:

  1. Someone needs to draft a message explaining the problem and telling projects/editors what they need to do to restore and cleanse an article.
  2. Before Tim does a bot run to hundreds of projects and probably more than ten thousand editors he is going to want some copyvio experts to agree to watchlist a talkpage that any queries can be forwarded to (I did this for the recent bot run to 8,000 authors of unreferenced BLPs and only had to handle one query - so if the explanation is clear this won't be onerous).
  3. In the uBLP project we project tagged over 10,000 articles that had no project tag other than as a biography - if we are going to inform the projects we need to do some project tagging, which is easiest done before the articles are blanked.
  4. The blanking to not blank any categories that the articles have.
  5. A Signpost articles would help
  6. We need a clear explanation on the talkpage and the blanking notice as to how anyone can correct these articles, setting out exactly what they need to do.

ϢereSpielChequers 12:41, 11 September 2010 (UTC)[reply]

  • On instructions about what to do, see the "executive summary", above. ↑ Uncle G (talk) 13:02, 11 September 2010 (UTC)[reply]
You can count on point 5, a Signpost N&N article is in the works. ResMar 20:55, 11 September 2010 (UTC)[reply]
The Signpost article is out. I wonder how many people will read it. The article is a bit more alarming than the underlying situation, in terms of the volume of copying described. We should write a more detailed info page than what we currently have. 67.117.146.236 (talk) 06:50, 14 September 2010 (UTC)[reply]

On the thought of enlisting project help, it would be helpful if each project received a list of the articles within their project that will be affected by the bot blanking, so that each project can enlist their members to help fix the articles. Would this be possible? Stevie is the man! TalkWork 14:24, 16 September 2010 (UTC)[reply]

Splitting the articles out into categories or by the presence of wikiproject banners ontheir talk pages is easy enough. There is also CATSCAN, a toolserver application that lets you find articles that are in the intersection of two categories. 67.119.14.196 (talk) 20:24, 17 September 2010 (UTC)[reply]

Can we slow down please?[edit]

I highly respect and appreciate all of you who are donating your time and energy to this amazing project, and I truly understand the desire to keep Wikipedia infringement-free. But when did that desire become more important than all of Wikipedia's other goals, like being the world's largest encyclopedia, or not having firm rules? I can't believe that Wikipedians are seriously considering blanking 10,000 articles (actually more than a quarter percent of all articles on the English Wikipedia), and replacing them with a big, scary warning that most readers and editors will not bother to fully read, let alone help correct—because of evidence that some portion of them may be copyright violations. According to Wikipedia's (well sourced) article on copyright infringement, "The enforcement of copyright is the responsibility of the copyright owner," and "Online intermediaries who host content that infringes copyright are not liable, so long as they do not know about it and take actions once the infringing content is brought to their attention." Is there some reason Wikipedia is now more concerned about copyright violation than the law is? So much so, in fact, that we're willing to amputate a substantial portion of our encyclopedic content on the suspicion of infringement, even when the law clearly states that the host is not liable in such cases? The CorenSearchBot already goes far beyond what the law requires us to do, by proactively finding potential violations instead of waiting for the copyright holder to submit a complaint. Do we really need to go this much further?

Even if Wikipedia is now so paranoid about copyright as to contemplate this massive surgery, there will always be some amount of large scale infringement in a project this open, and there will always be a risk of a lawsuit no matter how proactive we are. Like other open content hosting systems, we should promptly remove specific infringing content when it is pointed out by a copyright holder, and make a good faith effort to prevent copyright violations from happening. But the action being contemplated here is far beyond such a good faith effort. At the very least, there needs to be a much wider community discussion about Wikipedia's stance with respect to large scale potential copyright violations before such a drastic action is taken. Peace, GreatBigCircles (talk) 08:11, 15 September 2010 (UTC)[reply]

We've discussed copyright violations repeatedly in the past. Prominent among those was the hard decision to delete many thousands of images with imperfect licenses. Every time it comes up we've agreed that it's necessary to comply with the law and with scholarly standards. This is not a new thing that requires an even larger discussion than it's already received. The fact that we can handle copyright and plagiarism problems quickly is a positive reflection on the community.   Will Beback  talk  10:07, 15 September 2010 (UTC)[reply]
Wikipedia is not now more concerned about copyright violation than the law is. Our copyright policy has long established a proactive approach to copyright infringement. This is not the first contributor copyright investigation; we have a board specifically for these actions. This was opened after wider community discussion confirmed consensus for the need for such a thing.
There are two generally compelling reasons for our proactive stance on copyright (there are more, but these may persuade even those who do not believe copyright infringement is a legitimate concern). First, unlike many online web hosts, we actively encourage others to reuse our content, including in commercial works. The broad dissemination of free information is part of our mission statement. Removing copyrighted content is a (relatively) simple matter for an online webhost. It's not quite so easy for those who fix our content into print, with our encouragement. Wikipedia is very cautious of trusting YouTube content because of their rampant copyright violations. If our potential reusers take a similar stance on our content, we are undermining our own missions.
Second, responding to official take down requests is not the be-all and end-all of OCILLA. There is a duty of care implicit in 17 U.S.C. § 512(c)(1)(A)(ii). As explicated in Report 105-551 pt.2 by the House of Representatives:

New subsection (c)(1)(A)(ii) can best be described as a ‘‘red flag’’ test. As stated in new subsection (c)(l), a service provider need not monitor its service or affirmatively seek facts indicating infringing activity (except to the extent consistent with a standard technical measure complying with new subsection (h)), in order to claim this limitation on liability (or, indeed any other limitation provided by the legislation). However, if the service provider becomes aware of

a ‘‘red flag’’ from which infringing activity is apparent, it will lose the limitation of liability if it takes no action. The ‘‘red flag’’ test as both a subjective and an objective element. In determining whether the service provider was aware of a ‘‘red flag,’’ the subjective awareness of the service provider of the facts or circumstances in question must be determined. However, in deciding whether those facts or circumstances constitute a ‘‘red flag’’—in other words,

whether infringing activity would have been apparent to a reasonable person operating under the same or similar circumstances— an objective standard should be used.

I suspect that most reasonable persons would conclude that we should at this point be aware of the infringing activity from this user. Indeed, they might argue that we should have suspended his access to the service years ago, after his second notice. OCILLA also requires that OSPs adopt and reasonably implement "a policy that provides for the termination in appropriate circumstances of subscribers and account holders of the service provider’s system or network who are repeat infringers."(17 U.S.C. § 512(i)(1)(A)) Arguably, we may have lost our protection under OCILLA in this instance after the second copyright problem was discovered. The parameters of online liability are still largely unexplored. --Moonriddengirl (talk) 14:00, 15 September 2010 (UTC)[reply]
"But when did that desire become more important than all of Wikipedia's other goals, like being the world's largest encyclopedia, or not having firm rules?" - I don't think either of those are project goals. Being a "free" encyclopedia is, in which case copyright violations have no place here. IAR is a pragmatism based around helping to meet the project goals, it's about improving the project, leaving the project open to containing large numbers of copyvio's isn't a legitimate application of IAR --82.7.40.7 (talk) 21:13, 15 September 2010 (UTC)[reply]
  • Another "hold your horses": in how many of the listed articles, has the copyvio matter been effectively destroyed by later edits by miscellaneous people? Does the list include every article that User:Darius Dhlomo has edited, or only those which are known to contain copyvio matter? Anthony Appleyard (talk) 06:19, 16 September 2010 (UTC)[reply]
    • The list has every article DD created. If we knew which ones had copyvios, we wouldn't need to blank them all with a bot. They all have to be manually reviewed. Current estimates are that around 10% have vios. A small number have been cleaned up since this incident started. Those will probably be blanked too, but can be unblanked. 67.119.14.196 (talk) 09:14, 16 September 2010 (UTC)[reply]
      • Ones with two or less sentences of prose can likely be considered violation free. It's quite hard to infringe copyright when the two sole prose aspects of it are (a) a bare-bones, to the point description of a topic in Wikipedia style and (b) a short sentence briefly expanding on the first.
      • Furthermore, I think people are missing the point that things like 1999 World Weightlifting Championships – Women's 69 kg are not violations (merely doing the usual two basic sentences and following it with an all-encompassing boilerplate description of the sport. I am reminded of the sentences prefixing the "Background" section of most wrestling pay-per-view events. I am 100% sure that the great majority of Darius's articles are simply not substantial enough to form violations. It doesn't matter who wrote "The competition at the 1999 World Weightlifting Championships took place in Athens, Greece." No one could ever enforce authorship of such an indistinctive, unoriginal piece of prose. SFB 19:08, 16 September 2010 (UTC)[reply]
        • True that a sentence such as that would not be a copyvio if it's alone. If it is combined with other barebone sentences, though, it does accrue protection in the very arrangement. The more of it there is, the more likely a problem is likely to be. And you've done great work on this, so you know Darius's patterns at least as well as anyone at this point, but I know you're aware that even one or two sentences can be a problem if the sentences are not bare-bones. If he's copied one or two sentences from The Great Big Book of All Sports Information into one article, it would almost certainly be de minimis. If he's copied one or two sentences from The Great Big Book of All Sports Information into, say, thirty articles, it's a different matter. :/ --Moonriddengirl (talk) 19:20, 16 September 2010 (UTC)[reply]

So we're going to blank a quarter percent of all articles on the English Wikipedia because we suspect that just 10% of them had a copyright violation when they were created, which may have been edited away long ago? Why isn't it enough to let CorenSearchBot do its thing? It's clearly very good at finding violations. Has it been run on all 10,000 DD articles? As I said before, that kind of proactive violation seeking is far beyond what the law requires. Yes, of course the law requires a duty of care, but we already demonstrate that in many other ways. Duty of care may mean we should have acted sooner on Darius—and those procedures should probably be revisited—but no reasonable judge would rule that it means we should eliminate the work of hundreds or thousands of other people because we suspect that at some point in the past 10% of that work was tainted with a sentence or two copied from somewhere. There is no doubt that that kind of violation happens all the time on Wikipedia. It's not hard to imagine that it might even happen in more than 10% of all articles. We can't catch them all, and we don't have to.

I would like to know when and where these past discussions about copyright supposedly happened, and how the community was engaged. I looked, and couldn't find any—just a policy page with no discussion. I would like to be part of those discussions. I'd also like to find out if anything this vast was ever discussed before. I can't believe that most Wikipedians would support eliminating this much content when such a small percentage of it is problematic.

To the anonymous person above regarding Wikipedia's goals, being an online encyclopedia and not having firm rules are two of Wikipedia's Five Pillars, which as I understand it are the most fundamental principles on which Wikipedia is based. Peace, GreatBigCircles (talk) 23:26, 16 September 2010 (UTC)[reply]

Yes, that's what we're going to do. CorenSearchBot is very good; hence, how it found Evergreen Cooperatives when you created that article in June 2010 so that the copied content could be deleted and replaced. But it is not fail-safe; if it was, it would have found far more of these vios when they were created, and we wouldn't be looking at so many of them now. If you would like to research more of the conversations that have shaped Wikipedia's approach to copyright, you might start with WT:C amd WT:CP, both of which have search functions as well as comprehensive archives. More conversations, including about CCI, have been held at WP:AN and WP:VPP, which are also handily endowed with search functions. That said, you seem to be missing the point that the majority of this content is not going to be eliminated; these articles are being being blanked for review, not deletion. WP:5P requires us to "Respect copyright laws"; our copyright policy requires us to "Never use materials that infringe the copyrights of others. This could create legal liabilities and seriously hurt Wikipedia. If in doubt, write the content yourself, thereby creating a new copyrighted work which can be included in Wikipedia without trouble"; our Wikipedia:Copyright violations policy says, "If contributors have been shown to have a history of extensive copyright violation, it may be assumed without further evidence that all of their major contributions are copyright violations, and they may be removed indiscriminately." It also says, at the bottom of every edit screen, "Content that violates any copyrights will be deleted". Temporarily blanking this material is certainly less dramatic than indiscriminate removal, even though that solution is supported by policy, and it will permit us to meet the mandates of our other policies: to write new content as necessary to ensure that copyright laws are respected. --Moonriddengirl (talk) 23:45, 16 September 2010 (UTC)[reply]


Sorry, but as a minor editor, I most certainly believe that obeying the law and protecting people's rights are more important than being bigger than anyone else or "having no firm rules." I hardly think that WP strives to become bigger than anyone else by stealing content. Objective3000 (talk) 00:27, 17 September 2010 (UTC)[reply]

A breakpoint[edit]

  • Can your copyvio-detecter bot distinguish a copyvio from a Template:Backwardscopyvio? Anthony Appleyard (talk) 08:18, 23 September 2010 (UTC)[reply]
    • Are you talking about Corensearchbot? Usually, it's not an issue, because it scans brand new articles. There are some few mirrors that duplicate us so quickly that the bot has picked up our articles as copying them; those are generally whitelisted. It does pick up mirrors when content is split from one article to another or when deleted content is pasted into a new article, but that's actually a good thing, since it gives us a chance to make sure the content is properly attributed. --Moonriddengirl (talk) 10:46, 23 September 2010 (UTC)[reply]

Archive?[edit]

This page is getting pretty big, and lots of stuff on it is stale or resolved. Could somebody create

Wikipedia:Administrators' noticeboard/Incidents/CCI/Archive 1 ?

If that happens I'll see if I can move some older discussion there. I think it's best to do that manually rather than using an archive bot, because of how intermixed the stuff in the different threads is. 67.119.14.196 (talk) 03:30, 17 September 2010 (UTC)[reply]

 Done - Decided to be BOLD and created page with infobox from above. Hope that's Ok. -      Hydroxonium (talk) 09:32, 17 September 2010 (UTC)[reply]
Sounds like a good idea to me. --Moonriddengirl (talk) 11:18, 17 September 2010 (UTC)[reply]
I have started archiving some sections. Feel free to help, or to resurrect stuff if you must. I will archive some more later unless there are objections. 67.119.14.196 (talk) 23:32, 17 September 2010 (UTC)[reply]
I think archiving is a good idea. The discussion seems to be rehashing the same things, plus it makes it easier for slow dialup connection. Thanks. -      Hydroxonium (talk) 04:49, 18 September 2010 (UTC)[reply]

Not sure what to do[edit]

I reviewed Piet Raijmakers and found that DD copied it from here. Uncle G's bot has already blanked it. I am supposed to just subst {{Copyvio}} with the url parameter and leave the notice that Uncle G's bot put there? I'm not sure how else anybody would know it was created by DD. Thanks. -      Hydroxonium (talk) 09:45, 17 September 2010 (UTC)fix sentence Hydroxonium 09:54, 17 September 2010 (UTC)[reply]

Then I'm supposed to mark on the list of 9,000. Is that correct? -      Hydroxonium (talk) 09:49, 17 September 2010 (UTC)[reply]
Darius didn't copy the content from the page you linked to, which contains a biography of a different Dutch athlete called Jos Lansink (which is itself copied from the Wikipedia article at Jos Lansink). Given that the article is very short and doesn't contain any elaborate prose I suspect Darius wrote it by summarising statistics and it doesn't violate copyright at all. Hut 8.5 09:56, 17 September 2010 (UTC)[reply]
Oops, wrong website. Big mistake on my part. The site I should have linked to copied the info from Wikipedia. Not the other way around, so double mistake on my part. This is going to be much harder than I thought with all the places using Wikipedia content. Can anybody answer the question about the process though? Thanks. -      Hydroxonium (talk) 10:46, 17 September 2010 (UTC)[reply]

Having re-assessed my abilities, I've decided I'm not competent to validate copyright stuff. So I will be checking items with no prose (i.e. data only contributions) and mark them off the list. I am adding the article title in the edit summary just in case that may help in the future. Hope that is OK with everybody. -      Hydroxonium (talk) 12:10, 17 September 2010 (UTC)[reply]

Okay? It's fabulous. :) Thanks so much for your time! You don't have to include the article title in edit summary, since the page itself records the title. (And the backwards vio situation happens. It can be a challenge to determine which came first.) --Moonriddengirl (talk) 12:16, 17 September 2010 (UTC)[reply]

lol this is so ridiculous[edit]

seriously, blanking over 10k pages has got to be one of the stupidest things the admins of wikipedia have come up with. So dumb. There has got to be more than myself who are way way way against something drastic and over the top. — Preceding unsigned comment added by Bigdottawa (talkcontribs) 03:42, 19 September 2010

WP:IDON'TLIKEIT anyone? GiftigerWunsch [TALK] 11:57, 19 September 2010 (UTC)[reply]
The inherent problem with a large-scale blanking is that an individual user won't know ahead of time which individual items are going to disappear from his watch list - except by plowing through the list of 5,000 items or whatever - or by downloading the watchlist and the list of 5,000, and comparing them. Ugh. ←Baseball Bugs What's up, Doc? carrots→ 12:13, 19 September 2010 (UTC)[reply]
If you accept that there is a problem and the only solution is complete rewriting of the article, then you can either do that after it has been blanked, or do it beforehand. If you do it beforehand, having it blanked and then, especially if when you do the rewriting you mark it as "rewrite to avoid copyvio" or similar in the edit summary, it will be fairly quickly reversed isn't really an issue. The issue would be systematically ensuring that every single one does get checked or rewritten. That is where the bot idea to do the blanking works so well. The work for us, whether now or after blanking, is to do the copyvio checks and/or rewriting.The-Pope (talk) 12:50, 19 September 2010 (UTC)[reply]
Bugs, the bot is going to run without a bot flag, so any page on your watchlist that gets blanked will give you a watchlist notificaton. Does that help? 67.119.14.196 (talk) 17:37, 19 September 2010 (UTC)[reply]
If the history is left intact [as noted below], then you're right, and it should be no problem. ←Baseball Bugs What's up, Doc? carrots→ 17:48, 19 September 2010 (UTC)[reply]
Not very constructive. Stevie is the man! TalkWork 20:01, 19 September 2010 (UTC)[reply]
I wonder what alternative the OP could suggest, for removing mass quantities of suspected copyright violations? ←Baseball Bugs What's up, Doc? carrots→ 20:16, 19 September 2010 (UTC)[reply]

Why the rush?[edit]

As just an 'umble editor, the first I have heard of this is a message in the header today. I haven't read through all the preamble to it, so this may have already been covered, in which case my apologies, but deleteing so many articles does seem a little heavy handed. An alternative approach may be to post on each of the potentially copyvio articles, a message to the effect that This article may be subject to copyvio by DD. Please review and fix if possible, once fixed, add the {{DD-copyvio-fixed}} tag. All articles still unresolved by (a data to be set) will be subject to deletion

Regards, Lynbarn (talk) 16:26, 19 September 2010 (UTC)[reply]

Nothing is being deleted. The intention is to blank the pages, so all of the history etc is accessible to us 'umble editors. Legally, they think this is the best option - doing anything else (and maybe even procrastinating over it for 3 weeks like we have been) might leave the foundation open to problems.The-Pope (talk) 16:35, 19 September 2010 (UTC)[reply]
Lynbarn, this is similar to what I suggested before, but consensus seems to be in favour of blanking them until they're fixed at some point in the future, rather than not blanking them but deleting them after a fixed period. Both methods have their own merits, imo. GiftigerWunsch [TALK] 20:33, 19 September 2010 (UTC)[reply]
Blanking rather than deleting provides a lot more flexibility for the editors. ←Baseball Bugs What's up, Doc? carrots→ 20:48, 19 September 2010 (UTC)[reply]

Why the delay?[edit]

Let's get this show on the road already. We have a problem, we have a solution - what's the implementation hold up? To those concerned with losing content (albeit just blanking until somebody fixes it): WP:IMPERFECT: "Wikipedia is a work in progress.". Somebody will get to it sooner or later. With this approach, it'll probably be sooner, because anybody with the article on their watchlist will see the issue and have a chance to respond. Historically, largescale copyvios don't get that kind of attention (as the tireless User:Moonriddengirl will tell you), and that has always been a problem, and with copyvios on this scale, it's even more of a problem. Analogy: if you don't know which of 10 muffins has been poisoned, you don't just sell them in your shop because nine of them are OK. You've got to test each one (or else ditch the lot and make new ones). If they're already on the shelves you don't spend ages arguing about the testing process - you start by getting everything suspect off the shelves! Rd232 talk 21:27, 19 September 2010 (UTC)[reply]

Concur, at the risk of generating more discussion. The copyvio is already at least two weeks stale from the start of this discussion. This seems like a good plan, having read about half of the discussion, the plan seems sound and minimally invasive. Its the bot ready to roll? User A1 (talk) 21:48, 19 September 2010 (UTC)[reply]
I'm good to go. :) The Bot was approved yesterday; I'm not sure when it will be ready. As soon as I'm told the bot is ready to roll, I'll be generating a list of articles not to blank (see Wikipedia:Administrators' noticeboard/Incidents/CCI#To avoid redundancy). Haven't done this yet, because people were still working on the CCI last I saw. Still kind of an outstanding question whether we should ask those who revert the blanking to note what they've done at the CCI. This seems like the best idea to avoid redundancy and to keep things transparent. (I like the muffin analogy. :D) --Moonriddengirl (talk) 21:56, 19 September 2010 (UTC)[reply]

100 of Rd232's muffins have been tagged[edit]

I've taken the first 100 articles on the list, sorted them into alphabetical order, and run them through the process. It took just under 22 minutes, as expected. Now I am going to wait a little while, just to see what rolls in. The next batch is 900 articles long, taking us to the 1000 article mark. The batch after that is a really brave leap straight for 10000. Moonriddengirl, if you have a list, I can subtract it from these two batches. Uncle G (talk) 22:58, 19 September 2010 (UTC)[reply]

  • I've listed the ones that have been cleared out of the first 1,000: User:Moonriddengirl/checked. I'll do the rest of the list tomorrow, but it's past my bedtime now. :/ (You guys have done some awesome work on this already!) --Moonriddengirl (talk) 01:10, 20 September 2010 (UTC)[reply]
    • Bearing mind what jc37 said earlier, about still marking the edit history, what I could do is run those as a single batch on their own, straightaway. Uncle G (talk) 06:54, 20 September 2010 (UTC)[reply]
  • Go for it. 75.62.108.42 (talk) 08:38, 20 September 2010 (UTC)[reply]

A running count of progress please[edit]

CCI Progress
The 10,000 articles Blanked by bot Cleaned by human Left to go
9,664 created 200 100 9,364
The other 13,000 articles Red XN No problems Green tickY Cleaned Left to go
13,542 non-minor contribs 2,000 500 11,042

Could we get a bot or something to do a running count of progress please? Something like the table above would be great. Thanks very much. -      Hydroxonium (talk) 17:03, 20 September 2010 (UTC)[reply]

  • major work 'bot statistics thus far:
    • Batch blanked as noted in the section above: 100 articles
    • Size of Moonriddengirl's list: 634 articles
    • Already blanked articles out of the first 100 that were on Moonriddengirl's list: 6 articles
    • Current batch being blanked as I write this (all remaining already-checked articles from Moonriddengirl's supplied list): 628 articles
    • Remainder for the next batch(es): 8936 articles
    • Articles on VernoWhitney's original list that have been renamed or redirected: 190 82 articles
  • Uncle G (talk) 17:33, 20 September 2010 (UTC)[reply]
  • I've started a new batch, comprising the articles that (as of a few hours ago) had been moved elsewhere (leaving the original listings as redirects), to get these done quickly, too. It turns out that my list contained quite a few duplications.
    • Articles on VernoWhitney's original list that have been renamed or redirected: 82 articles
    • Remainder for the next batch(es): 8936 articles
  • Uncle G (talk) 19:52, 20 September 2010 (UTC)[reply]
  • I've started another batch, comprising the articles that (originally) would have taken us to the 1000 article mark. Note that at this point we're well into the territory of articles that haven't necessarily been reviewed, or even edited, by anybody (see 1989 Hammer Throw Year Ranking, for example).
    • Current batch: 800 articles
    • Remainder for the next batch(es): 8136 articles
  • Uncle G (talk) 10:00, 21 September 2010 (UTC)[reply]
    • Maybe I'm misunderstanding the issue here, but it seems to me that the easiest way to rectify the situation (without blanking pages) would be to design a bot that (1) identifies which articles have been edited by Darius Dhlomo, (2) determines when the first edit by said user was made to each page, and (3) locates and reinstates the previous revision of that particular article. As long as you know how the Wikipedia database is structured (and assuming a working knowledge of SQL, of course), anyway... Sebastian Garth (talk) 20:07, 21 September 2010 (UTC)[reply]
      • That may work for "The other 13,000 articles" (which we have not yet begun to address), but in the case of the articles being blanked at this point, the contributor created them. :/ --Moonriddengirl (talk) 20:09, 21 September 2010 (UTC)[reply]
  • Moonriddengirl has just given me a second list. It's almost not worth the 'bot giving it special treatment, from the figures. Most of the articles thereon have already been blanked. I'm processing the remainder now. In just over 20 minutes, every article on both of Moonriddengirl's lists should have been processed by the 'bot.
    • Size of Moonriddengirl's second list: 679 articles
    • Articles on that list not yet touched by the 'bot, that are being touched as I write this: 90 articles
    • Remainder for the next batch(es): 8046 articles
  • Uncle G (talk) 20:40, 21 September 2010 (UTC)[reply]
  • The 'bot has just completed its attempt to leap straight to 10,000. I have to review the logs, to see whether there were any problems. (The machine running the 'bot did have to be restarted partway through, and I had what appeared to be a connectivity outage at one point.) There might be some dribs and drabs coming through in the next couple of days. But since there are currently 7853 articles in Category:Articles tagged for CCI copyright problems, it seems that the 'bot has done the leap successfully bar just a few articles. Uncle G (talk) 16:01, 23 September 2010 (UTC)[reply]

Help wanted for igloo users[edit]

As you can see from the edit history of User talk:Uncle G's major work 'bot, HJ Mitchell (talk · contribs) is having all sorts of joy with igloo, which apparently goes beserk when it sees accounts blanking multiple articles. If you can help igloo users, I'm sure that it will be appreciated. Uncle G (talk) 19:45, 20 September 2010 (UTC)[reply]

Wait, igloo is doing those reversions all by itself? That sounds more like a user tagging edits as vandalism without bothering to look at them, a common enough occurrence but never appropriate. Could Uncle G please ask HJ Mitchell to be more careful in general? [redacted some griping]. Thanks. 69.111.195.229 (talk) 21:12, 20 September 2010 (UTC)[reply]
I left a suggestion at the igloo implementer's page.[21] 69.111.195.229 (talk) 21:33, 20 September 2010 (UTC)[reply]

Assistance requested[edit]

Some help with the editor who keeps changing the task explanation mid-task with irrelevancies (Read the surrounding text that is being edited and what it is discussing.) would be appreciated. Uncle G (talk) 11:53, 21 September 2010 (UTC)[reply]

Appears to be a long-term tendentious editor. 69.111.195.229 (talk) 15:47, 21 September 2010 (UTC)[reply]
Person seems to have switched to the talk page about the topic, which is an acceptable place to discuss it I suppose. 69.111.195.229 (talk) 16:57, 21 September 2010 (UTC)[reply]

Cleared articles[edit]

I've made my second list of cleared articles at User:Moonriddengirl/checked 2. I'm afraid at this rate, I'll never finish it! What I'm going to do, then, is follow the articles that have been cleared and revert the bot after it has blanked the articles. (It took me over an hour to compile my first list; I can't saw how long it took to do the second because, alas, real life and other Wikipedia business kept interrupting. Assuming a rate something like the first, it would take me ~seven hours to complete the list of his created articles alone.) --Moonriddengirl (talk) 20:07, 21 September 2010 (UTC)[reply]

  • It occurs to me that a little bit of template hacking could help this process. Accompany every entry in the list with a template expansion that starts out as {{tl|cci-dd-status-u} with a parameter (article=name_of_article) or something like that, where "status-u" means "status unknown". Anyone checking the article would change the status-u to status-y or status-n (or maybe an additional code like status-q, meaning the article has been examined but the status is still in doubt, so it's being flagged for further attention). Then it would be quick to sweep through all the references to the template and collect the names of the articles. Alternatively someone checking an article could insert a status template into the article itself. The bot could do null edits on articles with status-n instead of blanking them. That would still leave bot edits in the edit history. 66.127.54.226 (talk) 23:12, 21 September 2010 (UTC)[reply]
    • @Moonriddengirl, you are overworked! I just want to double check — should an inexperienced user like me be removing the diffs when I mark a page. I have been working on the 13,000, marking the ones that are data only as Red XN. I started at 4501 and have been leaving the diffs. I can go back and remove them and just leave the Red XN and sig if that will help. Please let me know. -      Hydroxonium (talk) 00:11, 22 September 2010 (UTC)[reply]
        • Generally, we do remove the diffs and just leave Red XN and the sig. It makes it easier to see at a glance if an article has been cleared, and it gradually shortens the pages, which can be kind of long. :) It's not worth going back to remove the ones you've left, I don't think, but if you want to do it going forward that's fine. It's not a huge deal, either way. The important thing is that the work is getting done. --Moonriddengirl (talk) 10:47, 22 September 2010 (UTC)[reply]
      • I also wanted to ask about the 10,000 created. WP:CCI/DD/Created articles list has a total of 9,664 while WP:CCI/DD page 10 list a total of 9,657. That's close enough for me, but I was wondering about the difference. Thanks. -      Hydroxonium (talk) 00:30, 22 September 2010 (UTC)[reply]
        • Several articles on the original list have since been deleted. Some might have been deleted in the period between the original list and the current list. I've been trying to cope with things that have changed since the list was constructed. I attempted to ensure that I caught any renamed articles, for example. See User talk:Bigger digger#That CCI where Bigger digger has been kindly reviewing that. Uncle G (talk) 01:13, 22 September 2010 (UTC)[reply]
  • I've created a list of cleared articles that are currently blanked: User:Hut 8.5/Blanked cleared articles. Hut 8.5 09:35, 22 September 2010 (UTC)[reply]
    • Well, aren't you my hero? :) That'll make my life easier! Thanks! --Moonriddengirl (talk) 10:47, 22 September 2010 (UTC)[reply]

Edit summaries[edit]

It is important to use good edit summaries when undoing a blanking. See User:Moonriddengirl/CCI 'bot stalk report for why. Uncle G (talk) 15:10, 24 September 2010 (UTC)[reply]

Rollback bot[edit]

I have filed Wikipedia:Bots/Requests for approval/VWBot 9 to have something ready to go in case there is support for rolling back all articles edited-but-not-created by DD (I know it has been mentioned repeatedly above, but the actual execution of this bot would obviously wait for consensus). VernoWhitney (talk) 20:34, 22 September 2010 (UTC)[reply]

I think there is general tentative support for it, but it has been waiting on 1) implementation (thanks for taking care of that), and 2) a desire to do this novel operation in stages, so we can see what kinds of disruption results and what we have to do to deal with it. That is, we want to see how the initial blanking operation works out before proceeding to the rollback operation. That gives us better chances to modify our plans in response to lessons learnt as we go. 71.141.90.138 (talk) 00:10, 23 September 2010 (UTC) Added: I commented at the BRFA. 71.141.90.138 (talk) 00:36, 23 September 2010 (UTC)[reply]

I understand the reasoning but I want to voice my opposition to this unless a better solution can be developed. I just watched my watchlist fill up (which in itself is great) but after reviewing 20 - 30 of the "reversions" done by the bot, not 1 had any copyvio in it. Additionally several others have complained about the same problem. After discussing this with VM on their talk page it was suggested I bring it here. Using the example article I used on their talk page, Accutink did some early edits to the Ernest Spybuck article and since then a lot of other edits have been made by several other editors. If this bot goes back and reverts all that it will not only cause harm to the article (and hundreds or thousands more) it will also anger users, do more harm than good to WP and the articles in it and give this bot, its operator and the process in general a bad name. Know myself and other editors are forced to go through hundreds or thousands of articles to fix this mess thats being caused by a Knee jerk reaction to a problem. Rather than continue to leave comments on a completely hidden page that almost knowone has visibility of I suggest some limitations for the bot that at a minimum should be met before going any further.

  1. Move this conversation to a more public venue like the Village pump. Especially know that it is causing so many problems
  2. Do not revert any article that has been through a GA or better review process or Peer review process since Accutink or one of their socks edited it. If its gone through these processes any problems would have been caught and fixed.
  3. Do not revert the article if more than 3-5 edits have been made since Accutinks last edit
  4. Do not revert if more than 1500 bytes of info has changes since thier last edit.
  5. Do not revert if all thats being reverted is a minor edit
  6. Do not revert if the only action was a change to External links or if the editor as adding or expanding and inline citation. --Kumioko (talk) 16:18, 10 December 2010 (UTC)[reply]

Problem solving and the serious threat this problem highlights[edit]

I'm late coming to this table, and forgive me if this has been brought up before. From what I've read so far, this isn't being addressed very strongly. The only place that I've so far seen this mentioned is here, where it says "Things like this endanger the entire project that we are working upon" If there is some other place where there is discussion regarding the more abstract problem, please feel free to point me towards it.

One of the core aspects of problem solving techniques is the concept of identifying the root cause of a problem and putting in place corrective actions that prevent that root cause from occurring again. From everything I read, this isn't being done.

Quite a bit of effort has been exerted on problem containment for this iteration of the root problem. We know a problem has happened, and we've done an excellent job of identifying what articles are potentially affected. That's not hard; we just need a list of article edited by this editor as a starting point. Further, we blocked the editor in question so that no further copyright violations happen.

What we haven't done is any significant work in going back to the root problem or really even asking about it. How is it that an editor could work for years right under the noses of thousands of other editors and at least one bot and not be discovered? What in our process is flawed such that we can have an outcome such as this particular iteration of the mass copyvios by DD?

Allow me this analogy; A very large dam has broken upstream, causing mass flooding below the dam, in the form of 10,000+ articles being horrifically damaged. Shock and horror ensue. Our containment action is to, in effect, quarantine the articles and remove direct display of the copyright violations. We've built a temporary dam, in the form of blocking DD, to prevent more flooding from the same problem iteration. Meanwhile, we've got dams all over the country (Wikipedia) of exactly the same design, exactly the same stress loads, exactly the same potential damage downstream and nothing has been done to question whether it might be a good idea or not to go look at those other dams.

So, off we go and 'fix' this current problem. Yet, right now, at this very moment we could have a hundred other editors causing exactly the same kind of damage, and we have no way of knowing it.

We had plenty of warning signs that there was a serious problem. CorenSearchBot did catch him February 2008April 2010April 2010October 2007. VernoWhitnet July 2010. Sillyfolkboy August 2010. Coren knew as early as October 2007, but someone else knew as early as November 2007 too.

'The problem isn't Darius Dhlomo. Darius Dhlomo is a symptom of the real problem. Our methods for catching this incredibly easy to catch perpetrator were completely inadequate to the task.

I'm not suggesting we change our way of creating and updating articles. But, we have to come to a much clearer picture of exactly how this was even possible, and work forward from that towards a means of making this impossible (in as much as possible) in the future, and having some metrics to help us gauge our success. Without doing this, there will be more symptoms. Whether it's one user doing 10,000 articles, or 1,000 editors doing 10 copyright violations, the effect is the same. Darious was 'easy' (though it took ridiculously long to catch) because of the scale concentrated in one user's contributions. I promise you, there are thousands of other copyright violations scattered through our ~3.5 million articles. Yet, we have absolutely no way to find them.

There's a fundamental flaw here that must be addressed. If we don't, we're chasing our tails. This symptom will happen again. For proof of that, all you need to do is look at the comments of so many editors who just don't think this problem is a big deal. Uncle G notes above "our continued existence is proof that the well-meaning ordinary editors outnumber the ill-intentioned", and I agree. But, the number of editors who routinely violate copyright out of ignorance or willful intent may or may not be significant. We don't know. Darius spent years creating these violations. None of our tools managed to catch him. None.

This headline hasn't this the press yet, but there's a good chance it will: "Wikipedia Violates Copyright: Project had thousands of violations for years and never knew it" And our response? "Oh we fixed those". Then when a journalist asks "What have you done to prevent this from happening again?" Our answer: "Uh, nothing". That's just not an acceptable response. So you cleaned up New Orleans. Big deal. What have you done to prevent it from happening again? Answering "nothing" will get you sacked if you were in the public eye. What is being sacked here is the reputation of Wikipedia.

Get your thinking out of the trenches, and look at this problem from the more abstract view. Darius is a symptom, not the problem. Fixing his problematic article edits is sticking your thumb in the dike, completely unconscious to the tidal wave approaching the dike. --Hammersoft (talk) 15:02, 23 September 2010 (UTC)[reply]

Yes, finding repeat copyright violators earlier is a known problem. What's your solution? VernoWhitney (talk) 15:57, 23 September 2010 (UTC)[reply]
  • I didn't say I had a solution. I said there's a lack of focus on the actual problem. --Hammersoft (talk) 16:03, 23 September 2010 (UTC)[reply]
  • This is not really the place to focus on the overall problem; it's an "incident" report, after all. :) That said, I do have my own ideas about it. The solution, I think, is to educate our contributing body and to change the still present attitude among some Wikipedians that copyright doesn't matter. I'd say we've made considerable strides in just the last few years, though I agree that the problem is still rampant. Since Darius started copying content onto Wikipedia, we have created a project (Wikipedia:WikiProject Copyright cleanup) and a process board (Wikipedia:Contributor copyright investigations). We've also changed the way SCV works, which I think is going to make repeat CSB catches less likely to go undetected. Reports used to be stripped from the page; now they are retained as a permanent record, and the people who address the listings sign them. They are also reviewed after a week at the copyright problems board now. We don't look at every listing, but I spot-check the ones that are cleared, and I have more than once found repeat infringers that way. We've got several new volunteers who put considerable labor into copyright work (particularly User:VernoWhitney, who seems to make a full time job of it) and a couple of new admins who rose through the ranks of WP:SCV and are savvy on the issues.

    That said, this isn't really a tidal wave approaching the dike; the tidal wave is already here. Remember: this is one CCI. We have 34 other CCIs currently open (and have already completed 40). This is a big one, granted, but take a look at Wikipedia:Contributor copyright investigations/Paknur, who is still adding copyvios under various sock accounts. That CCI has been opened over a year. Take a look at Wikipedia:Contributor copyright investigations/Aiman abmajid. (We didn't even number articles then; we just dumped them. :/) I've considered various ways to strengthen our anti-copyright message, from Wikipedia:Editnotices on particularly troubled articles to offering bounties to projects that keep their articles clean, even to having a bot that notes when a contributor has received multiple CSB notices, but it's difficult to know what measures might be most effective at spreading the word and containing the problem. I'd love to see us do a better job of both and would heartily applaud anybody who comes up with good ways to advance those goals. --Moonriddengirl (talk) 16:12, 23 September 2010 (UTC)[reply]
  • Maybe we should start a more abstract discussion elsewhere. But for now; permit me to summarize a bit and tell me if I'm wrong: "CCI is overwhelmed, and we lack the tools to manage".  ? --Hammersoft (talk) 16:18, 23 September 2010 (UTC)[reply]
  • Decent summary, yes. However, being protective of my peeps, I'd like to say that we do a pretty good job with the workload we're given under the circumstances. There may be 35 CCIs open, but I'm proud of the fact that we've closed 40. :) I would be thrilled to see better tools to expedite these processes. It's greatly distressing to me to see these languishing so long, but there's always a new one just around the corner and we never have spare time to catch up. :/ --Moonriddengirl (talk) 16:26, 23 September 2010 (UTC)[reply]
  • I meant to express no opinion on the work CCI does. I've been lightly involved in the past, even helped out once, and pleased it exists. The work is absolutely necessary and done to the best of abilities. The problem I think is the lack of horsepower, whether its provided by people, tools or a combination thereof. If all that effort couldn't catch DD after years of copyright violations, there certainly are a lot of other rights violators going by undetected too. --Hammersoft (talk) 17:12, 23 September 2010 (UTC)[reply]
  • Amen. :) (I didn't mean to imply I was feeling attacked by you, by the way, when I mentioned feeling protective. CCI work often goes unnoticed, so I wanted to be sure to point out to bystanders the good work that has been done. :D) If you want to go meta somewhere, please let me know. I really would love to see some significant steps taken in cutting down this problem. --Moonriddengirl (talk) 17:30, 23 September 2010 (UTC)[reply]
  • I'm open to suggestions on where to go. --Hammersoft (talk) 17:39, 23 September 2010 (UTC)[reply]
  • There is a solution that allows for better and regular automated checks. Flagged revisions (the original version, not the watered-down Pending Chances), enabled project-wide on every article, and a CSB II that scans every accepted revision. That way we don't just get a measure of issues on new articles but also a mean to crack, progressively, into all those thousands of copyvios introduced after article creation. Sadly, even the weak pending changes has issues getting passed here, and during the trial, at least a dozen editors with multiple CSB and copyvio warnings to their name have been granted reviewer rights. But the community as a whole isn't ready to take drastic measures for BLPs, I'm not holding my breath on copyvios. Too many people shy away from the topic, and even more refuse to admit it's a serious concern. MLauba (Talk) 20:28, 23 September 2010 (UTC)[reply]
  • I entirely disagree that Darius was an easy or obvious case. When an editor fires out dozens of articles a day (of which 98% have minimal content and are "clean") as well as having an edit history the size of a large baleen whale filled with no discussion and minor edits on tables and categories, it becomes difficult to find a pattern of violation (the way most copyright violators are found). Unsurprisingly, there are very few people who could mingle the odd violation into a five year history of 160,000 edits (just over 20 people in the world actually). Outside of bots, only a handful of people in the world could conceal violations by their sheer volume of articles created. The majority of violators create their account and create a slew of violations in a row. The majority are thus easy to catch. We know that there are not hundreds of editors like Darius Dhlomo causing widespread havoc merely due to the fact that he is such a unique user on this website. SFB 22:12, 24 September 2010 (UTC)[reply]

Thinking out loud, from the peanut gallery: one idea for the larger problem ... Add some sort of delay and filtering step to the editing process, check string(s) of words from submitted content against one or two search engines, if that exact phrasing exists elsewhere then flag the contribution as possible copyright violation. Possibly check another string for a soft confirm. The new content could still be added to the article, but another editor could act on the flag -- and/or flags could be logged server-side, someone could review the log & and either confirm or reject the flag(s). In trial run, tweak the parameters -- maybe it checks 3 words in a row that are each longer than 3 characters, for instance. Or 4 words. etc. Add exceptions and other rules. Admittedly, someone more familiar with advanced search algorithms may see a fatal flaw in this idea. So forgive me if it sounds naive in WP context, I'm a copyeditor with limited experience in coding. I realize this filter+delay would introduce some wrinkles into the current i/o process of article editing, but it could be worth a try. I suppose a central question would be: is there any inherent reason why article changes have to be instantaneous? -PrBeacon (talk) 06:54, 25 September 2010 (UTC) revised[reply]

Missed article[edit]

I've just removed a violation from Ronaldo da Costa (edit here) but that article does not feature in the non-minor contributions list. Why is this? Given the change from 664 bytes to 2,075 bytes with all prose additions, this should be the kind of edit that shows up. Is there anyway of refining the non-minor list to one which focuses on solely the addition of prose? If a bot could somehow combine the functions of the Prosesize script with a check on edits over a certain level of bytes (say 800 bytes for example) then it would pretty much nail 90% of copied prose edits. Is this even feasible? SFB 22:23, 26 September 2010 (UTC)[reply]

That edit does show up: It's the first one listed after the article (which is itself listed in Wikipedia:Contributor copyright investigations/Darius Dhlomo 11#Articles 721 through 740). As far as the prosesize goes...probably, but I don't know that there's a bot already set up to do it and I imagine it would take a while to get a good algorithm set up to do that. VernoWhitney (talk) 22:52, 26 September 2010 (UTC)[reply]

Coordinating lists? bot help?[edit]

Hi. Since very few people are coordinating the efforts with the CCI listings, I'm wondering if a bot can help out with that. Can we get a bot to check the list of articles that have been blanked for unblanking and add an indented subnote or something beneath the CCI listing indicating who, when and with what edit summary? Is that possible? We might want to give this another week or so, if so, because I imagine if we run this bot more than once it'll cause tons of redundant listings, unless we can add some kind of exclusion to keep it from checking articles it's already annotated. I don't know, but that sounds complex to me. :) --Moonriddengirl (talk) 16:53, 29 September 2010 (UTC)[reply]

  • I've been keeping the User:Moonriddengirl/CCI 'bot stalk report lists fairly up to date. You can check against those, as a fallback. Uncle G (talk) 18:53, 29 September 2010 (UTC)[reply]
    • Thanks. :) I guess we could always eventually move that to a subpage of the CCI for a long-term record. We will need at some point to see what's still blanked for further attention. Is there any way to get it to update when blanking has been reverted, as here, and then later checked? Or is that beyond bot abilities? :) --Moonriddengirl (talk) 18:57, 29 September 2010 (UTC)[reply]
      • I have been thinking about that, for those very reasons. It's not trivial. The current tool is fairly simple. It just looks at the account, revision ID, and summary of the immediately following edit. It doesn't look at the content of the edit. (It doesn't even retrieve the content of the edit.) Uncle G (talk) 19:03, 29 September 2010 (UTC)[reply]
      • The most obvious way (that I can think of) would be create an account that has all those blanked pages on its watchlist (populate with API). Then query the watchlist a few times an hour and notice the updates. That avoids having to query all 10,000 articles or monitor Recent Changes or anything like that. 67.122.209.115 (talk) 15:56, 30 September 2010 (UTC)[reply]
        • We already have Special:RecentChangesLinked/Wikipedia:Contributor copyright investigations/Darius Dhlomo/Created articles list. What Moonriddengirl is looking for is something akin to the User:Moonriddengirl/CCI 'bot stalk report lists but sensitive to subsequent rollbacks to the blanked version. That's harder to do, since it involves looking at the contents of edits. (If the rollbacks had been re-blankings by the 'bot, it would have been easier.) If there's a way to get api.php to provide an easily-machine-readable version of this, it isn't apparent from the documentation. A lack of it means that one has to instead download the wikitext for each newer revision and compare it to the wikitext of the blanked one, which requires a lot more bandwidth, for a report that is already making tens of thousands of queries. Uncle G (talk) 17:02, 30 September 2010 (UTC)[reply]
          • You can probably notice rollback to the blanked versions pretty reliably from the decrease in page size. The page size is included in the API response. Also, with that recent-change list, you shouldn't have to do 10,000's of queries except maybe once for an initial report. After that, you only have to query for pages that are actually edited, which looks like a few hundred pages a day, not that high a number. 67.122.209.115 (talk) 17:41, 30 September 2010 (UTC)[reply]

Any semi automated cleanup on the way?[edit]

I've just removed the tag from 2005 Men's Hockey Champions Trophy and I can see that a lot of these articles are just tables of results with minimal text. I don't have time to clear this up, but it looks like someone with AWB (or similar) can make a really quick first pass and resolve many articles just by cropping the text and leaving the tables. You can't copyright the facts. 188.222.170.156 (talk) 21:53, 29 September 2010 (UTC)[reply]

Are we allowed to ask for help?[edit]

I have gone though 100 articles in the last 2 weeks. I was able to mark 70 of them as Red XN because they were only names, stats and other data. That's 0.3% of the 23,197 articles — not even close to 1%. This is a HUGE task. I'm happy to see the discussion is listed on WP:CENT, but it doesn't mention that we need help. Are we allowed to ask for help on the WP:CENT message? Thanks. -      Hydroxonium (talk) 08:59, 30 September 2010 (UTC)[reply]

OK analysing for names and stats would be easy enough to automate. "other data" is a little more vague. Rich Farmbrough, 05:24, 2 October 2010 (UTC).[reply]