Wikipedia talk:WikiProject United States Public Policy/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Article Quality Metric

OK, go easy on me, this is my first time posting on WP. A little background: at the end of the project, Wikimedia must show the grant trustees that the quality of public policy articles actually improved, and that is my job. It is MUCH easier said than done. So I have been working on developing an understandable and quantifiable quality metric that is not overly complex. To evaluate article quality, I propose the following metric, called the NICE/BUMLU. What I like about it:

It is clear to new users; possibly experts in a subject area, who are unfamiliar to WP, would be willing to use it because it clearly outlines how to rate articles.
It standardizes the value to be placed on certain aspects of article quality, placing the greatest value on content, where I believe it belongs.
It results in 2 ratings: 1) a numeric score from 0-20 which is easily quantifiable for analysts (yay!), And 2) an acronym rating, that to people familiar with the metric it will identify the areas in which the article needs improvement.

My concerns about using a metric like this are:

The resultant acronym rating is not intuitive to readers like the Featured Article, Good Article, B, C, scale is.
If the metric is not widely used, there still won't be enough quality data to get an accurate measurement of quality improvement.

NICE/BUMLU Article Quality Rating Metric

This metric is specific to encyclopedia articles, it is for measuring the quality of any encyclopedia article not just public policy related topics. The NICE/BUMLU article quality measurement evaluates articles in four distinct areas: neutrality, importance, content, and format. The NICE/BUMLU Article Quality Metric results in the classification of articles into categories that indicate article deficiencies and strengths and a numeric score ranging from 0-20 that captures all areas except relative importance. One potential problem with this metric is that currently, this rating would not get updated with each revision to the article.

Neutrality

Only neutral articles can be awarded Heirloom status, articles are either neutral or not. Some topics are very difficult to present neutrally, for these topics important viewpoints must be equally presented. Rating is Neutral or Biased (N,B); for evaluation purposes, neutral articles score 3 and biased score 0.

Importance

Only articles of Time-limited high notability or Big importance may become Heirlooms. If a topic is not notable it should not be included in the encyclopedia, so topics that do not rate even at the lowest level (not to be forgotten) should be deleted. The examples given below provide a sense of the article topics tat level of importance. Rating is either Important or Underlying (I,U).

Important
- Big importance – These topics are very important in their time and have an impact in some historical context. These topics typically have mind share in the public, so most en.Wikipedia users are at least familiar with the topic. Topics in this category have been important for several decades and/or will be important in 50 years.
  - United States House of Representatives
  - McDonald's
  - Star Wars
  - Deepwater Horizon oil spill, the largest offshore spill in U.S. History
- Time-limited high notability – These topics are typically highly prevalent for a certain period of time, and may later prove to be of Big importance.
  - Affordable Health Care for America Act (or HR 3962)
  - Super Size Me (2004)
  - Mark Hamill, actor played Luke Skywalker in Stars Wars movies
  - Tony Hayward, CEO of BP during Deepwater Horizon oil spill
Underlying
- Essential to a larger subject area – These topics typically have value with respect to a larger subject area.
  - House Armed Services Subcommittee on Seapower and Expeditionary Forces
  - Ronald McDonald
  - Jabba the Hutt
  - Wellhead
- Not to be forgotten – A nugget that the collective knowledge should not lose, but does not hold a lot of importance for most people. It may have its own relatively short article or exist as a sub-topic in a larger article, or on a list.
  - Eric Massa (a one-term Congressman)
  - Happy Meal Toys (fun, but not very important)
  - Gorax (species on Endor Moon)
  - Gulf Sturgeon (an endangered subspecies of sturgeon in the Gulf of Mexico)

Content

The most important aspect of an article is scored between 1 and 12. Scores of 9-12 earn a Concrete rating, scores of 6-8 earn a Marginal rating, and scores 5 and below are rated Latent (C,M,L). Scoring follows this protocol:

Completeness – Information is complete and relevant to topic: 0-4 points
Accuracy – Information is accurate and provides sufficient sources: 0-4 points
- When scoring this aspect, remember that content must be cited according to WP conventions, which means that every bit of information is cited with a reliable source immediately following the relayed information. This convention is important because articles have multiple contributors, and immediate citation helps to maintain article integrity in future revisions.
Length – Length of the article is appropriate to its subject area: 0-2 points
Images – Images are good visual quality, accurately display information, fit the topic, and provide source: 0-2 points

Format

Format and grammar add polish and respectability to an article, scores range between 0-5. Scores 4-5 earn a format rating of Enhanced, and scores 0-3 earn a format rating of Unpolished (E,U). Scoring follows this protocol:

Grammar: the verb tense, spelling, sentence length, and punctuation are all correct scores 0-3 points.
Wikipedia format: the article uses appropriate Wikipedia formatting conventions scores 0-2 points.

Example Article Ratings

Heirloom – A NICE article that epitomizes the goal of Wikipedia to pass on high quality information, this rating is similar to a Featured Article.
NICE – This article is essentially complete, it is Neutral, Important, Concrete in content (it scores 9 or above in content) and has Enhanced grammar and WP format (it scores 4 or 5 in format).
NIMU – This article is Neutral, Important, Marginal in content (it scores between 6-8 in content) and Unpolished in grammar and WP format (it scores 0-3 in format).
BULU – This article is Biased, and the topic is of Underlying importance; it needs work. It's content has Latent potential, but essential sources or information are lacking (it scores 5 or below in content) and the grammar and format is Unpolished (it scores a 3 or below in format).
Other article ratings reflect problem areas in the article. For example, NICU indicates a neutral important article that is strong in content, but weak in format.
Alternatively, the articles may be rated by totaling the numeric scores. For example, a NICE or NUCE articles may have a numeric scores from 16 to 20 and BIMU and BUMU articles may have scores ranging from 6-11.

Evaluation Rubric

When using the NICE/BUMLU Article Quality Measurement, the evaluator takes all four quality measurement areas into account to generate a rating. An example rubric is:

—Preceding unsigned comment added by ARoth (Public Policy Initiative) (talk • contribs)

I haven't read your comment yet, but for future reference, on talk pages such as this one, it's standard practice to "sign" all comments that you make by adding four tildes to the end of your comment. Congratulations for your first post! johnpseudo 19:38, 25 June 2010 (UTC)

Johnpseudo's comments

Generally, this seems like an awkward task for someone not intensely familiar with Wikipedia to be attempting. Although you're looking for something that can be analyzed statistically, I think there's a lot more you can learn and borrow from the official wikipedia assessment criteria. More specifically:

I don't understand why article importance plays any role in this assessment, and "time-limited notability" seems at-odds with the requirement of all articles in wikipedia to have "enduring notability". Maybe I'm reading too much into that.
The "Heirloom" term seems hokey to me
You seem to be lacking a lot of wikipedia-specific features (flow, article division, content and graphic layout, hyperlinks) - maybe you were aiming to judge it purely on its traditional printed value?
I don't know that the entire acronym construction is necessary. Wikipedia is not paper, so whatever extra space you need to make the rating "intuitive to readers" is not a problem.

Maybe you could just use the official assessment criteria, and assign a point value based on whether it is Stub/Start/C/B/A/Featured. Or if there's a specific reason why your assessment should differ in what it values from what the wikipedia assessment method values, could you spell that out? Thanks! johnpseudo 20:02, 25 June 2010 (UTC)

Thanks for the formatting tips, I will work on it, and am getting a crash course in WP/WM. One problem with the existing assessment framework is that it is widely inconsistent at gauging article quality. Different users place different values on things like: content and wiki formatting and images and this creates a wide range of quality between the articles all rated the same "good Article" etc. The existing framework requires new evaluators to have a lot of experience and knowledge in Wikipedia. This is a drawback, especially for this project because part of its goal is to introduce and welcome new subject matter experts to becoming regular Wikipedia contributors with the help of the established community. The existing assessment system is problematic for new Wikipedia users, and experts in the topic area are deterred from evaluating articles in their area of expertise by both a lack of understanding of subjective criteria for each article and by ignorance (like mine) about the established Wikipedia culture. My Wikipedia coaches suggested that rather than trying to change the existing system, it would meet less resistance to create a new metric. They also identified the importance rating as problematic, but others liked it so I included it here. But please don't get hung up on the importance criteria, if the rubric pdf is visible: it shows that importance does not even become a part of the actual quality score. It was included it to indicate that some articles do not need to be as fully developed as others, so what might be a complete article for one topic is bare bones for another. ARoth (Public Policy Initiative) —Preceding undated comment added 21:13, 25 June 2010 (UTC).

A few quick points/questions:

"Different users place different values on things"- that is inevitable, no matter what rubric you use.
"The existing framework requires new evaluators to have a lot of experience and knowledge in Wikipedia" - in what way, specifically?

The standard assessment system basically requires people who are doing ratings to understand Wikipedia's core policies and basic stylistic norms, and for anything higher than a B-class rating, some knowledge of how the quality processes (WP:GAN, WP:FAC) work is needed as well. If the importance scale is to be used as well, raters ought to be familiar with the basic scope of the relevant WikiProject. That's definitely a non-trivial learning curve, especially for people who are new to wiki markup and general wiki conventions as well.--Sross (Public Policy) (talk) 20:10, 28 June 2010 (UTC)

I get the impression that there is a parallel conversation going on amongst Wikimedia employees. It'd help to know what you've talked about so far and to keep all conversation centrally located. johnpseudo 16:22, 26 June 2010 (UTC)

Yes, there were some prior conversations. Basically, Frank encouraged us to think about how ratings might be done in an ideal world, with the understanding that it might or might not be feasible to do something that departs significantly from the system already in place. Since Amy has expertise as a research analyst but doesn't see the world through Wikipedia-colored glasses, we wanted her to share her idea of what a de novo assessment system would look like; in the meantime, she's learning about what people have done and have proposed doing based on the existing system. That's basically where the internal discussion left off, because we wanted to move the discussions into the open as soon as we could. We also brainstormed some ideas for basically testing the existing system against itself, by training a small number of people in the details of the existing system and having them independently rate a set of articles that already have ratings on Wikipedia. But what sort of metric(s) will be used is still an open question. We also talked about testing out the ReaderFeedback extension to get another sort of data, but at least for now I think we've decided that would be too messy to be useful.--Sross (Public Policy) (talk) 20:10, 28 June 2010 (UTC)

Jodi.a.schneider's comments

Hi Amy (via your Wiki-research-l email)-I like the categories for importance. I agree that readability of the terms (rather than acronyms) is important. Since you're looking at information quality, I wonder whether research in that area such as B. Stvilia, M.B. Twidale, L.C. Smith, and L. Gasser, “Information Quality Work Organization in Wikipedia,” Journal of the American Society for Information Science and Technology, vol. 59, 2008, pp. 983-1001. might help you. I see that your account is just for this project; I think more familiarity with Wikipedia will help you, especially the ratings as discussed above, so I hope that you have a personal, non-project account to play with. It may also help to give more context about the project on your userpage. In your rubric, consider using icons (such as checkmarks or bars shaded from red to green) associated with the values; the initials are cryptic and take too much processing. Best of luck with your project! Jodi.a.schneider (talk) 23:49, 25 June 2010 (UTC)

Hi Jodi, i just wanted to thank you for the article reference. I was finally able to get a copy and it was really helpful. For me and other new Wikipedians on the team it offered a great description of the complex infrastructure involved in Wikipedia quality control. A new hybrid metric has emerged through this process, check it out if you get a chance, your feedback helps.

ARoth (Public Policy Initiative) (talk) 16:22, 1 July 2010 (UTC)

DGG's comments

Yes, it would be nice to have some general background.

Is your proposed metric a version of some standard metric in the field of measuring information quality, or is it a completely independent one, with no prior background? (I'm not trying to give a preference--I can see advantages in either, but just get information)
Is it based on some theoretical construct? What are the purposes that an article serves that the measures relate to? What are the factors in human understanding that they represent? What manner of measuring them are available, except unguided impressionistic human judgement?
If it is based on some explicit construct, has there been , or is there intended to be, any attempt to measure the validity? How would you propose to determine this?
In measuring the validity, are you aiming at showing that the quality achieved will accomplish the educational goals of Wikipedia, that they will satisfy people who might doubt the quality, or perhaps, just to satisfy the funders. The first two goals are good, the third is I would hope some combination of them.

Specifically, I agree with some of the comments above. I know this is going to sound over-critical right at the start, but you need to accept this environment:

Wikipedia uses too many acronyms as it is. Please do not introduce new ones. Especially do not use them if they seem in any way like institutional or report-writing terminology--too many of us have had unsatisfactory experiences with that way of working.
The "Heirloom" term gives the false impression that a high quality article will remain of high quality. I cannot think of many topics within this field that will not need continuous updating and revision. Even the historical ones are subject to re-evaluation, though not at as rapid a time scale.
The proposed scale of quality : a Concrete, Marginal, Latent, strikes me as non-standard jargon. (Perhaps it is standard and I am ignorant or it, but it is still jargon). Marginal seems clear, but it can mean marginal between good and not good, or marginal between good and excellent. What Concrete and Latent mean in this context I do not know--I suppose they are intended to mean satisfactory/unsatisfactory, but that seems unrelated to the common meanings of the terms.
Though our concept is enduring notability, it means two things--that something must be of more than transient interest which we express by saying Wikipedia is not a newspaper, but also the opposite: that once it is of substantial interest, it will remain of interest forever--that notability is permanent.
Simple sounding criterial are not necessarily simple in application.:
1. Neutral Point of view is required, but it is naïve to think that articles are either neutral or non-neutral--they are of varying degrees of neutrality. Everyone will agree that the different major view must be balanced, but we frequently disagree quite sharply over what is the appropriately neutral balance. And in particular we tend to really disagree on the importance of minority viewpoints-- there's a general rule that coverage is proportional to the importance of the viewpoint, and that far fringe views need not be covered , at least in detail, but how to apply this has caused many of our most difficult controversies, even in the natural sciences, where it is much easier to distinguish fringe than in the social sciences. More generally, I do not think that any humans can write on things that matter to them without some degree of expression of their own personal viewpoints--and if they work on things that do not matter to them, as they work on them, they will acquire a viewpoint.
2. Accuracy: we have a principle, that we follow reliable sources but do not attempt to determine whether they are in fact correct correct: the phrase we use is Verifiability, not truth. Many of the basic facts as well as interpretations in public policy--as in many other fields-- are not universally accepted.
3. "every bit of information is cited with a reliable source immediately following the relayed information" This is not a universally accepted standard here. It is currently fashionable to have extensive inline citations, although this is different from the practice of every other general encyclopedia. In particular, there are many of us who object to citing bits of information within sentences, unless there is some special reason to do so. Overcitation does not help comprehension.
4. Completeness. Perhaps you mean "balance" or comprehensiveness==I do not see how we shall ever have a complete article on anything, or that it is necessarily even desirable in a general encyclopedia.
5. Grammar. You mention sentence length; surely that is a matter of style, not grammar. You do not include style--there are accepted standards for good prose style, including clarity, comprehensibility, directness, and interest, which demand attention to many factors, including that of varying the sentence structure to avoid monotony. I often say that the hallmark of a Wikipedia article is that it is dull, rather than individualistic, but this is not necessarily desirable, just inevitable with community editing, where any carefully done effects are very likely to be disrupted by further edits. DGG ( talk ) 09:05, 26 June 2010 (UTC)

Swayland's comments

From what I understand, this project–Wikipedia:WikiProject United States Public Policy–requires that there is a quantitative assessment of the improvement of articles over the duration of the project. What Wikimedia is doing here is, hopefully, performing a sort of Field experiment that will tell us whether the intervention (using student assignments and public policy professors to improve articles) resulted in improved article quality. In order to successfully do this, they will need two things:

A Designed experiment - to establish causality, or at least show a larger than expected change over time
A metric designed to reliably measure article quality

Of these two, the most difficult is the second. Wikipedians have established the official wikipedia assessment criteria but, after carefully reading through the assessment page, it appears that the criteria are open to substantially different interpretation and valuation. For example, an article with good content but marginal formatting could be rated as either a B or C according to the assessment criteria, "The article is better developed in style, structure and quality than Start-Class, but fails one or more of the criteria for B-Class." However, the difficult portion of article writing is in establishing good content. Any careful Wikipedian could perform an excellent reformat of an article that he could not possibly have written. It also appears (after trying to perform one) that one of these assessments can only be reliably performed by a very experienced Wikipedian who is well versed in the arcana of Wikipedia and has looked at many other assessed articles. Amy's article assessment tool separates the different qualities of each article into different categories, which gives us much more information to work with when we look at the results of the assessment. Furthermore, it has two very useful consequences:

The editors of the assessed article have immediate, concrete feedback regarding the merits and shortcomings of the article.
A rubric can be written that yields valid, reliable results. A clear rubric can also be used by anyone who understands the topic of the article and the rubric.

If you know of a good metric that can be applied to Wikipedia articles, please post it. Otherwise, I suggest that Amy's method (with some improvements) would be an effective tool for an impact evaluation of this project. Also, this discussion should probably be moved to Wikipedia:WikiProject United States Public Policy/Assessment. --Swayland (talk) 18:16, 28 June 2010 (UTC)

Thanks, Swayland. Definitely, a more detailed rating schema along the lines Amy has mooted here would be much more useful to researchers than the current one. The problem to think about, though, is how to get people to use it; a great system with only a little bit of data won't be as useful as a less detailed one with ample data. So if we're going to use something totally different, we also need to come up with a plan to find reviewers. Using a system like this might require as much investment in figuring out how the system works as the existing one, without the advantage of an existing group familiar with it. That doesn't mean it's out of the question, it just complicates things; it may be that finding outsiders to do reviews with something like Amy's rubric would be the way to go, and we could then compare that data to the existing system.

It will probably be good to separate the discussion soon, and move this to an assessment a subpage as you suggest. My thought was to avoid fragmenting the discussions too soon.--Sross (Public Policy) (talk) 18:57, 28 June 2010 (UTC)

One more thing. A project like this one seems like a great place to alpha test a new article assessment method, as well as compare the results of the new method to the established one. Having a dedicated researcher, and others, assigned to a project is extremely valuable. --Swayland (talk) 18:26, 28 June 2010 (UTC)

Yeah, Amy's fresh perspective on how to do quality assessment is definitely welcome, from my perspective. --Sross (Public Policy) (talk) 18:57, 28 June 2010 (UTC)

Eh, call me blind, but I don't see how ARoth's metric is any more detailed or objective than the current one. It certainly breaks down the rating into individual point-based categories, but I don't think those categories are valid or practically applicable. Can you imagine an article with zero "grammar, correctness of spelling, punctuation, verb tense and sentence length", but with "presentation and formatting according to Wikipedia conventions"? Or an article with a length completely inappropriate to the topic, but with "information completeness and relevance to topic"? I agree with DGG above that "Simple sounding criteria are not necessarily simple in application". Just help us by answering some of the questions we've asked already. johnpseudo 19:27, 28 June 2010 (UTC)

There don't have to be articles that fall into every possible combination of scores for the metric to be valid or useful. I can think of no reason that there should have to be an article such as you describe in order for the metric to be useful. While it may be true that "Simple sounding criteria are not necessarily simple in application", simple criteria are usually much easier to apply than complicated ones.--Swayland (talk) 20:30, 28 June 2010 (UTC)

As I see it, there are two possibilities for effective quantitative assessment of articles: use a new metric with point-based categories and a carefully written rubric, or to add quantitative values to the current assessment method and develop a rubric for assigning those values. The key here is that there is good categorization, a reliable, repeatable rubric, and a numeric quantity assigned to each category. Are there:

Any suggestions for valid and applicable categories?
Any improvements to the criteria Amy suggested above?
Suggestions for how the current assessment criteria could be assigned separate sub-categories?
A way to combine the two (Amy's and the established assessment?

This problem is definitely solvable, I'll be thinking about how. Perhaps we need several things to get going:

A problem statement
A definition for a high-quality article (perhaps this exists?)
A list of categories that can define article quality
A rubric defining scoring for each of those categories--Swayland (talk) 20:30, 28 June 2010 (UTC)

[edit-conflicted reply to johnpseudo] Yeah, there are definitely some ways we could tighten up ARoth's metric if we go forward with something like it, and I agree fully with DGG's point about simplicity in practice. But I think it's also pretty clear that the current rating system is a very blunt instrument, and that the rough outline of what she proposes--ratings broken down and weighted along several axes of quality--would be useful for analyzing Wikipedia. I'm actually quite skeptical of the feasibility of something like this proposal, just because I have a pretty good understanding of why the current system is the way it is. But I think it's still worth thinking seriously about possible alternatives or supplements to it, especially in the context of something for outside reviewers rather than Wikipedians to use. We don't want to just boil in our own broth continuously.--Sross (Public Policy) (talk) 20:34, 28 June 2010 (UTC)

I am glad that there is some interest in this topic. The feedback here will be valuable to the success of the project and hopefully in the larger view to the purpose of Wikipedia. I just wanted to clarify a couple of things. My purpose in posting this metric at this early stage was to have input and collaboration in development from the WikiCommunity, to hopefully create something that would be useful to Wikipedia. The public policy project does require quantification of article quality improvement, so the project will probably use something like the metric described above or a similar one that evolves through this process. As a pilot program, we're trying to figure out the best model to replicate this in the future, but we certainly aren't expecting this to be used wholesale on Wikipedia right now. It is more in line with the sustainability goals of the project to develop something that has long term value to Wikipedia, but either way the public policy articles must be evaluated for improvement. I discussed the metric at length with three different long-time Wikipedians. The consensus from them was to use a different rating system, so that it was clear to the WikiCommunity that the Public Policy Project is not trying to take over Wikipedia or control Wikipedia-wide assessment methods, hence the use of the term "Heirloom," which is basically the same as a "Featured Article." They also urged the use of non-judgmental terms, which where "concrete" and "latent" came from, and they mean exactly what their definitions are: concrete means the information is solid, latent means the article has potential, but it is not realized yet. The Wikipedia experts agreed with me that content should carry the greatest weight in article quality, and that the current system is nearly impossible for new contributors to use - which is a huge problem for this project because one its major goals is to recruit and retain new subject matter expert Wikipedians. My information about article citation also came from Wikipedians, but they were not in total agreement on that point for the reasons DGG stated.

Thank you Jodi for the reference, I am working on getting that article.

DGG brings up a good point about using an established metric. That would be great, does anyone know of one that has been tested and used for similar purposes? I have not been able to find anything that fits. I see the public policy project as a pilot of sorts or an opportunity test whether certain tools would be useful and if they are implementable in Wikipedia.

Swayland outlined some of the reasons that I do not think the existing assessment method would fit this project's needs. DGG also touches on it when describing the current system as more of a grading system. The reasons I see for modifying or implementing a metric are:

The current assessment system is difficult for new contributors to understand and use.
The current assessment system has an enormous degree of variability which makes analysis difficult and maybe impossible.
The current system requires an article to linked to a WikiProject to be assessed.
The right new metric will establish more consistent weighting in aspects of article quality.
The public policy project is an opportunity to pilot a new metric.

ARoth (Public Policy Initiative) (talk) 15:24, 29 June 2010 (UTC) 01:33, 29 June 2010 (UTC)

Let's play a game…

…and see if the new article evaluation criteria work: Challenge. --Fschulenburg (Public Policy) (talk) —Preceding undated comment added 16:05, 29 June 2010 (UTC).

Frank is suggesting a one day "get to know you" test of Amy's metric, on July 1-2 (this Thursday). Let's try to refine it before then and incorporate some of the feedback everyone's given so far. I've copied the draft metric and her rubric for evaluating articles on a subpage, /Article Quality Rating Metric. Everyone should feel free to start hacking it into better shape.--Sross (Public Policy) (talk) 18:00, 29 June 2010 (UTC)

Article Quality Metric: Revised

I want to start by explaining more what I'm trying to do here. As the Research Analyst for the Public Policy Initiative, I need a scientific baseline for article quality of public policy articles on the English Wikipedia before we engage the experts to start editing. I will then compare this to the article quality after we wrap up this trial project, then compare the two.

This new rating system:

needs to be developed for the Public Policy Initiative to be defensibly quantifiable.
ideally, it would also be nice to develop something that could be implemented in Wikipedia in the future if the editing community finds it useful.

It will be used by:

subject-matter experts who may or may not be experienced Wikipedians, most are not.
Public Policy Initiative staff.
ideally Wikipedians who are eager to help us establish a scientific baseline.

The main question seems to be, what's wrong with the current Wikipedia article assessment? Here's what I see as being wrong:

It's difficult for new users to use.
There is no quantifiable rubric.
It is inconsistent in a number of ways:
- There is no consistent weighting to specific article quality aspects (for example, some evaluators rate style with importance than content, and vice versa, some consider content equal to style).
- The resulting ratings that articles fall into seem to vary (there's a wide range of quality within each category).

My goals for the new rating system address these problems:

Experts and new users need to have a clearly defined rubric with which they can assess articles.
Weighting each area puts more emphasis on content than on format, but recognizes all are important.
Each article should be getting more similar ratings from different evaluators -- in other words, each rating must be reproducible, within a point or two.
For the purposes of my analysis for the Public Policy Initiative, each article should have a numeric score.

I set out to create the example article ratings (Heirloom/NICE/etc.) as ways to keep this assessment separate from the existing Wikipedia assessment because I don't want to impose something on the community. And although what is most important for my role in this project is to get a quantifiable rating system that new Wikipedians can easily use, I also want the community to be involved as much as you'd like to be.

So here's my question to Johnpseudo, Jodi.a.schneider, Swayland, DGG, and the rest of my Public Policy team members who are more experienced Wikipedians:

I need a quantitative, weight consistent, and reproducible metric for the project goals. Would the community be more apt to use some version of what I've proposed here (I don't care what anything is labeled as specifically, it just needs to have those components) than if I simply attached weights and numerical values to the existing system?

Simply attaching numerical values to the existing system won't work: that will magnify the problems that the lack of weighting brings to the current system.

My revised rubric is posted here: WikiProject U.S. Public Policy / Article Quality Rating Metric

ARoth (Public Policy Initiative) (talk) 22:24, 29 June 2010 (UTC)

Let's try to rate an example article: Wikipedia talk:WikiProject United States Public Policy/Article Quality Rating Metric. --Fschulenburg (Public Policy) (talk) 23:00, 29 June 2010 (UTC)

As I understand it, you only really need two assessments, one before and one after? The existing stub-through-FA system holds very little data; its primary strengths are that it's relatively easy to assess quality without subject or style knowledge (GA and FA are done "for you", so to speak), and edit wars over quality assessment are (to my knowledge) rare, properties which are substantially less important for this project. I wouldn't seek to emulate the 1.0 ratings because those ratings are used on a continuous basis, only being updated when someone notices they're plainly wrong. I'd just do two assessment drives to gather as much data as you want, one before and one after. Nifboy (talk) 03:24, 30 June 2010 (UTC)

If the project scope were as simple as comparing before and after snapshots of article quality, then what you suggest would probably work. The reason we need to address the metric is because a big part of this project is an organized attempt to integrate new subject area experts into Wikipedia through collaboration with several university classes. So, we need to provide a metric that brand new Wikipedians can learn to use relatively quickly, and it will also hopefully be more understandable to them when their class-related Wikipedia editing work gets assessed. Check out the new [hybrid metric] revised by long-time Wikipedians. ARoth (Public Policy Initiative) (talk) 16:55, 1 July 2010 (UTC)

Hm, okay. I guess from the standpoint of integrating it into classroom grades it makes sense, because then there's a tangible event worth reassessing for and someone on hand to do the assessment. My experience is that the Wikipedia community by itself is really bad at continuous reassessment even at the current ABCStart scale, and it tends to only get done when someone specifically asks for it on a particular article, usually as part of the GA process. Nifboy (talk) 23:20, 1 July 2010 (UTC)

The problem you mention, if I can co-opt it into my own terms, as the "slowness" of the WIkiCommunity to update article assessments, is another hurdle for the project that I had not identified. Do you have any suggestions for motivating assessment drives at the beginning and end of the project? ARoth (Public Policy Initiative) (talk) 17:29, 2 July 2010 (UTC)

A drive isn't a problem, because it's a finite workload: We have however many hundreds of articles to assess, and when they're all assessed, we're done. A second (or third or fourth) drive is also not a problem, because you start from scratch. My suggestion is to prominently timestamp assessments, instead of the existing system which is assumed to always be current and editable whenever. This also helps with the before-and-after analysis because you can have, in a section that is probably collapsed by default, all the different assessments that have been done on the article. Nifboy (talk) 23:22, 2 July 2010 (UTC)