Talk:Database normalization/Archive 3

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Spelling: normalize vs normalise

'Normalize' is perfectly correct ([1]) spelling of the word. Changing it would be no improvement, and would render the article inconsistent with most usage, and inconsistent with the other wikipedia pages on database theory. --VinceBowdren 20:39, 7 January 2007 (UTC)roniel

'Normalise' is the British spelling and 'normalize' is the American spelling. Both are correct, and any English speaker, American, British, or otherwise will know what it means. As for the correctness of the articles, neither matters, just be consistent. Vince Bowdren makes a strong case for continuing with the normalize spelling.

Incomplete Information and potential source for tremendous bias

This is the first article I've encountered in the Wikipedia that I have the potential to help with. I'm not sure of the best way to go about doing so. Some of these comments also apply to the dimensional database article too.

In the Products section, Cache is listed, but not the original MUMPS that it was based on, also the current crop of multi-valued aka post-relational aka multidimensional databases (OpenQM, QM, uniVerse, UniDATA, mvBase, mvEnterprise, jBase, D3, etc.) are not mentioned.

There are also links to web pages by Fabian Pascal. He is a paid author and is known for his mischaracterizations or, and rants against, multi-value databases. His typical "argument" style consists of making a statement, and when it's argued, his reply is essentially no more than "The math proves it and if you don't see that you're too stupid to argue with." Also, when people point inaccuracies and mischaracterizations in his examples, the typical response is along the lines of, "You're too stupid to know your own product." Finally, when people point out his bias, he defends himself by trashing one of the SQL-style databases that he doesn't derive income from.

ElRay 16:27, 7 February 2007 (UTC) Ray

Erm... that's a very strong assertion. Can you source that statement? (BTW, I took the liberty of converting some of your text to links)

Regarding the Products section, I believe what they're trying to do is list some of the most widely used databases: from my experience, the post-relational databases haven't gained too much traction yet. For a complete list of DBMS, you'd want to consult Category:Database management systems. Anyway, it doesn't have anything to do with this article: it's a template that's transcluded throughout here. You'd want to bring your objections to this page: Template talk:Databases.

I'm glad to see someone from the field volunteering to help out this article. It is in desperate need of expert attention. If you have any more questions, don't hesitate to ask! — Edward Z. Yang^(Talk) 02:34, 8 February 2007 (UTC)

I used to come across Fabian alot in the old Compuserve Foxpro forum days where I was very active and achieved some level of acceptance (MVP award). While I think Fabian is pretty well right on factually it was sometimes very hard to tell due to exactly the types of arguement ElRay states.

I added a little bit to the first few sentences of the article that I hope helps introduce the subject by defining the word. I have been writing about database theory, informally in forums and such, and teaching database normalization off and on for years. The better you get the model (normalized database) the more flexible and long lasting the program is my experience. Billpennock 01:10, 12 August 2007 (UTC)

Teaching the Wrong Thing

Please do not teach the wrong way how to apply normalization. Starting with the universal relation (including all attributes) as a starting point is the worst way to go. Every expert in the field will tell You: Never start modeling with the universal relation! Don't try to normalize in praxis, since every thing is decomposed and normalized almost always. Just tell this your readers: Normalization is only theory failing always when practical problems are considered. --213.61.130.220 09:09, 1 March 2007 (UTC)

Exactly where does the theory fail? This is often asserted, never articulated where in fact the theory fails, and never proven. Codd submitted rigorous mathematical proof of the relational model, and was given the ACM Turing Award. Please either prove Codd wrong and step forward to claim your award or stop repeating this unsubstantiated nonsense. --[User:TGantos] 06:00, 28 May 2007 (UTC)Tgantos 04:01, 28 May 2007 (UTC).

Theory is correct, of course, but it fails where it assums that real world problems are brought into a form where this theory is applicable: Just start modeling with elementary fact tables (predicates of ary 1-3) and normalisation theory simply does not apply. Obviously, you can then show that joining those tables in advance will introduce all those problems you never had with the original design. So theory is simply presented the wrong way around for decades (following award winning God(d), however). --213.61.130.220 08:54, 26 July 2007 (UTC)

As a software architect (whatever that means ;-) and designer, specialising in database design for the last 12 years, I agree with 213.61.130.220 (gasp). Let's try not to obfuscate the real issue here, which is how does database normalization help real software projects! Let's not pull out the old overworn examples. Let's use instead a real-world approach, which starts best by gathering business knowledge of the entities involved and then testing the documented assertions against the available data, then reconsidering where attributes should normalise to.

The true benefits of real-world normalization only become evident as you consider pros and cons of alternative designs using various normal forms. How can we demonstrate this to our readers?

Can we start to collect here, some business cases and design cases, for use of the more esoteric normal forms? Under what circumstances does the cost-benefit favour these approaches?

-- Matthew 1130 12:44, 1 March 2007 (UTC)

213.61.130.220 says "Don't try to normalize in praxis, since every thing is decomposed and normalized almost always." By "everything", if I'm interpreting him correctly, he means "everything that's been modelled using common sense". This accords with what's often said about normalization, that it's merely formalized common sense. If you design sensibly, your design will come out normalized. I acknowledge this in the article in the passage which reads, "A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms." Also see the introduction to the example, in which it is acknowledged that the example is somewhat contrived for the purposes of illustrating the way the normal forms allow anomalies to be overcome; it is not meant to reflect what typically happens in design. Remember: the article is primarily concerned with explaining what it means to achieve the various normal forms. How one achieves them is another story. Finally, let me reply to Matthew's point about the more esoteric normal forms. It is actually very hard not to achieve these normal forms. In the vast majority of cases, a design adheres to those esoteric normal forms already, regardless of whether the designer has any knowledge of them. See the passage just referenced. There would not be business cases and design cases for meeting those esoteric normal forms, because except in rare cases, one meets them without trying! --Nabav 17:34, 12 March 2007 (UTC)

Thank you for clarification, Nabav. I agree in general. Why not to include your complete paragraph above into the article, since it's the thruth about normalization for me and for others? Why is 'normalization' (not to be confused with 'normal forms') still in the textbooks after 30 years of the relational model forcing Wikipedia to follow this tradition? What are the historical, academic or economic reasons? Who ever writes down functional dependencies, to say nothing on multivalued dependencies, accept in lessons or exams, however? Why using contrived examples and justifying this academic tradition by apologizing in the first paragraph (and by for the sake of this example phrases, which sound like a confession anyway)? Why not to delete the normalization example section completely instead of apologizing, justifying and confessing? Even in most textbooks, designers are first guided to model with entities and relationships or some linguistic approach in mind. It can be proven that some restricted formalism always lead to relational models satisfying at least BCNF, not to talk about PJ/NF or inclusion dependencies ... --213.61.130.220 09:56, 15 March 2007 (UTC)

These are good points, and have made me think about what 'normalization' is essentially about for me. I have some ideas for how the article could be changed to address some of your concerns. Unfortunately I don't have a great deal of time at the moment, but as a long-term ambition I would like to a) make the article convey a better sense of what normalization is (and isn't), and b) add material about the relation between normal forms and E/R modelling, and other approaches to modelling. --Nabav 07:17, 22 March 2007 (UTC)

Having 20+ years in DB design experience, using everything from DBII to Access (from its inception), to DBIV, to FMPro, MySQL (to name a few), I found this article to be a pretty easy-to-follow explanation. I have taught Advanced DB Design, and have made lots of money normalizing DBs that were designed poorly. The clients found their mistakes when the data reporting was not what they expected (b/c of of inconsistent redundancies).

I agree that diagrams would be easier to follow for some, but there is obviously a greater effort required to upload that to Wikipedia. I think a combination of graphics, tables, and textual explanations would cover all manners of learning styles. In a perfect world . . .

Keep up the good work! EsmereldaPea 19:32, 23 April 2007 (UTC)

jasimab says "The content in this page is useful for students trying to learn the theory. A good database developer would design a database so as to follow the best practices. He would unknowingly be normalzing the database he creates. So this section, if looked from a practical point of view, can be renamed to something like 'Best Practices in Database Design' and make it more practical oriented. But for me, I would suggest to keep this as such, and save some time for the many students who have to mug up this theory for scoring in their university exams." Jasimab 18:29, 8 May 2007 (UTC)

Thank You for removing the misleading section "Example of the Process" in August (justification doesn't matter) --213.61.130.220 09:53, 1 November 2007 (UTC)

Examples

Wouldn't it be easier to have E-R diagrams instead of the actual tables? It would be much more intuitive for someone who doesn't fully grasp the data in the table at first glance. --68.239.240.144 00:11, 21 March 2007 (UTC)

the thing about database normalization is that it's very dependant on the data in tables and implications that can be drawn from the data. ER diagrams don't convey nearly the same message. :( McKay 14:31, 21 March 2007 (UTC)

Superkey example incorrect

I belive that the assertion that {DVD ID, Member ID} is a superkey for the DVD rentals table is incorrect. A particular member may rent a particular DVD on more than one occasion. Thus an additional value (date / time) would be needed to guarantee that the key is unique. —The preceding unsigned comment was added by 66.28.245.154 (talk) 16:46, 23 April 2007 (UTC).

Whether it's correct depends on the predicate. If the desired predicate is "The DVD designated by [DVD ID] is currently rented out by the member designated by [Member ID], who lives at [Member Address]", then there is no problem. However, I concede that the predicate of a real-life DVD-rentals table would need to mention the date of rental in order to be of any use. This suggests that the example should be changed. Note that the DVD rentals material occurs in several places. What would be ideal would be a very simple example (3 columns) in which the time element is plainly not very important; that way it won't be vulnerable to this type of objection. --Nabav 22:45, 23 April 2007 (UTC)

The example has now been changed. --Nabav 18:40, 28 May 2007 (UTC)

1st normal form

The article says:

'Sometimes this second requirement is expressed like "there may not be repeating groups", leading to some prevalent misconceptions. The first misconception is that 1NF precludes a series of columns repeating the same domain'

However, authoritative sources such as "Data Modeling Essentials" (by Simsion and Witt) explicitly say that getting rid of repeating groups is part of getting a model into 1st normal form. Even Litt's Tips (http://www.troubleshooters.com/littstip/ltnorm.html) that is referenced in the further reading section of this article says:

'First Normal Form: No repeating groups. As an example, it might be tempting to make an invoice table with columns for the first, second, and third line item (see above). This violates the first normal form...'

Can we get some clarification to this?

This also seems to be contradicted in the NF² section, which uses a list of favorite colors as the example of something in non-first, much like the recipe example in the 1NF section, which claims NOT to be a violation of 1NF. --74.73.41.31 12:48, 18 July 2007 (UTC)

Yes, this is a very confusing area, with different authors understanding the concept of "repeating groups" in different ways. As a first step in sorting through the confusion, I have rewritten the First normal form article in a way that hopefully does justice to all the complexities of 1NF. Next step is to provide a balanced and internally consistent summary of 1NF here. --Nabav 12:14, 22 July 2007 (UTC)

The section on 1NF is bizarre, appearing outwardly to have been written by multiple people. It is "non-nulls-are-fine-according-to-Codd-and-more-realistic" sandwiched between "never-use-non-nulls-in-1NF" sentiment. It could use some smoothing-out and a more cohesive style. --207.170.253.102 (talk) 21:40, 22 July 2008 (UTC)

Third normal form

U should to provide an example with a table.—Preceding unsigned comment added by 220.227.168.118 (talk • contribs) 2007-11-13
Above comment moved here from Article page.--Boson 07:15, 13 November 2007 (UTC) An example is provided in the main article (see text below heading).--Boson 07:15, 13 November 2007 (UTC)

Sixth normal form

The paragraph on 6th normal form doesn't make much sense and looks like an advertisement (nobody except a particular article appears to use the term and it appears to be a different kind of constraint than the other normal forms). I propose to delete this paragraph. Rp (talk) 11:58, 21 May 2008 (UTC)

It looks like some WP:OR original research from the author of the linked book, see " In recent work of my own--documented in the book TEMPORAL DATA AND THE RELATIONAL MODEL, by myself with Hugh Darwen and Nikos Lorentzos--my coworkers and I have come up with a new sixth normal form, 6NF" [2]. The paper linked at the end of the paragraph doesn't mention the sixth normal form anywhere, and it's just referring to some research unrelated SQL stuff. I'm deleting it right now, since this is non-notable research by one author. It can be re-added later if there are other references on literature to this research. --Enric Naval (talk) 14:03, 21 May 2008 (UTC)

Hum, it seems that 6FN is actually discussed on some books[3] and used on some academic sources [4]. I also see that it's related to temporal relations on SQL, so the linked paper was actually relevant. I'll see what I can do with that. --Enric Naval (talk) 14:19, 21 May 2008 (UTC)

You quote Darwen and Date. What we need is independent quotes, and at least a formal definition. I personally don't mind original research, but I do mind "definitions" for which you have to buy the author's books or articles before they can be understood. Rp (talk) 16:34, 23 May 2008 (UTC)

I changed it so that it's clear that the 6FN definition is not formal and that it's not accepted by everybody, and that some people use it as a synonim for DKNF --Enric Naval (talk) 15:01, 21 May 2008 (UTC)

It's an improvement, but I still think a clear definition is required. Rp (talk) 17:07, 23 May 2008 (UTC)

I have attempted to clarify. As I see it, McKenna is one of those who started to use 6NF for DKNF (or something similar with no formal definition), which is described elsewhere. I think that "definition" should be taken there. Date's definitions is a direct extension of 5NF, i.e. it is "immediate" from the definition that every relvar that is in 6NF is also in 5NF, but in Date's 6NF there are no nontrivial join dependencies at all. This is significant for temporal relations. The concept of "projections" may be extended in this context. I would give a direct quote from Date's book, which would obviously be allowed under fair use, but I'm not sure it would be allowed under GFDL.--Boson (talk) 22:13, 23 May 2008 (UTC)

You can paraphrase the idea that he explains, but remember to use a reference to point to where you got the idea from. Or you can use a short relevant quote provided that you make very clear that it's a direct quote, and you give the exact source. Thanks for trying to clarify it. --Enric Naval (talk) 11:47, 24 May 2008 (UTC)

Extraneous Text?

I removed some extraneous text from the end of Third normal form section that appeared to be a bad edit:

if you have three columns in a single column. A COLUMN IS STUDENT ID , B COLUMN IS STUDENT NAME AND C COLUMN IS STUDENT ADDRESS A -> B A -> C C -> B

Mpd (talk) 18:44, 1 October 2008 (UTC)

This article is too wordy and difficult to read, needs a fresh rewrite

I felt like whoever wrote most of the copy in this article was trying to use as many large complicated words as possible to sound sophisticated, but in effect was clouding the language and making it difficult for a reader to understand the ideas being conveyed.

Take a look at http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html, 10 times easier to read and yet it explains the same topic. I think the rewrite should be more like the one in this link, for Wikipedia's purposes. J. Straub (talk) 07:12, 12 December 2008 (UTC)

I have begun a rewrite: thus far I've done the intro and an "objectives of normalization" section, and intend to rejig and clean up all the other sections as well. Consulting (and including citations to) the relevant papers is an enormous help to keeping the article focused and accurate. It is very easy to write sloppily and incorrectly about normalization: an enormous amount of misinformation is floating about on the Internet, in published how-to manuals, and in people's heads. This explains why the article has gone to seed. But I'm hopeful it can be whipped into shape. --Nabav (talk) 16:46, 1 January 2009 (UTC)

The first diagram is unclear, it is kind of ambiguous what the rowspans are supposed to represent ( presumably a violation of 2nd normal form, repeating data in multiple rows? ) 65.13.73.143 (talk) 22:46, 12 February 2009 (UTC)

Underneath the diagram it says: "An update anomaly. The Employee 519 is shown as having different addresses on different records." That's clear, isn't it? The diagram is an example of just what the caption says it is: an update anomaly. Accordingly, the diagram occurs just to the right of the section of the article that describes update anomalies. --Nabav (talk) 09:49, 13 February 2009 (UTC)