HN Books @HNBooksMonth

The best books of Hacker News.

Hacker News Comments on
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling

Ralph Kimball, Margy Ross · 7 HN comments
HN Books has aggregated all Hacker News stories and comments that mention "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling" by Ralph Kimball, Margy Ross.
View on Amazon [↗]
HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.
Amazon Summary
The latest edition of the single most authoritative guide on dimensional modeling for data warehousing! Dimensional modeling has become the most widely accepted approach for data warehouse design. Here is a complete library of dimensional modeling techniques-- the most comprehensive collection ever written. Greatly expanded to cover both basic and advanced techniques for optimizing data warehouse design, this second edition to Ralph Kimball's classic guide is more than sixty percent updated. The authors begin with fundamental design recommendations and gradually progress step-by-step through increasingly complex scenarios. Clear-cut guidelines for designing dimensional models are illustrated using real-world data warehouse case studies drawn from a variety of business application areas and industries, including: * Retail sales and e-commerce * Inventory management * Procurement * Order management * Customer relationship management (CRM) * Human resources management * Accounting * Financial services * Telecommunications and utilities * Education * Transportation * Health care and insurance By the end of the book, you will have mastered the full range of powerful techniques for designing dimensional databases that are easy to understand and provide fast query response. You will also learn how to create an architected framework that integrates the distributed data warehouse using standardized dimensions and facts. This book is also available as part of the Kimball's Data Warehouse Toolkit Classics Box Set (ISBN: 9780470479575) with the following 3 books: The Data Warehouse Toolkit, 2nd Edition (9780471200246) The Data Warehouse Lifecycle Toolkit, 2nd Edition (9780470149775) The Data Warehouse ETL Toolkit (9780764567575)
HN Books Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.
Check out the Data Warehouse Toolkit. You might want to look into natural vs surrogate keys.

https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimen...

squeaky-clean
Found a pdf copy (just to browse, I promise!) and it looks great. Ordered a used copy, thanks.

Edit: no instances of candidate key when I ctrl+f though, still looks pretty good.

Aug 01, 2016 · baakss on Why we lost Uber as a user
Yes, especially in business intelligence / data warehousing. Here is an excellent resource if you're interested in this sort of thing: https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimen...
Dimensional modeling (I'm a fan of Kimball's approach) mitigates these problems quite well while still offering very flexible ad-hoc reporting. Works great on a row-based RDBMS, even better on columnar.

http://en.wikipedia.org/wiki/Dimensional_modeling http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...

Redshift is indeed a solid product but all these comparisons against Hive are surprising, as that's not the right tool in the first place. Infobright, greenplum, aster, vertica, etc are the products which Redshift seeks to disrupt.

jacques_chester
I realised a few years ago that pretty much every database course taught only teaches OLTP. OLAP never really gets a lookin.

At my university, standard normalisation was taught in the "databases" course. OLAP was mentioned as part of the "advanced databases" course.

The database course at that time blew about half its time on building PHP applications to talk to the database. I hate to second guess my professors, but I can't help but feel that a more productive use of the time would have been to teach normalised OLTP in the first half, and dimensionally modelled OLAP in the second half. Better yet, to divide them into two courses and spend some time talking about database history ("here's why network and hierarchical databases sucked") and maybe some introduction to how query planners work.

eru
Do you know of any resource that talks about the same topic as your comment, but in more detail?
jacques_chester
I'm not sure I follow you.
eru
To be more stereotypical: I am intrigued and would like to sign up to your newsletter.
jacques_chester
Well ... really, just read a good pair of textbooks on each side of the spectrum. Date's Databases and Kimball's The Data Warehouse Toolkit are good.

Edit: actually, maybe not Date. It's up to you. It's good, but it's controversial because he's not a fan of SQL and so he uses his own language.

The one I used in uni was Ramakrishnan & Gehrke's Database Management. It was OK but there's a certain amount of at-the-time trendy bullshit that to me detracts from a focus on relational databases for their own sake.

Edit 2: and Joe Celko's SQL for Smarties contains good oil on the relational paradigm.

wisty
Philip Greenspan's SQL tutorial is a nice starting point: http://philip.greenspun.com/sql/

It's a bit old, but still pretty good.

I would suggest reading some books on the topic of Dimensional Modeling [1] such as "The Data Warehouse Toolkit" [2]. The critical thing you need to expose to your users is the ability to ask for things which make sense in their world that are actually really difficult for even an engineer to code. Things like: "Show me average 9am-12pm sales on Mondays, Wednesday and Fridays for 1st quarter, 2012"

  [1] http://en.wikipedia.org/wiki/Dimensional_modeling
  [2] http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
alex_anglin
Speaking as someone who does his fair share of dimensional modeling, I would just point out that the example you cite could only involve two tables in a well designed dimensional model (sales fact and time/date dimension, I reckon). The challenge is in getting to that point.

To speak to OPs point about difficulty in querying data warehouses, most business intelligence tools that I'm aware of provide semantic layer[1]-type capabilities, whereby the user interface of the tool is presented in the language of the business domain. Nevertheless, I still agree that this is still difficult work, unfortunately. That it is getting more complicated in some respects, such as through unstructured data, doesn't help either.

[1] http://en.wikipedia.org/wiki/Semantic_layer

meritt
I guess I wasn't clear enough if I came across like my example was complex. It's one easily solved via DM and one that's extremely hard to execute in most non-dimensionally-modeled setups. That's exactly why I'm a huge advocate of DM instead of just throwing a ton of servers, hadoop & MR at everything.
Jul 26, 2011 · eneveu on Data Warehousing 101
If you are interested in Data Warehousing, you should read Ralph Kimball's "The Data Warehouse Toolkit": http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...

When I started learning about BI (Business Intelligence), a few members of the Pentaho community advised me to read this book. I'm glad I did. Kimball is one of the "fathers" of data warehousing, and his book had a lot of great insights for dimensional modeling. It helped me avoid many design mistakes while building my DWH, and gave me insight I might have taken years to discover.

It's a "theoretical" book, in the sense that it does not focus on any specific technology; it's also a "practical book", because he uses real-world scenarios (inventory management, e-commerce, CRM...) to demonstrate the various dimensional modeling techniques. I also liked the part about BI project management and encouraging BI in a company (= how to engage users and how to "sell" a BI project to management).

He also has a newsletter with many DWH design tips (archives here: http://www.kimballgroup.com/html/07dt.html ).

I am currently doing reports/datawarehouses/dashboards. When something more complicated that simple questions is needed (see Data Warehousing for Cavemen), ad-hoc queries are quite often not the answer anymore, either with NoSQL or with SQL.

I don't want my clients to be dependent on me (or someone else) to build complicated SQL queries when they have questions, so I focus on getting an easy to maintain facts/dimensions model (as advocated by Ralph Kimball http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...) which can evolve if needed.

The nice point about MongoDB when doing this is that it makes it a lot easier to add attributes to dimensions, or load the data, or evolve the reporting system in general (and I like that).

You can apply the same principles to build dimensions/facts based data structure and answer questions that SQL alone wouldn't be able to answer easily.

Example of such question: how many calls did we receive during french legal week #9 that were handled by team X outside the normal working hours or while we were in vacations ? In those calls, how many were issued by a woman (as it has a financial impact in this case) ?

This method has been the bane of my existence for going-on 5 years now. YMMV, of course, but here's my experience:

Back in 2004, we were decided to denormalize a set of records containing year, quarter, and 15 or so data points going back to 1996. Using this cross-tab method, we started out with 8 (years) * 4 (quarters/year) * 15 (data points) = 480 columns (a few more for non-denormalized data). 5 years later, we're up to the high 700s. Needless to say, this pretty much blows out your standard MySQL admin tools.

Another pitfall is, you really have to make sure you data is pristine (of course you should be doing that anyway, but reality is usually far from ideal, especially if you're me in 2004), otherwise you end up with doubling of values or values included when they shouldn't be. The generated queries are very very difficult to debug, again blowing out many admin tools and also text editors (unless you add pretty formatting in your query generator).

So what would I recommend instead? There are multi-dimensional databases or ETL tools that will properly put your transactional data in an analytical style data warehouse. Read Kimball's Data Warehouse Toolkit [1] (ignore the kinda ratty paper quality—the content is good). The key thing is to denormalize down to a star schema, but keep your facts and dimensions separate (things you'll learn in DWT).

There's an Java based OLAP server called Mondrian [2] that also may be worth your time. I ended up going with ActiveWarehouse and ActiveWarehouse-ETL [3], but I've felt some things a little more complicated than they needed to be (but that's another discussion—and I haven't kept up with their development lately, so it may have improved).

On a side note, recently I've started wondering if map-reduce style tools would make sense in this context, but I haven't explored using them that way yet, so I can't actually make a recommendation there.

[1] http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...

[2] http://mondrian.pentaho.org/

[3] http://activewarehouse.rubyforge.org/

alexandros
Cross-tabulation is not about denormalization. If anything, it is about producing the reports you want while keeping your data normalized.
AndrewO
It creates repeating groups (gender -> M,F; location -> Boston, Cagliari, etc; dept -> Development, Personel, etc)—to me, that looks like denormalization.

I'm not sure this is what you're objecting to, but whether or not the data is _stored_ denormalized (as opposed to just using the result set and throwing it away), is not really of consequence. I think we'll both agree that we can and should keep our transactional data normalized.

Either way, this method goes against the grain of what SQL is supposed to do (otherwise you could meta-program queries without having to resort to a Perl script to string this thing together). My critique still stands that these queries can grow to intractable sizes.

If one is going to be a lot of reporting like this, they should really look to separate their transactional and analytical databases.

But hey, like I said YMMV. In a simple case it works. But trust me when I say, after 5 years of data growth and requirements shifts, I wish I knew then the burden that it is now. :)

HN Books is an independent project and is not operated by Y Combinator or Amazon.com.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.