Hacker News Comments on
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
·
7
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this book.Check out the Data Warehouse Toolkit. You might want to look into natural vs surrogate keys.https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimen...
⬐ squeaky-cleanFound a pdf copy (just to browse, I promise!) and it looks great. Ordered a used copy, thanks.Edit: no instances of candidate key when I ctrl+f though, still looks pretty good.
Yes, especially in business intelligence / data warehousing. Here is an excellent resource if you're interested in this sort of thing: https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimen...
Dimensional modeling (I'm a fan of Kimball's approach) mitigates these problems quite well while still offering very flexible ad-hoc reporting. Works great on a row-based RDBMS, even better on columnar.http://en.wikipedia.org/wiki/Dimensional_modeling http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...
Redshift is indeed a solid product but all these comparisons against Hive are surprising, as that's not the right tool in the first place. Infobright, greenplum, aster, vertica, etc are the products which Redshift seeks to disrupt.
⬐ jacques_chesterI realised a few years ago that pretty much every database course taught only teaches OLTP. OLAP never really gets a lookin.At my university, standard normalisation was taught in the "databases" course. OLAP was mentioned as part of the "advanced databases" course.
The database course at that time blew about half its time on building PHP applications to talk to the database. I hate to second guess my professors, but I can't help but feel that a more productive use of the time would have been to teach normalised OLTP in the first half, and dimensionally modelled OLAP in the second half. Better yet, to divide them into two courses and spend some time talking about database history ("here's why network and hierarchical databases sucked") and maybe some introduction to how query planners work.
⬐ eruDo you know of any resource that talks about the same topic as your comment, but in more detail?⬐ jacques_chesterI'm not sure I follow you.⬐ eru⬐ wistyTo be more stereotypical: I am intrigued and would like to sign up to your newsletter.⬐ jacques_chesterWell ... really, just read a good pair of textbooks on each side of the spectrum. Date's Databases and Kimball's The Data Warehouse Toolkit are good.Edit: actually, maybe not Date. It's up to you. It's good, but it's controversial because he's not a fan of SQL and so he uses his own language.
The one I used in uni was Ramakrishnan & Gehrke's Database Management. It was OK but there's a certain amount of at-the-time trendy bullshit that to me detracts from a focus on relational databases for their own sake.
Edit 2: and Joe Celko's SQL for Smarties contains good oil on the relational paradigm.
Philip Greenspan's SQL tutorial is a nice starting point: http://philip.greenspun.com/sql/It's a bit old, but still pretty good.
I would suggest reading some books on the topic of Dimensional Modeling [1] such as "The Data Warehouse Toolkit" [2]. The critical thing you need to expose to your users is the ability to ask for things which make sense in their world that are actually really difficult for even an engineer to code. Things like: "Show me average 9am-12pm sales on Mondays, Wednesday and Fridays for 1st quarter, 2012"[1] http://en.wikipedia.org/wiki/Dimensional_modeling [2] http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
⬐ alex_anglinSpeaking as someone who does his fair share of dimensional modeling, I would just point out that the example you cite could only involve two tables in a well designed dimensional model (sales fact and time/date dimension, I reckon). The challenge is in getting to that point.To speak to OPs point about difficulty in querying data warehouses, most business intelligence tools that I'm aware of provide semantic layer[1]-type capabilities, whereby the user interface of the tool is presented in the language of the business domain. Nevertheless, I still agree that this is still difficult work, unfortunately. That it is getting more complicated in some respects, such as through unstructured data, doesn't help either.
⬐ merittI guess I wasn't clear enough if I came across like my example was complex. It's one easily solved via DM and one that's extremely hard to execute in most non-dimensionally-modeled setups. That's exactly why I'm a huge advocate of DM instead of just throwing a ton of servers, hadoop & MR at everything.
If you are interested in Data Warehousing, you should read Ralph Kimball's "The Data Warehouse Toolkit": http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...When I started learning about BI (Business Intelligence), a few members of the Pentaho community advised me to read this book. I'm glad I did. Kimball is one of the "fathers" of data warehousing, and his book had a lot of great insights for dimensional modeling. It helped me avoid many design mistakes while building my DWH, and gave me insight I might have taken years to discover.
It's a "theoretical" book, in the sense that it does not focus on any specific technology; it's also a "practical book", because he uses real-world scenarios (inventory management, e-commerce, CRM...) to demonstrate the various dimensional modeling techniques. I also liked the part about BI project management and encouraging BI in a company (= how to engage users and how to "sell" a BI project to management).
He also has a newsletter with many DWH design tips (archives here: http://www.kimballgroup.com/html/07dt.html ).
I am currently doing reports/datawarehouses/dashboards. When something more complicated that simple questions is needed (see Data Warehousing for Cavemen), ad-hoc queries are quite often not the answer anymore, either with NoSQL or with SQL.I don't want my clients to be dependent on me (or someone else) to build complicated SQL queries when they have questions, so I focus on getting an easy to maintain facts/dimensions model (as advocated by Ralph Kimball http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...) which can evolve if needed.
The nice point about MongoDB when doing this is that it makes it a lot easier to add attributes to dimensions, or load the data, or evolve the reporting system in general (and I like that).
You can apply the same principles to build dimensions/facts based data structure and answer questions that SQL alone wouldn't be able to answer easily.
Example of such question: how many calls did we receive during french legal week #9 that were handled by team X outside the normal working hours or while we were in vacations ? In those calls, how many were issued by a woman (as it has a financial impact in this case) ?
This method has been the bane of my existence for going-on 5 years now. YMMV, of course, but here's my experience:Back in 2004, we were decided to denormalize a set of records containing year, quarter, and 15 or so data points going back to 1996. Using this cross-tab method, we started out with 8 (years) * 4 (quarters/year) * 15 (data points) = 480 columns (a few more for non-denormalized data). 5 years later, we're up to the high 700s. Needless to say, this pretty much blows out your standard MySQL admin tools.
Another pitfall is, you really have to make sure you data is pristine (of course you should be doing that anyway, but reality is usually far from ideal, especially if you're me in 2004), otherwise you end up with doubling of values or values included when they shouldn't be. The generated queries are very very difficult to debug, again blowing out many admin tools and also text editors (unless you add pretty formatting in your query generator).
So what would I recommend instead? There are multi-dimensional databases or ETL tools that will properly put your transactional data in an analytical style data warehouse. Read Kimball's Data Warehouse Toolkit [1] (ignore the kinda ratty paper quality—the content is good). The key thing is to denormalize down to a star schema, but keep your facts and dimensions separate (things you'll learn in DWT).
There's an Java based OLAP server called Mondrian [2] that also may be worth your time. I ended up going with ActiveWarehouse and ActiveWarehouse-ETL [3], but I've felt some things a little more complicated than they needed to be (but that's another discussion—and I haven't kept up with their development lately, so it may have improved).
On a side note, recently I've started wondering if map-reduce style tools would make sense in this context, but I haven't explored using them that way yet, so I can't actually make a recommendation there.
[1] http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...
⬐ alexandrosCross-tabulation is not about denormalization. If anything, it is about producing the reports you want while keeping your data normalized.⬐ AndrewOIt creates repeating groups (gender -> M,F; location -> Boston, Cagliari, etc; dept -> Development, Personel, etc)—to me, that looks like denormalization.I'm not sure this is what you're objecting to, but whether or not the data is _stored_ denormalized (as opposed to just using the result set and throwing it away), is not really of consequence. I think we'll both agree that we can and should keep our transactional data normalized.
Either way, this method goes against the grain of what SQL is supposed to do (otherwise you could meta-program queries without having to resort to a Perl script to string this thing together). My critique still stands that these queries can grow to intractable sizes.
If one is going to be a lot of reporting like this, they should really look to separate their transactional and analytical databases.
But hey, like I said YMMV. In a simple case it works. But trust me when I say, after 5 years of data growth and requirements shifts, I wish I knew then the burden that it is now. :)