Hacker News Comments on "The architecture of Stack Overflow" www.dev-metal.com Video

Rankings: this week · month (apr/may) · year (2024) · all time

digests · search

Hacker News Comments on
The architecture of Stack Overflow

www.dev-metal.com · 98 HN points · 0 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention www.dev-metal.com's video "The architecture of Stack Overflow".

Watch on www.dev-metal.com [↗]

www.dev-metal.com Summary

One of the most interesting talks these weeks, and a rare insight into one of the most active pages on the web: Marco Cecconi of StackOverflow speaks about the general server architecture, why they don’t unit-test (!), how they release (5 times a day) and shows some awesome server load screenshots. It’s fascinating that they

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

The architecture of Stack Overflow [video]

⬐

Jan 13, 2014 · 98 points, 48 comments · submitted by schmylan

⬐ esw
Here are the slides for anyone who's interested: https://speakerdeck.com/sklivvz/the-architecture-of-stackove...

⬐ carsongross
The most important thing, technically, is having great developers who ship.
For piths sake, I want to say "Everything else is noise" but that isn't true. Everything else can help or hurt, depending on the application and how doctrinaire the application of a given approach/methodology is, the organizational knock on effects (e.g. "Mr Tough Guy Testalot" holds up the release train or nukes your architecture to make it 'testable'), etc. but, seriously, "great developers who ship" is really what moves the needle.

⬐ UK-AL
Why not combine both?
Every feature/user story has to go through a workflow of selected for development -> UX design(If required) -> Development -> Unit Tests(Or the otherway around) -> Staging -> Load Test -> Acceptance Test -> Production -> Analytics(To see if people actually use it) -> Learn from analytics -> back to start if required.
The goal is to get as many issues through the workflow as fast and rigoursly(no shortcuts) as possible at a sustainable pace. Have a continuous flow of features rolling out through this process. Ideally with continuous delivery to automate the majority of it.

⬐ DavidWoof
Everything else is noise. If you have great developers who ship, then by definition you don't have doctrinaire methodology or "Mr Tough Guy Testalot" (I generally find "Mr No Test" to be a much bigger problem anyway). You might the situation where you have great devs but bad management, but that's next to impossible in the real world.
There's really only two steps to great software development.
1. Hire good developers.
2. Don't hire bad developers

⬐ WestCoastJustin
Having a great Ops staff also helps ;) Of note is Thomas Limoncelli who wrote "The Practice of System and Network Administration" [1] and "Time Management for System Administrators" [2] works for Stack Exchange (formerly at Google). The Practice of System and Network Administration is basically the bible for most sysadmins, myself included.
ps. I only singled Thomas Limoncelli out as an example just to highlight the caliber of their Ops staff.
[1] http://www.amazon.com/Practice-System-Network-Administration...
[2] http://www.amazon.com/Management-System-Administrators-Thoma...

⬐ carsongross
Violently agree.

⬐ skeletonjelly
Vehemently? Or do you want to punch someone?

⬐ carsongross
Violently.
It's funnier.

⬐ skittles
He mentioned that they use the servicestack.text library. I've looked into servicestack recently (using the nuget packages), but then found the library to be pay-to-play. There's an older version (v3) that is BSD licensed that is being maintained. Do any of you have experience with it? I have grown tired of Microsoft pushing new solutions to the same problem (REST service with WCF and then Asp.net web api).

⬐ dan_b
ServiceStack is just plain awesome when it comes to developing web services, though it's gone commercial for v4 onward. Nancy is another popular alternative - it's basically Sinatra for .Net. Every time I go back to WCF I want to stab myself in the face.

⬐ sklivvz1971
We used it at the time I gave that talk, we don't anymore. We only used JSON serialization and we have rolled out our own free solution, Jil.
https://github.com/kevin-montrose/Jil

⬐ kmontrose
Technically we use Newtonsoft and Jil, Jil replacing Newtonsoft as we become increasingly confident in it.
I wouldn't suggest anyone use Jil in a production role unless you're at Stack Overflow. It's too untested at the moment, and the typical person can't get me on the horn to fix whatever just broke.

⬐ guiomie
Why would I use Jil over Newtosoft ?

⬐ JasonPunyon
You wouldn't right now (Kevin doesn't recommend it). But in the end it you'll want to use it if JSON serialization is a performance bottleneck for you.

⬐ y0ghur7_xxx
I would love to know more about the Databases:
- Are they used for different things on the sites?
- Is data partitioned across tables?
- Are they all SQL Server instances?

⬐ zero1zero
I would like to know more about this as well.
It sounds like they are all SQL Server instances. However, he made it seem like they are reproducing the schema once per site? I.e., a separate database per site rather than sharding the shared data to multiple hosts per site. Did I hear this right in the question/answer portion?

⬐ kmontrose
Stack Exchange has one database per-site, so Stack Overflow gets on, Super User gets one, Server Fault gets one, and so on. The schema for these is the same.
There are a few wrinkles. There is one "network wide" database which has things like login credentials, and aggregated data (mostly exposed through stackexchange.com user profiles, or APIs). Careers Stack Overflow, stackexchange.com, and Area 51 all have their own unique database schema.
All databases are MS SQL Server.

⬐ avemg
How do you manage schema changes with release deployments across across all of the databases that are meant to be standard?

⬐ sklivvz1971
All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column - a worst case scenario - it's a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

⬐ avemg
Thanks for the reply. We have a similar architecture where I work so this is interesting to me. A couple more questions if you don't mind:
- Do you use any tools for orchestrating the rollout of those schema changes or do you just have some homegrown scripts?
- Do you separate your schema versioning and deployment process from your application versioning and deployment process?
- How do you handle cases where backwards-compatibility is not possible? For example, a new application feature that depends on a brand new table.

⬐ schmylan
Before the title was moderated there was an important tidbit. StackOverflow doesn't unit-test. Fascinating.

⬐ schmylan
tldw; He says he doesn't advocate it but they get away with it by having the community test it out for them in their meta site. Then the community writes up the bugs.

⬐ merak136
He actually says " I'm not advocating that you shouldn't put in tests. [ The reason we can get away with this ] is that we have a great community. "
I take this to mean that he feels that StackOverflow doesn't need tests. Not that tests are useless.

⬐ kmontrose
That's an accurate read.
- Stack Exchange employee

⬐ BrandonY
User community as testers presents some interesting pros and cons.
Pros:
* Tests are self-updating. Add a new feature: tests come in for free. Change a feature: tests automatically update. Fail to document a change: tests fail.
* Tests are unusually thorough
* Eventually consistent testing. If nobody ever complains, it probably wasn't a bug worth fixing.
Cons:
* Tests cannot be run offline. Feature must be committed and deployed before tests can be run.
* Potentially large quantity of false positives (bad bug reports)
* Potentially large quantity of false negatives (nobody notices particular bug, release considered good)
* Does not work for non-user-visible features
So basically you trade the reliability of your tests for a substantial build/release speedup. Some users experience each bug, but they are the users who are actively using the meta-community and have signed up to experience more bugs. Still, lack of pre-release unit testing must radically increase the importance of VERY careful code reviews.
Not the decision I would have made, but definitely has the sorts of advantages that a small team of engineers drool would drool over.

⬐ sklivvz1971
Remember that our community writes bug reports but also vets bug reports. We rarely have to deal with bad reports. Interestingly, large quantities of false negatives are a non-issue.

⬐ welegan
Presumably the same reason why they don't have a ton of bad questions on stack overflow: their community scoring would apply just as much to bug reports

⬐ alexgartrell
Dear any Stack Overflow Developers,
Can you describe the network infrastructure in finer detail? Specifically what type of load balancer are you running?
And what's peak RPS? Where are your network peaks? (I'm guessing major peak US Pacific and minor US Atlantic?)

⬐ TacticalCoder
IIRC at first they had an entire Microsoft stack (I may be mistaken on that).
But nowadays, from what I've read here on HN by SE devs in other threads, they're using lots and lots and lots of Linux: HAProxy, Redis, Nagios, etc.
I just double-checked the slide and although I didn't notice it at first, you can see that 'HA Proxy' and 'Redis' are mentioned.
The core Q&A is in C#/MS-SQL so that's probably not going to move to Linux anytime soon.

⬐ dimension64
This might be a stackoverflow question, so what is a static code?

⬐ notastartup
is there an open source, self-hosted version of stack overflow that you can deploy on your own domain?

⬐ robzienert
Yes. http://meta.stackoverflow.com/questions/2267/stack-overflow-...

⬐ m_myers
To be clear: there is no version of the actual Stack Overflow code that is publicly available. There are, however, numerous open-source reimplementations of portions of the site code.
Also (as the video perhaps mentioned), the Stack Overflow developers have often been able to spin off pieces of the code as open-source libraries. See http://blog.stackoverflow.com/2012/02/stack-exchange-open-so...

⬐ dlazerka
I wouldn't trust Joel Spolsky's code expertise -- just look at Excel internals! Nevertheless, Stack Overflow is super cool. But that tells nothing about its architectural quality.

⬐ merak136
Some points that I find interesting:
[1] StackOverflow has VERY FEW tests. He says that StackOverflow doesn't use many unit tests because of their active community and heavy usage of static code.
[2] Most StackOverflow employees work remotely. This is very different than a lot of companies that are now trying to force employees back into an office.
[3] Heavy usage of Static classes and methods. His main argument is that this gives them better performance than a more standard OO approach.
[4] Caching even simple pages in order to avoid performance issues caused by garbage collection.
[5] They don't worry about making a "Square Wheel". If their developers can write something more lightweight than an already developed alternative, they do! This is very different from the normal mindset of " don't reinvent the wheel ".
[6] Always using multiple monitors. I love this. I feel like my productivity is nearly halved when I am working on one tiny screen.
Overall, I was surprised at how few of the "norms" that they follow. Either way, seems like it could be a pretty cool place to work.

⬐ tegeek
[7]. Millions of page views and just 25 servers for whole infrastructure, which includes everything including load balancing, cache, dbs' etc. etc.
Seems very very optimized and cost effective. Its brilliant.

⬐ vladimirralev
I see more and more static methods and classes last 2 years maybe. It's probably more about the stateless design and less side effects, but it definitely helps garbage collection if you avoid classes at session scope or smaller. In OOP there is another pattern that helps - object pools, but it's a lot of work to get it to work correctly and it's not as efficient.

⬐ edwinnathaniel
If most of your classes are small, I don't see why people have to resort to static methods.
If your methods are static, there are tendency/lust to use static member variables (hence stateful) which will cause side effects.
Don't forget the following points too:
1) You still have pooled objects somewhere (stateless business logic classes like XYZServices, repository classes that may be backed by pooled DB connections and Transaction Managers) provided/managed by your Application Server or by 3rd-party framework (Spring does this).
2) Your Application Server tend to have beefy hardware, good enough not to care of GC hiccups.
There are other reasons to use static methods but I don't think they're strong enough in this case.

⬐ vladimirralev
Small classes are actually worse GC-wise. Because they will fill up the GC graph with many small nodes as opposed to fewer large nodes, which are released in bulk with little fragmentation. Small and large nodes have the same GC overhead essentially. In general you want your objects to be large. When they are small, once the GC realizes what's going on (usually at some high threshold 90% or so), it will have to run a some O(n^x) graph reduction algorithm or defragmenataion. Special tuning is required for such cases. Beefy hardware doesn't help in many cases due to locks. There are very few production-ready lock-less GCs.

⬐ edwinnathaniel
We're talking in the context of stateless Request <-> Response of the Web-Application nature here.
When a Request comes in, the App-Server will allocate (or use from the pool) a thread to serve that Request (in .NET/JVM world, Ruby/Python uses Processes unless you use different App Server).
If you create small objects within the scope of that Request (which usually lives inside a method) and that objects are contained and don't hold references to any long-lived objects, they will be GC-ed quickly (and potentially way quicker) once the method is finished.
Thread is GC-ed as well once it's finished (unless you wish to release them back to the 'unused' pool).
My feeling is that their use of static methods have nothing to do at all with GC.

⬐ barrkel
Object lifetime is a much more important factor than class size for most server request / response style processing.
Typically there are three lifetimes for objects in server processes. Those that are allocated around startup and are never deallocated; those that are allocated per-request and become garbage once the response goes out; and lifetimes that span multiple requests, like objects in caches.
The first are normally ultra-cheap to "collect": with a generational GC, you simply don't scan them at all, because they haven't changed.
The second group, per-request, are also fairly cheap to collect. Every so often, you GC the youngest generation, and you only need to keep track of references in registers and on the stack. Ideally many requests will have occurred between collections, and the only objects that get kept alive are objects that are in-flight for the current request. And this is why you need at least three generations; you really don't want to have to scan the oldest generation to collect these ephemeral objects after they've built up over a number of youngest generation collections.
It's the third group that kills you. You can save on the cost of scanning the whole heap, using write barriers to track new roots buried in the oldest generation; but that adds accounting costs, and eventually overtakes the cost of a whole heap GC. These guys can also cause the fragmentation you're worried about - they need to be compacted down, copied possibly multiple times. On the CLR, last time I checked, you need a full gen2 GC in order to get rid of them, as they've likely survived a gen1 collection.
With these guys, it's worthwhile doing the big object thing. In fact, it may be worthwhile not having any GC heap storage for them at all, and refer to them using different techniques, like ephemeral keys that look up in Redis, or native pointers stored in statically allocated arrays.
In app servers I've designed, I've never seen GC CPU usage over 5% or so, even with heavy usage of tiny short-lived objects. But you need to care about lifetime.

⬐ thedufer
> StackOverflow employees work from home.
Many do, but they have a fairly large office in NYC and a smaller one in London.

⬐ merak136
You are correct. I edited my post and actually found a good blog post on the subject.
http://blog.stackoverflow.com/2013/02/why-we-still-believe-i...

⬐ kmontrose
The Stack Overflow Q&A dev team has 2 people in New York, out of a team of 10 team. The Careers dev team is more New York heavy, 3 remote and 5 in New York. The sysadmin team is also quite remote, though I don't know the breakdown offhand.
I believe at this point most new technical hires are remote.
Our offices are mostly sales, Denver and London exclusively so.

⬐ thedufer
I saw that Jason went remote recently. Any particular reason so many devs are going remote? Is it people making individual decisions or the company providing new incentives to do so? My impression when you were at 55 was that most devs worked at the office (I've been at Fog Creek since a little before you guys moved. Hi!).

⬐ kmontrose
The most common reason for someone going remote (that I'm aware of) is starting a family. New York's great, but spacious it is not.
I can think of 3 devs who have gone remote, and 2 devs (including myself) who have moved to NYC since I've been here. Most people stay wherever they were hired. The only location-specific policy I'm aware of is a cost-of-living adjustment in NYC (though that may also apply to London/SF/etc., I don't honestly know).

⬐ jaydles
People making individual decisions. All else equal, we'd slightly prefer to have people in NYC, because we think the in-person time is a plus for the casual interaction that happens in between "getting things done". But we've set our selves up to make real work and official team collaboration work almost entirely online. We've learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

⬐ JasonPunyon
It's not that no one follows the norms or tests. On the Careers team we do much more automated testing because there's money and literally people's jobs at stake. We have unit tests, integration tests and UI tests that all run on every push. All the tests must succeed before a production build run is even possible.

Hacker News Comments on The architecture of Stack Overflow

Hacker News Stories and Comments

Hacker News Comments on
The architecture of Stack Overflow