Hacker News Comments on ""Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson" Strange Loop Conference Youtube Video

Rankings: this week · month (apr/may) · year (2024) · all time

digests · search

Hacker News Comments on
"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson

Strange Loop Conference · Youtube · 65 HN points · 29 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention Strange Loop Conference's video ""Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson".

Youtube Summary

Debugging highly concurrent distributed systems in a noisy network environment is an exceptionally challenging endeavor. On the one hand, evaluating all possible orders in which program events can occur is a task ill-suited to human cognition, rendering a pure analytic understanding of the control flow of such a system beyond the reach of any individual programmer. On the other hand, a more “empirical” approach to the task is also fraught with difficulty, as the dependence of severe bugs on precise timings or transient network conditions makes every part of the debugging cycle – from bug replication to verification of a fix – a Sisyphean labor bordering on the impossible.

One approach which has been developed to ameliorate this situation is that of deterministic simulation, wherein the hardware components of the system – including hard disks, network links, and the machines themselves – are replaced in testing with software which fulfills the contracts of those systems, but whose state is completely transparent to the developer. This enables the simulation of a wide diversity of failure modes including network failures, disk failures or space exhaustion, unexpected machine shutdown or reboot, IP address changes, and even entire datacenter failures. Moreover, once a particular pattern of failures has been identified which uncovers a bug, the determinism property of the simulation means that the exact same series of events can be replayed an indefinite number of times, greatly facilitating the debugging process, and providing confidence when a bug has been fixed.

Attendees of this talk will gain an understanding of the benefits, drawbacks, and tradeoffs involved in implementing a deterministic simulation framework, with frequent reference both to theory and to real-world engineering experience gleaned from applying this method to a complex distributed system. Attendees will also learn about language features which aid in the development of such a framework.

Will Wilson
FoundationDB

Will Wilson works on the engineering team at FoundationDB (https://foundationdb.com). Will started his career in biotechnology, leading a successful R&D effort in spinal cord injury diagnostics, currently undergoing commercialization by a company he co-founded. Since then, Will has worked in a variety of technical and business roles at data science and data virtualization startups. Will has a degree in math and philosophy from Yale.

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

⬐

Apr 29, 2022 · georgelyon on Jepsen: Redpanda 21.10.1

I'm constantly surprised more folks don't use FoundationDB, I'm pretty sure the Jepsen folks said something to the tune of the way FoundationDB is tested is far beyond what Jepsen does (Good talk on FDB testing: https://www.youtube.com/watch?v=4fFDFbi3toc).
My read is that most use cases just need something that works _enough_ at scale that the product doesn't fall over and any issues introduced by such bugs can be addressed manually (i.e. through customer support, or just sufficient ad-hoc error handling). Couple that with the investment some of these databases have put into onboarding and developer-acquisition, and you have something that can be quite compelling even compared to something which is fundamentally more correct.

⬐ staticassertion
Having looked at FoundationDB a bit it wasn't clear why I would choose it. It has transactions, which is nice, but not that big of a deal despite how much time they put into talking about it. I actually don't even need transactions since all of my writes commute, so it's particularly uninteresting to me.
They say they're fast, but I didn't find a ton of information about that.
Ultimately the sell seemed to be "nosql with transactions" and I just couldn't justify putting more time into it. I did watch their excellent talk on testing, and I respect that they've put that level of effort into it, and it was why I even considered it, but yeah, what am I missing?

⬐ jwr
As someone who is switching to FoundationDB: because it's not easy. It doesn't look like other databases, it isn't in fashion (yes, these things matter), and it requires thinking and adapting your application to really use it to its full potential. It could also benefit from a bit more developer marketing.
But it's the best thing out there.

⬐

Apr 05, 2022 · andrewflnr on Accidentally making a language, for an engine, for a game

AHA! I should have known that generation-reference business was intended for games all along.
Reminds me of the time the FoundationDB team wrote a compiler to write a simulator to test their database: https://www.youtube.com/watch?v=4fFDFbi3toc

⬐

Mar 12, 2022 · philovivero on FoundationDB: A distributed unbundled transactional key value store

I was watching the old 2014 video: https://youtu.be/4fFDFbi3toc?t=1081
I've included the timestamp of the relevant bit. He is talking about swizzling, which is like clogging++.
He says they take a subset of the network connections, and have them timeout, but then bring them all back in reverse order, and that this somehow exposes far more bugs than other methods.
This seems like a very interesting "well, that's funny!" moment that might lead to understanding something foundational about how distributed systems fail.
Does anyone have more insight on this?

⬐ hinkley
That sounds reasonable to me. In a cascading failure where network congestion ramps up (eg, old school NFS), the machines with the worst network bandwidth or latency would time out first, and be the last to recover.
In a heterogeneous cluster, I might expect the fastest machines to die last, and either a human operator concentrating on getting the most capacity back online first, or just automation getting the fastest one online first, since a faster machine can go from hard down to green health check sooner than the slow machine. Not precisely reverse chronological order, but certainly with large statistical clusters.
And on top of that, exponential back-off would cause the same phenomenon.
But if your question is "why more bugs that way?" your guess is as good as mine. Mine is 'consensus is hard.'.

⬐

Jul 20, 2021 · dangerlibrary on Software engineering research is a train wreck

The FoundationDB team took a different approach, that I found pretty interesting. They created a fully deterministic (single threaded) simulation of a distributed database, and then use that simulation to test their implementation of the database under difference scenarios. They'd probably be interested in something like what you describe, as the bulk of their work seems to be rooting out and squashing bugs caused by factors out of their control (dropped connections, etc.)
https://www.youtube.com/watch?v=4fFDFbi3toc

⬐

Jun 07, 2021 · alistairw on FoundationDB: A distributed, unbundled, transactional key value store [pdf]

Yeah it's their own language on top of c++ to help them with testing distributed systems with deterministic simulation.
Their talk from a while ago about it was something that really blew me away at the time [0]
[0] https://www.youtube.com/watch?v=4fFDFbi3toc

⬐

Jun 07, 2021 · sgk284 on FoundationDB: A distributed, unbundled, transactional key value store [pdf]

Here's another excellent talk at Strangeloop on FoundationDB's simulation testing by Will Wilson in 2014: https://www.youtube.com/watch?v=4fFDFbi3toc

Testing Distributed Systems with Deterministic Simulation

Jun 07, 2021 · 1 points, 0 comments · submitted by rubyn00bie

⬐

Sep 03, 2020 · inaseer on Even in Go, concurrency is still not easy

Neat to learn about thread sanitizer. It sounds similar to another tool from Microsoft Research called Torch (https://www.microsoft.com/en-us/research/project/torch/) which automatically instruments binaries to detect data races. Coyote is similar in some ways but different in others. Coyote serializes the execution of the entire program (running one task at a time), exploring one set of interleavings and then rewinding, and then exploring another set of interleavings, hoping to hit hard-to-find safety and liveness bugs. In addition to finding concurrency bugs in one isolated process, we use it to find bugs in our distributed system by effectively running our entire distributed system in one process and having Coyote explore the various states our system can be in. It sounded mind-boggingly cool when I first came across this way of testing distributed systems through Foundation DB (https://www.youtube.com/watch?v=4fFDFbi3toc); we're emulating this kind of testing in our distributed system through Coyote. And unlike Foundation DB which had to develop their own variant of C++ to be able to do this kind of testing (kudos to them for doing it), Coyote allows us to do it on regular C# programs written using async/await asynchrony and benefit from decades of Microsoft Research in exploring large state spaces effectively.

⬐

Apr 27, 2020 · nickpsecurity on Debugging Distributed Systems

Check out FoundationDB's approach:
https://www.youtube.com/watch?v=4fFDFbi3toc

⬐ epdlxjmonad
Using a single thread to simulate everything is cool (as stated in my previous comment on FoundationDB at https://news.ycombinator.com/item?id=22382066). Especially if the overhead of building the test framework is small.
In our case, we use a single process to run everything, in conjunction with strong invariants written into the code. When an invariant is violated, we analyze the log to figure out the sequence of events that triggers the violation. As the execution is multi-threaded, the analysis can take a while. If the execution was single-threaded, the analysis would be much easier. In practice, the analysis is usually quick because the log tells us what is going on in each thread.
So, I guess there is a trade-off between the overhead of building the test framework and the extra effort in log analysis.

Testing Distributed Systems with Deterministic Simulation (2014) [video]

Mar 27, 2020 · 3 points, 0 comments · submitted by rubyn00bie

⬐

Feb 22, 2020 · drej on Challenges with distributed systems

There is a great talk about this testing, it goes all the way to deterministic seeds and other basics. The level of determinism is admirable. https://youtu.be/4fFDFbi3toc

⬐

Aug 22, 2019 · nickpsecurity on GitHub Outage Currently Ongoing – Issues, PRs, Notifications

Their development velocity might have been due to their management or internal culture. You can get most of those results faster and on a budget. The best evidence was the startup behind FoundationDB continually improving their product in highly-competitive space while doing QA like this:
https://www.youtube.com/watch?v=4fFDFbi3toc
Generally, on top of code reviews, one can get high return with minimal labor and hardware with a combo of Design-by-Contract, contract/spec/property-based testing, low-F.P. static analysis, and fuzzing with contracts in as runtime checks. That's my default recommendation.
In Github's case, they also might have access to both closed and open tools that Microsoft Research makes. MSR is awesome if you've never checked them out. Two examples applicable to system reliability and security:
https://www.microsoft.com/en-us/research/project/ironclad/
https://www.microsoft.com/en-us/research/publication/p-safe-...
Plus some of their other tools in various states of usability:
https://rise4fun.com/

⬐

May 19, 2019 · adverbly on Preventing the Collapse of Civilization [video]

I remember seeing the foundationDB distributed systems testing video for the first time, and being blown away with what it takes to build robust software. Worth a view if you haven't seen it. https://www.youtube.com/watch?v=4fFDFbi3toc
Would love to see more things in this direction, but, I agree that the market doesn't want it. Most users will gladly accept an infrequent bug for an earlier release, or lower cost version of a product.

⬐

Mar 08, 2019 · chubot on Why isn’t CPU time more valuable?

FWIW I think that DOES happen, but it happens on the "wrong" computers!
Google has at least tens of thousands of cores running builds and tests 24/7. And they're utilized to the hilt. Travis and other continuous build services do essentially the same thing, although I don't know how many cores they have running.
From a larger perspective, Github does significant things "in the background" to make me more productive, like producing statistics about all the projects and making them searchable. (Admittedly, it could do MUCH more.)
I think part of the problem is that it's cheaper to use "the cloud" than to figure out how to use the developer's own machine! There is a lot of heterogeneity in developer machines, and all the system administration overhead isn't worth it. And there's also networking latency.
So it's easiest to just use a homogeneous cloud like Google's data centers or AWS.
There's also stuff like https://github.com/google/oss-fuzz which improves productivity. I do think that most software will be tested 24/7 by computers in the future.
Foundation DB already does this:
"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson https://www.youtube.com/watch?v=4fFDFbi3toc&t=2s
Autonomous Testing and the Future of Software Development - Will Wilson https://www.youtube.com/watch?v=fFSPwJFXVlw
They have sort of an "adversarial" development model where the tests are "expected" to find bugs that programmers introduce. So it's basically like having an oracle you can consult for correctness, which is a productivity aid. Not exactly, but that would be the ideal.

⬐ abecedarius
Neat -- I haven't worked at Google but wondered what they might be doing on this score.
There's probably lots of potential for modern AI to hack at programmer productivity more directly. Machine learning so far has been more of a complement than a substitute, but I'm imagining a workflow where a lot of the time you're writing tests/types/contracts/laws and letting your assistant draft the code to satisfy them. You write a test, when you're done you see there's a new function ready for you coded to satisfy a previous test, you take a look and maybe go "Oh, this is missing a case" and mark it incomplete and add another test to fill it out.
Maybe in the sci-fi future programming looks more like strategic guidance; nearer term perhaps we might see 500 cores going full blast to speed up your coding work by 20% on average. Or maybe not! But it's one idea.

⬐

Mar 06, 2019 · drej on FaunaDB 2.5.4

Oh I do appreciate it - I've read Martin Kleppmann's book cover to cover and then watched all of the speeches Kyle Kingsbury has given in the past three years or so. I love this area, my absolute favourite is deterministic testing of FoundationDB [0].
It's because I appreciate this work that I felt the blog post didn't do it justice. And I know Jepsen hardly ever passes (ZooKeeper, I believe, did). And I don't take FaunaDB's hard work for granted.
[0] https://www.youtube.com/watch?v=4fFDFbi3toc

⬐

Jan 14, 2019 · wikibob on FoundationDB Record Layer

Is this the video you mention?
https://youtu.be/4fFDFbi3toc

⬐ rubyn00bie
Yes it is! Sorry for not linking it and thank you for doing so!

⬐

Nov 29, 2018 · rubyn00bie on FoundationDB Document Layer

Soooo stoked. This should make futzing with it to get started a lot easier.
I've been thinking about going to the FoundationDB summit to hopefully learn more... Does anyone have any good resources for learning about it? I've tried a few times but find the documentation a bit hard to grok. Maybe I'm just slow.
Also if you haven't seen it, this is one of the most amazing talks on distributed systems (because it's about testing them) I've ever seen (by Will Wilson from FoundationDB pre-Apple merger): https://www.youtube.com/watch?v=4fFDFbi3toc
Truly mind blowing how good FoundationDB is supposed to be for its use cases, and this Mongo layer is just the business when it comes to helping adoption.

⬐ andrewflnr
That talk was glorious. "We want to build a database, but first we need to build a simulator for the database, but first we need to write a compiler to help us write the simulator." It would be self-parodying engineering hubris if it hadn't worked.

⬐

Sep 21, 2018 · StreamBright on CrimsonDB: A Self-Designing Key-Value Store

>>> But for the Web site for my startup, early on it was clear that I needed a good key-value store as a Web user session state store (server). So, I designed one, coded it up, and have been using it. It seems to work fine.
Well depending on what error scenarios you might run into your code will work until it runs into those scenarios.
Few ideas what could go wrong you can get from this video:
https://www.youtube.com/watch?v=4fFDFbi3toc

⬐ graycat
Thanks for the video.
Yup.
So far the server side of my Web site has servers of 6 types -- Web server, session state store (server), log server, SQL Server, and two core servers for the real work, doing some applied math to get the users the good stuff that will make them happy. Soon there will also be an ad server for a total of 7.
As in the sense of the video, that sounds like a non-deterministic hell to debug. Ah, in this case, I don't think the debugging is so difficult. Mostly the reason is that so far all the interactions are conceptually and as in the architecture quite simple. E.g., the code seems to run fine, and I never encountered any significant problems getting the code, first version and first production version, running fine.
But I do intend to add to the code lots more writes to the log server and be able to turn on/off those writes while the system is live. Then I could write some code to look at the log file and for each user session check all the traffic and work.
Yes, things could fail. I do have a lot of reasonable value checks in the code and, thus, could get some diagnostics then.
Eventually I'll want to implement some fault tolerance: E.g., if a log server becomes unresponsive, then switch to another one -- there maybe several anyway due to load and desire for redundancy. Similarly for the session state store -- a user might lose their session, but that would get logged, but the Web server would then start to use a different session state server.
So far my site is simple enough in purpose, work, architecture, etc. that some reasonably careful thinking, documentation, coding, and testing should continue to be okay.
But, sure, if the site gets to 1000 users a second, I'll have to rethink the architecture in big ways.
Then I'll try to keep using the "keep it simple, stupid" KISS principle and be willing to waste a factor of 2 to 10 or so in checking and being sure and being able to do the right things for some failure scenarios. Uh, my Web site is not for banking, exchanging crypto tokens, etc.!!!
Thanks, especially for the video.
Right, for non-deterministic systems, can be tough to duplicate a problem in order to analyze it! So, tough, in practice, try never to have to confront directly the non-determinism -- i.e., have enough in tracing in logs that have in effect converted the work to something deterministic, the randomness removed.
So, first cut lessons, try to avoid non-deterministic designs; if have one, then try to keep it dirt simple and likely relatively isolated from everything else; and if can't do all that, then during development and testing, at various levels, have lots of scaffolding and for production put in lots of checks, redundancy, thought, system management, monitoring, reporting, and analysis infrastructure, etc.
E.g., I did and published some high end work in anomaly detection for real time monitoring to detect problems never seen before, and I intend to use that work when appropriate. The intuitive idea is that if the system is not humming like it usually does and should, then I'll discover that right away and, then, be able to zoom down a tree of partitions of the systems to localize the problem -- hopefully in a usefully large fraction of such cases. My work is good because it is multi-dimensional, distribution-free, has adjustable, known in advance, exact control over false alarm rate, and in a useful sense is the most powerful such monitoring possible from any means of processing the data.
No way should ever have to see that the system occasionally gets sick and just look at the system, in racks in 100,000 square feet of floor space, and start to guess. Instead should have a LOT of darned good additional data from the various over-designed internal checks, logs, redundancies, etc. Really, the system should come close to just making it obvious about where the problem is.
No joke: In 100,000 square feet of racks of humming servers, want some darned good health and wellness monitoring that, at least in the design, architecture, development, testing, and first months of production operation, appears to be able to localize problems.
Suggesting that such work is important; not suggesting that it's easy or will get it good enough right away. Am suggesting that should take the problem seriously and try. For some important systems, may spend several times as much computing checking and monitoring as doing the actual production work.
But no way do I ever want to have to know there is something wrong and have to be forced just to look at 100,000 square feet of servers and start with Ethernet diagnostic hardware, reading files, etc. I hope never have to consider a main memory dump.
My first development computer got flaky -- I became able just to smell flaky. It really stinks!

⬐

Jul 15, 2018 · qaq on Verifying Transactional Consistency with Jepsen and FaunaDB

https://www.youtube.com/watch?v=4fFDFbi3toc

⬐

May 05, 2018 · chubot on The Problem with Threads (2006) [pdf]

This talk was about Foundation DB was brought up recently, and it's pretty amazing. I recommend watching the whole thing, but to be brief they are taming the "infinite interleavings" problem through determinism.
"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson
https://www.youtube.com/watch?v=4fFDFbi3toc
They wrote an interesting Actor DSL that compiles to C++ and is completely deterministic, and they torture this deterministic engine with generated test cases on a cluster every night.
I guess you could say that the whole cluster is necessarily non-deterministic, but an individual node is deterministic, given an ordering of the messages it receives.

⬐

Apr 19, 2018 · wwilson on Apple open-sources FoundationDB

This is INCREDIBLE news! FoundationDB is the greatest piece of software I’ve ever worked on or used, and an amazing primitive for anybody who’s building distributed systems.
The short version is that FDB is a massively scalable and fast transactional distributed database with some of the best testing and fault-tolerance on earth[1]. It’s in widespread production use at Apple and several other major companies.
But the really interesting part is that it provides an extremely efficient and low-level interface for any other system that needs to scalably store consistent state. At FoundationDB (the company) our initial push was to use this to write multiple different database frontends with different data models and query languages (a SQL database, a document database, etc.) which all stored their data in the same underlying system. A customer could then pick whichever one they wanted, or even pick a bunch of them and only have to worry about operating one distributed stateful thing.
But if anything, that’s too modest a vision! It’s trivial to implement the Zookeeper API on top of FoundationDB, so there’s another thing you don’t have to run. How about metadata storage for a distributed filesystem? Perfect use case. How about distributed task queues? Bring it on. How about replacing your Lucene/ElasticSearch index with something that actually scales and works? Great idea!
And this is why this move is actually genius for Apple too. There are a hundred such layers that could be written, SHOULD be written. But Apple is a focused company, and there’s no reason they should write them all themselves. Each one that the community produces, however, will help Apple to further leverage their investment in FoundationDB. It’s really smart.
I could talk about this system for ages, and am happy to answer questions in this thread. But for now, HUGE congratulations to the FDB team at Apple and HUGE thanks to the executives and other stakeholders who made this happen.
Now I’m going to go think about what layers I want to build…
[1] Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc

⬐ ashishnm
I echo what wwilson has said. I work at Snowflake Computing, we're a SQL analytics database in the cloud, and we have been using FoundationDB as our metadata store for over 4 years. It is a truly awesome product and has proven to be rock-solid over this time. It is a core piece in our architecture, and is heavily used by all our services. Some of the layers that wwilson is talking about, we've built them. metadata storage , object-mapping layer , lock manager , notifications system . In conjunction with these layers, FoundationDB has allowed us to build features that are unique to Snowflake. Check out our blog titled, "How FoundationDB powers Snowflake Metadata forward" [1]
Kudos to the FoundationDB team and Apple for open sourcing this wonderful product. We're cheering you all along! And we look forward to contributing to the open source and community.
[1] https://www.snowflake.net/how-foundationdb-powers-snowflake-...

⬐ ddorian43
Where is the object-mapping-layer ? Inside the db in c++ or a service (different process etc) outside of it.

⬐ ashishnm
Latter. It's a Java library used by our services.

⬐ polskibus
That's impressive! Are you going to contribute your changes to foundationdb back?

⬐ jamesblonde
I am one of the designers of probably the best known metadata storage engine for a distributed filesystem, hopsfs - www.hops.io. When I looked at FoundationDB before Apple bought you, you supported transactions - great. But we need much more to scale. Can you tell me which of the following you have: row-level locks partition-pruned index scans non-serialized cross-partition transactions (that is, a transaction coordinator per DB node) distribution-aware transactions (hints on which TC to start a transaction on)
The relative performance contribution of most of those features (cross-partition transactions not evaluated -it's a must) can be seen in our paper at Usenix Fast: https://www.usenix.org/system/files/conference/fast17/fast17...

⬐ notacoward
> best known metadata storage engine for a distributed filesystem, hopsfs
Not even close. I don't even see anything I'd call a filesystem mentioned on your web page. I missed FAST this year, and apparently you had a paper about using Hops as a building block for a non-POSIX filesystem - i.e. not a filesystem in my and many others' opinion - but it's not clear whether it has ever even been used in production anywhere let alone become "best known" in that or any other domain. I know you're proud, perhaps you have reason to be, but please.

⬐ jamesblonde
Yes, it's used in production. There's a company commercializing it: www.logicalclocks.com

⬐ agibsonccc
I'm not convinced a paper with a website and some proof of concepts would be considered the "best". You're throwing around a bunch of components in to a distro calling yourselves everything from "deep learning" to a file system. It's not clear what you guys are even trying to do here.

⬐ spullara
You don't need to worry about shards / partitions with FDB. Their transactions can involve any number of rows across any number of shards. It is by far the best foundation for your bespoke database.

⬐ voidmain
It's somewhat hard to answer your questions because the architecture (and hence, terminology) of FoundationDB is a little different than I think you are used to. But I will give it a shot.
FoundationDB uses optimistic concurrency, so "conflict ranges" rather than "locks". Each range is a (lexicographic) interval of one or more keys read or written by a transaction. The minimum granularity is a single key.
FoundationDB doesn't have a feature for indexing per se at all. Instead indexes are represented directly in the key/value store and kept consistent with the data through transactions. The scalability of this approach is great, because index queries never have to be broadcast to all nodes, they just go to where the relevant part of the index is stored.
FoundationDB delivers serializable isolation and external consistency for all transactions. There's nothing particularly special about transactions that are "cross-partition"; because of our approach to indexing and data model design generally we expect the vast majority of transactions to be in that category. So rather than make single-partition transactions fast and everything else very slow, we focused on making the general case as performant as possible.
Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.
Please tell me if that leaves questions unresolved for you!

⬐ evanweaver
> Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.
Ok, per my other question that makes sense. Similar to FaunaDB except the "resolvers" (transaction processors) are themselves partitioned within a "keyspace" (logical database) in FaunaDB for high availability and throughput. But FaunaDB transactions are also single-phase and we get huge performance benefits from it.

⬐ jamesblonde
Thanks for the detailed answer. Is it actually serializable isolation - does it handle write skew anomalies (https://en.wikipedia.org/wiki/Snapshot_isolation)? Most OCC systems I know have only snapshot isolation.
Systems that sound closest to FoundationDB's transaction model that i can think of are Omid (https://omid.incubator.apache.org/) and Phoenix (https://phoenix.apache.org/transactions.html). They both support MVCC transactions - but I think they have a single coordinator that gives out timestamps for transactions - like your "resolvers". The question is how your "resolvers" reach agreement - are they each responsible for a range (partition)? If transactions cross ranges, how do they reach agreement?
We have talked to many DB designers about including their DBs in HopsFS, but mostly it falls down on something or other. In our case, metadata is stored fully denormalized - all inodes in a FS path are separate rows in a table. In your case, you would fall down on secondary indexes - which are a must. Clustered PK indexes are not enough. For HopsFS/HDFS, there are so many ways in which inodes/blocks/replicas are accessed using different protocols (not just reading/writing files or listing directories, but also listing all blocks for a datanode when handling a block report). Having said that, it's a great DB for other use cases, and it's great that it's open-source.

⬐ elcritch
I agree with voidmain’s comment as secondary indexes shouldn’t be any different than the primary KV in your case. Almost seems that you’re focusing on a SQL/Relational database architecture but storing your data demoralized anyways. Odd combination of thoughts.

⬐ spullara
I love the idea of demoralized data :)

⬐ kovacs
Is that where the data sits around waiting, eager, to be queried into action while watching data around them being used over and over again.... But that time never comes, thus leaving them to question their very worth?

⬐ voidmain
Yes, it's really serializable isolation. The real kind, not the "doesn't exhibit any of the named anomalies in the ANSI spec" kind. We can selectively relax isolation (to snapshot) on a per-read basis (by just not creating a conflict range for that read).
I tried to explain distributed resolution elsewhere in the thread.
I believe our approach to indices pretty much totally dominates per-partition indexing. You can easily maintain the secondary indexes you describe; I don't understand your objection.

⬐ jwatte
My guess is the objection lies in "Have to manage the index myself."
Also, the main draw-back of "indices as data" in NoSQL is when you need to add a new index -- suddenly, you have to scrub all your data and add it to the new index, using some manual walk-the-data function, and you have to make sure that all operations that take place while you're doing this are also aware of the new index and its possibly incomplete state.
Certainly not impossible to do, but it sometimes feels a little bit like "I wanted a house, but I got a pile of drywall and 2x4 framing studs."

⬐ voidmain
"I wanted a house, but I got a pile of drywall and 2x4 framing studs."
This is a totally legitimate complaint about FoundationDB, which is designed specifically to be, well, a foundation rather than a house. If you try to live in just a foundation you are going to find it modestly inconvenient. (But try building a house on top of another house and you will really regret it!)
The solution is of course to use a higher level database or databases suitable to your needs which are built on FDB, and drop down to the key value level only for your stickiest problems.
Unfortunately Apple didn't release any such to the public so far. So I hope the community is up to the job of building a few :-)

⬐ bsaul
Your talk was one of the best talk i've seen , and i keep mentionning it to people whenever they ask me about distributed systems, database and testing.
i'm incredibly impatient to have a look at what the community is going to build on top of that very promising technology.

⬐ srigi
Can you share that talk here too?

⬐ nunb
link to the [talk] from the parent comment [by wwilson]
[talk] https://www.youtube.com/watch?v=4fFDFbi3toc
[by wwilson] https://news.ycombinator.com/item?id=16877401

⬐ thefounder
I'm not familiar with FDB but what you say sounds almost too good to be true. Can I use it to implement the Google Datastore api? I'm trying for years to find a suitable backend so that I can leave the Google land. Everything I tried either required a schema or lacked transactions or key namespaces.

⬐ wsh91
Hey there--I'm an engineer on the Cloud Datastore team. I'd love to know more about what your needs are if you're willing to share.

⬐ thefounder
I've forked the official SDK so that I can get extra functionality but it's quite hard to keep it updated when internal stuff changes. There is no way I can contribute. I can't use it everywhere I want...shortly said it's not open source and this sucks.

⬐ jlgaddis
He "needs" to get rid of you (Google). :-)

⬐ julien_c
Honest question, does MongoDB not work for this?

⬐ skissane
MongoDB is AGPL or proprietary. Many companies have a policy against using AGPL licensed code. So, if you work at one of those companies, then open source MongoDB is not an option (at least for work projects), and proprietary may not be either (depending on budget etc).
FoundationDB is now licensed under Apache 2, which is a much more permissive license, so most companies' open source policies allow it.

⬐ diaz
Unless people want to change the mongodb code that they would be using, using the agpl software should be a non issue and there are not problems with it. People should start understanding the available licenses instead of spreading fear.

⬐ skissane
It isn't "spreading fear", it is just reality. Google bans AGPL-licensed software internally: https://opensource.google.com/docs/using/agpl-policy/
I know that multiple other companies have a similar policy (either a complete ban on using AGPL-licensed software, or special approval required to use it), although unlike Google, they don't post their internal policy publicly.
If someone works at one of these companies, what do you want to do – spend your day trying to argue to get the AGPL ban changed, or a special exception for your project; or do you just go with the non-AGPL alternative and get on with coding?

⬐ jkaplowitz
The main reason it's a problem at many of the companies which ban it is they have a lot of engineers who readily patch and combine code from disparate sources and might not always apply the right discipline to keep AGPL things appropriately separate. Bright-line rules can be quite useful.
It is true that MongoDB's AGPL contagion and compliance burden, if you don't modify it, is less than many fear. It is also true that those corporate concerns are valid. MongoDB does sell commercial licenses so that such companies can handle their unavoidable MongoDB needs, but they would tend to minimize that usage.

⬐ ddorian43
Because MongoDB doesn't (didn't at least) anything at all good except for little complex document modifiers.
It was a very bad rdbms (since they marketed as a replacement) and a very bad sharded db (marketed too).

⬐ PeCaN
Honest question, does MongoDB work?

⬐ boubiyeah
MongoDB is pretty average at absolutely everything.
It's fast to install on a dev machine though, lol.

⬐ dangets
For better or worse, many times UX is more important than functionality.

⬐ btilly
Until recently, no. But apparently they got serious about making it work and now it actually does. See https://jepsen.io/analyses/mongodb-3-4-0-rc3 for verification.
I realized this 2 months ago at https://news.ycombinator.com/item?id=16386129

⬐ KMag
That's Nostradamus-level crazytown. What next, PHP strongly enforcing a sound static type system with immutable defaults, GADTs and dependent types?
Congrats to the MongoDB team!

⬐ smadge
> What next, PHP strongly enforcing a sound static type system with immutable defaults, GADTs and dependent types?
Too funny!

⬐ wwilson
As an existence proof: before the acquisition we built an ANSI SQL database and a wire-compatible clone of the MongoDB API.
I see no reason you wouldn't be able to implement Datastore. In fact here's a public source claiming that Firestore (which I believe is its successor) is implemented on top of Spanner: https://www.theregister.co.uk/2017/10/04/google_backs_up_fir...

⬐ edwinyzh
I hope this question doesn't feel too dump, but is it possible to implement a SQL Layer using SQLite's Virtual Table mechanism and leverage all foundationdb's features?

⬐ X-Istence
Is the wire-compatible clone of the MongoDB API available?

⬐ wwilson
It doesn't look like Apple open-sourced the Document Layer, which is a slight bummer. But I echo what Dave said below: what we got is incredible, let's not get greedy!
Also TBH now that I don't have commercial reasons to push interop, if I write another document database on top of FDB, I doubt I'd make it Mongo compatible. That API is gnarly.

⬐ itp
> That API is gnarly.
The Will Wilson I remember was not prone to such understatement. That API (and the corresponding semantics) was a nightmare.

⬐ tjb1982
I was pretty bummed not see that, myself

⬐ chx
> That API is gnarly.
Out of the many MongoDB criticisms, this one is valid. This https://www.linkedin.com/pulse/mongodb-frankenstein-monster-... article is quite right about it (note: endorsing the article does not mean I endorse its author by far).
Other than that, they totally did a fake it until you make it with MongoDB 3.4 passing Jepsen a year ago and MongoDB BI 2.0 containing their own SQL engine instead of wrapping PostgreSQL.

⬐ aeorgnoieang
What specifically are you trying to avoid endorsing about the author of the LinkedIn post to which you linked? I couldn't find anything from a cursory web search.

⬐ chx
You could try adding controversy to the author name and searching then. As mst correctly notes, I am trying to avoid reigniting said controversy while indicating my distaste.

⬐ mst
He runs lambdaconf, and refused to disinvite a speaker who many people felt shouldn't be permitted to speak because of his historical non-technical writings.
(I've tried to keep the above as dry as possible to avoid dragging the arguments around this situation into this thread - and I suspect the phrasing of the previous comment was also intended to try and avoid that, so let's see if we can keep it that way, please)

⬐ rthomas900
If I recall correctly, the "SQL layer" you had in FDB before the Apple acquisition was a nice proof of concept, but lacked support for many features (joins in SELECT, for example). Is the SQL layer code from that time available anywhere to the public? (I'm not seeing it in the repo announced by OP.)

⬐ rthomas900
Perhaps it's this?
https://github.com/jaytaylor/sql-layer

⬐ zwily
That has FoundationDB dependencies that don’t exist in the apple repo (SQL Parser, JDBC Driver, etc)

⬐ tjb1982
I used to work there. The SQL layer was capable of actually the majority of SQL features including joins, etc. We had an internal rails app that we used to dog-food for performance monitoring, etc. I used to work on the document layer, and was sad to see it wasn't included here.

⬐ lars_francke
> How about replacing your Lucene/ElasticSearch index with something that actually scales and works?
Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.
It's definitely interesting but I'm cautious. The track record for FoundationDB and Apple has not been great here. IIRC they acquired the company and took the software offline leaving people in the rain?
Could this be like it happened with Cassandra at Facebook where they dropped the code and then more or less stopped contributing?
Also I haven't seen many contributions from Apple to open-source projects like Hadoop etc. in the past few years. Looking for "@apple.com" email addresses in the mailing lists doesn't yield a lot of results. I understand that this is a different team and that people might use different E-Mail addresses and so on.
In general I'm also always cautious (but open-minded) when there's lots of enthusiasm and there seems to be no downside. I'm sure FoundationDB has its dirty little secrets and it would be great to know what those are.

⬐ jacobparker
> Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.
https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...
https://aphyr.com/posts/317-call-me-maybe-elasticsearch
https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-...
The statement was blunt but the reputation is not exactly unearned. These problems become worse proportional to scale.

⬐ jasode
>It’s in widespread production use at Apple
Anybody know if Apple migrated any projects from Cassandra to FoundationDB?
Or was every project using FoundationDB a greenfield project?

⬐ shawn-butler
There were rumors iMessage moved there but ¯\_(ツ)_/¯
Would be great to see from Apple an engineering blog "strengths and challenges at scale" post now that it has been opened again.

⬐ itp
I have nothing specific to add to this either other than I also spent years working there, and I'm unbelievably happy about this.

⬐ foobaw
It is a genius move from Apple. I just wish they'd apply this logic to a lot of their other stuff.

⬐ pritambaral
~~How is it different from when Apple acquired the then-open-source FoundationDB (and shut down public access)? They could have just kept it open source back then.~~
EDIT: My bad, looks like FoundationDB wasn't fully open-source back then.

⬐ djrogers
From what I recall (and based on some quick retro-googling) I don't believe Foudnation was open-source. One of the complaints about it on HN back in the day was that it was closed...

⬐ None
None

⬐ piva00
You are correct, FoundationDB was free but never OSS.
You can even check the comments on the HN thread [1] when they removed the downloads.
[1] https://news.ycombinator.com/item?id=9259986

⬐ woolvalley
They are, slowly. Swift is open source, clang is open source. They are moving parts of the xcode IDE into open source, like with sourcekitd and now recently clangd.
I don't think they will ever move 'secret sauce' into open source, but infrastructural things like DBs and dev tooling seems to be going in that direction.

⬐ mercutio2
Well, clangd is a Google project, which Apple has decided to start contributing to, so probably doesn’t belong on your list.
Apple, like everyone else, wants to commoditize their complements.

⬐ None
None

⬐ mattkrea
Clang was an Apple project from the start.. I'm not sure what is telling you it is a Google project

⬐ gnat
In case a citation is needed: https://en.wikipedia.org/wiki/Clang
> Apple chose to develop a new compiler front end from scratch, supporting C, Objective-C and C++. This "clang" project was open-sourced in July 2007.

⬐ kalleboo
I believe the "d" in the GP comment was not a typo https://news.ycombinator.com/item?id=16874734

⬐ mercutio2
Clangd != clang.
Woolvalley said “and now recently clangd“, which is a source code completion server which began its life at Google.
Clang did start its life with Apple.

⬐ catwell
> In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc
Unrelated to the original topic, but I had never come across that talk and it is great. I use the same basic approach to testing distributed systems (simulating all non-deterministic I/O operations) and that talk is a very good introduction to the principle.

⬐ wenc
In reinforcement learning, the same approach (simulation) to addressing a data deficit is used, although the goal there is learning and not testing.
When certain events happen rarely, you have fake them deterministically so that you have reproducible coverage.

⬐ KyleBrandt
Know of any ideas around using this for long term Time Series data? Wonder if maybe something like OpenTSDB but with this as backend instead of hbase (which can be a sort of operational hell)

⬐ wenc
Not quite the same but TimescaleDB is Postgres-based and is showing a lot of promise for time-series data.

⬐ mzeier
Yes :) it's the core of how Wavefront stores telemetry.

⬐ infinite8s
https://www.wavefront.com/wavefront-foundationdb-open-source...

⬐ bryanrasmussen
How would you replace a Lucene/Elasticsearch index with foundationDb?

⬐ voidmain
It's more like you would build a better Elasticsearch using Lucene to do the indexing and FoundationDB to do the storage. FoundationDB will make it fault tolerant and scalable; the other pieces will be stateless.

⬐ bryanrasmussen
ok thanks, it was sort of confusing me

⬐ bryanrasmussen
Actually, since my current side project needs good graph support I went looking for foundationDb graph and found this https://github.com/rayleyva/blueprints-foundationdb-graph may have to check it out.

⬐ fizx
It'd take a low number of hours to wire up FoundationDB as a Lucene filesystem (Directory) implementation. Shared filesystem with a local RAM cache has been practical for a while in Lucene, and was briefly supported then deprecated in Elasticsearch. I've used Lucene on top of HDFS and S3 quite nicely.
If you have a reason to use FoundationDB over HDFS, NFS, S3, etc, then this will work well.
Doing a Lucene+DB implementation where each entry posting lists are stored natively in the key-value system was explored for Lucene+Cassandra as (https://github.com/tjake/Solandra). It was horrifically slow, not because Cassandra was slow, but because posting lists are optimized and putting them in a generalized b-tree or LSM-tree variant will remove some locality and many of the possible optimizations.
I'm still holding out some hope for a hybrid implementation where posting list ranges are stored in a kv store.

⬐ nl
I wrote the original version of Solandra (which is/was Solr on Cassandra) on top of Jake's Lucene on Cassandra[1].
I can confirm it wasn't fast!
(And to be fair that wasn't the point - back then there were no distributed versions of Solr available so the idea of this was to solve the reliability/failover issue).
I wouldn't use it on a production system now days.
[1] http://nicklothian.com/blog/2009/10/27/solr-cassandra-soland...

⬐ burntsushi
> I've used Lucene on top of HDFS and S3 quite nicely.
Out of curiosity, what led you to do this? And what does it do better/worse/differently than out-of-the-box things like Elasticsearch or SOLR?

⬐ gazarsgo
HDFS is supported by Solr+Lucene, but rather than my own poor paraphrasing, see what you think of this writeup: https://engineering.linkedin.com/search/did-you-mean-galene

⬐ burntsushi
Ah, excellent! Thanks. That answers my question. I also found the idea of early termination via static rank very intriguing.

⬐ ddorian43
See elassandra for a better solandra, keeping lucene indexes and sstables separate. It should be better than keeping posting list in kv-store

⬐ voidmain
I think you are on the right track. Storing every individual (term, document, ...) in the key value store will not be efficient, but you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently. And of course you can do caching (and represent invalidation data structures in FDB), and...
FDB leaves room for a lot of creativity in optimizing higher layers. Transactions mean that you can use data structures with global invariants.

⬐ burntsushi
Hmmm. I'm skeptical. A Lucene term lookup is stupidly fast. It traverses an FST, which is small and probably in memory. Traversing the postings lists itself also needs to be smart by following a skip table, which is critical for performance.

⬐ derefr
> you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently.
That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?

⬐ wll
It most definitely would.
Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) [0].
I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees [1], drop its 64bit integer-based keys to take advantage of FoundationDB range reads [2], drop its memcached layer due to FoundationDB’s distributed caching [3], use FoundationsDB watches for transactor messaging and tx-report-queue function [4], use FoundationDB snapshot reads [5] for its immutable indexes trees nodes, and maybe more?
Datomic is a FoundationDB layer. It just doesn’t know yet.
[0] https://docs.datomic.com/on-prem/acid.html#how-it-works
[1] https://apple.github.io/foundationdb/developer-guide.html?hi...
[2] https://apple.github.io/foundationdb/developer-guide.html?hi...
[3] https://apple.github.io/foundationdb/features.html#distribut...
[4] https://docs.datomic.com/on-prem/clojure/index.html#datomic....
[5] https://apple.github.io/foundationdb/developer-guide.html?hi...

⬐ mvc
Can't see datomic itself ever doing that though because they'd have to support those features in all the backends.

⬐ fizx
So from the Lucene perspective, the idea of a filesystem is pretty baked into the format. However, there's also the idea of a Codec which takes the logical data structures and translates to/from the filesystem. If you made a Codec that ignored the filesystem and just interacted with FDB, then that could work.
You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.
If anyone has any serious interest in trying this, my email is in my profile to discuss further.

⬐ jtchang
I just watched the demo of 5 machines and 2 getting unplugged. The remaining 3 can form a quorum. What happens if it was 3 and 3? Do they both form quorums?

⬐ wwilson
A subset of the processes in a FoundationDB cluster have the job of maintaining coordination state (via disk Paxos).
In any partition situation, if one of the partitions contains a majority of the coordinators then it will stay live, while minority partitions become unavailable.

⬐ jamesblonde
The arbitrator pattern is a great lightweight solution to the split-brain problem. Here's a great blog about it: http://messagepassing.blogspot.se/2012/03/cap-theorem-and-my...

⬐ jtchang
What if the number of boxes are even?
Or do you get around this by not deploying an uneven number of boxes?

⬐ voidmain
The number of coordinators is separate from the number of boxes. You don't have to have a coordinator on every box.
I think you can set the number of coordinators to be even, but you never should - the fault tolerance will be strictly better if you decrease it by one.

⬐ gregmac
This makes me wonder also what happens with multiple partitions, for example:
5 into 2, 2, and 1
7 into 3, 2, and 2

⬐ traviscj
No majorities, I guess?

⬐ maccam94
The majority is always N/2 + 1, where N is the number of members. A 6 member is less fault-tolerant than a 5 member cluster (quorum is 4 nodes instead of 3, and it still only allows for 2 nodes to fail).

⬐ voidmain
Nitpick: To be fully live, a partition needs a majority of the coordinators and at least one replica of each piece of data (if you don't have any replicas of something unimportant, you might be able to get some work done, but if you have coordinators and nothing else in a partition you aren't going to be making progress)

⬐ ece
Apple also uses HBase for Siri I believe, what are some of the cluster sizes that FoundationDB scales to? Could it be used to replace HBase or Hadoop?
I was in attendance at your talk, and thought it was one of the best at the conference. Apple I think broke some hearts completely going closed-source for a while, but glad to see them open sourcing a promising technology.

⬐ killertypo
https://news.ycombinator.com/item?id=16879392 massive

⬐ mzeier
If scale is a function of read/writes, very large. In fact with relatively minimal (virtual) hardware it's not insane to see a cluster doing around 1M writes/second.

⬐ ece
I was talking more about large file storage like HDFS, and the MapReduce model of bringing computation to data. HBase does the latter, and it's strongly consistent like FoundationDB, though FoundationDB provides better guarantees. As a K/V I understand what you and OP say.

⬐ sydcanem
How does this compare with CockroachDB? I'm planning to use CockroachDB for a project but would love to get an idea if I can get better results with FoundationDB.

⬐ capkutay
well FoundationDB for one doesn't resort to clickbaity pseudo-benchmark marketing tactics
https://www.cockroachlabs.com/blog/performance-part-two/

⬐ zzzcpan
They might be targeting the wrong market, hence the desperate marketing. For people who use MySQL/PostgreSQL a compatible, slower, but distributed database probably just doesn't solve any problem. Those people need a managed solution, not a distributed one.

⬐ avip
"we also simulate dumb sysadmins". That was a really inspiring talk, thanks!

⬐ StreamBright
Fantastic talk! Very engaging and insightful.

⬐ jahhein
That presentation was really good! Well explained on the simulations. If one wanted to get into this exciting event and create something with FoundationDB but no database experience (I do know many programming languages) where would I start? If anyone could point me in the direction, I'd greatly appreciate it.

⬐ edwinyzh
How scalable and feasible to implement a SQL Layer on top of SQLite's Virtual Table Mechanism (https://www.sqlite.org/vtab.html) which redirects the read/write of the record data from/to foundationdb?

⬐ voidmain
Long before we acquired Akiban, I prototyped a sql layer using (now defunct) sqlite4, which used a K/V store abstraction as its storage layer. I would guess that a virtual table implementation would be similar: easy to get working, and it would work, but the performance is never going to be amazing.
To get great performance for SQL on FoundationDB you really want an asynchronous execution engine that can take full advantage of the ability to hide latency by submitting multiple queries to the K/V store in parallel. For example if you are doing a nested loop join of tables A and B you will be reading rows of B randomly based on the foreign keys in A, but you want to be requesting hundreds or thousands of them simultaneously, not one by one.
Even our SQL Layer (derived from Akiban) only got this mostly right - its execution engine was not originally designed to be asynchronous and we modified it to do pipelining which can get a good amount of parallelism but still leaves something on the table especially in small fast queries.

⬐ edwinyzh
@voidmain, Thank you, it's very insightful and clear! I mean, I can see the disadvantage if such SQL layer is implemented directly through SQLite's virtual tables.

⬐ tinco
Would it be possible to build a tree DB on top of it like MonetDB/Xquery? I always wondered why XML databases never took off, I've never seen anything else quit as powerful. Document databases if du jour seem comparatively lame.

⬐ spullara
They had a document layer that wasn't released that was MongoDB compatible — so you can do this.

⬐ tinco
MongoDB is just a database with less features than a SQL database. An XML/XQuery database is fundamentally different, so I figured if FoundationDB layers are really so powerful, they might be able to model a tree DB as well.

⬐ spullara
Ah misread. Maybe? Storing hierarchical data would be pretty natural.

⬐ CMCDragonkai
Yes you can. You need a tree index basically. Any kv store can serve as the backing data structure. I've been writing one for config file bidirectional transformation.

⬐ None
None

⬐ tjb1982
That is so exciting! Can't wait to poke around. I wonder if there's anything in there I might recognize

⬐ eklavya
Sorry if I missed it but can you or someone else please link where the client wire protocol is documented?

⬐ voidmain
The client is complex and needs very sophisticated testing, so there is only one implementation. All the language bindings use the C library.

⬐ eklavya
Ok, thanks :)

⬐ ddorian43
Curious why the client is complicated compared to other dbs in same space ?

⬐ voidmain
In some distributed databases the client just connects to some machine in the cluster and tells it what it wants to do. You pay the extra latency as it redirects these requests where they should go.
In FDB's envisioned architecture, the "client" is usually a (stateless, higher layer) database node itself! So the client encompasses the first layer of distributed database technology, connects directly to services throughout the cluster, does reads directly (1xRTT happy path) from storage replicas, etc. It simulates read-your-writes ordering within a transaction, using a pretty complex data structure. It shares a lot of code with the rest of the database.
If you wanted, you could write a "FDB API service" over the client and connect to it with a thin client, reproducing the more conventional design (but you had better have a good async RPC system!)

⬐ ddorian43
Wouldn't layers be hard to be built on the server (since you have to also change the client) and slow to be built as a layer (since it will be another separate service) ?

⬐ qaq
I am guessign you'd pretty much embed the client into your higher layer

⬐ voidmain
I'm not sure what you are asking, but depending on their individual performance and security needs layers are usually either (a) libraries embedded into their clients, (b) services colocated with their clients, (c) services running in a separate tier, or (d) services co-located with fdbservers. In any of these cases they use the FoundationDB client to communicate with FoundationDB.

⬐ mogui
In case (c) or (d) how can a layer leverage the distributed facilities that FDB gives? I mean if I have clients that connect to a "layer service" that is the one who talks to FDB, I have to manage "layer service" scalabily, fault tolerance etc... by myself.

⬐ voidmain
Yes, and that's the main advantage of choosing (a) or (b). But it's not quite as hard as it sounds; since all your state is safely in fdb you "just" have to worry about load balancing a stateless service.

⬐ mogui
got it, what will you suggest to do something like that? a simple RPC with a good async framework I've read, like what? an RPC service on top of Twisted for python, similar things in other languages?
thanks :)

⬐ zbentley
> but you had better have a good async RPC system!
The microservices crew with their "our database is behind a REST/Thrift/gRPC/FizzBuzzWhatnot microservice" pattern is still catching up to the significance of this statement.

⬐ foodbaby
This might be a dumb question (from someone used to using blocking JBDC) but why is async RPC important in this case? Just trying to understand. And can gRPC not provide good async RPC?

⬐ ddorian43
Imagine single core single threaded design. You send 2 requests for 1 row each.
First request the row needs to be read from disk HDD. It takes 2ms.
Second request, the row is already in ram, it takes microseconds but still has to wait for the first request to finish.
Threads have overhead when having a lot of concurrency (thousands/millions requests/second).
For extreme async, see seastar-framework and scylladb design.
TLDR: high concurrency, low overhead etc.

⬐ zbentley
I was referring to the trend of splitting up applications into highly distributed collections of services without addressing the fact that every point where they communicate over the network is a potential point of pathological failure (from blocking to duplicate-delivery etc). This tendency replaces highly reliable network protocols (i.e. the one you use to talk to your RDBMS) with ad hoc and frequently technically shoddy communication patterns, with minimal consideration for how it might fail in complex, distributed ways. While not always done wrong, a lot of microservice-ification efforts are quite hubristic in this area, and suffer for it over the long term.

⬐ ccifuentes
Do we really need another rookie database system? To be honest, Postgres and MongoDB is all you need to achieve any project.
I respect hobby projects though, and if that’s the case then great!

⬐ 15155
> I respect hobby projects though
https://techcrunch.com/2015/03/24/apple-acquires-durable-dat...
FDB raised >$22m and was acquired by Apple.
If you consider that a "hobby project," please, teach me how to hobby.

⬐ foobarbazetc
Friends don’t let friends use MongoDB.

⬐ qaq
https://www.youtube.com/watch?v=4fFDFbi3toc I am big fan of PostgreSQL but it would prob be a good idea to look into a thing you are commenting about.

⬐ ianamartin
Postgres operates great as a document store, btw. You don’t really need mongo at all. And if you need to distribute because you’ve outgrown what you can do one a single postgres node, you don’t want to use mongo anyway.
If you’ve read any of the comments or have been following the project, it should be pretty obvious that this is far from rookie.
This is a game changer, not a hobby project. This is the first distributed data store that offers enough safety and a good enough understand of CAP theorem trade offs that it can be safely used as a primary data store.

⬐ segmondy
Or perhaps it's not so incredible? Maybe it wasn't such a huge hit for Apple and didn't leave up to expectation so they figure they can give it away and earn some community goodwill.

⬐ banana-guy
there were rumors that fdb had never been adopted at Apple because of some limitations and they still ran an in-house version of cassandra

⬐ voidmain
Will said what I wanted to say, but: me too. I'm super happy about this and grateful to the team that made it happen!
(I was one of the co-founders of FoundationDB-the-company and was the architect of the product for a long time. Now that it's open source, I can rejoin the community!)

⬐ rcgenova
Ditto. So glad to see FoundationDB available again! Both the tech and the company have been missed. (Former FoundationDB Solutions Engineer :) )

⬐ nagarc
Nice to see you are joining back!

⬐ nlavezzo
Another (non-technical) founder here - and I echo everything voidmain just said. We built a product that is unmatched in so many important ways, and it's fantastic that it's available to the world again. Will be exciting to watch a community grow around it - this is a product that can benefit hugely from OS contributions as layers that sit on top of the core KV store.

⬐ lmorris84
I wrote a java port of the python counter many moons ago [1]. Will have to resurrect it!
[1] https://github.com/leemorrisdev/foundationcounter

⬐ voidmain
It should probably be pointed out that atomic increment is in most situations a more efficient solution for high contention counters in modern FDB.

⬐ lmorris84
Ah I don’t believe that was available last time I used it - I’ll check it out thanks!

⬐

Jan 20, 2017 · bsaul on Eventually Consistent: How to Make a Mobile-First Distributed System

Some advice : if you want to have people trust and thus use this feature, you should make this kind of talk :
https://youtu.be/4fFDFbi3toc

⬐

Feb 22, 2016 · bsaul on Swifton: Ruby on Rails-inspired web framework for Swift

One talk that really made me confident in using a new DB technology was this one :
https://www.youtube.com/watch?v=4fFDFbi3toc&feature=youtu.be...
Maybe you could make videos and talks about your testing procedures. That would probably help build trust.
nb : you're not distributed, but you're enabling multi-threaded access, in an environment where your app can be killed anytime by the OS. That's quite hostile as well :)

⬐ timanglade
We did record this talk about core recently: https://realm.io/news/jp-simard-realm-core-database-engine/ does it hit anywhere close to the mark?
Great observation on the threading — it’s definitely complicated to reason about and mobile is a very hostile environment due to this and other factors (e.g. mmap limits can be quite aggressive as well).

⬐

Oct 22, 2015 · Smerity on Fuzzing Raft for Fun and Publication

I'm very excited by this work. As they say, testing for distributed systems leaves a great deal to be desired, but there work is a great step!
I was excited for FoundationDB, who put a great deal of focus on testing[1], but they were acquired into mystery by Apple. To give you an idea of how well tested they were, Aphyr of "Call Me Maybe" fame didn't test them "because their testing appears to be waaaay more rigorous than mine"[2]. FoundationDB ran Jepsen internally anyway[3].
DEMi is a really nice tool that has me excited. I've worked with similar test systems for non-blocking IO models and they're hugely useful. For non-blocking IO, you can artificially accelerate time as you can leap to the next interrupt, speeding up fuzzing substantially (see: Flow from FoundationDB in [1]).
@ikneaddough, the demi-applications repo notes Spark, have you tested it so far? I quite like Spark but had some very temperamental issues at scale.
[1]: https://www.youtube.com/watch?v=4fFDFbi3toc
[2]: https://twitter.com/aphyr/status/405017101804396546
[3]: https://web.archive.org/web/20150325003511/http://blog.found...

⬐ ikneaddough
Yup! We ran on Spark as well, though we were only focused on reproducing (and minimizing) known bugs, not on finding bugs. For more info, see our research paper: http://eecs.berkeley.edu/~rcs/research/nsdi_draft.pdf

⬐

Sep 04, 2015 · bsaul on Call Me Maybe: MariaDB Galera Cluster

Everytime i read one of those post (or other about mongodb failures), i keep thinking about this talk http://youtu.be/4fFDFbi3toc
And why noone tried to repeat their strategies for building a robust db system : start by building an extremely robust failure simulation and testing facility. Then build your product.
Actually, i think what those guys at foundationdb did was so exceptional, that by buying the company and killing the product, Apple harmed the software industry for the next 10 years. The fact the foundationdb is mentionned in OP as the only distributed db system one could recommend makes me more confident making that statement.

⬐ threeseed
FoundationDB was hardly redefining the database industry. There is more to a database being successful in the market than how well it fairs in one particular guy's blog posts (as excellent as they are). Apple isn't harming anyone.

⬐ MrBuddyCasino
That was a great talk. If there isn't anyone doing this right now, somethings wrong with the world. Its a huge opportunity.

⬐ lostcolony
Only distributed SQL store. You want non-distributed SQL, there are options, you want distributed NoSQL, there are options.

⬐ rspeer
Aphyr's been testing more than SQL stores.
"NoSQL" is not a magic bullet: concurrency is still hard when you skip the SQL.

⬐ lostcolony
I know. Sorry, I should have been clearer; what I meant was Aphyr has had some fine results for SQL DBs, provided you avoided distribution. And he has had some fine results for NoSQL DBs, when distributed, provided you preferred availability to consistency (generally in certain scenarios, or with certain caveats). What he has not had, and so presumably was lamenting, was any distributed DB that offered transactions that passed any sort of muster, except for FoundationDB, which was made defunct. That, or any DBs he could recommend that met every claim their documentation made (as that's largely what he's testing for).
I meant that with this being the first distributed SQL store, he was saying there was nothing he could recommend that actually offered similar guarantees to what this was claiming. That is, ACID style (well, technically --ID I believe) DB transactions in a distributed environment. FoundationDB did (though it was NoSQL), hence his mentioning it. But that's different than there is nothing he can recommend, at large; he can't recommend similar solutions, but for a given problem he could likely recommend a compelling, different, solution.

⬐ jhugg
This might be overdoing things a bit. FoundationDB was a rigorously tested KV store with strong consistency (uncommon for such a product). Yes, that talk from StrangeLoop was great, but there's two common misconceptions here:
(What follows is my opinion / guesswork as a VoltDB employee)
1. FoundationDB wasn't killed by Apple; it was rescued by Apple. The product couldn't compete on just being a KV store and wasn't doing well in the market. Apple saw a very bright and now experienced team and scooped them up for a song.
2. Before this happened, FoundationDB realized they needed a way to query their system to compete, so they bought Akiban (a failing SQL db company) to add SQL to their system. But they assumed they could do this without deep integration, which was wrong. They added a SQL "layer" on top of the KV store and it was way to slow to be practical. The benchmarks they published were embarrassing.
I wrote a blog post about this: http://voltdb.com/blog/foundationdbs-lesson-fast-key-value-s...
SUMMARY FoundationDB: Great Testing, Great Engineering, Not particularly good product...

⬐ bsaul
Very interesting pov, thanks for sharing.
Actually, the thing that i find most impressive in their tech stack is the approach they took for building it starting with the simulator + c++ extension. Those are the technology that i think would benefit all the community, if they were ever open sourced.
As a voltDB engineer, how do you ensure your implementation doesn't compromise the theorical correctness of your system ?

⬐ jhugg
I'm actually going to be discussing this at Strangeloop next month.
http://www.thestrangeloop.com/2015/all-in-with-determinism-f...
VoltDB is actually a bit simpler with what it promises, full serializable ACID for all transactions. This is much easier to understand and verify than lesser isolation.
We think what we've done is pretty clever too. We've built a determinism checker into our replication engine, so that we can verify that each replica has the same state at each logical point in time, and each operation makes identical changes to that state.
Then we built test patterns that are designed to be as co-dependent as possible and run them against a replicated VoltDB cluster. That VoltDB cluster goes though one or more kinds of failure, including multiple simultaneous failure, and then a checker ensures no data is ever lost, corrupted or run in the wrong order.
It's different from the FDB thing. The simulation they did is certainly easier to run on a pure KV store, but keep in mind we also have to test SQL that queries millions of rows, along with upserts, materialized aggregations, etc...
We're working on some blog content on this in addition to my talk. Stay tuned.

⬐ bsaul
Great, i'm looking forward to see that talk. To me, it seems that the only way you can make sure that an implementation is correct, is either via extensive testing ( and when talking about acid property over a distributed system, i think the kind of simulator fdb built is a requirement), or run a theorem proover ofer your source code, ala Coq. But i've never heard of any code analysis tool that is able to guarantee properties over a distributed system.
But, since i'm absolutely not working on the field, i'm really looking forward to see what professional people like you are finding to tackle those issues.

⬐

Jun 10, 2015 · developer1 on Ultratestable Coding Style

Testing the filesystem? This talk is interesting as hell, outlines how FoundationDB goes about testing unusual failures in systems like the filesystem and the network stack: https://www.youtube.com/watch?v=4fFDFbi3toc . Crazy beyond all belief.

⬐

Mar 25, 2015 · teeray on Apple Acquires FoundationDB

One of their engineers gave a great talk at Strangeloop (https://m.youtube.com/watch?v=4fFDFbi3toc) that shows some of the amazing simulations they ran FoundationDB through. It's well worth a watch if you want to understand the punishment that they put this database through to make it stand out.

⬐

Dec 11, 2014 · itp on Databases at 14.4Mhz

I'll give you some brief thoughts from my experience working at FoundationDB, but if you really want the in-depth answer to what makes our simulation framework an enabling technology for everything we do, you should take a look at the talk my colleague Will Wilson gave at Strange Loop [0].
Everything in our core is written in Flow, our asynchronous programming environment. The architecture of our server is essentially single threaded, with (pseudo-)concurrent actors allowing a single process to satisfy multiple roles at once.
The interfaces that these actors use to communicate -- with the disk, over the network, with the OS in general -- are abstracted and implemented in Flow as well. Each of these interfaces not only has at least one "real" implementation, but at least one, and sometimes more, simulated implementations.
Since we run multiple actors in a single process all the time, running yet more actors, pretending to be different machines, all still in the same process, was an obvious step. These workers communicate with each other via the _same_ network interfaces that real-world workers do in real clusters.
In the end we are able to become our own adversaries, pushing the limits of the system in ways that just don't happen often enough in the real world to test and debug in the wild. Our simulated implementations are allowed to present any behavior that can be observed in the real world, or justified by the spec, or implied by the contract. And they can do so much more frequently.
But most importantly -- as in, without this our system would never have been developed -- we can simulate these pathological behaviors in a completely deterministic fashion. Having the power to run a million tests, simulating multi-machine clusters trying to complete workloads while suffering from the most unbelievable combination of network partitions, dropped packets, failed disks, etc. etc., all while knowing that any error can be replayed, stepped through in a debugger, in a single process on a single machine... without this capability, we would never have had the confidence to build our product or evolve it as aggressively as we have.
tl;dr: We're not making simulations _of_ the system, we're running the system _in_ simulation, and that makes all the difference.
[0]: https://www.youtube.com/watch?v=4fFDFbi3toc

Testing Distributed Systems with Deterministic Simulation [video]

⬐

Sep 20, 2014 · 61 points, 6 comments · submitted by tosh

⬐ wwilson
I'm glad people have been getting a kick out of this, but for those who don't want to listen to me drone on for 40 minutes:
(1) Here are the slides I used: http://www.slideshare.net/FoundationDB/deterministic-simulat...
(2) Here is a (significantly out of date) white paper of ours which covers some of the same territory: https://foundationdb.com/key-value-store/white-papers/testin...

⬐ mpweiher
Great stuff! Would it be fair to characterize this in software architecture terms as taking your components (under test) and using different connectors that model the real world?

⬐ tinco
You actually have a very pleasant presentation style, it drew me in and I enjoyed every minute of it. Though perhaps it's just because I find the subject and your solution very interesting.
What's your standpoint on formal verification? Did you guys think about it and reject it?

⬐ jamii
http://db.cs.berkeley.edu/papers/dbtest12-bloom.pdf has a really nice extension of this idea - since their language is unordered by default and has only a few explicit ordered primitives, they can use an SMT solver to determine which event interleavings cannot possibly affect the result. This lets them generate schedules that more efficiently explore the space of all possible schedules and find bugs faster.

⬐ tinco
This talk is awesome. The idea that the simulation framework becomes part of the production code is brilliant. I wonder if these ideas could be merged with formal methods, so that perhaps a model of the simulation could be generated, and then through model analysis it could generate simulation stories for itself that humans might overlook.

⬐ parley
I did this for a decently complex distributed system for embedded devices, and it practically saved my life.
It was in C and I didn't do any language extending/precompiling, but I had interfaces for everything related to I/O, execution actors, randomness, etc.
On target hardware everything used TCP/UDP/real disk, pthreads, normal rand sources, etc. In simulation everything used virtual networking, a single simulation thread stepping all event loops, test-specified random seeds for reproducibility, etc.
It is completely invaluable. I can concisely write completely deterministic system tests that will execute tens of thousands of lines of code. I can fuzz test actor scheduling, I/O problems like dropped packets/msgs, and everything you can think of. I can run the entire test suite in valgrind and other nice tools. I can put a big machine in a corner of the office to fuzz test the suite for weeks on end and email me when a a test fails and tell me exactly which random seed to use to reproduce the failure myself within minutes. I can debug the entire simulation perfectly in GDB.
I've barely begun to describe how great it is to have, how many bugs these tests have caught or what a reliable regression test suite one can build. It doesn't replace testing on target - I do that extensively as well. Big system scenario tests don't replace smaller module and unit tests - I do those too. But deterministic simulation testing saved my sanity. Don't hesitate to evaluate this approach if you're doing something similar.

Hacker News Comments on "Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson

Hacker News Stories and Comments

Hacker News Comments on
"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson