HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”

CppCon · Youtube · 14 HN points · 16 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention CppCon's video "CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”".
Youtube Summary
http://CppCon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2017

Hash tables consume a large volume of both compute resources and memory across Google's production system. The design for hash tables in C++ traces its origins to the SGI STL implementation from 20 years ago. Over these years, computer architecture and performance has changed dramatically and we need to evolve this fundamental data structure to follow those changes. This talk describes the process of design and optimization that starts with std::unordered_map and ends with a new design we call "SwissTable", a 2-level N-way associative hash table. Our implementation of this new design gets 2-3x better performance with significant memory reductions (compared to unordered_map) and is being broadly deployed across Google.

Matt Kulukundis: Google, Senior Software Engineer

Matt is a senior software engineer on the C++ libraries team at Google. Prior to Google he has worked on compilers, machine learning, and underwater robotics. In his free time, he scuba dives in warm places.

Videos Filmed & Edited by Bash Films: http://www.BashFilms.com

*-----*
Register Now For CppCon 2022: https://cppcon.org/registration/
*-----*
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Related: this talk https://www.youtube.com/watch?v=ncHmEUmJZf4 about the making of absl::flat_hash_map (a "competitor" to boost::unordered_flat_map) is one of my favorites.
garaetjjte
This talk is also interesting, covering multiple approaches: https://www.youtube.com/watch?v=M2fKMP47slQ
Could someone explain to me in a clear and high level fashion what exactly pointer stability (or pointer instability for that regards) entails? I've watched some of the cppcon talks on the google hashtable implementation [0] and in that talk Matt Kulukundis also mentioned it affecting the adoption rate, but I always failed to really grasp the concept and its implications. Google searches also ended up rather fruitless to be honest.

[0]: https://youtube.com/watch?v=ncHmEUmJZf4

gpderetta
If you allocate your objects inline in your internal table, rehashing will invalidate existing references to those objects as they will need to be copied into the new, larger, table.

Allocating every object separately in its own node as used in chaining implementations guarantee that the object location is stable (i.e doesn't change), but in addition to making iteration slower it requires separate allocation for each object.

UncleMeat
An operation maintains Pointer Stability means if you take a pointer to an element in a data structure and then execute that operation, the pointer is still valid and pointing to the same data.

Think about how vector would be implemented under the hood. What happens if you delete an element? Does it shift down all of the elements after it? If you do that, the objects have literally moved in memory. Any pointer that points into those memory locations is now invalid. That's a problem if you wanted to hold onto pointer to elements. But if you declare that your interface has certain pointer stability guarantees then suddenly your implementation is massively constrained. Now your vector implementation will have big gaps that you need to deal with and will waste a ton of memory if you add and remove many elements.

Koffiepoeder
So, am I correct if I then put it as following?

Pointer instability means that it is undefined behaviour to:

1) store (both on stack and heap) the pointer address to a value element outside of its respective container

2) then modify the container's content

3) then thereafter use the stored pointer address

This implies that:

- containers without pointer stability require pass-by-value (or a pointer to a temp copy) to be used for its values after accessing them OR

- containers without pointer stability require absolute immutibility of the container to guarantee runtime safety OR

- containers without pointer stability require "relative" immutability during the existence of a pointer to a value element to guarantee runtime safety. Note that I think this last option is a bad idea, because it requires multithreaded blocking/synchronization of the container and is also easy to encapsulate wrongly, leading to UB.

brandmeyer
> 2) then modify the container's content

... with a specific set of mutating operations that (may) alter the container's structure (eg, element insertions and removals).

Altering the value of one element of the container doesn't invalidate iterators or references to any element. You can make as many mutating passes over a std::vector as you like without invalidating anything.

Things only start to get dicey when you push_back(), insert(), erase(), etc.

STL has a lot of weird pitfalls. There was std::vector<bool>. Here you can see some pitfalls of std::unordered_map: <https://youtu.be/ncHmEUmJZf4?t=2220>. The whole talk is interesting to watch. In the beginning you can also see the puzzling design of std::unordered_map (puzzling because of the number of indirections).

I'd reach for abseil first: <https://abseil.io/>.

qalmakka
> STL has a lot of weird pitfalls

everything that's old and has legacy is doomed to have several. It's just a fact of life.

The C++ committee doesn't and can't just throw away stuff like `std::vector<bool>`, despite it being a whole fractal of bad ideas, because people have already used that piece of dung in uncountable projects. It would be nice to have a way to get a _normal_ vector of booleans (even though vector<char> just suffices most of the time), but that's life I guess.

yakubin
People bring this point up every time C++ is criticized, but no, not all C++ shortcomings come from its age. In fact, most of the shortcomings it is criticized for today are new. The committee just keeps on adding new ones: initialization (mentioned by another reply to my comment), std::optional & std::variant (UB), concepts[1], std::thread (good luck debugging where the std::thread which crashes your program in a destructor came from), std::random_device()() being allowed to return 0 every time (which e.g. won't show up when you compile for your own machine, but will when you cross-compile) and probably many others which I don't remember off the top of my head.

[1]: <https://news.ycombinator.com/item?id=28098153>

But back to the point, my original comment wasn't about C++ being a bad choice. It was about STL. You can use C++ without STL. I pointed to one alternative: abseil.

qalmakka
It does indirectly come from age, because legacy and backward compatibility mean that you can't simply change some core parts of the language at all. For instance, std::optional uses UB because _all_ of C++ does the same, and making std::optional<T>::operator* different from std::unique_ptr<T>::operator* would have caused similar concepts to have a different behaviour - something that would have definitely translated into weird bugs.

Also, C++ hasn't features like built-in move semantics and borrow checking, so it's very hard to enforce certain things that just come natural in Rust. That's simply stating a fact. The real point here is that std::optional<T> is much less likely to cause UB than a plain, old C pointer, so it's a net improvement - even though it could still lead to UB in some circumstances.

josefx
> https://youtu.be/ncHmEUmJZf4?t=2220

Some good points, but also "calling find with C strings is slow", you are using C strings, don't expect speed from code worshiping strlen of all things. Also the issue with integer hash being its identity? That is the case for every language I checked (Python, Java, C#).

jcelerier
I mean, if Python, Java or C# were good enough for me as a C++ user, I'd use them.

The mindset behind C++ is not a relative "thing must be good enough", but an absolute "there must not be something better possible".

Of course, this is often not realized in practice, but it's the goal.

josefx
> "there must not be something better possible"

It maps all integer values to distinct hashes which seems rather ideal, what it doesn't give you is perfect performance in an artificial benchmark written to exploit this implementation.

jcelerier
Tangent, but I remember reading not too long ago that there were some cases where perfect hashes weren't optimal, but can't remember where..
josefx
So his first trap is calling find with a C string on a map that expects std::string? Yeah, I totally believe someone passing around a type that needs constant calls to strlen cares about performance and isn't just using bad code to fish for edge cases.
fnord77
after watching that video, I never want to touch C++ again
pjmlp
Then someone has to do it for you, because for the foresable future there will be plenty of critical libraries, and compiler infrastructure, that won't be available in anything else as implementation language.

Hence why I keep my C and C++ skills sharp, even though they aren't my main languages any longer.

fnord77
or it is time to replace these critical libraries with something that has the same features and is far less convoluted and far harder to make serious mistakes with.

In addition to having experts accidentally make mistakes with `ref ref` like in that first video that resulted in performance degradation due to unwanted copies, C/C++ is a major security risk. Countless vulns are from simple mistakes and complex misunderstanding.

pjmlp
That is a noble goal, first LLVM and GCC need to be rewritten in one of those languages, then CUDA, Metal Shaders, DirectX, Tensorflow, Qt, WinUI, ....

So even though managed languages and better systems programming languages are making their replacements, somewhere down the stack there will be C++.

And then to top it, as long as people insist in using UNIX clones, C will be there as well.

tekknik
This is a pipe dream. There are systems still running COBOL. It’s a real hard sell to convince those with the money to let someone rewrite something that’s working and has been for some time.
db48x
Several CPPCON talks in recent years have reinforced that for me as well. https://www.youtube.com/watch?v=7DTlWPgX6zs, for example.

Edit: I originally linked to the wrong video (https://www.youtube.com/watch?v=TFMKjL38xAI), which was a different talk by the same person.

cjfd
These pitfalls really aren't that bad. An extra copy here and there only matters in the tight loops. Some of the examples in the talk are really contrived. Am I going to insert the empty string many times as a string in a hash map? No, in most cases I am going to make sure that almost all of the keys that I generate are different. Sure, google can save on its electricity bill by doing micro-optimizations. For me trying that would be a bad trade.

Every language has its pitfalls and I tend to prefer C++ pitfalls to some other pitfalls. Lately I having been doing some python. Turns out a python -m venv is different from a virtualenv. Turns out pytest some/specific/test/file.py is different from python -m pytest some/specific/test/file.py. I am wishing quite hard this code base consisted of all C++. And of course mypying this is still on the TODO list so the type annotations that are there might well lie.

Jun 05, 2021 · 3 points, 0 comments · submitted by tosh
Mar 08, 2021 · sujayakar on MIPS Becomes RISC-V
Very cool! Independent of the cool use of `aesenc` and `aesdec`, the features for skipping ahead in the random stream and forking a separate stream are awesome.

> My current home project is bump-allocator + semi-space garbage collection in SIMD for GPUs. As far as I can tell, both bump-allocation and semi-space garbage collection are easily SIMDified in an obvious manner. And since cudamalloc is fully synchronous, I wanted a more scalable, parallel solution to the GPU memory allocation problem.

This is a great idea. I wonder if we could speed up LuaJIT even more by SIMD accelerating the GC's mark and/or sweep phases...

If you're interested in more work in this area, a former coworker wrote a neat SPMD implementation of librsync [1]. And, if you haven't seen it, the talk on SwissTable [2] (Google's SIMD accelerated hash table) is excellent.

[1] https://github.com/dropbox/fast_rsync

[2] https://www.youtube.com/watch?v=ncHmEUmJZf4

dragontamer
> Very cool! Independent of the cool use of `aesenc` and `aesdec`, the features for skipping ahead in the random stream and forking a separate stream are awesome.

Ah yeah, those features... I forgot about them until you mentioned them, lol.

I was thinking about 4x (512-bits per iteration) with enc(enc), enc(dec), dec(enc), and dec(dec) as the four 128-bit results (going from 256-bits per iteration to 512-bits per iteration, with only 3-more instructions). I don't think I ever tested that...

But honestly, the thing that really made me stop playing with AESRAND was discovering multiply-bitreverse-multiply random number generators (still unpublished... just sitting in a directory in my home computer).

Bit-reverse is single-cycle on GPUs (NVidia and AMD), and perfectly fixes the "multiplication only randomizes the top bits" problem.

Bit-reverse is unimplemented on x86 for some reason, but bswap64() is good enough. Since bswap64() and multiply64-bit are both implemented really fast on x86-64-bit, a multiply-bswap64-multiply generator probably is fastest for typical x86 code (since there are penalties for going between x86 64-bit registers and AVX 256-bit registers).

---------

The key is that multiplying by an odd number (bottom-bit == 1) results in a fully invertible (aka: no information loss) operation.

So multiply-bitreverse-multiply is a 1-to-1 bijection in the 64-bit integer space: all 64-bit integers have a singular, UNIQUE multiply-bitreverse-multiply analog. (with multiply-bitreverse-multiply(0) == 0 being the one edge case where things don't really workout. An XOR or ADD instruction might fix that problem...).

---------

> This is a great idea. I wonder if we could speed up LuaJIT even more by SIMD accelerating the GC's mark and/or sweep phases...

Mark and Sweep looks hard to SIMD-accelerate in my opinion. At least, harder than a bump-allocator. I'm not entirely sure how a SIMD-accelerated traversal of the heap is even supposed to look like (aka: simd-malloc() looks pretty hard).

If all allocs are prefix-sum'd across the SIMD-units (ex: malloc ({1, 4, 5, 1, 2, 3, 20, 10}) == return (memory + {0, 1, 5, 10, 11, 13, 14, 34, 44})... for a bump-allocator like strategy... its clear to me that such a mark/sweep allocator would have fragmentation issues. But I guess it would work...

Semispace collectors innately fix the fragmentation problem. So prefix-sum(size+header) allocators are just simple and obvious.

--------

On the "free" side of Mark/sweep... I think the Mark-and-sweep itself can be implemented in GPU-SIMD thanks to easy gather/scatter on GPUs.

However, because gather/scatter is missing (scatter is missing from AVX2), or slow (AVX512 doesn't seem to implement a very efficient vgather or vscatter), I'm not sure if SIMD on CPU-based Mark/Sweep would be a big advantage.

------------

Yup yup. Semispace GC or bust, IMO anyway for the SIMD-world. Maybe mark-compact (since mark-compact would also fix the fragmentation issue).

The mark-phase is just breadth-first-search, which seems like a doable SIMD-pattern with the right data-structure (breadth-first is easier to parallelize than depth-first)

sujayakar
> Bit-reverse is unimplemented on x86 for some reason, but bswap64() is good enough.

You totally nerd-sniped me! I implemented a basic "reverse 128-bit SIMD register" routine with `packed_simd` in Rust. The ideas to process 4 bits a time:

    let lo_nibbles = input & u8x16::splat(0x0F);
    let hi_nibbles = input >> 4;
Then, we can use `pshufb` to implement a lookup table for reversing each vector of nibbles.

    let lut = u8x16::new(0b0000, 0b1000, 0b0100, 0b1100, 0b0010, 0b1010, 0b0110, 0b1110, 0b0001, 0b1001, 0b0101, 0b1101, 0b0011, 0b1011, 0b0111, 0b1111);
    let lo_reversed = lut.shuffle1_dyn(lo_nibbles);
    let hi_reversed = lut.shuffle1_dyn(hi_nibbles);
Now that each nibble is reversed, we can flip the lo and hi nibbles within a byte when reassembling our u8x16.

    let bytes_reversed = (lo_reversed << 4) | hi_reversed;
Then, we can shuffle the bytes to get the final order. We could use a different permutation for simulating reversing f64s in a f64x2, too.

    let rev_bytes = u8x16::new(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    return bytes_reversed.shuffle1_dyn(rev_bytes);
Looking at the disassembly, if we assumed our LUT and shuffle vectors are already in registers, the core shuffle should be pretty fast. (I haven't actually benchmarked this or run it through llvm-mca, though :-p)

    __ZN8arbolito7reverse17he044c5155cfe877bE:
    push    rbp
    mov     rbp, rsp
    movdqa  xmm1, xmmword ptr [rip + 557012]
    movdqa  xmm2, xmm0
    pand    xmm2, xmm1
    psrlw   xmm0, 4
    pand    xmm0, xmm1
    movdqa  xmm1, xmmword ptr [rip + 557003]
    movdqa  xmm3, xmm1
    pshufb  xmm3, xmm2
    pshufb  xmm1, xmm0
    psllw   xmm3, 4
    pand    xmm3, xmmword ptr [rip + 556992]
    por     xmm3, xmm1
    pshufb  xmm3, xmmword ptr [rip + 556995]
    movdqa  xmmword ptr [rdi], xmm3
    pop     rbp
    ret
And, this does a full bitstring reversal, not just reversing the bytes like `bswap64`, right?

> The key is that multiplying by an odd number (bottom-bit == 1) results in a fully invertible (aka: no information loss) operation.

This is neat! How do you choose odd numbers so the final generated numbers are high quality?

dragontamer
> This is neat! How do you choose odd numbers so the final generated numbers are high quality?

This was years ago, so I forget the details. But it was along the lines of...

    uint32_t hash(uint32_t seed, uint32_t k1, uint32_t k2){
        return (brev(seed*k1) * k2);
    }

    uint32_t evaluate(uint32_t seed, uint32_t k1, uint32_t k2){
        return popcnt(hash(seed, k1, k2) ^ seed);
    }
The goal is to find the values of k1 and k2 that resulted in an evaluate(seed, k1, k2) close to 16-bits (aka: 50% of bits change, the definition of "avalanche condition"). There's probably some statistical test I could have done that'd be better, but GPUs have single-cycle popcount and single-cycle XOR.

I forgot exactly which search methods I used, but note that a Vega64 GPU easily reaches 10 Trillion-multiplies / second, so you can exhaustively search a 32-bit space in a ~millisecond, and a 40-bit space in just a couple of seconds.

You can therefore search the values of k1 and k2 ~8-bits at a time every few seconds. From there, plug-and-play your favorite search algorithm (genetic algorithms? Gradient descent? Random search? Simulated annealing?).

--------

After that, I'd of course run it through PractRand or BigCrush (and other tests). In all honesty: random numbers (with bottom bit set to 1) from /dev/urandom are already really good.

---------

Exhausting the 64-bit space seems unreasonable however. I was researching FNV-hashes (another multiplication-based hash), trying to understand how they chose their constants.

Feb 27, 2021 · 4 points, 0 comments · submitted by r4victor
Yes? Why wouldn't people talk about hash tables at conferences?

https://www.youtube.com/watch?v=aXj_DsIx1xs

https://www.youtube.com/watch?v=ncHmEUmJZf4

https://www.youtube.com/watch?v=M2fKMP47slQ

https://www.youtube.com/watch?v=JZE3_0qvrMg

CyberDildonics
Talking about hash tables at a conference and a conference dedicated to hash tables wouldn't be the same thing.
Author here.

A few weeks ago, we wrote about how we implemented SIMD instructions to aggregate a billion rows in milliseconds [1] thanks in great part to Agner Fog’s VCL library [2]. Although the initial scope was limited to table-wide aggregates into a unique scalar value, this was a first step towards very promising results on more complex aggregations. With the latest release of QuestDB, we are extending this level of performance to key-based aggregations.

To do this, we implemented Google’s fast hash table aka “Swisstable” [3] which can be found in the Abseil library [4]. In all modesty, we also found room to slightly accelerate it for our use case. Our version of Swisstable is dubbed “rosti”, after the traditional Swiss dish [5]. There were also a number of improvements thanks to techniques suggested by the community such as prefetch (which interestingly turned out to have no effect in the map code itself) [6]. Besides C++, we used our very own queue system written in Java to parallelise the execution [7].

The results are remarkable: millisecond latency on keyed aggregations that span over billions of rows.

We thought it could be a good occasion to show our progress by making this latest release available to try online with a pre-loaded dataset. It runs on an AWS instance using 23 threads. The data is stored on disk and includes a 1.6billion row NYC taxi dataset, 10 years of weather data with around 30-minute resolution and weekly gas prices over the last decade. The instance is located in London, so folks outside of Europe may experience different network latencies. The server-side time is reported as “Execute”.

We provide sample queries to get started, but you are encouraged to modify them. However, please be aware that not every type of query is fast yet. Some are still running under an old single-threaded model. If you find one of these, you’ll know: it will take minutes instead of milliseconds. But bear with us, this is just a matter of time before we make these instantaneous as well. Next in our crosshairs is time-bucket aggregations using the SAMPLE BY clause.

If you are interested in checking out how we did this, our code is available open-source [8]. We look forward to receiving your feedback on our work so far. Even better, we would love to hear more ideas to further improve performance. Even after decades in high performance computing, we are still learning something new every day.

[1] https://questdb.io/blog/2020/04/02/using-simd-to-aggregate-b...

[2] https://www.agner.org/optimize/vectorclass.pdf

[3] https://www.youtube.com/watch?v=ncHmEUmJZf4

[4] https://github.com/abseil/abseil-cpp

[5] https://github.com/questdb/questdb/blob/master/core/src/main...

[6] https://github.com/questdb/questdb/blob/master/core/src/main...

[7] https://questdb.io/blog/2020/03/15/interthread

[8] https://github.com/questdb/questdb

seektable
Hi Vlad,

what BI tools currently can connect to QuestDB for ad-hoc reporting (without need to write SQL queries by hands)? I guess popular solutions like Tableau/PowerBI doesn't support QuestDB.

Is there a demand for BI tools that can use QuestDB as a data source? I'm asking about that because we can add a connector for QuestDB to our BI tool (SeekTable) with a little effort, but this has sense only if someone really will use it.

y42
I cannot talk for the whole industry, but I know for sure, that I (working in BigData / OnlineMarketing) was looking for exactly this recently: A BI tool that is agile and a database that is fast. I ended up with Tableau Server and ClickHouse (which is impressivly fast). Problem was: Tableau does not really support CH and Tableau is not that agile. So to answer your question: Yes. If QuestDB is that fast, there is a demand.
seektable
JFYI SeekTable already has a connector for ClickHouse :) it works over native binary protocol (with connection-level compression). Since you have control over SQL query and SQL expressions used for dimensions/measures ClickHouse SQL-dialect is not a problem.
maltalex
That's a very cool demo, good job! I have a few questions if you don't mind.

- Could you share the specs of the hardware powering the demo?

- For IoT-type usage, one common access pattern is "get the readings of sensor X from device Y in the given time frame". But unfortunately, the data set in the demo doesn't include something like "cab_id" column that would allow me to simulate such a query. Is there a demo that has such a column?

- Does QuestDB require rows to be inserted strictly in order, or is there some leeway?

- Are you using any sort of compression? Timeseries columns are extremely compressible, and a good compression algorithm can greatly reduce IO wait times. We're using the Postgres-based TimescaleDb at work, and its compression capabilities have been a game changer [0].

- Do you plan on supporting GROUP BY clauses?

[0]: https://blog.timescale.com/blog/building-columnar-compressio...

sirffuzzylogik
Thanks for your feedback, I am pleased to know that you like it!

- This is a c5.metal instance (AWS) with 48 physical cores and 192 GB of memory. We use only 24 of those 48 cores.

- If I understand your example well, the query "select * from trips where trip_type = 'Dispatch' and pickup_datetime = '2018-03' and vendor_id = 'VTS';" should be similar. There is no limit in how many filters you can specify in the where clause, it will also have a very minor (if any) impact on the performance.

- Currently yes but this is something we are working on. Out of order insertions is one of our priority and you can expect this to be implemented in a very near future.

- Currently no. However, the way data is stored means that columns are extremely compressible as you mentioned. Our internal API (how we store, access/retrieve columns) would make it easy for us to implement compression, we will do it if this is something important for our users. In the meantime, a filesystem like BTRFS or ZFS would do a nice job (if disk usage is your concern).

- GROUP BY is automatically inferred, users don't need to specify it explicitly, you can find more details about this here: https://questdb.io/docs/sqlExtensions#important-differences-...

EDIT: fix typos

valyala
Great demo!

I'm curios why the following query doesn't finish in tens of seconds:

    select sum(passenger_count) from trips where passenger_count=1
The same query is executed in a hundred of milliseconds on ClickHouse running on the same hardware (~20 vCPUs).
bluestreak
the use of where clause right now turns off vectorization completely is falls back on simple single threaded execution. This is the reason why the query is slow right now. We are going to vectorize 'where' really soon.
osipov
Cool project! You posted a comparison to Clickhouse on the front page. Do you have any insights on how QuestDB compares performance-wise to PrestoDB?

Also, PrestoDB is known to be manageable at scale, for example as part AWS Athena. What are you thoughts on building out an Athena-like service using QuestDB?

bluestreak
Athena has been benchmaked here with similar dataset (1.1B vs our 1.6Bn):

https://tech.marksblogg.com/benchmarks.html

TheTank
We will definitely make QuestDB distributed in the near future. Today, our focus is to extract the maximum performance out of a single node. If we start scaling out too early, it may become harder to improve performance after the fact. So all improvements we get today are foundational: they will have multiplicative effects once you start to scale out.
May I suggest you to have a look at Swiss Tables? There's a good talk from CppCon [0] that explains its design decisions, and implementations available as part of Abseil [1] and Rust stdlib [2].

[0] https://www.youtube.com/watch?v=ncHmEUmJZf4 [1] https://github.com/abseil/abseil-cpp [2] https://github.com/rust-lang/hashbrown

Apparently the standard imposes a set of requirements that the implementation must abide by thus making it a lot slower.

[0] : https://www.reddit.com/r/programming/comments/5pwgtn/hash_ma...

[1] : https://stackoverflow.com/questions/42588264/why-is-stdunord...

[2] : https://www.youtube.com/watch?v=ncHmEUmJZf4

Oct 11, 2019 · nwallin on Hash tables
I think you're underestimating the number of memory redirections that are necessary for chaining. It's more than one. That's the killer.

It's important to be wary of assuming your data will be in cache. It usually isn't. The problem isn't making lots of high density searches in a hash map, the problem is making a search every now and then. A hash map that makes three DRAM accesses will always be worse that one that makes one, regardless of how much work it needs to do, and this is for two reasons. First is there obvious: RAM is slow as balls, we get it. Second is more insidious. You just evicted two cache lines. It's not just the lookup itself that's slower, you also made the consumer of your data structure slower.

Here's a look at what open addressing looks like these days: https://youtu.be/ncHmEUmJZf4 (warning, it's an hour. Takes a while to build up, too. Still worth it.)

Aug 21, 2019 · kllrnohj on Bulk Data Structures C++
This is where we get into O(1) != fast territory. Algorithmic complexity has a weak relationship to CPU performance, not a strong one.

If you want to find something in a set storing it as an array and doing a linear scan will beat a std::unordered_set up to a shockingly large number of items due to how CPU's work.

In particular it's the pointer chasing aspect of std::unordered_set that becomes a problem (an issue shared with _most_ hash set implementations). Remember an unordered_set is not an array of items, it's an array of buckets of items (this is how hash collision is handled). Worse still, those buckets are usually linked lists. It typically can't be speculated effectively and it can't be prefetched effectively, so you become memory latency bound during an un-cached lookup. And memory latency is just shy of absolutely terrible. If you're expecting L1/L2/L3 cache hits on lookups then you're not dealing with vary large sizes probably and you're going to get much better cache density with the flat array than the array-of-buckets.

There are alternative hashsets that are flat and avoid this, but they are less common and as far as I know no standard implementation on any language uses such a hash set. There's a good talk about such a dense, flat hash set here: https://www.youtube.com/watch?v=ncHmEUmJZf4

dragontamer
I am unfamiliar with std-library details, but the solution is to use Hash Tables with linear-probing, with maybe Robin-hood hashing.

Linear Probing means that if h(key) has a collision, you store the value at h(key)+1. If that has a collision, you store at h(key)+2. Etc. etc. (wrap-around if necessary). Linear Probing means that most accesses will have one-main memory fetch, and then after that, its all pre-fetchers to linear-scan for the data.

Robin-hood hashing means that the "poor" steals from the rich. "Poor" values are the ones who have to be probed significantly: maybe 10x or 20x, or 100x, away from their preferred spot. A "rich" value is one which has been placed close to its ideal spot.

The idea is that you average-out your linear probing distances, so that instead of having 100x (on some probes) and 0x on other probes, all of your hash values will tend towards... roughly 3x probes or so. (I forget the math exactly).

Its as simple as checking on insertion: if values[h(foo)+rehash_count].rehash_count > current_rehash_count, you insert foo into that location. Then, you take THAT value, and hash it forward.

---------

It seems like hash-tables + linear probing with robinhood hashing is popular these days for L1 caches (everyone at offset 5 away from their ideal is way more cache friendly than most at offset 0, but some at offset 100). But I'm not aware of any standard-lib implementation of the technique.

yxhuvud
Some languages have hash tables (and hence sets, as they tend to be implemented with them) that use open addressing, in a way that doesn't end up being bad cachewise unless there are unreasonably many collisions for the same hash code.

It is also not uncommon for mature implementations to optimize the cases with few elements in the hash to use linear lookup. In some cases that optimization is also used while storing data in small hash tables.

Pointer chasing hash tables was good when cache was nonexistant or small (ie the 90s), but nowadays, open addressing is just superior.

Jul 04, 2019 · 1 points, 0 comments · submitted by mnem
They are very, very similar. The hashes are two level: one is fairly traditional. You take the first k bits from the hash and look that up in a table. The second layer is 128 bit blocks. (one SSE/NEON register) This block contains 14 8 bit blocks, where the top bit is a tombstone marker, and the bottom 7 bits are the k+1 though k+7 bit from the hash. (or the top 7 bits? not sure. it's 7 bits from the hash that aren't used in the top layer) It uses SSE instructions to identify which of the 14 blocks to do a full key compare with.

Both of them have two variants, one which guarantees that references to keys and values will remain valid even if iterators are invalidated, and another which is faster but iterator invalidation invalidates references to keys or values.

IMHO it looks like the Folly team saw the Abseil presentation last year and said, "That's a good idea, we should do that."

abseil presentation: https://www.youtube.com/watch?v=ncHmEUmJZf4 It's super interesting.

Open addressing has better cache performance, and for most workloads is much faster than chaining implementations. Hashbrown's implementation is based on Google's SwissTable, and they explained the reasoning behind their choices in this CppCon talk: https://www.youtube.com/watch?v=ncHmEUmJZf4
mehrdadn
How does open addressing handle deletions? I never figured this out.

Update: So I just read the article (I didn't have the chance earlier) and I see it explains tombstones, which seem like a pretty clever solution I wasn't thinking about. What I had been confused about, though was the other more-obvious attempt at a solution, which is backshifting. I think I had gotten stuck is what you do with the spot that opens up after the backshift... it might have been part of another probe chain (or multiple, for that matter) that had jumped over it before, so how do you find one of these chains to move back an item into this slot (which you would have to do)?

munificent
If you want more material, the "Hash Tables" chapter in my book walks through it:

http://www.craftinginterpreters.com/hash-tables.html

Jernik
Reading the article reveals that you can either use tombstones or move the elements back into the slot they "would have been in" if the deleted element was never added
shittyadmin
This is actually described quite well in the OP.
mehrdadn
Oh I see, thanks. I'm on my phone about to go to a meeting and haven't had a chance to read the article yet.
blattimwind
The easiest one is tombstones (i.e. "this item is deleted") to keep the chain alive, or backshifting (i.e. moving all items in the chain forward one slot).
0815test
I'm not sure that backshifting is as easy as "moving all items in the chain forward by one slot". Consider the hashtable [A, B1, B2, _, _], where one element is subsequently added after B2, giving [A, B1, B2, X, _]. Now when we remove B1 and shift B2 forward one slot ([A, B2, _, X, _]), we have to shift X forward if it hashes to the second or the third slot in the table, but not the fourth. So there might be multiple chains involved; if it was one contiguous chain only, we could simply arrange for the "hole" in it to be filled with no need for shifting all the items in it, and it would be quite efficient. However, it seems that it's not so simple.
barrkel
Yup, the chains can be interleaved. Often it's best to save the hash of each key as well as the key itself. Comparing the full hash (rather than the modulus of the hash) will eliminate many keys faster than doing a full key comparison, and having the full hash available means recreation on expansion is cheap, as all the keys don't need rehashing. The full hashes can then be used to distinguish between different chains if you decide to do backfilling, again cheaper than recalculating key hashes.

Of course, having the hashes available also speeds up recreation, should the tombstone approach be used.

Basically, keep the full hashes around :)

rurban
Certainly not with open addressing as it will destroy all the cache-line advantages. with seperate chaining it's very common, esp. for resizing.
barrkel
The key is usually indirected, which means a pointer, which is usually bigger than the hash code.

(And of course you could use parallel arrays if you're super concerned about cache lines, though the tradeoffs would very much depend on whether hash misses or hits are the expected mode.)

senderista
And if it’s not a pointer but a word-size integer, just use an invertible hash function (aka permutation aka block cipher) so you can store the hash code in place of the key :)
mjevans
For back-shifting, I'd prefer to actually seek for the tail item in the list (just before you prove the key doesn't exist) and move that BACK to the evicted slot, rather than update all of the keys.

However the tombstone idea fits better with minimizing mutations and improving the potential for concurrency if the design only cares that inserts/deletes are atomic (not that they're propagated in parallel instantly).

For the 'move back' idea to be that safe I'd still want to use a tombstone value, but it would need to also include a pointer to the moved location. The list of tombstones would need to be submitted (as in, message queue) to a kind of garbage collection thread that sat on the messages for a pre-determined validity time and then did a scrub of the key/value to make sure the compaction still made sense. A shorter interval might also allow for that deferred compaction to be replaced by a new entry instead.

I don't like any of that as a generic solution though, as the trade-offs with each bit of extra effort and code complexity seem likely to be premature optimization and sources of potential bugs when not considered in a specific problem domain.

usefulcat
The article mentions a couple of options, one of which is tombstones. In a chaining implementation, the load factor is straightforward: num_items / table_size. Adding items increases the load factor and deleting items reduces it.

With open addressing, deletions do not decrease the load factor (at least not immediately, in general), because a deleted item becomes a tombstone. So for open addressing, load factor is (num_items + num_tombstones) / table_size.

The upshot is that in a scenario where items are being repeatedly added and removed, over time the load factor will gradually increase until eventually the table has be recreated. This is true even if the average number of items remains relatively constant over time, and is unlike a chaining implementation.

In this scenario, the average insertion time might be lower with open addressing, but the worst case will be much worse than for chaining.

None
None
a1369209993
You can eliminate tombstones over time by:

  # any time a tombstone immediately preceeds a empty, it can be marked empty
  [ $ _ ] -> [ _ _ ]
  # any time you lookup a key, it can be swapped with a tombstone immediately preceeding it
  [ $ A ] -> [ A $ ]
  # (moving the tombstone closer to a empty that will destroy it)
  # if you don't have iterators, you can also jump over other keys
  [ $ B A ] -> [ A B $ ]
  [ $ B C A ] -> [ A B C $ ]  # etc
  # (this will cause a iterator at A to yield B again, or at B to skip A)
How well this keeps the load factor down depends on how aggressively you look for tombstone disposal opportunities, but it does keep it down.
tptacek
Mutating the table on lookup seems pretty gross, though.
dhash
Eh, it’s the classic amortized approach. Whoever you ca “touch” the data and you’re right there already due to a lookup, it makes sense to amortize your data structure housekeeping IMO.

TBH, the right answer is always due to the users use case (Amortization and housekeeping really help with purely functional data structures), and benchmark data.

a1369209993
Well, it still works (just slower) if you only do fixups during {table[key] = val} operations. But honestly, if you're using a probabilistic data structure like a hash table, the ship has already sailed on gross implementation details.
rurban
No, that's the common approach with chained lists: Move found list item to first.
btym
I would argue that "the table" is not mutated, only the internal state of its implementation. Every time you access any information, a cache at some layer below you is updated. Is that also gross?
cperciva
Yes. Normally you can have one thread writing to a data structure OR many threads reading the data structure at any given time and not need to worry about them causing problems. (This situation is common enough that we have things called "reader-writer mutexes" or "shared-exclusive" mutexes.)

As soon as your reads can modify the internal state of the data structure, it might modify the state in a way which trips up another read; so you can no longer have many threads reading the data structure at once.

shittyadmin
But you don't need to write every time, only on occasion, so you can actually use a read write lock and in the nominal case many threads can read just fine.

That said, it's probably still better to avoid this unless it's absolutely necessary to modify the underlying structure sometimes, I recently had to do this for an LRU cache.

pcwalton
Right. And in Rust implementing the hash table that way will suddenly make the table no longer flagged as "Sync" by the compiler, so you will be unable to share it between threads.
btym
That's a great point. It's still a possible optimization with a compare-and-swap or if you can determine that you're in a single threaded context.
None
None
hinkley
Cliff Click (of JVM fame) did a presentation once on an open addressing hash table implementation that he believed was lock free (it used CAS and possibly some memory barriers). I wouldn't want to argue with Cliff Click on concurrency but everybody makes mistakes. I haven't checked back if anyone found any bugs yet.

I'm also curious if or how that implementation would translate to other languages. What he did might not be declarable in the Rust ownership model.

Dec 09, 2018 · 3 points, 1 comments · submitted by Mauricio_
karmakaze
Talk is about SwissTable at Google and also covers dense_hash_set/map. Another recent related article on hashbrown[1] a port of SwissTable.

[1] https://news.ycombinator.com/item?id=18630563

I believe the design is similar because Google presented the SwissTable design more than a year ago in CppCon [1].

If I'm reading the link you sent correctly, then there was an initial bias in the benchmarks that favored F14, but after adjusting for that, the results are similar (not surprising since the basic idea is the same.)

[1] https://www.youtube.com/watch?v=ncHmEUmJZf4

Google's hash map [1], which is a fresh look at a heavily studied problem.

[1] https://www.youtube.com/watch?v=ncHmEUmJZf4

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.