Hacker News Comments on "Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition (The Morgan Kaufmann Series in Multimedia Information and Systems)" Amazon Repository

Rankings: week · month (mar/apr) · year (2024) · all time

digests · search

Hacker News Comments on
Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition (The Morgan Kaufmann Series in Multimedia Information and Systems)

Ian H. Witten, Alistair Moffat, Timothy C. Bell · 10 HN comments

HN Books has aggregated all Hacker News stories and comments that mention "Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition (The Morgan Kaufmann Series in Multimedia Information and Systems)" by Ian H. Witten, Alistair Moffat, Timothy C. Bell.

View on Amazon [↗]

HN Books may receive an affiliate commission when you make purchases on sites after clicking through links on this page.

Amazon Summary

In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web.

HN Books Rankings

Ranked #14 this year (2024) · view

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this book.

⬐

Oct 31, 2022 · DamonHD on Ask HN: What was being a software developer like about 30 years ago?

You had to know things and keep them in your head, and have a pile of text books otherwise! Plus no normal app had access to gigabytes of storage, and indeed that was a vast amount for many years more[1]. Skills learnt then in terms of memory and CPU efficiency are still valuable now, though your typical smartphone is more powerful than the supercomputer replacements I used to look after for an oil company...
[1] https://www.amazon.co.uk/Managing-Gigabytes-Compressing-Inde... (1999)

⬐

Aug 25, 2022 · skoczko on Ask HN: I want to dive into how to make search engines

If by a "search engine" you mean a tool to index and retrieve documents (essentially what the terms mean in the traditional Information Retrieval, e.g Lucene, SOLR, Elastic) then this is a pretty good on the subject that taught me a lot:
https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...

⬐

Dec 02, 2021 · mindcrime on Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

I don't even know if anybody has written a book specifically about search at "web scale" (no MongoDB jokes here, please). But about the closest things I know of would be something like:
https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
https://www.amazon.com/Information-Retrieval-Implementing-Ev...
https://www.amazon.com/Introduction-Information-Retrieval-Ch...

⬐

Jun 18, 2021 · PaulHoule on Sorted Integer Compression

This book has a nice treatment of that kind of compression:
https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
For instance you might be keep track of facts like
   the word "the" is contained in document 1
   the word "john" is contained in document 1
   the word "the" is contained in document 2
   ...
   the word "john" is contained in document 12
and you code the gaps; the word "the" appears in every document and the gap is always 1, but the gap for "john" is 11. With a variable-sized encoding you use fewer bits for smaller gaps -- with that kind of encoding you don't have to make "the" be a stopword because you can afford to encode all the postings.

⬐ ot
That book is really dated (1994), but unfortunately there aren't much better options.
It is unfortunate because the understanding of the field has improved dramatically over the past decades, and explanations from a more modern perspective would benefit even novices.

⬐ tgv
The Standard MIDI file spec had something similar back in 80s. They code gaps between successive events with increasingly longer byte sequences. Never mind that they waste a whole bit to do that.

⬐ atombender
That book is great, and relies on a combination of Golomb-Rice, delta coding, and arithmetic coding. I wonder how this compares with the OP's compression.

⬐ PaulHoule
... it doesn't just "rely on" that stuff, but it explains the fundamentals of them really well!

⬐

Sep 13, 2017 · nostrademons on What every software engineer should know about search

Separate out the concepts of "search infrastructure" (how documents and posting lists are stored in terms of bits on disk & RAM) and "ranking functions" (how queries are matched to documents).
The former is basically a solved problem. Lucene/ElasticSearch and Google are using basically the same techniques, and you can read about them in Managing Gigabytes [1], which was first published over 2 decades ago. Google may be a generation or so ahead - they were working on a new system to take full advantage of SSDs (which turn out to be very good for search, because it's a very read-heavy workload) when I left, and I don't really know the details of it. But ElasticSearch is a perfectly adequate retrieval system, and it does basically the same stuff that Google's systems did circa 2013, and even does some stuff better than Google.
The real interesting work in search is in ranking functions, and this is where nobody comes close to Google. Some of this, as other commenters note, is because Google has more data than anyone else. Some of it is just because there've been more man-hours poured into it. IMHO, it's pretty doubtful that an open-source project could attract that sort of focused knowledge-work (trust me; it's pretty laborious) when Google will pay half a mil per year for skilled information-retrieval Ph.Ds.
[1] https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...

⬐ ot
> The former is basically a solved problem.
That's a bit of a stretch :) The high-level architecture is quite mature and stable, but there's still a lot of research, both in academia and industry, on the data structures to represent indexes, on query execution (see all the work on top-k retrieval), and distributed search systems (for example query-dependent load balancing, novel sharding methods).

⬐ markpapadakis
It's true that different index codecs are being designed with different tradeoffs (index size vs postings lists access time and cost), but the work is very incremental IMHO, no huge advances to speak of. Also, can you talk about the top-k challenges you mentioned? Priority queues are not optimal enough ?

⬐

Apr 26, 2017 · verytrivial on MG4J – A free full-text search engine for large document collections

That name sound very familiar, as does the feature set. Managing Gigabytes[1], or "mg" was the output of a University of Melbourne and RMIT research in the 1990s. It went on to be commercialized as SIM and later TeraText[2] and has largely disappeared into the government intelligence indexing and consulting-heavy systems space (where it is now presumably being trounced by Palantir).
[1] https://www.amazon.com/Managing-Gigabytes-Compressing-Indexi... - Note review from Peter Norvig!
[2] http://www.teratext.com/

⬐ timb07
That's exactly what I thought - I worked on index construction for MG back in 1994. (Note, although my name is Tim Bell, I'm not Timothy C. Bell, the coauthor of "Managing Gigabytes".)

⬐

Jul 28, 2014 · drblast on Impossible Engineering Problems Often Aren't

If you enjoy reading articles about the rediscovery of indexing large amounts of read-only data, I'd highly recommend reading this book which is a treasure trove about this kind of work:
http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...

⬐

Dec 21, 2011 · sajid on Build your own web search engine

I recommend reading 'Managing Gigabytes' by Witten, Moffat and Bell:
http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...

⬐

May 04, 2010 · dejv on Ask HN: How did you learn about search engines/text processing ?

You can take a look on Managing Gigabytes (http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...)
It is nice book, but might be little bit outdated.

⬐

Mar 03, 2008 · slackerIII on Five Memorable Books About Programming

I always have to plug Managing Gigabytes whenever a discussion of computer books comes up. Great reference for anyone dealing with searching or compressing large amounts of information:
http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...

Hacker News Comments on Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition (The Morgan Kaufmann Series in Multimedia Information and Systems)

Hacker News Stories and Comments

Hacker News Comments on
Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition (The Morgan Kaufmann Series in Multimedia Information and Systems)