HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
The NOSQL Store that Everyone Ignored

Postgres Conference · Youtube · 97 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Postgres Conference's video "The NOSQL Store that Everyone Ignored".
Youtube Summary
The talk is based on my original article that I wrote back in 2011 and followed up with friend feed case study. We will do a walk through history of HSTORE, and how we can now use JSONB support in PostgreSQL, discuss what makes it enticing, and comparable to NoSQL stores like MongoDB or CouchDB.

We will look at how a FriendFeed clone could have been designed today using basic document store (JSON documents), build indexes on these deep nested fields for our various queries, and use partitioning features to distribute the data. The best part about all this is you get all of that schema free storage/lookup benefits, with solid ACID guarantees that other NoSQL stores can't match.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Sep 03, 2021 · 97 points, 42 comments · submitted by zdw
bob1029
I noticed a lot of people noting that they enjoy use some hybrid of JSON w/ their SQL. I do the same thing and I think this is an incredibly productive way to manage fast-changing things.

One trick I learned along the way is to apply gzip to your JSON columns. For us, we got over 70% reduction in column size and query performance actually increased. I presume because it is faster to read fewer IO blocks than more, esp when doing table scans (which you shouldnt but they happen).

The only caveat with this is if you want to query against something in the JSON, but I think this is where the balancing act comes into play regarding what is in (compressed) JSON and what is in a column.

NicoJuicy
In. Net you can use Marten DB.

You can signal query able properties who are then duplicated as normal db columns. It works quite nice actually.

fabian2k
In Postgres you would generally use JSONB columns, those are stored as TOASTs and are automatically compressed. That compression isn't very good, but there is some work on adding other compression algorithms there.
tpetry
PostgreSQL 14 will add lz4 compression for toast data. But if i remember correctly the first ~700 bytes are always stored inline and only the rest is compressed. But i could be wrong on this.
darkhorse13
The performance of JSONB is also surprisingly great. I played around with around 1M rows of data for a project using JSONB, and compared to traditional SQL, the complex queries were 500-800ms slower, which was totally reasonable for the use case.
eyan
Great data point! Thank you!
sgt
Been doing this for years now and it has proven itself to he a great solution. We use a hybrid approach: traditional columns in addition plus an entity/content type JSONB column.
mateuszf
Same here. Have only good experiences with this setup.
hodgesrm
JSON blob + columns is the recommended approach for handling semi-structured data in ClickHouse. It's easy to add columns on the fly, since you just give the expression to haul out the data using a DEFAULT clause. ClickHouse applies it automatically on older blocks without rewriting them. For new blocks it materializes the data as rows are inserted.
the_gipsy
A collegue put some data into a JSON columns and I absolutely loathed it. Can’t query it well and he neither did type it in the models in the code.

However later I found that there two kinds of JSON: one “bad” that we hd to patch up with lots of validation to some spec that emerged, but another one that was absolutely right to be free form, because it was never accessed by us. The user simply inputs it freely, and reads it out somewhere else, with freedom to evolve it as desired.

mlk
Clob
the_gipsy
Aha, I only knew them as blobs or JSON
bonzini
Yeah, I have a similar experience from a project that had moved from MongoDB to Postgres. In the conversion they left a free form JSON field with information filled in by plugins, which was messy because it was impossible to query efficiently. Plugins also had some code to format the free form data in HTML, which was suffering from code duplication but provided some ideas to build a common schema for a new table and for the corresponding REST API endpoints. That worked great but we still left a free form JSON field in the new table, and also kept the JSON field in the original table to store the configuration for the plugin.

In our case the "user" that just threw JSON data in and read it out was the plugins. You just have to strike the right balance between core parts of the schema, expressed as tables and relations, and auxiliary data that can use JSON.

jf22
There is a .NET project that uses this feature to create a NoSQL db experience. I've been using it for a few years and it's been great.

https://martendb.io/

NicoJuicy
Came here to say that too.

Marten DB is really good!

nomilk
As someone who does indie ruby stuff, often under time constraints, I find jsonb great for storing messy data that looks unnecessary, but which might be useful down the track. jsonb is just another type as far as postgres/rails is concerned, so it's fast implement (just like any other rails migration), fast to query, and easy to access using key/value (similar to the ruby hash).
jrochkind1
If you are using Rails and jsonb, you may be interested in https://github.com/jrochkind/attr_json (by myself), or https://github.com/madeintandem/jsonb_accessor (an alternative).
bullfightonmars
We wrote a similar library that hooks into ActiveRecord. https://github.com/data-axle/superstore
jrochkind1
Interesting, thanks, I hadn't seen this one before! So many of us inventing the same wheel! The docs on yours are kind of limited, I don't really understand how it maps to your postgres tables, what schema it's expecting. But I plan to find time to check it out just to compare and contrast with mine! Thanks!
bullfightonmars
Yeah. We built this for our own use and haven’t really documented it extensively. We use it in conjunction with an elastic search ORM.

https://github.com/data-axle/elastic_record

jrochkind1
Again, interesting, thanks!

I hadn't seen that ElasticSearch ORM either! The more popular one of course is: https://github.com/elastic/elasticsearch-rails

It can be hard to know what open source is out there!

paulryanrogers
For speculative storage like that my preference is for files, ideally on something like S3. It's harder to analyze but IME the need is infrequent. Storage is usually significantly cheaper than a DB.
stevesimmons
Does anyone know of performance (and usability) comparisons between say Postgres JSONB and Azure CosmosDB?
fisf
We use azure SQL (basically mssql) to perform something similar -- this might be a more apt comparison.

We deal with largish amounts of JSON event data, and unpack specific fields to computed columns. This allows to use clustered indices, standard analytical tools, etc., while leaving the raw data as it is.

Performance is mostly fine for analytics (time series aggregation and such). It is also a very flexible approach.

jf22
Those are two completely different tools with different use cases.
exceptione
Well, tell that to the MS-shops... I think it is a very good question, would love to see some benchmarks.
stevesimmons
To expand on my particular use case for Azure CosmosDB, I have a Python web app built with FastAPI. I find it super convenient to dump Pydantic data models of our primary domain objects to JSON and save them into Cosmos, tagged with the model name.

Then on rereading the Cosmos records, my model registry automatically imports the required Pydantic model and reconstitutes the live Python object.

debug-desperado
Had a great win this last week with JSONB. Basically I needed an aggregated materialized view for performance but still have access to the non-aggregated row data. Think aggregate by account ID but the query may access any given month of data for that account. It’s cheating a bit to use a non-relational structure, but the correlated subqueries I had to use before were 20x slower.
lysecret
I don't know why anyone would use say mongodb when postgresjson exists.
Izkata
A terrifying example from work:

"Hey cool, mongodb just accepts whatever you give it! We don't need to worry about writing database migrations!"

...Luckily we convinced them this was a bad idea and ended up using postgres with a handful of json columns instead. Up until that point their only experience had been with mysql so they had no idea json columns existed.

paulryanrogers
Does PostgreSQL have standard operators yet? Last I checked they only had the Pg specific ones, which predate the SQL standard. (Same for SQL/PSM sadly.)
mdaniel
Do you have a link to the standard operators you mean? I know they support [SQL/JSON path function](https://www.postgresql.org/docs/12/functions-json.html#FUNCT...) but as for the operators, I only know of their own which predate the standard
simonw
PostgreSQL 14 is currently in beta and adds JSON subscripting: https://www.postgresql.org/docs/14/datatype-json.html#JSONB-...
fabian2k
I find this approach incredibly useful if any parts of your data don't have a uniform schema.

The one thing you need to look out for is developers putting stuff into JSON that belongs into proper columns. The Postgres JSON support is very powerful, but plain old relational queries are faster and often much easier to write.

tda
I have created an app with such a hybrid approach as the part if the model was very dynamic. The idea was that I would slowly migrate the stable bits to proper columns, but in practice that never happened. Partly also because it is not really broken, and the performance is actually quite good. You can add partial indexes etc to improve where needed.

But writing analytycal queries is definitely more complicated so the cost of not having a "normal" data model adds up over time. To allow business analysts to work with the data I endend up creating a a bunch of views om the data without JSONB fields as a quick fix. So though the approach worked, I am not sure I would do it again.

kroolik
What you did, in fact, was creating read models specific to use case. This is completely normal and allows one to optimize for different users: app's schema can be different than the one for analytics.
fabian2k
I'm using the JSON mostly for parts that really have no fixed schema, e.g. fields that vary by customer. The alternative would be something like EAV, and JSON columns are far superior here.

Performance is perfectly reasonable with JSON, but there are many more ways to make it slow if you're not careful. Postgres has to read the entire JSON content even if you only access a small part of it, that alone can kill performance if your JSON blobs are large. The lack of true statistics can also be an issue with JSON columns. They're still easily fast enough for many purposes, they just have a few more caveats than plain columns you should be aware of.

There's some other aspects that are not that intuitive, but also not a big deal once you know them. For example you can use a jsonb_path_ops GIN index to speed up queries for arbitrary keys and values inside your JSONB column, but you need to use the @> operator so that the index can be used. If you write queries like WHERE myjson->>'foo' = 'bar' it can't use the index.

jrochkind1
How would you describe what "belongs" in a proper column? Not just anything that doesn't have a uniform schema?
bob1029
For us, a "proper" column is usually something we want to index against. Things like timestamps, user ids, etc. Anything that would be useful for top-level management of whatever has the json tacked onto it.
conradev
You could pull those out and index them using generated columns, though.

That can make things a lot easier on the application side, where the code just needs to insert the JSON blob

jrochkind1
You don't even have to "pull those out and index them using generated columns" to index with postgres jsonb, although it's a bit more confusing to figure out the proper index and make sure your queries hit it, jsonb data is indexable on it's own.
megous
Also for leftovers when scraping website data. Trendy SPAs usually have JSON endpoints, so I usually use some ID column, maybe timestamp column (to ease incremental syncing later on), some primary content columns, and the whole scraped JSON object is put into a `data JSONB` column in case some of it is needed in the future. :) Works amazingly.
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.