Hacker News Comments on "Tree-sitter – a new parsing system for programming tools" thestrangeloop.com Video

Rankings: this week · month (apr/may) · year (2024) · all time

digests · search

Hacker News Comments on
Tree-sitter – a new parsing system for programming tools

thestrangeloop.com · 82 HN points · 0 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention thestrangeloop.com's video "Tree-sitter – a new parsing system for programming tools".

Watch on thestrangeloop.com [↗]

thestrangeloop.com Summary

Strange Loop (Sept 12-14, 2019 - St. Louis) is a conference for software developers covering programming langs, databases, distributed systems, security, machine learning, creativity, and more!

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

Tree-sitter – a new parsing system for programming tools [video]

⬐

Oct 14, 2018 · 82 points, 25 comments · submitted by matt_d

⬐ DannyBee
It's interesting that this appears based on Wagner and Graham's papers/work.
A bunch of that was eventually open sourced, but for years it was very state of the art for incremental parsing, but imho languishing, because all the implementations were closed[1]
[1] ensemble, pan, etc.

⬐ transreal
The project website is here: http://tree-sitter.github.io/tree-sitter/, and the Github repo is here: https://github.com/tree-sitter/tree-sitter.

⬐ mncharity
Here's the js grammar[1].
Despite the worrying talk of tokenization, it looks scannerless. Yay.
One big grammar. JSX is blended in. And ecmascript versions aren't segregated. Which I didn't expect... but the motivation wasn't writing picky compilers.
[1] https://github.com/tree-sitter/tree-sitter-javascript/blob/m...

⬐ maxbrunsfeld
It's not actually implemented as a scannerless parser, but the grammar API mostly abstracts away the parser/lexer distinction.
We have to handle JSX and all language versions combined because for our use case, users need to be able to open up `.js` files and have them Just Work.
It's a similar story with Python and Ruby - we have to handle the union of all language versions.

⬐ mncharity
> open up `.js` files and have them Just Work
Nod. A single big grammar just wasn't what I'd thoughtlessly expected. When I did something similar, there was a gaggle of small grammars slicing up javascript aspects and history, then variously composed into assorted language grammar variants. But part of the motivation for that was sharing grammar fragments across languages, and "parse exactly language version X" in support of parser validation and compiler use. So a bit different use cases.

⬐ mncharity
> It's not actually implemented as a scannerless parser, but the grammar API mostly abstracts away the parser/lexer distinction.
Any aspects of the 'not' part of 'mostly' which one should bear in mind?

⬐ maxbrunsfeld
Yeah, the grammar author does fully control the division between the parser and lexer: every literal (string or regex) in the grammar corresponds to a token. There's also a `token()` function that you can use to specify that an arbitrary rule should be handled by the lexer as a single token.
In most cases, you don't have to think about it; the obvious way to write something is the right way. There are cases where its helpful to have a mental model of how lexing works.

⬐ mncharity
> grammar author does fully control the division between the parser and lexer
Nifty. So explicit GLR forks at rule level (`conflicts:`), non-forking token "conflict" resolution[1], no token regex backtracking pressure across tokens, and token resolution at a single code position can(?) differ across conflicting rules? I'm uncertain on that last bit.
> In most cases, you don't have to think about it
Well yes, but, some of us crave much more syntactically flexible languages. :)
[1] https://tree-sitter.github.io/tree-sitter/creating-parsers#c...

⬐ specialp
The Strange Loop videos are in the process of being uploaded still. There are a lot that deserve a look. I was there at this talk as well. This is one of the most diverse and high level conferences out there.
https://www.youtube.com/channel/UC_QIfHvN9auy2CoOdSfMWDw/vid...

⬐ lioeters
Fascinating work!
This talk is a part of an archive not accessible from the site's menus: https://thestrangeloop.com/2018/sessions.html

⬐ crb002
TIL Tim Wagner did a thesis on IDEs before the serverless thing. The man has a passion for developer user experience.

⬐ None
None

⬐ Avi-D-coder
This is great. Has any one implemented utilized Tree-sitter in another editor? How does the performance of atom compare to vscode these days?

⬐ dwenzek
Pretty cool! I found impressive the way Max refactor a bunch of code using extend selection (16 mn 45).

⬐ habitue
Ok, now I just need to get this working in emacs.

⬐ maxbrunsfeld
Hey, author of Tree-sitter here. Thanks for sharing this! I'd be happy to answer any questions people have about the project.

⬐ mncharity
Is the JSON representation of a grammar intended as a public interface?
I noticed [1] doesn't have a version field. "No version: means version:1" is ok, but I then wondered.
[1] https://github.com/tree-sitter/tree-sitter-json/blob/master/...

⬐ maxbrunsfeld
Good question.
At one time, I had the thought that folks who disliked JavaScript might want to write their own grammar DSL in another language, which could generate compatible JSON to be consumed by the core Tree-sitter compiler library. Now, I think this flexibility is pretty unnecessary.
I've also thought that some applications might want to inspect the grammar JSON at runtime in order to do some kind of meta-programming. I haven't really thought that through though.
In either case, versioning would probably be a good idea. The actual parser is already versioned so that we can error out if you try to use a parser with an incompatible version of the runtime. We could just add that same version number into the grammar.

⬐ mncharity
> some kind of meta-programming
Is usually what I end up doing. But there's already the npm package versioning to capture api versions. So one question is whether grammar json is likely to be moved around in space or time in ways for which that's insufficient. One story could be moving generated grammars among different programming language tree-sitting implementations, which can be out of sync, and where there's no out-of-band information on how to deal with it. Another story could be grammar json being captured in repo, as with tree-sitter-json, while avoiding "be careful updating the version of tree-sitter you use". Generalizing that, any other cases where you don't want to generate the json at runtime, and thus it gets stored longer-term.

⬐ mncharity
> versioning
One possible approach might be "part of spec" but "optional" and "not actively it use"? So present in schema and tests, to somewhat reduce the chance of getting locked into not having it by hypothetical others failing to support it. And easily available, so absence doesn't discourage any hypothetical interest. But without spending much time on it, absent non-hypothetical need. Maybe.
> that same version number
Hmm. Parser-runtime versioning would seem only lightly coupled with grammar-parser versioning? Unless there's a grammar-runtime coupling to express?
EDIT: Ok, I was thinking of "grammar" in too general a sense, decoupled from choice of parser tech, rather than the usual "bison grammar tightly coupled to bison parser". Failure to RTFM. Grammar's `conflict:`, `inline:`, `word:`, `externals:` are all somewhat runtime coupled. Though perhaps looser than parser-runtime? It can be nice to flexibly move work between a compiler and its runtime, without affecting the external api.

⬐ mncharity
`externals` seem the (only) way to mix arbitrary code into the parsing, so...
Can `externals` call back into the parser? That is, is the parser reentrant?
Can `externals` be a zero-length assertions? That is, is there a progress requirement for success? And thus serve as arbitrary-computation parse-rejecting "semantic actions".
Do `externals` have access to GLR parse state?

⬐ maxbrunsfeld
The Externals API is designed to only let you do things that are compatible with incremental parsing. It’s good for things like indentation based delimiters and automatic semicolon insertion.
You cannot call back into the parser; it’s just for tokenization.
You can produce zero length tokens, and we do use this feature a lot.
You dont have access to the parse state directly, but the external scanner is passed an array of Boolean values that indicates which external tokens are expected in the current state.

⬐ mncharity
Can an `externals` be implemented using a second tree-sitter parser? That is, can multiple parser instances be parsing at the same time in a single thread, with no implementation global-state conflicts?
Thanks again.

⬐ phodge
First up, thanks for dedicating a big chunk of your life to building tree-sitter! I had a go making a similar parser last year (fast, dynamic-grammar, self-correcting) and had to give up when I started to realize what a ludicrously complex undertaking this is. The fact that tree-sitter works at all is nothing short of amazing.
Question: do you have any plans to integrate tree-sitter with the language-server project(s)? If fast, accurate parsing of any programming language is now easily implemented in language servers via tree-sitter, it seems to make sense for LSP to expand its protocol to include syntax highlighting as well.

⬐ mncharity
> thanks
Yes! The state of parsing is such a sad thing, discouraging so much other progress, that it's wonderful to have it improving. It's appreciated.
EDIT: And my thanks for all your guidance this evening. Good night.

⬐ maxbrunsfeld
Thanks for the kind words!
I haven't specifically pursued integration with LSP, but Tree-sitter has been used to build a couple of language servers which work with both VSCode and Atom (and probably other editors):
* Bash - https://github.com/mads-hartmann/bash-language-server * Ruby - https://github.com/rubyide/vscode-ruby/tree/master/server

Hacker News Comments on Tree-sitter – a new parsing system for programming tools

Hacker News Stories and Comments

Hacker News Comments on
Tree-sitter – a new parsing system for programming tools