HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Tree-sitter – a new parsing system for programming tools

thestrangeloop.com · 82 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention thestrangeloop.com's video "Tree-sitter – a new parsing system for programming tools".
Watch on thestrangeloop.com [↗]
thestrangeloop.com Summary
Strange Loop (Sept 12-14, 2019 - St. Louis) is a conference for software developers covering programming langs, databases, distributed systems, security, machine learning, creativity, and more!
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Oct 14, 2018 · 82 points, 25 comments · submitted by matt_d
DannyBee
It's interesting that this appears based on Wagner and Graham's papers/work.

A bunch of that was eventually open sourced, but for years it was very state of the art for incremental parsing, but imho languishing, because all the implementations were closed[1]

[1] ensemble, pan, etc.

transreal
The project website is here: http://tree-sitter.github.io/tree-sitter/, and the Github repo is here: https://github.com/tree-sitter/tree-sitter.
mncharity
Here's the js grammar[1].

Despite the worrying talk of tokenization, it looks scannerless. Yay.

One big grammar. JSX is blended in. And ecmascript versions aren't segregated. Which I didn't expect... but the motivation wasn't writing picky compilers.

[1] https://github.com/tree-sitter/tree-sitter-javascript/blob/m...

maxbrunsfeld
It's not actually implemented as a scannerless parser, but the grammar API mostly abstracts away the parser/lexer distinction.

We have to handle JSX and all language versions combined because for our use case, users need to be able to open up `.js` files and have them Just Work.

It's a similar story with Python and Ruby - we have to handle the union of all language versions.

mncharity
> open up `.js` files and have them Just Work

Nod. A single big grammar just wasn't what I'd thoughtlessly expected. When I did something similar, there was a gaggle of small grammars slicing up javascript aspects and history, then variously composed into assorted language grammar variants. But part of the motivation for that was sharing grammar fragments across languages, and "parse exactly language version X" in support of parser validation and compiler use. So a bit different use cases.

mncharity
> It's not actually implemented as a scannerless parser, but the grammar API mostly abstracts away the parser/lexer distinction.

Any aspects of the 'not' part of 'mostly' which one should bear in mind?

maxbrunsfeld
Yeah, the grammar author does fully control the division between the parser and lexer: every literal (string or regex) in the grammar corresponds to a token. There's also a `token()` function that you can use to specify that an arbitrary rule should be handled by the lexer as a single token.

In most cases, you don't have to think about it; the obvious way to write something is the right way. There are cases where its helpful to have a mental model of how lexing works.

mncharity
> grammar author does fully control the division between the parser and lexer

Nifty. So explicit GLR forks at rule level (`conflicts:`), non-forking token "conflict" resolution[1], no token regex backtracking pressure across tokens, and token resolution at a single code position can(?) differ across conflicting rules? I'm uncertain on that last bit.

> In most cases, you don't have to think about it

Well yes, but, some of us crave much more syntactically flexible languages. :)

[1] https://tree-sitter.github.io/tree-sitter/creating-parsers#c...

specialp
The Strange Loop videos are in the process of being uploaded still. There are a lot that deserve a look. I was there at this talk as well. This is one of the most diverse and high level conferences out there.

https://www.youtube.com/channel/UC_QIfHvN9auy2CoOdSfMWDw/vid...

lioeters
Fascinating work!

This talk is a part of an archive not accessible from the site's menus: https://thestrangeloop.com/2018/sessions.html

crb002
TIL Tim Wagner did a thesis on IDEs before the serverless thing. The man has a passion for developer user experience.
None
None
Avi-D-coder
This is great. Has any one implemented utilized Tree-sitter in another editor? How does the performance of atom compare to vscode these days?
dwenzek
Pretty cool! I found impressive the way Max refactor a bunch of code using extend selection (16 mn 45).
habitue
Ok, now I just need to get this working in emacs.
maxbrunsfeld
Hey, author of Tree-sitter here. Thanks for sharing this! I'd be happy to answer any questions people have about the project.
mncharity
Is the JSON representation of a grammar intended as a public interface?

I noticed [1] doesn't have a version field. "No version: means version:1" is ok, but I then wondered.

[1] https://github.com/tree-sitter/tree-sitter-json/blob/master/...

maxbrunsfeld
Good question.

At one time, I had the thought that folks who disliked JavaScript might want to write their own grammar DSL in another language, which could generate compatible JSON to be consumed by the core Tree-sitter compiler library. Now, I think this flexibility is pretty unnecessary.

I've also thought that some applications might want to inspect the grammar JSON at runtime in order to do some kind of meta-programming. I haven't really thought that through though.

In either case, versioning would probably be a good idea. The actual parser is already versioned so that we can error out if you try to use a parser with an incompatible version of the runtime. We could just add that same version number into the grammar.

mncharity
> some kind of meta-programming

Is usually what I end up doing. But there's already the npm package versioning to capture api versions. So one question is whether grammar json is likely to be moved around in space or time in ways for which that's insufficient. One story could be moving generated grammars among different programming language tree-sitting implementations, which can be out of sync, and where there's no out-of-band information on how to deal with it. Another story could be grammar json being captured in repo, as with tree-sitter-json, while avoiding "be careful updating the version of tree-sitter you use". Generalizing that, any other cases where you don't want to generate the json at runtime, and thus it gets stored longer-term.

mncharity
> versioning

One possible approach might be "part of spec" but "optional" and "not actively it use"? So present in schema and tests, to somewhat reduce the chance of getting locked into not having it by hypothetical others failing to support it. And easily available, so absence doesn't discourage any hypothetical interest. But without spending much time on it, absent non-hypothetical need. Maybe.

> that same version number

Hmm. Parser-runtime versioning would seem only lightly coupled with grammar-parser versioning? Unless there's a grammar-runtime coupling to express?

EDIT: Ok, I was thinking of "grammar" in too general a sense, decoupled from choice of parser tech, rather than the usual "bison grammar tightly coupled to bison parser". Failure to RTFM. Grammar's `conflict:`, `inline:`, `word:`, `externals:` are all somewhat runtime coupled. Though perhaps looser than parser-runtime? It can be nice to flexibly move work between a compiler and its runtime, without affecting the external api.

mncharity
`externals` seem the (only) way to mix arbitrary code into the parsing, so...

Can `externals` call back into the parser? That is, is the parser reentrant?

Can `externals` be a zero-length assertions? That is, is there a progress requirement for success? And thus serve as arbitrary-computation parse-rejecting "semantic actions".

Do `externals` have access to GLR parse state?

maxbrunsfeld
The Externals API is designed to only let you do things that are compatible with incremental parsing. It’s good for things like indentation based delimiters and automatic semicolon insertion.

You cannot call back into the parser; it’s just for tokenization.

You can produce zero length tokens, and we do use this feature a lot.

You dont have access to the parse state directly, but the external scanner is passed an array of Boolean values that indicates which external tokens are expected in the current state.

mncharity
Can an `externals` be implemented using a second tree-sitter parser? That is, can multiple parser instances be parsing at the same time in a single thread, with no implementation global-state conflicts?

Thanks again.

phodge
First up, thanks for dedicating a big chunk of your life to building tree-sitter! I had a go making a similar parser last year (fast, dynamic-grammar, self-correcting) and had to give up when I started to realize what a ludicrously complex undertaking this is. The fact that tree-sitter works at all is nothing short of amazing.

Question: do you have any plans to integrate tree-sitter with the language-server project(s)? If fast, accurate parsing of any programming language is now easily implemented in language servers via tree-sitter, it seems to make sense for LSP to expand its protocol to include syntax highlighting as well.

mncharity
> thanks

Yes! The state of parsing is such a sad thing, discouraging so much other progress, that it's wonderful to have it improving. It's appreciated.

EDIT: And my thanks for all your guidance this evening. Good night.

maxbrunsfeld
Thanks for the kind words!

I haven't specifically pursued integration with LSP, but Tree-sitter has been used to build a couple of language servers which work with both VSCode and Atom (and probably other editors):

* Bash - https://github.com/mads-hartmann/bash-language-server * Ruby - https://github.com/rubyide/vscode-ruby/tree/master/server

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.