Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for query language (e.g., a simple subset of JSONPath) #95

Closed
lemire opened this issue Mar 2, 2019 · 26 comments
Closed

add support for query language (e.g., a simple subset of JSONPath) #95

lemire opened this issue Mar 2, 2019 · 26 comments
Labels
enhancement New feature or request
Projects
Milestone

Comments

@lemire
Copy link
Member

lemire commented Mar 2, 2019

Currently, client can navigate the parsed JSON, but there is no support for a query language.

This could be supported either by re-using an existing framework (plugging simdjson into it) or working from simdjson itself.

@lemire lemire added the enhancement New feature or request label Mar 2, 2019
@lemire
Copy link
Member Author

lemire commented Mar 2, 2019

cc @TkTech

@TkTech
Copy link
Member

TkTech commented Mar 2, 2019

This is a much bigger project than it sounds if you want comprehensive support. I'd suggest implementing support for simdjson as an extension to jq instead of implementing the query language in simdjson itself. jq and (spec-complete) JSONPath implementations are languages in their own right and typically use a psuedo-VM to run the query.

That said, if you restrict the scope to simple filtering for navigation that is much simpler. I use a simple but inefficient approach in pysimdjson, but I can get away with it because Python object creation is just so expensive in comparison.

@lemire
Copy link
Member Author

lemire commented Mar 2, 2019

I'd suggest implementing support for simdjson as an extension to jq instead of implementing the query language in simdjson itself.

I think that is what I mean by "by re-using an existing framework (plugging simdjson into it)". I am betting that this can be done relatively well without too much work.

This is a much bigger project than it sounds if you want comprehensive support. (...) That said, if you restrict the scope to simple filtering for navigation that is much simpler.

Right. Honestly, I was just thinking of filtering... basically "subset selection".

I am not assuming that this is a small project. It could get a bit challenging, even with a restricted query language.

@TkTech
Copy link
Member

TkTech commented Mar 2, 2019

If that's all you're looking for at the moment, that is something I can contribute. It would be nice to have the filtering language consistent between the various bindings/forks.

@lemire
Copy link
Member Author

lemire commented Mar 2, 2019

@TkTech

I think that would be good. Even if it is minimalist, that would still be a huge step forward in some instances.

@geofflangdale
Copy link
Member

I like the idea of a compatible subset. Is there anything stopping us from saying "ok, here's the bit of JSONPath we support"?

I don't know whether it's feasible to combine doing things entirely from outside simdjson and to get performance that's representative of what it could be. Much of the work in stage 2 could be completely elided if we knew we only needed some of the input.

What might be nice is to figure out how to define a subset of functionality for simdjson that supports outside projects without taking on all the trappings of JSONPath or some big complicated language - i.e. what additional API features would we have to expose?

@TkTech
Copy link
Member

TkTech commented Mar 4, 2019

JSONPath is probably out-of-scope. You would build a JSONPath implementation on simdjson, but it's a lot to put into what should ideally be a pretty small, solid core.

There are 2 main improvements that I can see as an end-user. One is iterative parsing, which doesn't necessarily offer speed improvements except for possible early termination, but it does offer significantly better peak memory usage. It should probably work in batches or it'll suffer from warmup penalties if the caller takes too long to continue to the next element. The query language won't help with this. (#31)

The second is avoiding validation and storing bits we don't care about in the first place (which can also work with iterative parsing). This would certainly improve memory usage but risks being slower than simply parsing everything if we don't implement it properly or make too complex of a language.

So I propose 4 simple OPs, same as pysimdjson.

  1. GET [key] (.)
  2. ARRAY ([])
  3. ARRAY INDEX ([<N>])
  4. ARRAY SLICE ([<START>:<STOP>])

A query string is decomposed into these simple operations using a trivial state machine. The parser grabs the next operation off the stack before continuing to parse the next element, so it can immediately discard it if the types mismatch (ex: found the start of an object/dict but query is looking for an array).

Usage is simple in practice, and is similar to what jq users would expect.

{
    "hello": "world",
    "list": [
        1,
        2,
        3
     ],
     "list.of.dicts": [
         {"hello": "world"},
         {"hello": "bob"}
     ]
}

Example queries for the above document:

  • . -> entire document
  • .hello -> "world"
  • .list[0] -> 1
  • ."list.of.dicts"[].hello -> ["world", "bob"]
  • ."list.of.dicts"[1].hello -> "bob"

These queries should be usable after the document has been parsed into ParsedJson and before, when calling json_parse, in which case the parser can use it to avoid unwanted elements and validation.

@TkTech
Copy link
Member

TkTech commented Mar 5, 2019

@EgorBo @luizperes as folks that have built on top of simdjson already, your input would be appreciated :)

@luizperes
Copy link
Member

Hi @TkTech,
I liked it! It seems simple and effective!

grammar ::= '.' | '.' binding
binding ::= string | string '[' number* ']' | binding '.' 'binding'

screen shot 2019-03-04 at 5 44 51 pm

@lemire
Copy link
Member Author

lemire commented Mar 5, 2019

Wait? We already have a formal syntax?

@geofflangdale
Copy link
Member

I'm not enthusiastic about this direction.

I can potentially see all manner of helper libraries that sit outside of simdjson and help people query things. This is fine and I don't see any reason to stop this, but there's no reason that simdjson needs to change (I hope) to support this kind of use.

What I'm enthusiastic about is the idea of people being able to control the simdjson parsing step with a query to select out things that are needed and reduce the parsing cost. This has potentially enormous benefits of performance, as we can avoid materializing large quantities of the document. Even a straightforward suppression of stage 2 when not needed is a big deal, but beyond that, techniques to accurately search for keys or values (while keeping track of what level we're in, etc) have huge potential (e.g. 4-5x on our existing speeds).

@lemire
Copy link
Member Author

lemire commented Mar 5, 2019

@geofflangdale This context is relevant: TkTech/pysimdjson#22

@geofflangdale
Copy link
Member

We're not going to get too many cracks at building a query language. I like the start that's made, here, but don't really want something that is only centered around putting a band-aid on the injury of Python object creation.

I would like something that offers a bit more power and can be supported natively within simdjson. We could really make selective queries blaze along. Stage 2 is a huge millstone around our necks, and half of the reason we stopped optimizing Stage 1 after a point is that Stage 2 is so expensive. So having more opportunity to cut down on Stage 2 work would open up even more ability to go really fast.

So think big! (and also small and tractable and implementable, please :-)). But I think a query language really needs some ability to search and pattern match as well (within reason).

@lemire lemire changed the title add support for query language (e.g., JSONPath) add support for query language (e.g., a simple subset of JSONPath) Mar 13, 2019
@klon
Copy link

klon commented Apr 10, 2019

Maybe support for JSON Pointer would be a nice start (https://tools.ietf.org/html/rfc6901)

@lemire
Copy link
Member Author

lemire commented Apr 10, 2019

@klon Cookies for you!!! Yes... yes...

@lemire
Copy link
Member Author

lemire commented Jul 16, 2019

@ioioioio Can we add JSON Pointer to our 0.2 release target? (End of summer)

@ioioioio
Copy link
Member

Sure. It works on json_parser branch, it is just not yet as clean as I'd like it to be.

@lemire
Copy link
Member Author

lemire commented Jul 16, 2019

Marked for inclusion in the next release.

@lemire
Copy link
Member Author

lemire commented Jul 16, 2019

cc @carinecroteau

@lemire
Copy link
Member Author

lemire commented Jul 26, 2019

JSON Pointer support has been added by @ioioioio, it follows RFC6901. I believe that this should go a long way toward addressing part of this issue. Nevertheless, we need to leave it open because it seems like there are several different issues.

Todo: we need to break this wide issue into separate components that can be addressed and closed. It is currently a bit too open-ended.

@davidglavas
Copy link

Where can I find examples on how to navigate/query a ParsedJson object? Where can I learn how to get values out of the parsed json document?

@lemire
Copy link
Member Author

lemire commented Nov 26, 2019

JSON Pointer has been implemented and is in master, so I am going to close this issue.

@davidglavas
Copy link

Yes, but where can I find examples on how to use it? Is there some tutorial (other than the "Navigating the parsed document" section in the readme) to learn how to get values out of a parsed json document?

@lemire
Copy link
Member Author

lemire commented Nov 26, 2019

@davidglavas

The section in question has been improved:

https://github.com/lemire/simdjson#navigating-the-parsed-document

Currently there is no tutorial, I will create a new issue.

@lemire lemire closed this as completed Nov 26, 2019
@lemire lemire added this to Done in Release 0.3 Nov 26, 2019
@jkeiser jkeiser added this to the 0.3 milestone Feb 12, 2020
@udem1234
Copy link

Hi.
Is there a chance that JSONPath will eventually be supported ?

@lemire
Copy link
Member Author

lemire commented Sep 26, 2023

@udem1234 Definitively. I have opened an issue regarding JSON Path: #2070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Release 0.3
  
Done
Development

No branches or pull requests

9 participants