Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added "MongoDB Query" syntax #2502

Closed
wants to merge 3 commits into from

Conversation

airs0urce
Copy link
Contributor

No description provided.

Copy link
Member

@RunDevelopment RunDevelopment left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this pull request @airs0urce.

I gave it a quick review but I mainly want to ask this: Is this a real language?
From what I see, this is just a subset of JS with some additional highlighting for special MongaDB properties. Could you please explain the use-case for this.

return keyword.replace('$', '\\$');
});

var keywordsRegex = '(?:' + keywords.join('(?:\\b|:)|') + ')\\b';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will generate a string value like this:

(?:\$foo(?:\b|:)|...|\$bar(?:\b|:)|\$baz)\b

Assuming that it's a bug that the last keyword doesn't get the (?:\b|:) suffix, we can factor out the common pre- and suffixes like this:

\$(?:foo|...|bar|baz)(?:\b|:)\b

Let's talk about the (?:\b|:)\b suffix. It's equivalent to (?:\b|:\b) where the problem is easier to see. The :\b alternative can never be matched. If we have a string "$foo:", then the \b alternative will accept after "$foo".
So we can simplify the whole pattern even further:

\$(?:foo|...|bar|baz)\b

But we know that the \b assertion will always accept because of the way the this pattern is used. It's inserted to create the keyword regex, so we know that what follows looks like this: ["']?$. Since we know that \b always accepts, we can just remove it.

\$(?:foo|...|bar|baz)

^ This is the string we want to generate.
So you can remove the $ prefix in all strings of the keywords array, you can remove the mapping adding a \ character, and this line becomes:

Suggested change
var keywordsRegex = '(?:' + keywords.join('(?:\\b|:)|') + ')\\b';
var keywordsRegex = '\\$(?:' + keywords.join('|') + ')';

},
entity: {
// ipv4
pattern: /\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the whole string content be an IP address instead of just a substring? E.g. "foo bar 0.0.0.0 baz".

Same for url.

@airs0urce
Copy link
Contributor Author

airs0urce commented Aug 7, 2020

@RunDevelopment

Thank you for making this pull request @airs0urce.

I gave it a quick review but I mainly want to ask this: Is this a real language?
From what I see, this is just a subset of JS with some additional highlighting for special MongaDB properties. Could you please explain the use-case for this.

Thank you for the quick review.
Yes, actually this looks more like subset of javascript.

So, basically MongoDB has query language that you use to fetch data from database, this is like SQL, but syntax based on some mix of JSON/javascript, you can define only one js object and you can use limited set of functions supported by mongodb.

This type of highlighting implemented in many mongo clients. Examples:

  • MongoDB Compass (official mongodb gui client)
    Screen Shot 2020-08-07 at 10 54 36 AM
  • "NoSQL booster for mongoDB"
    Screen Shot 2020-08-07 at 10 55 56 AM
  • Robo 3T
    Screen Shot 2020-08-07 at 11 04 58 AM

and many others.
After checking I see they have a little different highlighting types, looks like there is no standard about this.

I'm working on client app for MongoDB database, here is the code: https://github.com/airs0urce/punkmongo
This is screenshot of interface, I pasted demo query in "filter" area:
Screen Shot 2020-08-07 at 10 49 39 AM
I need this new syntax to highlight query that user types and also I use it to highlight results of query:
Screen Shot 2020-08-07 at 11 07 48 AM

Not sure if this syntax should go to main prism.js repo, but I decided to send pull request anyway. I was not able to find any library that can highlight mongo query, only one here: https://github.com/mongodb-js/ace-mode

But it has webworker for syntax checking and for me looks not lightweight enough, especially because I plan to use it for highlighting query results too and there may be 1000 records shown. So, I wanted some simple highlighting and this is why I implemented it on prism.js.

--

Btw, now I think it makes more sense to call it "Mongo Filter", because there are also Aggregation query which is query too and different keywords can be used in that kind of query.
So, it's better to call this syntax "Mongo Filter" and I plan to create another syntax named "Mongo Aggregagation Stage".

@airs0urce
Copy link
Contributor Author

Here is also example of query made using official Mongo Shell (https://docs.mongodb.com/manual/mongo/) and results:
Screen Shot 2020-08-07 at 11 22 48 AM

Just to show, here I tried to use "RandomFunction" in mongo shell and got error:
Screen Shot 2020-08-07 at 11 24 01 AM

In the syntax I highlight only valid functions and keywords like $set, $unset, $gt, etc.

@airs0urce
Copy link
Contributor Author

airs0urce commented Aug 7, 2020

here is list of syntaxes I came with after thinking how to define them more correct:

  • MongoDB Filter - Only filter related keywords should be available here like $gt, $lt. But we should not include update-related keywords like $set, $unset
  • MongoDB Update - When you want to update records in database you call update query, you set MongoDB Filter to show what records you want to update and then in "MongoDB Update" you explain how exactly you want to update those records. Keywords used in Mongo Filter should not be available here like $gt, $lt. But update -related keywords should be, for example: $set, $unset, $pull
  • MongoDB Document - This is one record in results MongoDB returns (like a row in SQL dbs). Keywords like $set, $gt, $lt should not be available here
  • MongoDB Aggregation Stage - This is aggregation framework, you can make more complex queries than with MongoDB Filter using it. Additional keywords available here. Like $match, $group

So, if I make it like this it will be 4 different syntaxes, looks a lot but this is how it works in MongoDB. Anyway let me know what do you think. I could create all 4 syntaxes.

@RunDevelopment
Copy link
Member

Sorry for the delay!

After your response and looking through the MongoDB doc and projects, I think it is best to implement this as one language that is a superset of JS. (One language because it's easier to use.) The additions to vanilla JS will be MongoDB-specific properties (e.g. $currentDate), properties in general, and highlighting for string URLs (and IP addresses).

The implementation could look like this:

Prism.languages.mongodb = Prism.languages.extend('javascript', {});

Prism.languages.insertBefore('mongodb', 'string', {
	'property': {
		pattern: /(?:(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1|[_$a-zA-Z\xA0-\uFFFF][$\w\xA0-\uFFFF]*)(?=\s*:)/,
		greedy: true,
		inside: {
			'keyword': RegExp('^([\'"]?)\\$(?:' + ['lt', 'gt', ...].join('|') + ')(?:\\1)$')
		}
	}
});

Prism.languages.mongodb.string.inside = {
	'url': { pattern: /.../ },
	'ip-address': { pattern: /.../, alias: 'entity' },
};

If we also wanted special highlighting for MongoDB-specific types (e.g. ObjectId), we can do it like this:

Prism.languages.insertBefore('mongodb', 'function', {
	'builtin': /\b(?:ObjectId|Code|...)\b/
});

The main advantage of implementing it like this is that JS is doing most of the work for us. We don't have to worry about comments, numbers, keywords, etc; JS handles all of this for us.

Thoughts?

@airs0urce
Copy link
Contributor Author

airs0urce commented Aug 10, 2020

@RunDevelopment Yes, sounds good. I'll check how that extending works and prepare the changes when I have free time.
Only thing I hope there is way to exclude keywords like "if", "else" etc. something JS specific that doesn't make sense for mongodb syntax. So, I'll play with that.

About this:

highlighting for string URLs (and IP addresses).

Actually this is what I did for my mongo client to make it more easy to analyze content of strings, this is not related to any mongodb features directly. I believe it will be better if we don't include them in generic mongodb syntax that potentially can go to Prism.js distribution.
Is there any practice about this? Right now I think about preparing mongodb syntax without "url" and "ip" and later I can make a fork and add those additional "decorations".

@RunDevelopment
Copy link
Member

Only thing I hope there is way to exclude keywords like "if", "else" etc. something JS specific that doesn't make sense for mongodb syntax.

You could say that they are dead weight and they will make the tokenization process slightly slower but it's probably not worth your time.

Also, can't queries include functions that in turn can contain arbitrary JS code?

Is there any practice about this? Right now I think about preparing mongodb syntax without "url" and "ip" and later I can make a fork and add those additional "decorations".

The reason we do syntax highlighting is to improve the readability of code. As long as the feature improves readability, I'm very open to make it a part of Prism. Since URLs are a somewhat common appearance in databases, I think it's a nice idea to highlight them.

@airs0urce
Copy link
Contributor Author

airs0urce commented Aug 11, 2020

@RunDevelopment

Sorry for bothering you again, but I was thinking more about mongodb syntaxes.
And as I want them to be part of prism.js - I would know your opinion.

Here are my thoughts:

Actually I see there are two approaches different mongo clients use.

1) Approach #1 is when client allows to edit query filter and all other options available from GUI by fields/checkboxes:

So, only thing to highlight here is query filter which is one object with set of available functions like ObjectId:
{'_id': ObjectId('5f31857635a20728b57d6c96'),}

Official client MongoDb Compass:

mongo_query2

The client I'm working on:

mongo_query

2) Approach #2 is to give you shell. In this case you actually can run javascript and extending javascript syntax makes sense here. "NoSQL Mongo Booster" gives us even more - you can use libraries like moment.js etc.

So you can write normal JS code:

db._logs_api.find({'_id': ObjectId('5f31857635a20728b57d6c96')})

Official MongoDB Shell

mongo_shell

NoSQL Mongo Booster app

mongo_shell2

There is no right answer for question "Which approach is right" as even two official Mongo clients use both approaches.
It means both approaches are ok. Based on this I believe syntaxes for mongodb should be done like this:

4 syntaxes for basic parts (described in my coment above):

  • MongoDB Filter
  • MongoDB Update
  • MongoDB Document
  • MongoDB Aggregation Stage

I plan to use all of them in different places of my project, so in query filter I'll use "MongoDB Filter" syntax and this
way properties like $match and $group (from "MongoDB Aggregation Stage") will be ignored. For displaying results I'll use "MongoDB Document" and
redundant properties like $set, $unset from "MongoDB Update" will be ignored too, it will speedup parsing when you show many results like I plan to do in my client.

So each part will highlight only what needed which will make it better for UX as user will get better control on what he wrote and same time we will get better parsing speed.

Then later if somebody wants to implement syntax for approach #2 they can reuse all 4 parts I created. May be even I will create it.
So, they can call syntax for example

  • MongoDB Shell

"MongoDB Shell" will extend javascript syntax and also "depends" on those 4. Something like this will be in "components.json":

"mongodb-shell": {
    "title": "MongoDB Shell",
    "require": ["mongodb-filter", "mongodb-update", "mongodb-document","mongodb-aggregation-stage"],
    "require": "javascript",
    "owner": "<username>"
},

And that new "MongoDB Shell" syntax will be like normal javascript, but will detect parts where syntax must be highlighted mongo-specific way, examples:
For example:

MongoDB Filter

.find(<MongoDB_Filter-syntax>)

MongoDB Update

.updateOne(<...>, <MongoDB_Update-syntax>)
.updateMany(<...>, <MongoDB_Update-syntax>)

MongoDB Document. Any object will be highlighted:

{<MongoDB_Document-syntax>}

MongoDB Aggregation Stage

.aggregate([
    <MongoDB_Aggregation_Stage-syntax>, 
    <MongoDB_Aggregation_Stage-syntax>,  
    ...
])

So, on the end list of syntaxes will look like this:

  • MongoDB Filter
  • MongoDB Update
  • MongoDB Document
  • MongoDB Aggregation Stage
  • MongoDB Shell: javascript + reuse of 4 syntaxes above

For now I can create first 4 syntaxes. I don't need "MongoDB Shell" right now, but probably later I'll make it too.
So, the list looks big, but same time it looks like optimal way to be sure that it's possible to highlight syntaxes for both approaches: #1 and #2.

I would know your opinion about this - are you ok about pull request for 4 syntaxes together?
If you have doubts about many syntaxes instead of one, let me know - I'll try to explain with more examples.

@RunDevelopment
Copy link
Member

it will speedup parsing
we will get better parsing speed.

Please don't worry about the tokenization speed. I really mean it. Since #1909, Prism is really fast.

Benchmark

This was run on my Windows 10 computer with an Intel i7-8700K and NodeJS v13.12.0.

This is a benchmark log from #2153. Each section starts with the current language(s) followed by a list of files and the tokenization timings for that file.

Example:

  ../../components.json (25 kB)
  | local     2.23ms ±  2%   48smp

This means that the file ../../components.json is 25kB in size, was tokenized using the javascript language, and it took 2.23ms on average.

Log:

c

  https://raw.githubusercontent.com/git/git/master/mergesort.c (1 kB)
  | local     0.08ms ±  1%   45smp
  https://raw.githubusercontent.com/git/git/master/mergesort.h (1 kB)
  | local     0.02ms ±  1%   49smp
  https://raw.githubusercontent.com/git/git/master/remote.c (58 kB)
  | local     2.80ms ±  1%   48smp
  https://raw.githubusercontent.com/git/git/master/remote.h (10 kB)
  | local     0.35ms ±  1%   52smp

------------------------------------------------------------

css

  ../../style.css (7 kB)
  | local     0.85ms ±  1%   54smp

------------------------------------------------------------

css!+css-extras (css)

  ../../style.css (7 kB)
  | local     1.32ms ±  1%   53smp

------------------------------------------------------------

javascript

  ../../components.json (25 kB)
  | local     2.23ms ±  2%   48smp
  ../../package-lock.json (190 kB)
  | local    11.87ms ±  2%   39smp
  ../../scripts/utopia.js (11 kB)
  | local     1.32ms ±  1%   49smp
  https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/prism.js (29 kB)
  | local     3.53ms ±  1%   48smp
  https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/prism.min.js (14 kB)
  | local     2.04ms ±  1%   51smp
  https://code.jquery.com/jquery-3.4.1.js (274 kB)
  | local    30.00ms ±  2%   32smp
  https://code.jquery.com/jquery-3.4.1.min.js (86 kB)
  | local    17.37ms ±  2%   33smp

------------------------------------------------------------

json

  ../../components.json (25 kB)
  | local     1.33ms ±  1%   51smp
  ../../package-lock.json (190 kB)
  | local     7.58ms ±  1%   41smp

------------------------------------------------------------

markup

  ../../download.html (4 kB)
  | local     0.34ms ±  0%   51smp
  ../../index.html (19 kB)
  | local     1.81ms ±  1%   53smp
  https://github.com/PrismJS/prism (192 kB)
  | local    16.24ms ±  1%   35smp

------------------------------------------------------------

markup!+css+javascript (markup)

  ../../download.html (4 kB)
  | local     0.61ms ±  1%   53smp
  ../../index.html (19 kB)
  | local     2.52ms ±  1%   51smp
  https://github.com/PrismJS/prism (192 kB)
  | local    21.09ms ±  1%   35smp

------------------------------------------------------------

ruby

  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/base.rb (12 kB)
  | local     0.53ms ±  0%   53smp
  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/layouts.rb (16 kB)
  | local     0.64ms ±  1%   53smp
  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/template.rb (14 kB)
  | local     0.94ms ±  0%   54smp

------------------------------------------------------------

rust

  https://raw.githubusercontent.com/rust-lang/regex/master/src/compile.rs (42 kB)
  | local     2.68ms ±  2%   44smp
  https://raw.githubusercontent.com/rust-lang/regex/master/src/lib.rs (28 kB)
  | local     0.27ms ±  1%   51smp
  https://raw.githubusercontent.com/rust-lang/regex/master/src/utf8.rs (9 kB)
  | local     0.61ms ±  2%   46smp

If you highlight 1MB of text (= >10k lines of code), the tokenization should only take about 100~200ms on a typical desktop computer.

It's what comes after the tokenization that is usually the bottleneck. First, we have to create HTML from the token stream (this usually takes half as long as the tokenization itself) and then we hand it off to the browser. The browser has to parse the HTML code, create hundreds and thousands of DOM nodes, and then calculate layout and style for all of them. With asynchronous highlighting, we can offload all of Prism's work (tokenization + HTML creation) to a different thread, so your page remains responsive but we can't do anything about the browser having to display the highlighted code.

Unfortunately, Prism can't do partial highlighting and we don't plan to make it a focus in the future. If you need to regularly highlight megabytes of text (>= 10MB) and still need your webpage to be snappy, you might need a library that dynamically highlights and displays part of the text as you scroll.

If you have doubts about many syntaxes instead of one, let me know

I still do. As I said, tokenization performance isn't usually the bottleneck and it seems to be one of your motivations for making 4 languages.

You said that making it 4 languages is more correct, but I don't really see that. I get that you can get false positives if you merged everything together (e.g. "properties like $match and $group (from "MongoDB Aggregation Stage") will be ignored [in filter queries]"). However, I don't think this is too much of an issue because nobody will be using those properties in the wrong context (or at least nobody shouldn't).

Also, in the end, it's just less work to make one language instead of 4 to 5.

@airs0urce
Copy link
Contributor Author

Ok, got you. I’ll create new pull request when done. Will close current one for now.
So in new pull request I’ll add “MongoDB” syntax with full coverage and extending javascript, it will fit prism.js approach to languages.
Along with this I’ll keep the syntaxes separated in my forked version, so it will be possible to use sub-syntaxes if anybody needs.

@airs0urce airs0urce closed this Aug 12, 2020
@airs0urce airs0urce deleted the mongodb-query-syntax branch August 13, 2020 10:24
@airs0urce airs0urce mentioned this pull request Aug 13, 2020
@airs0urce
Copy link
Contributor Author

For those who read this thread, here is forked version with separated mongodb syntaxes:
https://github.com/airs0urce/prism-mongodb

Read README.md to understand how to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants