Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROV-N deserialization? #122

Open
MarcelPa opened this issue Sep 5, 2018 · 17 comments
Open

PROV-N deserialization? #122

MarcelPa opened this issue Sep 5, 2018 · 17 comments
Labels

Comments

@MarcelPa
Copy link

MarcelPa commented Sep 5, 2018

Hello, I would like to know whether a PROV-N deserializer is somewhere on the implementation roadmap? If not, I would like to contribute that; in case it is of interest of course.

@trungdong
Copy link
Owner

Thanks, @MarcelPa. That'd be fab!

I've been thinking about doing this, but haven't found the time for it. @TomasKulhanek recently wrote an ANTLR grammar for PROV-N, which I believe can be used to build a PROV-N parser.

Are you interested in working on that? I'd be very happy to help with testing/integration when needed.

@MarcelPa
Copy link
Author

MarcelPa commented Sep 6, 2018

Great, I would like to work on that then. The ANTLR grammar will surely be really helpful for that, thanks @trungdong (and @TomasKulhanek of course).
I just forked the repo and will start to look into ANTLR; I hope to start pushing to the forked repo to the development branch starting next week.
Will keep you posted :-)

@trungdong
Copy link
Owner

Excellent! Thanks a lot, @MarcelPa.

@MarcelPa
Copy link
Author

Quick question regarding testing: is there a pattern on how to create test files that can be found under tests/rdf for example? My approach would be to just copy the rdf documents and translate them into provn docs step by step.

FYI: ANTLR works fine so far, I got rid of raising NotImplementedError to successfully run all test cases in my virtual python environment.

@trungdong
Copy link
Owner

There is an extensive suite of round-trip conversion tests that you can use right away. See test_json.py for an example. The following test code will do:

class RoundTripPROVNTests(RoundTripTestCase, AllTestsBase):
    FORMAT = 'provn'

BTW, could you develop from the dev branch, please? I've been reorganising the directory structure there and will update it in the next release. Cheers!

@MarcelPa
Copy link
Author

MarcelPa commented Nov 17, 2018

Quite some time that I have pushed to my forked repo, therefore I am giving you an update via this issue:
Right now, the antlr-grammar seems to be erroneous in some cases, like langtags. Unfortunaly, I am not an expert in grammars, but I hope to get my head around them soon. Right now, I think these errors can be solved by reordering the lexer rules of the grammar, I will test soon whether this helps.

@trungdong
Copy link
Owner

Thanks for the update, @MarcelPa. Unfortunately, I won't be of much help on ANTLR.

The PROV-N specs does use grammar rules, which you might find useful.

@MarcelPa
Copy link
Author

Hey, after quite a while I finally had some spare time to spend for this. I modified the grammar a little bit (basically just reordered some rules), now it seems to work properly :-)
I am down to 13 test cases which fail / error. Next step for me will be to rebase to the newest commit of the dev branch and keep on developing.
As for now, failed test cases seem to be incorrectly parsed float values from typed literals. What would you expect to insert into the attributes of an expression? Something like

Literal(somevalue, datatype="xsd:float", langtag=someLangtag)

or parsed native value, like

float(somevalue)

I think I used those a little bit inconsequently right now. I will need to refactor this one way or another ;-)

@trungdong
Copy link
Owner

Thank you for the update and the work, @MarcelPa!

float(value) should work, I think. Do you have an example of a problematic case?

BTW, a Python float value is mapped to xsd:double by the package though.

@MarcelPa
Copy link
Author

I do: running test_entity_with_multiple_attribute fails. Both outputs are almost identical:
Parsed data from a debug print:

document
  prefix ex <http://example.org/>
  prefix ex_1 <http://example4.org/>
  
  entity(ex:emov, [ex:v_0="un lieu", ex:v_1="un lieu"@fr, ex:v_2="a place"@en, ex:v_3=1, ex:v_4=1, ex:v_5="1" %% xsd:short, ex:v_6="2" %% xsd:float, ex:v_7="1" %% xsd:float, ex:v_8="10" %% xsd:decimal, ex:v_9="1" %% xsd:boolean, ex:v_10="0" %% xsd:boolean, ex:v_11="10" %% xsd:byte, ex:v_12="10" %% xsd:unsignedInt, ex:v_13="10" %% xsd:unsignedLong, ex:v_14="10" %% xsd:integer, ex:v_15="10" %% xsd:unsignedShort, ex:v_16="10" %% xsd:nonNegativeInteger, ex:v_17="-10" %% xsd:nonPositiveInteger, ex:v_18="10" %% xsd:positiveInteger, ex:v_19="10" %% xsd:unsignedByte, ex:v_20="http://example.org" %% xsd:anyURI, ex:v_21="http://example.org" %% xsd:anyURI, ex:v_22='ex:abc', ex:v_23='ex:abcd', ex:v_24='ex_1:zabc', ex:v_25='ex_1:zabcd', ex:v_26="2019-03-27T12:52:02.266484" %% xsd:dateTime, ex:v_27="2019-03-27T12:52:02.266486" %% xsd:dateTime])
endDocument

versus the testcase data:

document
  prefix ex <http://example.org/>
  prefix ex_1 <http://example4.org/>
  
  entity(ex:emov, [ex:v_0="un lieu", ex:v_1="un lieu"@fr, ex:v_2="a place"@en, ex:v_3=1, ex:v_4=1, ex:v_5="1" %% xsd:short, ex:v_6="2" %% xsd:float, ex:v_7="1.0" %% xsd:float, ex:v_8="10" %% xsd:decimal, ex:v_9="1" %% xsd:boolean, ex:v_10="0" %% xsd:boolean, ex:v_11="10" %% xsd:byte, ex:v_12="10" %% xsd:unsignedInt, ex:v_13="10" %% xsd:unsignedLong, ex:v_14="10" %% xsd:integer, ex:v_15="10" %% xsd:unsignedShort, ex:v_16="10" %% xsd:nonNegativeInteger, ex:v_17="-10" %% xsd:nonPositiveInteger, ex:v_18="10" %% xsd:positiveInteger, ex:v_19="10" %% xsd:unsignedByte, ex:v_20="http://example.org" %% xsd:anyURI, ex:v_21="http://example.org" %% xsd:anyURI, ex:v_22='ex:abc', ex:v_23='ex:abcd', ex:v_24='ex_1:zabc', ex:v_25='ex_1:zabcd', ex:v_26="2019-03-27T12:52:02.266484" %% xsd:dateTime, ex:v_27="2019-03-27T12:52:02.266486" %% xsd:dateTime])
endDocument

The difference is noticable at ex:v_7="1.0" %% xsd:float, which will be parsed as a float but returned as ex:v_7="1" %% xsd:float.

So far, I did not notice any changes happening from float to double.

@pohutukawa
Copy link

@MarcelPa Just a quick question what the status of the PROV-N deserialiser is. It's been a good year, and it looked like things weren't far off.

@MarcelPa
Copy link
Author

Oh my, I completely lost track of this issue, thanks for the reminder @pohutukawa ! I will rebase later today and give a status update; If I recall correctly, I was "stuck" editing the antlr prov-n grammar. Will keep you posted :-)

@MarcelPa
Copy link
Author

I am back at finding out how antlr4 works (any help is appreciated!). For reasons I do not yet understand, langtags and some int_literals will fail to parse, which gives me 57 fails of 185 unit tests. Once I figure out how to fix that, PROV-N deserialization should near its completion.

@pohutukawa
Copy link

That looks promising!
Even if there are some "glitches" as in the comment above (where a float is parsed to ex:v_7="1" %% xsd:float), I'd be fully happy, as the value and its type is still preserved, and only the formatting (to 1.0) is lost.

@ChrisJMacdonald
Copy link

Hi @MarcelPa, Wondering if you've had any progress on this deserializer? I'm wanting to work some more with Prov-n but seem quite limited without the ability to store and extract from Prov-n strings. I'm taking a bit of a look at the code and the tools to see if I could help but it's a little bit beyond me at this stage
Thanks!

@ChrisJMacdonald
Copy link

ChrisJMacdonald commented Dec 3, 2020

I also found a mildly hacky way to convert in and out of Prov-n using the java ProvToolbox and provconvert,
Saving the file as a .provn then using provconvert to spit it out as .json, and then using the python deserialiser to get it back as a ProvDocument.
Luc Moreau had put some of his information up about the ANTLR3 grammer for prov too (here)

@pohutukawa
Copy link

We've been trying to use @MarcelPa's feature branch that can parse PROV-N with decent results so far. Though, it's not based on the current 2.0 version, yet, so that's a bit of a pity.

If the ANTLR3 grammar by Luc is more complete, would that be an option to move forward on? (Even though it may be more "sexy" to use a current ANTLR4 grammar.) After all, there is a antlr3_python_runtime Python module as well.

I'm just searching for ways to not create any inconsistencies between individual approaches, for the case that the ANTLR4 grammar may differ from the ANTLR3 one ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants