Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Unexpected namespace creation during turtle file serialization #2779

Open
mickremedi opened this issue May 3, 2024 · 4 comments
Open

Comments

@mickremedi
Copy link

Hello! I've noticed that serializing a ttl file has an unexpected behavior where adding a triple to a blank graph and then serializing it randomly adds a prefix to the turtle file:

import rdflib

TRIPLE = (
   rdflib.URIRef("http://example1.com/s"),
   rdflib.URIRef("http://example2.com/p"),
   rdflib.Literal("some literal"),
)

g = rdflib.Graph(bind_namespaces="none")
g.add(TRIPLE)

print("Namespaces Before:", list(g.namespaces()))

x = g.serialize(format="turtle")

print(x)
print("Namespaces After:", list(g.namespaces()))

Results in:

Namespaces Before: []
@prefix ns1: <http://example2.com/> .

<http://example1.com/s> ns1:p "some literal" .


Namespaces After: [('ns1', rdflib.term.URIRef('http://example2.com/'))]

When someone would expect:

Namespaces Before: []
<http://example1.com/s> <http://example2.com/p> "some literal" .


Namespaces After: []

I've boiled it down to the following line:

self.getQName(node, gen_prefix=(i == VERB))

Here we create a new prefix if we're looking at the predicate of a triple during serialization. I can't follow the blame of this change or docs explaining that serialize modifies the graph. Does anyone know why this was put there and if it can be set to self.getQName(node, gen_prefix=False)? This seems to have already been done for trig files #2467 .

@sardormajano
Copy link

Running into the same issue

@seo-chang
Copy link

Please solve this!

@mickremedi
Copy link
Author

Quick Note: I've been able to patch this bug for now by overriding the getQName() method:

class FixedTurtleSerializer(TurtleSerializer):
    def getQName(self, uri, gen_prefix=True):
        return super().getQName(uri, gen_prefix=False)

This fixes the fact that there are multiple places in this serializer that call the method. I'm considering throwing in a PR to adjust the behavior of serialize to not generate namespaces by default since:

  • There are no docs explaining this behavior
  • Serializers generally don't modify the data of what they are serializing
  • If generating prefixes is done to help optimize the resulting ttl, it would make more sense to apply this generation to any identifier found in a triple statement rather than just the predicate (this can also be added in the PR)

A possible method could look like:

g = rdflib.Graph(bind_namespaces="none")
serialized_without_prefixes = g.serialize(format="turtle", generate_prefixes=False)
serialized_with_prefixes = g.serialize(format="turtle", generate_prefixes=True)

Any thoughts? I could also go with the reverse approach where the default behavior remains the same and an optional param is added to disable the predicate prefix generation. This would be non-breaking, but could be less intuitive for new users.

@nicholascar
Copy link
Member

Having the two options - to generate and to not generate prefixes - with a documented default sounds great, please do make a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants