-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getZxy with Azure Function app entry point #197
base: main
Are you sure you want to change the base?
Conversation
Thanks, this looks like a good starting point. The two vital features needed are
|
I will try with SAS token next week and get back to you. |
@tengl the preference is for the authentication between Azure Functions and Blob Storage to be role-based without needing to specify a hardcoded key/secret. in Cloudflare this is done via Workers Bindings and in AWS via IAM roles. The principle of the AWS and Cloudflare integrations is to present a well-tested and uniform solution for deploying protomaps within a single provider. It is of course possible to mix-and-match - e.g. Azure Functions serving from an AWS S3 Bucket - but the cost model is less efficient and it becomes difficult to debug across providers. So an integrated solution for Azure will be extremely valuable. |
I see. I guess it should be possible using this tutorial, https://learn.microsoft.com/en-us/azure/app-service/tutorial-connect-app-access-storage-javascript?tabs=azure-portal. I've only browsed through it, so I'm not sure. We could make an The SAS token and private container works, as expected. We will go for that solution until we find a way to use the storage blob SDK. |
@bdon I just pushed some updates. I made an It still supports |
@bdon I pushed some fixes. It seems to work now, at least on one of my pmtiles files. I've tested on two different pmtiles files and both local debug and on Azure. All combinations work when using A caveat though, it doesn't work if multiple maps are enabled and they mix sources. In that case I get an error saying A note, using FetchSource and SAS token URLs seems to be a lot faster than using Azure Managed Identities. I'm sure/hope that it can be optimized, because right now it is too slow. I consider it done, but there are some fixes for edge cases and optimizations to be made. I won't have the time to continue work on this at the moment. |
What does "mix sources" mean? In the AWS and Cloudflare setups, a single instance of the function can serve multiple PMTiles files, so the cache in global memory needs to key portions of each file based on the filename. |
I made it a bit more flexible, so that the code can accept any kind of source. You can have one pmtiles files served by a |
serverless/azure/src/get.ts
Outdated
break; | ||
} | ||
|
||
if (p_header.etag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want this. If I am reading this correctly it will pass the etag of the entire archive as the etag of every tile; this means the etag is no longer properly content-aware since tiles will almost always be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I've changed that to use the tileData.etag
, but haven't pushed the change. I've also added a Last-Modified
, which is the reason I haven't pushed it yet. I need #304 to be merged first.
But I don't think this is a problem, because the source pmtiles file will not change, so every tile can have the same ETag. It actually will have the same ETag, at least if you use FetchSource
or AzureStorageSource
because the ETag comes from the file in storage, not the bytes. The ETag is unique in combination with the URL, so it will work in browsers. If you change pmtiles file, the ETag will change for all tiles.
But anyway, by using tileData.etag
it will be fine if some other source has unique ETags for each tile.
serverless/azure/src/index.ts
Outdated
); | ||
|
||
let allowed_origin = "*"; | ||
if (typeof process.env["ALLOWED_ORIGINS"] !== "undefined") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not interact with a caching CDN in front correctly - if we store the Access-Control-Allow-Origin
in the CDN cache, we need to invalidate the entire cache any time we change the set of allowed hosts. For AWS we can work around this by letting CloudFront dynamically handle CORS, for Cloudflare the Workers code can determine the correct header on every invocation. Is there any Azure CDN functionality to do dynamic CORS like CloudFront?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I need to check that. I used *
in our case. I pretty sure Azure CDN can set the header dynamically.
Also, I see that the ALLOWED_ORIGINS
parameter is used in a different place as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked ChatGDP. Got this answer:
-
Vary Header: If your service correctly implements CORS, it should include a Vary: Origin header in its responses. This header tells caches, including CDNs, that the response varies based on the value of the Origin request header. When a CDN sees this, it should cache separate copies of the resource for each unique value of the Origin header. In this case, a request from example.com would hit one cache entry, and a request from other.com would hit a different cache entry, assuming both are allowed origins.
-
Caching Without Vary Header: If your service does not include a Vary: Origin header, the CDN might cache the response to the first request and then serve that cached response to the second request, regardless of the origin. This can lead to incorrect Access-Control-Allow-Origin headers being sent to clients and potential CORS errors in the browser.
-
CDN Configuration: Some CDNs offer more advanced configurations that can help manage CORS headers more effectively. For example, you might be able to configure the CDN to dynamically alter the Access-Control-Allow-Origin header based on the origin of the request. However, this capability depends on the specific CDN and its feature set.
-
Potential Issues: Without proper handling, you might encounter issues where a user from an unallowed origin receives a cached response indicating they are allowed, or a user from an allowed origin receives a response indicating they are not allowed, due to the CDN serving a cached response intended for a different origin.
Testing and Verification: It's crucial to test how your CDN handles requests from different origins, especially if you are dynamically setting the Access-Control-Allow-Origin header. Verify that the correct headers are returned and that CORS is functioning as expected for users from different domains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, I think that in this case it is probably better to have the CDN handle CORS. That way the tile will be cached in the CDN, based on URL and cache-control headers, and subsequent requests will use the cached entry (I assume).
serverless/azure/src/azure_source.ts
Outdated
function getAzureDetails(mapName: string) { | ||
let defaultAccountName = process.env.AZURE_STORAGE_ACCOUNT_NAME; | ||
let defaultContainerName = process.env.AZURE_STORAGE_CONTAINER_NAME; | ||
let defaultBlobName = process.env.AZURE_STORAGE_BLOB_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the AWS and Cloudfront implementations there is a hard constraint of 1 function = 1 bucket. It looks like this behavior is something specific to your use case in Azure? I would strongly prefer there is as close to 100% parity between AWS/Cloudflare/Azure meaning there shouldn't be any way to override using ${mapName}
- was there a situation where the existing design did not meet your requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did that is that we need to specify a blob name, which is a path that could be maps/basemaps/protomaps/20231124.pmtiles
for map name default
and maps/custom_maps/rail/europe_20231123.pmtiles
for map name rail
.
If we want the function to support multiple maps by ${mapName}
in the URL, I can't think of an easy way to map that. I would prefer if the Function App didn't decide how we organize our files, like it does in the AWS and Cloudflare cases. A suggestion is to keep the AZURE_STORAGE_BLOB_NAME_${mapName}
override, and let the users only specify one specific storage account and container.
I don't have a strong opinion in this case, I'll let you decide the solution. We don't store custom maps in the same folder as basemaps and the files don't have the same name. But we use PMTILES_PATH
anyway, because it is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply (I'm also going offline for a week tomorrow but will prioritize this when I get back)
I would prefer if the Function App didn't decide how we organize our files
Then the easiest thing to do is to map the bucket key directly to the URL. It looks like you want to perform "aliasing", where a presentation name like map1
in the URL hits another key in the storage bucket. If we support this feature in Azure then we need to make it work in CF and AWS as well. This kind of aliasing could happen outside this code function logic in your own application-specific code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the easiest thing to do is to map the bucket key directly to the URL. It looks like you want to perform "aliasing", where a presentation name like
map1
in the URL hits another key in the storage bucket. If we support this feature in Azure then we need to make it work in CF and AWS as well. This kind of aliasing could happen outside this code function logic in your own application-specific code.
Well, you could say that it is in the application code, because getAzureStorageSource
is called in the application code and you can easily implement your own version, the source takes the details as paramerters. But I can change so that the default getAzureStorageSource
doesn't support "aliasing" and you would have to implement your own version if you need more advanced logic. I'll also change so that it just replaces {name}
in the blob name, just like the other sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed some changes to make it more like Cloudflare and AWS implementations. No environment variables (well, almost) are read in the main "library code", but there is a default implementation to get the Azure storage details, so the users can use it or develop their own version if they like.
I removed support for .pbf
since I figure no-one uses AzureStorageSoure
yet so there is no need to be backwards compatible.
No need to rush it, look at it when you have the time.
Finally revisiting this, some thoughts:
The SAS token can be scoped to a container's read operations, but it looks like the way to use them is to set an expiration date far into the future. Is there any alternative SAS token generation feature equivalent to AWS Roles? |
Also, ideally we can deploy the code as-is using |
Yes, we use SAS tokens and it's working great so far.
I didn't realize that :-/. We used the
Yes, that might be good enough. The only thing (I think) specific to Azure Functions is the entry point, every thing else should be generic. I think the only parameter to the function should be the URL, which includes the SAS token because that's how you copy it from Azure Portal. No need to split it up in multiple parameters. It also makes it easier to support other hosting scenarios, only the entry points needs to be customized as long as the pmtiles file is hosted via http.
Not that I am aware of, apart from Managed Identities. |
I used VS Code integration. In my experience that is the best way if you want to make changes to the code. But I'm no expert in Azure Functions, I just dot the minimal to get the job done :-) If we want to deploy directly to Azure the CLI is probably the best way. Maybe just create a bash and ps1 script for that? |
It should work fine, but has some logic that is irrelevant for non-browser usage, and won't have efficient Another option for Azure is to use Azure Container Apps and not use Functions at all, but I don't use Azure enough to get a sense of which is better supported. Azure Functions seems like a mess with multiple runtimes, versions and upload methods, but Containers might have slow start times. |
const p_header = await p.getHeader(); | ||
|
||
if ( | ||
(ifNonMatchEtag && ifNonMatchEtag === p_header.etag) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tengl we are working through the ETag logic in protomaps/go-pmtiles#137 and I think the behavior here may not be intended. While it's OK for all the individual tiles to have the same ETag as the whole archive as implemented here, we can't short-circuit the 304 response, because the underlying archive may have changed on storage; in other words every request to the Azure Function must make at least one request to the storage blob to properly reflect the present state of the ETag.
Given that this optimization isn't possible any longer I suggest that we converge the PMTiles decoding implementations on generating content-based etags for each unique tile, like the Lambda function does now: https://github.com/protomaps/PMTiles/blob/main/serverless/aws/src/index.ts#L238
This would also provide the benefit that updating an archive does not invalidate all etags, any tiles that remain the same across archives can be cached by intermediate CDNs.
Would making this change to unique ETags still work for your deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the Azure Function would always have to make a request to the storage in order to detect changes, right? Isn't that the point of ETags, the only thing they save is the data transfer. I don't understand how you can short-circuit the 304 response and still detect changes in the archive?
So the process is something like this?
- Client request tile without ETag.
- Request tile data from archive.
- Store the ETag for the archive.
- Calculate and store ETag for the tile.
- Return tile ETag to client.
- New client request for same tile with ETag.
- Request tile data from archive with ETag for archive.
- Calculate ETag from archive data, same as client request. Return 304.
- UPDATE OF ARCHIVE.
- Client request same tile with ETag.
- Request tile data from archive, with ETag for archive.
- ETag mismatch, new data in response.
- Overwrite ETag for archive.
- Calculate new ETag for tile.
- Check if client ETag is same as new calculated ETag.
- If yes, return unchanged, if no, return data and new ETag.
That saves the data transfer between the Azure Function and the client, but not from the storage blob to the Azure Function.
I don't think we need to care for my deployment, I use my own repo so I can change the code however I want and it makes it easier to deploy using my current tooling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the implementation on line 67 is that it short circuits based on the archive etag and the If-None-Match request header matching. but the call to p.getHeader()
could be returning a cached, not fresh value. So on step 10-12 above, the azure function will short-circuit and not detect the mismatch at line 12. Am I interpreting that right or missing something? (it's been awhile since I've looked at this part of the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right. We don't want that.
But how would unique ETags work? Like the process I described above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for an Azure Function written in JavaScript, it would call one of the Node standard library hashing functions such as md5
on the tile contents, then set those as a strong ETag header. So this would add some computational overhead to every request. Same as how the lambda works here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the overhead would also be a data transfer from the storage to the function. Every request needs to calculate the ETag with the tile data, right? Unless we don't check the ETag of the archive.
An optimized version just for HTTP (Fetch) sources could do a HTTP HEAD request at regular intervals to check if the ETag for the archive has changed. If it has, invalidate cache and load new data, otherwise short-circuit and return 304. The tiles could still have individual ETags, and they could still be the same after the archive has changed. That would save a data transfer from the function to the client if the ETag is still the same. But I don't know how that would be implemented and still fit other sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are two different perspectives on the purpose of the Azure Function:
- It is a "view" into the storage, not a cache. So every request into the function generates at least one data transfer to the underlying storage. There is caching going on, but it is the minimal necessary parts (headers, directories) to accomplish efficient behavior. It is paired with a standard HTTP cache in front of it like Azure CDN or CloudFront.
- It itself acts like a cache and is allowed to return stale data. In this case you can optimize away some requests by using the header ETag, at the cost of possibly returning a version that is older than the underlying storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, I agree that it is a view and that it should cache as little as possible. Leaving all caching to CDN is fine.
- Sure, depending on how long the ETag for the archive is valid but I suspect the CDN will return stale data more often (depending on Cache-Control values, I guess). Keeping the function simple it probably better than optimizing for some cases where the CDN doesn't hit the cache.
Anyway, the function is pretty fast, normally about 200-300 ms round-trip time but it can take up to 700 ms. From my very rudimentary tests, I think that at least half of that time is between the CDN and the function, and the rest is between the function and storage.
My subjective impression is that the map feels pretty responsive even when navigating in areas that the CDN hasn't cached. So the conclusion is:
- Keep the function as simple as possible, translate the incoming request and fetch the data from the storage.
- Recalculate the ETag from the tile content.
- Avoid caching.
- I suggest the configuration of the function only needs an URL to a file (with SAS token), which means that most of the code can be used in other hosting environments that has HTTP hosted archives.
FYI, I have been mostly working on the Google Cloud integration which uses a Docker container on Docker Registry. https://docs.protomaps.com/deploy/google-cloud If this deployment method works scale-to-zero on Azure Containers, that is preferable over a custom Function App because it's simply less code. But I haven't succeeded in deploying this (or getting a scale-to-zero config). Anyone with Azure insight want to try it out? |
This PR adds code that can be used to deploy protomaps on Azure. See
README.md
for more info.It is intended to be deployed in an Azure Function App what is exposed though a CDN.
The
pmtiles
file can be stored at any URL, but if you use Azure then you probably want to deploy it to Azure Blob Storage. Just make sure the function app has access to the container/file.This exact code has note been tested. I just used it as base for a Function App, which has some modifications that I didn't want to include here.
Most of the code should work for any
Source
and any back-end, with minor modifications.