Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support API based import/export for pageviews as JSON #2653

Open
whitfin opened this issue Apr 6, 2024 · 9 comments
Open

Support API based import/export for pageviews as JSON #2653

whitfin opened this issue Apr 6, 2024 · 9 comments

Comments

@whitfin
Copy link

whitfin commented Apr 6, 2024

I am attempting to migrate to Umami from Mixpanel, due to MIxpanel's new data retention rules. Everything is setup and went fairly smoothly, so for my next step I wanted to import from Mixpanel into Umami.

After looking through documentation, discussion, code, etc. it appears that this is a weakness of Umami and something I believe we can improve.

I propose adding (at least) the following routes:

GET /events
POST /events
POST /pageviews
GET /pageviews

Although it's possible to index pageview events via /send and just not providing eventName, this isn't really intuitive for the API itself (it's fine for script.js) so I propose splitting them here. They would use the same underlying logic, of course.

There is no way at all to access raw documents via the API, which isn't ideal for anyone trying to export their data to process in any other form. There are some tools internally that would do the queries required, but ideally we'd probably have pagination here.

So I guess my question is... is this on the roadmap? If it were to be contributed, would it be accepted? It's quite a blocker on my side; essentially I cannot use Umami unless I can bring my historical data with me.

@whitfin
Copy link
Author

whitfin commented Apr 6, 2024

I am not much of a Typescript user, and I've never touched Clickhouse - but I'm happy to contribute towards this effort. As I mentioned, it's a blocker for me and I'd be happy to avoid having to start over 😄

@mikecao
Copy link
Collaborator

mikecao commented Apr 6, 2024

Are you referring to self-hosting or cloud? For the cloud product there is a guide on how to import data by providing a CSV file, https://umami.is/docs/cloud/import-data. This is a cloud only feature because it's made specifically for our infrastructure.

For self-hosting, there is no such feature because 1) it's database specific and 2) you should have full access to your database and access to tools like mysqlimport and psql that will do it for you.

@whitfin
Copy link
Author

whitfin commented Apr 6, 2024

I'm talking about self hosting, but I'm not totally sure I understand why it matters? Maybe I didn't express my intent properly, so I'll try to clarify.

For self-hosting, there is no such feature because 1) it's database specific

I'm not sure I understand this. How is it any more database specific than calling /send? In fact you could already turn something like an Nginx log into a series of calls to /send and accomplish an equivalent import - it's simply that the current API does not provide an intuitive interface for doing this. The database (if there even is one) that I'm coming from is not really relevant.

In my specific case, I pulled my events out of Mixpanel and I have them all in JSON, no database involved. This request is more about user friendly APIs; things like indexing multiple views in one call, etc. This is practically no different whether it's cloud hosted or self hosted, unless I'm missing something.

Edit: to clarify, I'm not asking Umami to accept the Mixpanel JSON format I have, it would be up to me to translate it to the Umami format in advance of passing it to the API.

Lots of people have asked about import/export in various issues and discussions and the answer appears to be that it's not being added. Adding APIs that can be used this way allows the community to do the work to build importers from other analytics services so that the Umami team doesn't have to.

  1. you should have full access to your database and access to tools like mysqlimport and psql that will do it for you.

Well, sure, if you also happen to know the schema of Umami's tables. I (and anyone else who ever wants this) could sit here and sift through Prisma definitions to figure out how to map data from other services into Umami, but this is definitely a) not productive and b) a HUGE turn off.

The same is true for export; clearly I can log into the database and try figure out how to stitch all the necessary tables together, but I feel like it's pretty standard to offer this via an API. It's extremely common that people want to pull things out of their main analytics storage for things like scripting and more complex evaluation.

@mikecao
Copy link
Collaborator

mikecao commented Apr 8, 2024

I think I get what you're saying. You want to be able to use the API to import data so that it uses the same internal logic as send. I think we can support something like that. But session data will be tricky because your session id is calculated on the fly at time of creation. But I'm sure we can figure something out.

So, there are really only two tables that you would care about for import/export and that's website_event and session. Then we will need something like:

GET /websites/:id/pageviews
POST /websites/:id/pageviews
GET /websites/:id/events
POST /websites/:id/events
GET /websites/:id/sessions
POST /websites/:id/sessions

There is already an existing pageviews endpoint, but we can rename it.

@whitfin
Copy link
Author

whitfin commented Apr 9, 2024

Yeah, pretty much hit the nail on the head. I understand there's probably a lot of work here, I know it'll take some time to get to this state.

As a thought exercise based on your initial response, I actually did this over the weekend because I wanted to use Umami for my personal projects. It took me approximately 6 hours to write something to map my data across to Umami's format (converting my Mixpanel events to both pageview events and also session models). Most of this was spent looking at how to duplicate the logic of Umami (things like normalizing browser, os, that type of stuff).

To import I ditched SQL and actually loaded Prisma from the Umami directory and wrote an import using that. This took another couple of hours, but worked. However I did hit the case where I had formatted something correctly, and of course it wasn't easy to revert without going into SQL. Obviously not a Umami issue, just a note for anyone who tries the same approach.

My hope is that these new endpoints will eventually accept the Umami models, and use whatever validation is necessary to determine the inputs are correct (e.g. things like OS should be "Windows 10" or "Windows 11" but not "Windows"). Then it would be possible to write a generic mixpanel2umami or google2umami or whatever, to aid people to migrate their data over.

I agree the session stuff is a bit awkward, but a lot of other providers you'd be importing from have their own session identification. In the case you don't have an existing session and/or device identifier, the path I took was to simply hash the event identifier and generate a 1:1 session. I know this is far from ideal, but there's really nothing else you can do in this case (unless sessionId ever becomes optional...). The session data for these events is obviously skewed, but at least your other metrics are generally okay.

It's currently already pretty close, it's just that /send extracts a lot from the user agent - and of course there is no user agent from exported events in other systems. The other thing is that createdAt is defined automatically at the time you call the endpoint, which would also have to be overridable.

So, there are really only two tables that you would care about for import/export and that's website_event and session.

Yes, although I also wanted to put a 'from': 'mixpanel' in my event.data so I could easily drop it and reload it if I made a mistake, or discern where data came from in future. This was pretty awkward manually because I had to learn about the flattening stuff, so were it to be possible via the API things like this could be automatic.

Sorry for the text dump, just wanted to share from my migration this weekend 😅.

@boehs
Copy link

boehs commented Apr 10, 2024

Instead of this API, I think it's better to just document the table structure and maybe how to use pg_dump and pg_restore. If you're selfhosting, you have direct access to the database.

Edit: If you have a CSV or whatever, a script is really simple to make. I have my own here:

https://github.com/boehs/site/blob/master/utils/cloudflare-analytics-to-umami-lord-save-me.ts

@githubyouser
Copy link

As a thought exercise based on your initial response, I actually did this over the weekend because I wanted to use Umami for my personal projects. It took me approximately 6 hours to write something to map my data across to Umami's format (converting my Mixpanel events to both pageview events and also session models). Most of this was spent looking at how to duplicate the logic of Umami (things like normalizing browser, os, that type of stuff).

@whitfin did you document this process in more detail anywhere? I have some data I'd like to import to Umami, and I'm wondering how hard it would be.

@whitfin
Copy link
Author

whitfin commented May 8, 2024

Instead of this API, I think it's better to just document the table structure and maybe how to use pg_dump and pg_restore.

@boehs the issue with that is the assumption that everyone who wants analytics for their blog (or whatever) knows and wants to interact with a database. It's much more likely you break something doing it directly in the database instead of through the Umami API (which would have model validation, etc).

Plus, as you point out, anyone can write scripts to move their stuff around (like both me and you did), but the advantage of an API is that someone somewhere could make a cloudflare2umami package and there's no need for everyone to write these scripts anymore.

If you're selfhosting, you have direct access to the database.

I think this is an assumption that doesn't necessarily hold up. Not every user will be a single person hosting a blog on a $5 server.

@whitfin did you document this process in more detail anywhere?

@githubyouser, I didn't - although I could consider doing so. The script from @boehs looks like a decent starting point though!

@githubyouser
Copy link

Thanks @whitfin I took a look at @boehs's script, but I've never used typescript before, so I'm not quite sure how to get started. Basically I want to just import the bare minimum of data (pageviews, etc) from an old Matomo installation for several fairly low traffic sites. My Matomo install is getting too expensive to maintain, and I'm starting fresh with a lite installation of umami, which seems to be a lot less resource hungry.

I've been experimenting with umami for a few months now, and I was actually able to migrate a website from one test umami installation to another (following this guide), so that was a win, but moving from a different analytics install is a lot bigger of a challenge. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants