Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster s2i container start idea #220

Closed
praiskup opened this issue Jan 4, 2018 · 16 comments
Closed

faster s2i container start idea #220

praiskup opened this issue Jan 4, 2018 · 16 comments
Labels

Comments

@praiskup
Copy link
Contributor

praiskup commented Jan 4, 2018

The current s2i proposal in #208 suffers from one ache, it is that even if user provides the initial database state in the "sql" dump to be restored after "initdb", it takes more than several seconds to get the database initialized.

I'm curious whether we could run initdb also during the run of assemble script, and than copy the data directory somewhere within the image -- IOW whether we could have the binary data directory baked into the built s2i image. Then, we could skip the initdb and just copy the backed directory under $PGDATA, and save a lot of time. WDYT?

@omron93
Copy link
Contributor

omron93 commented Jan 4, 2018

we could skip the initdb and just copy the backed directory under $PGDATA, and save a lot of time

It would save time. On the other hand the image could be really big and it would slow deployments down. But it's user choise, so why not.

@praiskup
Copy link
Contributor Author

praiskup commented Jan 4, 2018

You mean the re-deployments, where the data directory is already initialized? Well, in such case you still have the sql dump file baked into the image, and that would be per-se large. If we baked the binary datadir into the image instead of the sql file, it is likely we would get even smaller image...

@pkubatrh
Copy link
Member

pkubatrh commented Jan 5, 2018

I might be missing something but is the "sql restore" something we are supporting right now? Or is it a future use case using the new hooks that we could possibly make easier for the users to achieve?

@praiskup
Copy link
Contributor Author

praiskup commented Jan 5, 2018

Discussion-only topic :-)! I would mark this with question label, if I could. I'm trying to find usecase for myself for #208.

Doing a development of python+postgresql project/app, my usecase would be:

  • for development purposes I need to get that app into initial state many times a day
  • I have the initial-state-sql-dump, which I can provide during s2i build (but it has about 60MB, db restore takes about one minute on my box)
  • s2i build (assemble) could go and initdb, import that ^^ dump (only once), and store the datadir content somewhere (but "drop" the original sql dump to save the space)
  • and when instantiating container from the image, instead of initdb+dump-restore we could simply copy the datadir from the image.

I don't know whether (a) i can to dhat right now with supported container, (b) the #208 is required, or (c) some other pull request is needed. I think (c) is right, but I'm not sure.

@omron93
Copy link
Contributor

omron93 commented Jan 9, 2018

You mean the re-deployments, where the data directory is already initialized? Well, in such case you still have the sql dump file baked into the image, and that would be per-se large. If we baked the binary datadir into the image instead of the sql file, it is likely we would get even smaller image...

I was thinking mainly about storing data in image. I don't know details but I think image is transferred over network several times during app lifetime (pushed into registry, pulled into each node where image is run,...). So the bigger the image is the slower that process is.

On the other hand if data are stored in persistent network volume data are transferred over network anyway. So maybe storing initial data in image isn't much slower:-)

Also Open Shift (Online) restricts size of used persistent storage. I haven't found note about restricting of images size, so maybe this is even advantage:-D

@pkubatrh
Copy link
Member

pkubatrh commented Jan 9, 2018

In my opinion this does not sound like something that the image should be taking care of. More like work for Openshift itself (project backup?).

@praiskup
Copy link
Contributor Author

praiskup commented Jan 9, 2018

@omron93

I was thinking mainly about storing data in image. I don't know details but I think image is transferred over network several times during app lifetime (pushed into registry, pulled into each node where image is run,...). So the bigger the image is the slower that process is.

If you have plain text sql file with default data baked into the image, the space requirements are asymptotically equivalent.

Of course, the db scenario might be that you fetch the data from the internet after db initialization, but that's not anymore task for s2i.

@pkubatrh

More like work for Openshift itself (project backup?).

Hms, maybe. Do you have a link?

@pkubatrh
Copy link
Member

pkubatrh commented Jan 9, 2018

Hms, maybe. Do you have a link?

Nope, Im not sure if such a feature exists yet. It was just an idea on how it should ideally work.

@pkubatrh
Copy link
Member

pkubatrh commented Jan 9, 2018

Quick search revealed:
https://docs.openshift.com/container-platform/3.6/admin_guide/backup_restore.html#project-backup

But that does not seems like something we would want (backs up only project configuration)

@praiskup
Copy link
Contributor Author

praiskup commented Jan 9, 2018

Full project snapshot would be nice, but that doesn't help with the use-ase I described -- because even though I want to have "backed" the initial state of database, the rest of the project goes forward during development...

My thought on this is that we shouldn't support this directly, but it would be nice if we allowed users to implement this themselves (via s2i, once merged)... that is, it should be doable without ugly "workarounds". The run-postgresql does too much, so maybe separate command would be needed for this. I'll have a look at this later, probably hack some "example" project leveraging this ..

@omron93
Copy link
Contributor

omron93 commented Jan 10, 2018

Of course, the db scenario might be that you fetch the data from the internet after db initialization, but that's not anymore task for s2i.

Task for s2i could to process sql with the right postgresql version and the database "to the internet" (volume,...)

@praiskup
Copy link
Contributor Author

praiskup commented Jan 10, 2018

@omron93 , can you elaborate on the use case more concretely? I'm not sure I follow.

@omron93
Copy link
Contributor

omron93 commented Jan 10, 2018

I was thinking mainly about storing data in image. I don't know details but I think image is transferred over network several times during app lifetime (pushed into registry, pulled into each node where image is run,...). So the bigger the image is the slower that process is.

If you have plain text sql file with default data baked into the image, the space requirements are asymptotically equivalent.

Of course, the db scenario might be that you fetch the data from the internet after db initialization, but that's not anymore task for s2i.

I can image this scenario (nothing detailed, only the way how I understand your goal):
I guess that (in general) to use database files in binary form the files have to be created with same version that will use it.
So s2i build is created every time:

  1. new version of database image is created
  2. sql form of database stored in some git repo is changed

Every build do:

  1. configure database, create users,... (customization common for our database images now)
  2. import initial sql data -> copy the binary form of database to some shared location -> and "clean database" (only imported data, no configuration)
  3. s2i build commits this state of container as new image

And the image during the start could allow an option to obtain database files from somewhere (for example copy /var/lib/postgresql/initdata to /var/lib/postgresql/data). Initial data would be mounted there from shared location by kubernetes for example.

The benefit of s2i usage is that it will automatically create right binary initial database when image or sql data change!

(what is wrong in this is that I think OpenShift don't support using persistent volumes during build... and reuse them in deployments)

On the other hand /var/lib/postgresql/initdata can be stored in the image after s2i build. Currently I don't have any preference which way to use. The above is only another possible implementation.

@pkubatrh
Copy link
Member

@omron93 so basically instead of the result of an s2i build being just the image, it would be an image and an initial DB living somewhere and would get re-initialized every time the the image or input sql changes?

That seems like a unnecesarily difficult way to achieve an always initialized database. Would rather go with Pavel's original proposal of baking the data directly into the image since that would work everywhere without too much hassle.

Generally we could provide the users with some hooks that would be called during the assemble process if present and leave them the freedom to do whatever they need to do.

@praiskup
Copy link
Contributor Author

praiskup commented Feb 8, 2018

Seems like the idea is pretty complicated; the assemble script is now too trivial
and we don't even start the postgresql server (run-postgresql script) when
assembling.

So to make this happen, we would have to have (a) way to run initialize_database
through assemble, e.g. through some assemble-done hook, and (b) have some postgresql-preinit hook. The workflow would be that:

  • post-assemble hook: initdb -> import data (from some hook) -> tar cJf /baked-data.tar.xz /var/lib/pgsql/data/userdata
  • the pre-init hook (called right after generate_postgresql_config) check that userdata is not initialized, and if the tarball exists it would extract it.

So can we vote whether this makes sense? (likes/dislikes, I could then follow up with PR adding the hook support, and preparing example leveraging this feature)

praiskup added a commit to praiskup/postgresql-container that referenced this issue Mar 11, 2018
- run 'initdb' from 'assemble', and bake the datadir into image
- install hook which extracts the tarball when data is not
  initialized

Fixes: sclorg#220
@praiskup
Copy link
Contributor Author

See #251 with WIP example.

praiskup added a commit to praiskup/postgresql-container that referenced this issue Mar 13, 2018
postgresql-pre-start hook example and test

- run 'initdb' from 'assemble', and bake the datadir into image
- install hook which extracts the tarball when data is not
  initialized

Fixes: sclorg#220
praiskup added a commit to praiskup/postgresql-container that referenced this issue Mar 19, 2018
postgresql-pre-start hook example and test

- run 'initdb' from 'assemble', and bake the datadir into image
- install hook which extracts the tarball when data is not
  initialized

Fixes: sclorg#220
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants