Skip to content

Add cache command #3968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed

Add cache command #3968

wants to merge 26 commits into from

Conversation

tdsmith
Copy link
Contributor

@tdsmith tdsmith commented Sep 14, 2016

Adds an option that prints the configured location of the cache directory and
exits. This is useful if users want to know where the cache is so that they can
delete stale or poisoned wheels.


This change is Reviewable

Copy link
Member

@dstufft dstufft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the use of a flag for this. I think it'd be better off being a command of some sort. I'm not sure what exactly that command would look like though. Off the top of my head I can think of two different ways of going with this.

  1. We make a dedicated pip cache command that allows people to interact with the cache to do things like see the location, clear the cache, etc.
  2. We make a dedicated pip config command that allows people to set and read configuration values, so that they can see what the cache dir, and other settings would be and also modify them via a command.

@tdsmith
Copy link
Contributor Author

tdsmith commented Sep 15, 2016

I kind of like the idea of pip cache list, pip cache erase, pip cache location...

@dstufft
Copy link
Member

dstufft commented Sep 15, 2016

Yea that could be real good. Wouldn't need to start off with everything either. I think pip cache erase and pip cache location could be two real good ones. Or maybe pip cache erase and pip cache info which shows things like location and sizes and such.

@dstufft
Copy link
Member

dstufft commented Sep 15, 2016

Related issues: #2819, #3734, #2851

@pfmoore
Copy link
Member

pfmoore commented Sep 15, 2016

I kind of like the idea of pip cache list, pip cache erase, pip cache location

+1 from me, too. That sounds like the right way to expose this type of functionality.

@tdsmith tdsmith changed the title Add --print-cache-dir option Add cache command Sep 16, 2016
@tdsmith
Copy link
Contributor Author

tdsmith commented Sep 16, 2016

Added a cache location command.

I see this reinvents part of #3734; any feelings about what to do there?

@pfmoore
Copy link
Member

pfmoore commented Sep 16, 2016

There's a PEP8 issue, and I've just restarted the python 3.6 nightly run, as the failures there seem unrelated. You should probably also have a docs page for the new command. But otherwise it's looking OK to me.

Regarding #3734 let's see what @xavfernandez thinks. I'm inclined to think that this PR is a nice quick win, and the extra functionality from #3734 could be added later if needed - but if #3734 is close to merging, it might be easier to get the whole lot at once.

Copy link
Member

@dstufft dstufft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking a lot better! Two small comments and one major one.



class CacheCommand(Command):
"""\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the trailing slash here for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better auto-dedenting behavior; IIRC if the first line of the block quote doesn't start with whitespace the rest of the docstring won't be dedented.

"ERROR: Please provide one of these subcommands: %s" %
", ".join(self.actions))
return ERROR
method = self.__getattribute__("action_%s" % args[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better written as method = getattr(self, "action_%s" % args[0]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return method(options, args[1:])

def action_location(self, options, args):
logger.info(os.path.join(options.cache_dir, "wheels"))
Copy link
Member

@dstufft dstufft Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want this to be a bit more structured yea? I imagine something that outputs something more like:

Cache Locations:
    HTTP    = /Users/dstufft/Library/Caches/pip/http
    Wheels = /Users/dstufft/Library/Caches/pip/wheels

Or something like that? Maybe It'd be better to switch pip cache locations for pip cache info and have somehting like:

HTTP Cache:
    Location = /Users/dstufft/Library/Caches/pip/http
    Size     = 284M
    Items    = 1045

Wheel Cache
    Location = /Users/dstufft/Library/Caches/pip/wheels
    Size     = 95M
    Items    = 391

I guess the key point here is that we currently have at least two caches, might add more in the future and arbitrarily picking one of them to return as the location feels like the wrong thing to do so we should at least support something that returns them all. Adding more information like Size and number of items might be interesting too to give people an idea of the scope of their caches.

Copy link
Contributor Author

@tdsmith tdsmith Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping this is the output from cache info. :) I'd like $(pip cache location) to be possible. Refraining from adding /wheels is probably wise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you hoping to do with $(pip cache location)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ask because I'm trying to think of a use case where you don't need to be able to explicitly ask for http versus wheel cache OR bake in an assumption about the names of those caches relative to the overall cache directory that isn't rm -rf $(pip cache location) or du -h -d 0 $(pip cache location).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point; I think it follows that any verb that operates on a cache will have to understand which cache it wants to operate on, right?

Is it worth exposing anything about the http cache? It's less mysterious than the wheel cache; I haven't seen people having trouble with it. Should cache just deal with the wheel cache?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the name I'd expect to be able to deal with any of pip's caches with pip cache. Maybe we could add a --type [wheel|http] with the default being wheel. That's explicit, and clearly documents that you get wheel by default, but maybe it's a bit over-engineered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I definitely think everything should encompass all of the caches we have. The reason I was asking what your use case was for that command is trying to determine if it makes sense to have pip cache location return the top level cache directory, or if it made sense to have flags to return specific cache directories or if it made sense to just omit the command all together and roll whatever specific functionality you were looking at using that command for into it's own command (e.g. if you were looking to do rm -rf $(pip cache location) then just add pip cache clear).

However, I think the way to do it, assuming we can't just roll whatever functionality up, is maybe by default have it return the top level cache directory, and then add flags for --wheel and --http (or --type [wheel|http]) that let you scope that down to a specific type instead of the entire thing. Could easily apply this pattern to other commands like pip cache info (only show info for wheel or http) and pip cache clear (only clear one type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to make this an option on the cache command or on each subcommand? Similarly, I guess, does a pip cache list command make sense for the http cache?

@tdsmith
Copy link
Contributor Author

tdsmith commented Sep 17, 2016

I added an implementation of an info command; let me know if you'd rather handle that in a separate PR.

Output looks like:

(tmp-f5f9d55100f33573) tim@rocketman:pip (show-cache-dir)$ pip cache --type=all info
HTTP cache info:
   Location: /Users/tim/Library/Caches/pip/http
   Number of files: 1155
   Size: 677.3 MiB

Wheel cache info:
   Location: /Users/tim/Library/Caches/pip/wheels
   Number of files: 281
   Size: 55.6 MiB

return result


def human_size(n_bytes):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, duplicates pip.utils.format_size; should use that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dstufft dstufft closed this Sep 18, 2016
@dstufft dstufft reopened this Sep 18, 2016
@Ivoz
Copy link
Contributor

Ivoz commented Sep 19, 2016

A) info and location are practically duplications of eachother. I understand the argument would be "I want the non-hassle of being able to use $(pip cache location) without having to filter cache info." Firstly, if functionality of pip cache clear is indeed intended to be implemented, that's mostly a non-argument in the first place; secondly, if the need to clear the cache is so often that this slight convenience becomes a real time-saver, then I'd suggest there would be other behaviour in pip towards how to manage its cache that needs to be fixed first (so it's not causing the user such problems in the first place), not to make it easy for someone to delete on the command-line in one-liners. Therefore I think only the command equivalent to pip cache info is necessary to be implemented here.

B) I'd suggest that Files: <number> is abundantly clear as to the meaning of the information, so Number of... seems verbose to me.

C) If given my first point, then I'd suggest that rather than implementing a whole command this could fit neatly under the existing informational command in pip, pip show. E.G, pip show --cache. If we had clearly decided there needs to be other commands for cache functionality, other than one to show the cache and one to delete it, such that grouping them under a sub command would definitely be necessary I'd concede this point.

I make most of these comments, under a vest of trying to keep feature-creep out of pip's existing UI. Not that I want to see anything added ever, but always very solid reasons when done so.

@tdsmith
Copy link
Contributor Author

tdsmith commented Sep 19, 2016

wrt pip cache location, the other use I can think of is commands of the form find $(pip cache location)....

Sort of in that vein, I think pip cache list (list cached artifacts) and pip cache clear should exist; pip cache rm seems plausible.

@dstufft
Copy link
Member

dstufft commented Sep 19, 2016

One thing to be careful of is anything that bakes in an assumption about how our caches are laid out internally.

For example, right now the wheel cache you can kind of figure out what's cached just by doing an exhaustive search of all of the filenames. However that is not currently true for the HTTP cache where filenames are sha256 (IIRC?) hashes of the URLs they are caching. For these, it's not currently possible to see what items are currently cached, only to ask "is this particular URL cached?". While adding a dedicated location command doesn't on it's own tell people that hey you can depend on the structure of our cache, it feels like it's paving a cowpath towards people doing just that (outside of cases like rm -rf $(pip cache location)).

Given that, I think that it probably makes sense to not implement pip cache location unless we can come up with a use case that isn't going to mandate baking in assumptions about the "schema" of our caches themselves and instead bake that logic into pip itself where we can update it as/when we change the cache schema.

I think that pip cache [info|clear|rm] seem reasonable to me (and if we do that, we probably don't need --type flags I think? It seems OK to just have all of those three commands operate on everything. I'm less sure about pip cache list, spewing hundreds or thousands of lines of text onto the screen doesn't seem as useful to me as the other commands. (To be clear, I don't think you need to implement all of the above in this PR, but if you want to you can!).

Does that all sound reasonable?

@pfmoore
Copy link
Member

pfmoore commented Sep 19, 2016

On reflection, I also think we should strongly enforce YAGNI here. ISTM that we've ended up at a point where we're implementing a feature without exactly remembering why we wanted it. As @dstufft said, the internals of the cache are undocumented and private, so we should be very clear on intended use cases before implementing features, as otherwise there's a risk that we open up the internals in a way that means people end up relying on them. Sorry if this feels like backing out of what seemed like an almost-complete PR, but sometimes it takes getting to this sort of stage before the implications are really clear.

For example, re-reading the the original use case, it was "This is useful if users want to know where the cache is so that they can delete stale or poisoned wheels." Is the best way to clear out invalid wheels from the cache to let people locate the cache and go hunting for the relevant wheels themselves? If there's a need to delete wheels would an explicit command not be better? Something like pip cache clear-wheel <requirement>? Then we can keep the cache structure private.

There's also a reasonable use case for "blow away the whole cache" - something like pip cache clear. Maybe we could allow people to choose HTTP, wheel or both, but I don't know why anyone would need to specify to that level.

But what other use case is there for "tell me where the cache is"? Or indeed for any other cache related commands?

@dstufft
Copy link
Member

dstufft commented Sep 20, 2016

Looking good!

I'm still on the fence about the cache list command. I grok that its goal is to enable figuring out what arguments need passed to cache rm, but I wonder if it doesn't just make sense to just make the argument to cache rm be a project name and be done with it. That feels like a simpler UI for end users to me (at the cost of maybe wider than needed cache removal) especially if we ensure that pip cache rm <thing that isn't cache> works successfully (making rm essentially "ensure this isn't cached"). This also seems like it'll be less error prone as end users don't need to worry about interpreting wheel filenames to figure out which wheel they need to cache.

I think it makes sense for the list and rm commands to not operate on the HTTP cache since that should (ideally) be entirely transparent to end users whereas (as you identified) the wheel cache is a bit different as there is a chance of it going stale if the state of the machine has changed.

I kind of don't like the --type flag though. Would it be terrible if we just made info show both cache types always, and just made list and rm only operate on the wheel cache?

@RonnyPfannschmidt
Copy link
Contributor

what about using cache discard instead of cache rm as name (similar to the difference between set.remove and set.discard)

@tdsmith
Copy link
Contributor Author

tdsmith commented Sep 21, 2016

I wonder if it doesn't just make sense to just make the argument to cache rm be a project name and be done with it

This might be a bridge too far, but what if it's treated like a glob if it contains a metacharacter or ends in .whl and otherwise it's treated like a project name?

I kind of don't like the --type flag though. Would it be terrible if we just made info show both cache types always, and just made list and rm only operate on the wheel cache?

I wouldn't be mad. Thoughts about purge?

@dstufft
Copy link
Member

dstufft commented Sep 21, 2016

I wonder if it doesn't just make sense to just make the argument to cache rm be a project name and be done with it

This might be a bridge too far, but what if it's treated like a glob if it contains a metacharacter or ends in .whl and otherwise it's treated like a project name?

My thinking about restricting it to just project names also means it's not tied to wheel caching either. If we rejigger the HTTP cache to tag cached responses with a project name, we can make it just work with that. If we add caching of VCS repositories, we can make it just work with that as well. By keeping it higher level we eliminate some amounts of tying it to a specific implementation. Of course the flipside of that is with your proposed change, that is still somewhat true, just that glob characters/.whl would restrict it to just deleting stuff from the wheel cache in my hypothetical above.

Overall, it feels somewhat magical to do that and I'm trying to think of a use case where it is vitally important to just delete a specific wheel from the cache and no others. Worst case scenario it seems like it'd just trigger the slow path again (once) but also probably save some space by cleaning up old wheels for versions you're no longer installing?

I kind of don't like the --type flag though. Would it be terrible if we just made info show both cache types always, and just made list and rm only operate on the wheel cache?

I wouldn't be mad. Thoughts about purge?

I'd make it act like info, always operate on both caches. If someone comes along and really wants to only purge one or the other we can look at adding it back then.

As an aside, I don't really like the discard verb that @RonnyPfannschmidt suggested (idk why, just feels harder) but maybe it'd make sense to spell out the entire word "remove", so it'd be pip cache remove <whatever>.

@tdsmith
Copy link
Contributor Author

tdsmith commented Sep 21, 2016

If you want to extend this to cover arbitrary cache types I'm wondering if it makes sense to retain --type since the different caches aren't necessarily related. I like it for purge too since the reasons you'd want to delete the http cache (it got too big) are different from the reasons you'd want to delete the wheel cache (something was stale).

I almost wonder if --type should be mandatory instead of defaulting to wheel, in that case!?

remove sounds fine!

@xavfernandez
Copy link
Member

Regarding #3734 let's see what @xavfernandez thinks. I'm inclined to think that this PR is a nice quick win, and the extra functionality from #3734 could be added later if needed - but if #3734 is close to merging, it might be easier to get the whole lot at once.

Hello, I was in holidays :)
Regarding #3734, it was on hold since it did not seem to attract any enthusiasm from anyone and I wouldn't want to add unused commands to pip. So I'm glad the idea has gained traction but disappointed that it was reimplemented from scratch (but I guess that's the fate of pip with its 60+ open PRs).

Note that https://github.com/pypa/pip/pull/3734/files#diff-2695f32c4432acd141c3dbe7e7e3a6b0R803 was also added to track the origin of the wheel cache.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label May 14, 2017
@pradyunsg
Copy link
Member

@tdsmith @xavfernandez Is one of you going to look into this in the near future?

@xavfernandez
Copy link
Member

@pradyunsg Well, seeing your comment, I rebased #3734 which still works fine but which contained less functionnality than @tdsmith's PR.

I won't have much time to work on it though.

@pradyunsg
Copy link
Member

@tdsmith May I take this forward?

@pradyunsg
Copy link
Member

Closing in favour of #4685.

@pradyunsg pradyunsg closed this Aug 21, 2017
@duckinator duckinator mentioned this pull request Apr 8, 2019
10 tasks
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 2, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 2, 2019
@pradyunsg pradyunsg removed the needs rebase or merge PR has conflicts with current master label Apr 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants