Skip to content

CLN: clean up data.py #4002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 28, 2013
Merged

CLN: clean up data.py #4002

merged 1 commit into from
Jun 28, 2013

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jun 23, 2013

closes #4001
closes #3982
closes #4028

@ghost ghost assigned cpcloud Jun 23, 2013
@@ -107,12 +107,13 @@ def get_quote_yahoo(symbols):
request = str.join('', codes.values()) # code request string
header = codes.keys()

data = dict(zip(codes.keys(), [[] for i in range(len(codes))]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to change this to a defaultdict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep will do, does py26 have defaultdict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud came in with 2.5, so yes :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet i like defaultdict.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2013

everything still work? (not sure how many tests we actually have on io.data)

can u squash the num of commits down a bit?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 23, 2013

@jreback yes everything still works, but not quite finished yet...will squash when i check off the list items

@cpcloud
Copy link
Member Author

cpcloud commented Jun 23, 2013

the fact that everything still works after all these changes makes me think that there is potentially a lot of untested code in data.py...

@jtratner
Copy link
Contributor

@cpcloud I agree with you on that. Do you know why the main entry point is named like a class (DataReader) but isn't really a class? It's confusing because the functions seem really similar and a good candidate for abstracting into a class definition. Not even necessarily because it would reduce code duplication, but because it could make the file easier to follow (i.e., a workflow described as get_data -> clean_data -> package_data [into appropriate type, like Series/Panel]) and would allow us to do a better job in standardizing error handling and making it easier to add in new data sources (or allow people to make plugins or something).

@jtratner
Copy link
Contributor

Would it make sense to set a default timeout on these requests? (you can pass a timeout keyword to urlopen). Helps the test suite and gives more flexibility to users.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 24, 2013

possibly...somewhat related is that there's a retry_count all over the place which i think may be unnecessary and might overlap a bit with timeout...timeout might be nice although i'm not sure when you'd want to use it. maybe u have some specific use cases in mind?


lines = urllib2.urlopen(urlStr).readlines()
with closing(urlopen(url_str)) as url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud you could use timeout here...just so you can explicitly set when you want to give up on a server responding...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically every line like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure that timing out is an issue in these cases...but i suppose that a timeout parameter might be useful. i'm not sure abou it though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, weren't you saying that the test_google method was hanging? Was it
because of iterations? Might help it...but yeah prob not a big deal.

On Mon, Jun 24, 2013 at 12:20 AM, Phillip Cloud [email protected]:

In pandas/io/data.py:

  • lines = urllib2.urlopen(urlStr).readlines()
  • with closing(urlopen(url_str)) as url:

i'm not sure that timing out is an issue in these cases...but i suppose
that a timeout parameter might be useful. i'm not sure abou it though


Reply to this email directly or view it on GitHubhttps://github.com//pull/4002/files#r4835821
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what the reason is. i'll check and let u know

@jreback jreback mentioned this pull request Jun 24, 2013
13 tasks
@cpcloud cpcloud closed this Jun 27, 2013
@cpcloud cpcloud reopened this Jun 28, 2013
@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

@jreback on this clean up i have those tests that fail skip if they raise an AttributeError. i feel a little weird about that .... how should i handle this error, seems hard to predict since lxml will return but then not have any html...how to test for that? could return nan if that happens or raise...

@jreback
Copy link
Contributor

jreback commented Jun 28, 2013

I would have an exception fail them. I guess you can't really 'validate' these consistently. I think these reutrn a list, so a len(0) list is valid then; only complete the rest of the test if you have len > 0. Maybe I would print a warning (in the test) if this is the case, so that when running in the command line, we know this has 'failed'.

AssertionWarning? a little bit cheesy...but what can you do

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

hold on maybe there's something different between using expiry vs passing month/year explicitly. only the warning tests seem to be failing

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

nope that looks ok

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

after this passes i'm going to do one more rehash and push then im gonna merge

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

going to also run tox a few times in a row to see if i can get the warnings to show up

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

@jreback any objections here?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

kind of a big overhaul, but badly needed

@jreback
Copy link
Contributor

jreback commented Jun 28, 2013

looks good

cpcloud added a commit that referenced this pull request Jun 28, 2013
@cpcloud cpcloud merged commit 3b08632 into pandas-dev:master Jun 28, 2013
@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

round 42 on this...here we go.

@cpcloud cpcloud deleted the data-dot-py-cleanup branch June 28, 2013 23:06
@woodb
Copy link

woodb commented Jun 29, 2013

I picked a hell of a time to decide to make a minor change to data.py! Things are moving quickly 👍 I'll have to check back on #4044 once the dust has settled

@cpcloud
Copy link
Member Author

cpcloud commented Jun 29, 2013

sorry man there's just a bunch of stuff that needed to essentially be rewritten and network stuff cleared up (still some issues there)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants