Skip to content

lookup_linke_turbidity is slow #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adriesse opened this issue Sep 4, 2017 · 7 comments
Closed

lookup_linke_turbidity is slow #368

adriesse opened this issue Sep 4, 2017 · 7 comments
Milestone

Comments

@adriesse
Copy link
Member

adriesse commented Sep 4, 2017

When using time series with millions of entries, lookup_linke_turbidity becomes a bottleneck.

My first suggestion is to set interp_turbidity=False by default.

My second suggestion is to to replace the fragment:

linke_turbidity = pd.DataFrame(time.month, index=time)
# apply monthly data
linke_turbidity = linke_turbidity.apply(lambda x: g[x[0]-1], axis=1)

with something like this:

months = time.month - 1
linke_turbidity = pd.DataFrame(g[months], index=time)

The interpolation could be sped up by expanding g to 366 values and then indexing by dayofyear.

PS. I know this would make a perfect issue for my first pull request, but I'm afraid I'm not ready for that step! :)

@wholmgren
Copy link
Member

I ran line_profiler on a year of data at 1 minute resolution. The bottlenecks appear to be

  1. loading the file
  2. determining if the year is a leap year

Not sure what to do about loading the file, but the leap year list comprehension can be replaced with the pandas DatetimeIndex.is_leap_year attribute. Before/after...

pvlib 0.5

time = pd.DatetimeIndex(start='20170101', end='20171231', freq='1min')
latitude = 32.2
longitude = -110.9

%load_ext line_profiler
%lprun -f pvlib.clearsky.lookup_linke_turbidity pvlib.clearsky.lookup_linke_turbidity(time, latitude, longitude)
Timer unit: 1e-06 s

Total time: 2.28546 s
File: /Users/holmgren/git_repos/pvlib2/pvlib-python/pvlib/clearsky.py
Function: lookup_linke_turbidity at line 153

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   153                                           def lookup_linke_turbidity(time, latitude, longitude, filepath=None,
   154                                                                      interp_turbidity=True):
[omitted]
   195                                           
   196         1            2      2.0      0.0      try:
   197         1            4      4.0      0.0          import scipy.io
   198                                               except ImportError:
   199                                                   raise ImportError('The Linke turbidity lookup table requires scipy. ' +
   200                                                                     'You can still use clearsky.ineichen if you ' +
   201                                                                     'supply your own turbidities.')
   202                                           
   203         1            1      1.0      0.0      if filepath is None:
   204         1           35     35.0      0.0          pvlib_path = os.path.dirname(os.path.abspath(__file__))
   205         1           10     10.0      0.0          filepath = os.path.join(pvlib_path, 'data', 'LinkeTurbidities.mat')
   206                                           
   207         1      1044696 1044696.0     45.7      mat = scipy.io.loadmat(filepath)
   208         1            3      3.0      0.0      linke_turbidity_table = mat['LinkeTurbidity']
   209                                           
   210                                               latitude_index = (
   211         1           49     49.0      0.0          np.around(_linearly_scale(latitude, 90, -90, 0, 2160))
   212         1            9      9.0      0.0          .astype(np.int64))
   213                                               longitude_index = (
   214         1           17     17.0      0.0          np.around(_linearly_scale(longitude, -180, 180, 0, 4320))
   215         1            3      3.0      0.0          .astype(np.int64))
   216                                           
   217         1            4      4.0      0.0      g = linke_turbidity_table[latitude_index][longitude_index]
   218                                           
   219         1            2      2.0      0.0      if interp_turbidity:
   220                                                   # Data covers 1 year. Assume that data corresponds to the value at the
   221                                                   # middle of each month. This means that we need to add previous Dec and
   222                                                   # next Jan to the array so that the interpolation will work for
   223                                                   # Jan 1 - Jan 15 and Dec 16 - Dec 31.
   224         1           14     14.0      0.0          g2 = np.concatenate([[g[-1]], g, [g[0]]])
   225                                                   # Then we map the month value to the day of year value.
   226         1      1200791 1200791.0     52.5          isleap = [calendar.isleap(t.year) for t in time]
   227         1            4      4.0      0.0          if all(isleap):
   228                                                       days = _calendar_month_middles(2016)  # all years are leap
   229         1         2387   2387.0      0.1          elif not any(isleap):
   230         1           76     76.0      0.0              days = _calendar_month_middles(2015)  # none of the years are leap
   231                                                   else:
   232                                                       days = None  # some of the years are leap years and some are not
   233         1            2      2.0      0.0          if days is None:
   234                                                       # Loop over different years, might be slow for large timeserires
   235                                                       linke_turbidity = pd.Series([
   236                                                           np.interp(t.dayofyear, _calendar_month_middles(t.year), g2)
   237                                                           for t in time
   238                                                       ], index=time)
   239                                                   else:
   240         1        35985  35985.0      1.6              linke_turbidity = pd.Series(np.interp(time.dayofyear, days, g2),
   241         1          144    144.0      0.0                                          index=time)
   242                                               else:
   243                                                   linke_turbidity = pd.DataFrame(time.month, index=time)
   244                                                   # apply monthly data
   245                                                   linke_turbidity = linke_turbidity.apply(lambda x: g[x[0]-1], axis=1)
   246                                           
   247         1         1220   1220.0      0.1      linke_turbidity /= 20.
   248                                           
   249         1            2      2.0      0.0      return linke_turbidity

proposed changes

Timer unit: 1e-06 s

Total time: 1.21595 s
File: <ipython-input-51-3596f8f01cdf>
Function: lookup_linke_turbidity at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def lookup_linke_turbidity(time, latitude, longitude, filepath=None,
     2                                                                      interp_turbidity=True):
   [omitted]
    43                                           
    44         1            2      2.0      0.0      try:
    45         1            5      5.0      0.0          import scipy.io
    46                                               except ImportError:
    47                                                   raise ImportError('The Linke turbidity lookup table requires scipy. ' +
    48                                                                     'You can still use clearsky.ineichen if you ' +
    49                                                                     'supply your own turbidities.')
    50                                           
    51         1            2      2.0      0.0      if filepath is None:
    52                                                   pvlib_path = os.path.dirname(os.path.abspath(__file__))
    53                                                   filepath = os.path.join(pvlib_path, 'data', 'LinkeTurbidities.mat')
    54                                           
    55         1      1108545 1108545.0     91.2      mat = scipy.io.loadmat(filepath)
    56         1            3      3.0      0.0      linke_turbidity_table = mat['LinkeTurbidity']
    57                                           
    58                                               latitude_index = (
    59         1           66     66.0      0.0          np.around(_linearly_scale(latitude, 90, -90, 0, 2160))
    60         1            9      9.0      0.0          .astype(np.int64))
    61                                               longitude_index = (
    62         1           20     20.0      0.0          np.around(_linearly_scale(longitude, -180, 180, 0, 4320))
    63         1            3      3.0      0.0          .astype(np.int64))
    64                                           
    65         1            4      4.0      0.0      g = linke_turbidity_table[latitude_index][longitude_index]
    66                                           
    67         1            1      1.0      0.0      if interp_turbidity:
    68                                                   # Data covers 1 year. Assume that data corresponds to the value at the
    69                                                   # middle of each month. This means that we need to add previous Dec and
    70                                                   # next Jan to the array so that the interpolation will work for
    71                                                   # Jan 1 - Jan 15 and Dec 16 - Dec 31.
    72         1           14     14.0      0.0          g2 = np.concatenate([[g[-1]], g, [g[0]]])
    73                                                   # Then we map the month value to the day of year value.
    74         1        24311  24311.0      2.0          years = time.year
    75         1        35334  35334.0      2.9          isleap = time.is_leap_year
    76         1           17     17.0      0.0          if all(isleap):
    77                                                       days = _calendar_month_middles(2016)  # all years are leap
    78         1        10787  10787.0      0.9          elif not any(isleap):
    79         1           76     76.0      0.0              days = _calendar_month_middles(2015)  # none of the years are leap
    80                                                   else:
    81                                                       days = None  # some of the years are leap years and some are not
    82         1            2      2.0      0.0          if days is None:
    83                                                       # Loop over different years, might be slow for large timeserires
    84                                                       linke_turbidity = pd.Series([
    85                                                           np.interp(t.dayofyear, _calendar_month_middles(y), g2)
    86                                                           for y in years
    87                                                       ], index=time)
    88                                                   else:
    89         1        34883  34883.0      2.9              linke_turbidity = pd.Series(np.interp(time.dayofyear, days, g2),
    90         1          130    130.0      0.0                                          index=time)
    91                                               else:
    92                                                   linke_turbidity = pd.DataFrame(time.month, index=time)
    93                                                   # apply monthly data
    94                                                   linke_turbidity = linke_turbidity.apply(lambda x: g[x[0]-1], axis=1)
    95                                           
    96         1         1729   1729.0      0.1      linke_turbidity /= 20.
    97                                           
    98         1            2      2.0      0.0      return linke_turbidity

@wholmgren
Copy link
Member

Turns out that the .is_leap_year attribute is only available for pandas 0.19 or greater. Our minimum pandas requirement is currently 0.14, so that's a big increase in the minimum requirements. 0.19 is almost a year old, but I have code that's tied to 0.17! So, we might want to implement this with a fallback for older pandas versions.

Anyone interested in taking this on?

@adriesse
Copy link
Member Author

adriesse commented Sep 4, 2017

I must admit that my first suggestion is actually more of an indication of what I think the default behaviour should be, and I think the comment in the code about "might be slow" motivated me to include the suggestion here.

The second suggestion improves the speed for the non-interpolated mode--could you profile that too? I suspect the bottleneck there is the apply. I have about 50M records.

@wholmgren
Copy link
Member

Your interp_turbidity=False suggestion improves the speed by orders of magnitude. It took my machine about 10 seconds to run a year of minute resolution data (500k records) with the original code. It took about 20 seconds to run 10 years of 1 second resolution data (30M records) with your code. Roughly a 30x improvement in time/record.

With all of the proposed changes, the difference in speed between interp_turbidity=False and interp_turbidity=True is fairly insignificant and the function itself is unlikely to be a bottleneck for large analyses. So, the issue of which should be the default needs further discussion. Let's keep this issue focused on the performance and make a new issue for the default behavior. I can see arguments for both options.

@cwhanse
Copy link
Member

cwhanse commented Sep 5, 2017

If it helps, we can speed up the leap year calculation by a factor of 10 by writing our own code that avoids the calendar.isleap call within a list comprehension. I don't know why my inserted code doesn't appear as code.

import calendar
import numpy as np
import pandas as pd
from datetime import datetime

from timeit import timeit

def leap(year):
    return (np.mod(year, 4) == 0) & ((np.mod(year, 100) != 0) | (np.mod(year, 400) == 0))

idx = pd.DatetimeIndex(start=datetime(1999, 1, 1, 0, 0, 0), 
                       end=datetime(2005, 12, 31, 0, 0, 0), freq='1H')

isleap = [calendar.isleap(t.year) for t in idx]

print(timeit(stmt="isleap = [calendar.isleap(t.year) for t in idx]", 
             number=1000, 
             globals={'idx': idx, 'calendar': calendar}))

yrs = np.asarray(idx.year)
alt_isleap = leap(yrs)

print(timeit(stmt="alt_isleap = leap(np.asarray(idx.year))", 
             number=1000, 
             globals={'idx': idx, 'np': np, 'leap': leap}))

@wholmgren
Copy link
Member

Thanks @cwhanse. 3 ticks for code blocks. See here for github code formatting.

I incorporated your suggestion into the _is_leap_year function in the profiled code below. It takes 50% longer to compute the year and then test for leap year than to get the same directly from the pandas attribute when possible, but it's still plenty fast.

I realized that the interpolation code is much worse for the case when the time series contains both leap years and non-leap years. 10 years of 1 minute data took ~180 seconds to complete due to the horrible performance of the list comprehension within the if days is None case. The following function takes only ~1 second for the same input, and it simplifies the lookup_linke_turbidity function.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5                                           def _interpolate_turbidity(g, time):
     6                                               # Data covers 1 year. Assume that data corresponds to the value at the
     7                                               # middle of each month. This means that we need to add previous Dec and
     8                                               # next Jan to the array so that the interpolation will work for
     9                                               # Jan 1 - Jan 15 and Dec 16 - Dec 31.
    10         1           15     15.0      0.0      g2 = np.concatenate([[g[-1]], g, [g[0]]])
    11                                               # Then we map the month value to the day of year value.
    12                                               
    13                                               # use this for the real code
    14                                           #     try:
    15                                           #         isleap = time.is_leap_year
    16                                           #     except AttributeError:
    17                                           #         year = time.year
    18                                           #         isleap = _is_leap_year(time.year)
    19                                           
    20                                               # for testing only, compute both ways
    21         1       317041 317041.0     27.9      isleap = time.is_leap_year
    22         1       235586 235586.0     20.7      year = time.year
    23         1       206882 206882.0     18.2      isleap = _is_leap_year(year)
    24                                               
    25         1       262155 262155.0     23.1      dayofyear = time.dayofyear
    26         1           86     86.0      0.0      days_leap = _calendar_month_middles(2016)
    27         1           26     26.0      0.0      days_no_leap = _calendar_month_middles(2015)
    28         1        51635  51635.0      4.5      lt_leap = np.interp(dayofyear, days_leap, g2)
    29         1        51929  51929.0      4.6      lt_no_leap = np.interp(dayofyear, days_no_leap, g2)
    30         1        10800  10800.0      1.0      linke_turbidity = np.where(isleap, lt_leap, lt_no_leap)
    31         1          151    151.0      0.0      linke_turbidity = pd.Series(linke_turbidity, index=time)
    32         1            1      1.0      0.0      return linke_turbidity

@wholmgren wholmgren added this to the 0.5.1 milestone Sep 5, 2017
@wholmgren
Copy link
Member

This was closed by #369. API changes can be discussed in a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants