Skip to content

Streamlines API #391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 139 commits into from
Jun 7, 2016
Merged

Streamlines API #391

merged 139 commits into from
Jun 7, 2016

Conversation

MarcCote
Copy link
Contributor

@MarcCote MarcCote commented Dec 3, 2015

This is take two at providing a general API to manage different streamlines file format (the code can all be found in nibabel.streamlines.). This is related to this somewhat outdated BIAP5. I initially wanted to make two PRs but as @Garyfallidis pointed out, without the entire PR "it will make it difficult to try this with real data".

This new streamline API is divided in two main parts:

  1. Representation of streamlines in memory (Tractogram and ArraySequence classes)
  2. Loading and saving streamlines from/to different file format (TractogramFile class)

We use the term 'Tractogram' to refer to a set of streamlines, usually the output of a tractography algorithm.

streamlines_api

ArraySequence

The ArraySequence class needs to be explained first as it is at the heart of the streamlines representation. I guess it could be seen as a sort of jagged/ragged array to store a list of 2D arrays that have different length but a same number of dimensions (e.g. a list of list of points). For efficiency, we store a list of streamlines contiguously in memory using what we call a ArraySequence. This was needed as @Garyfallidis and I realized that a list of numpy arrays has too much overhead in term of memory when handling >1M streamlines (loading a tractogram of 2.5Gb could take up to 5Gb in RAM!).

A ArraySequence behaves exactly like a list of 2D ndarray. That is, it supports appending, extending, indexing (supports slicing too), iterating, etc. as a list object. For example, indexing a ArraySequence object will return a 2D ndarray, as one would expect, representing the 3D points of the selected streamline. Note that, while ArraySequence does not support deletion (i.e. no del nor remove), one can efficiently use slicing to achieve the same goal.

A ArraySequence object can be instantiated using any iterable yielding 2D array:

import numpy as np
import nibabel as nib
data = [np.random.rand(10, 3), np.random.rand(3, 3)]
seq = nib.streamlines.ArraySequence(data)

Slicing a ArraySequence object returns a new ArraySequence object acting as a view (no memory duplication). Of course, one can use the copy method to "force" the memory duplication.

Tractogram

This part handles streamlines in memory using the Python class Tractogram. This class provides a mechanism to keep data associated to points of every streamline (data_per_point property). It can also keep data associated to each streamline (data_per_streamline property).

Regarding the reference space we used for the streamlines of a tractogram live in, we decided to follow the NiBabel convention: the streamlines are assumed to be in RAS+, points are expressed in millimetre and the coordinate of a voxel (eg. (0,0,0) or (1,2,3)) refers to the centre of that voxel.

A Tractogram objects supports indexing, slicing and iterating. For example, indexing a Tractogram object will return a TractogramItem object, representing a single streamline with its associated data. Note that, while Tractogram does not support deletion (i.e. no del nor remove), one can efficiently use slicing to achieve the same goal. Again, slicing a Tractogram object returns a new Tractogram object acting as a view (no memory duplication).

Here a small example of instantiating a Tractogram object:

import numpy as np
import nibabel as nib
data = [np.random.rand(10, 3), np.random.rand(3, 3)]
tractogram = nib.streamlines.Tractogram(streamlines=data)
colors = [(1, 0, 0), (0, 1, 0)]
tractogram.data_per_streamlines['colors'] = colors

Tractogram File

This part handles the saving and loading of Tractogram objects from/to the disk. Right now, as a proof of concept, only the TRK file format is supported (support for TCK is already in progress and will be added in a following PR).

The TractogramFile class is abstract. Each file format should have its own class inheriting TractogramFile (e.g., TrkFile).

Realistic example

Here a small example of loading a TRK file, computing the length of each streamlines, adding that information to the tractogram and saving it back.

import nibabel as nib
trk = nib.streamlines.load('my_streamlines.trk')

from dipy.tracking.streamline import length
lengths = length(trk.tractogram.streamlines)

trk.tractogram.data_per_streamline['lengths'] = lengths

new_trk = nib.streamlines.TrkFile(trk.tractogram, trk.header)
nib.streamlines.save(new_trk, "my_streamlines.trk")

EDIT: Renamed CompactList to ArraySequence

@MarcCote MarcCote mentioned this pull request Dec 3, 2015
@arokem
Copy link
Member

arokem commented Dec 3, 2015

As a straw-man, I would suggest to see whether it would be easy to apply these same classes to the PDB file format (#313). At first blush, the tractogram object might be able to handle all features of this file format as well. I have not looked at the code yet, but I see no mention of the affine transformation being stored together with the Tractogram object. Would that be a good idea? Maybe it's already there -- looking at the code next.

@MarcCote
Copy link
Contributor Author

MarcCote commented Dec 4, 2015

@arokem: at first glance, the PDB seems easy to support by this streamlines API. Regarding the affine transformation, I struggled a lot with that: should we keep it? If so, where do we keep it? In the Tractogram object or a separate TractogramHeader object?

Presently, the affine transformation is not stored in the Tractogram and assumed the streamlines are always in RAS+mm (i.e. the world space according to Nifti). Not all formats contain information about the affine transformation (e.g. TCK from MRtrix) and sometime it even depends on the software (e.g. VTK files outputted by Camino, mitk, etc.). For this reason, I think it makes more sense to keep that information in the TractogramFile object that is responsible to load/save tractograms correctly.

At first, I thought about creating a complex and generic TractogramHeader class to store all information that could be shared across different file format. In the end, it was too much trouble. So, I finally decided to add an attribute named header (a simple dict) to the TractogramFile class and stored the affine in it, along other information provided by the file format.

To make it easy to convert from one header's format to another, I've created a static class (some sort of enum) named nib.streamlines.header.Field containing the most common header fields (e.g. Field.NB_STREAMLINES). So, for a given file format, whenever I read some header information that I know has a corresponding Field's attribute, I used it as the key in the header dictionary of the TractogramFile object.

I hope it's clearer. Don't hesitate to ask question, give suggestion or comment about the code or how the streamlines API works.

@Garyfallidis
Copy link
Member

We need this is in asap to reduce memory usage of tractographies. @MarcCote is it hard to increase the coverage so we can get the green light for merging? Let us know if you need some help.

@samuelstjean
Copy link
Contributor

Does it really reduce memory that much? Fiber compression could probably help you for that meanwhile, but the possibility to convert from one format to another for interoperability (visualization, extraction, roi drawing in other tools for example) would be really useful.

@Garyfallidis
Copy link
Member

No, unfortunately even if you use compression you first need to load the initial streamlines before compressing them. When you do that you will still block around 2-3X more memory than the actual size of your datasets.

@samuelstjean
Copy link
Contributor

They can be loaded only one at a time for compressing, its probably
something with the implementation. That being said, I don't know if the
original one and the python rewrite handle loading in the same way though,
might be worth asking francois about that.
On Jan 8, 2016 22:27, "Eleftherios Garyfallidis" [email protected]
wrote:

No, unfortunately even if you use compression you first need to load the
initial streamlines before compressing them. When you do that you will
still block around 2-3X more memory than the actual size of your datasets.


Reply to this email directly or view it on GitHub
#391 (comment).

@MarcCote
Copy link
Contributor Author

@Garyfallidis okay I found the problem, I wasn't adding the tests folder to Nibabel's packages in the setup.py.

@samuelstjean, loading one streamline at a time will indeed reduce the memory but the overhead @Garyfallidis is talking about will still be present. Having a list of numpy.array is just not as memory efficient as concatenating all ndarrays in one big numpy array (that is what nibabel.streamlines.CompactList is doing).

@samuelstjean
Copy link
Contributor

I meant that if it's that bad, try to compress with the C++ version, then
work from there, that should get you going in the meantime. I can compress
a whole 2go of streamlines down to 300mo in like 7 minutes, so it's not too
bad.

2016-01-11 22:49 GMT+01:00 Marc-Alexandre Côté [email protected]:

@Garyfallidis https://github.com/Garyfallidis okay I found the problem,
I wasn't adding the tests folder to Nibabel's packages in the setup.py.

@samuelstjean https://github.com/samuelstjean, loading one streamline
at a time will indeed reduce the memory but the overhead @Garyfallidis
https://github.com/Garyfallidis is talking about will still be present.
Having a list of numpy.array is just not as memory efficient as
concatenating all ndarrays in one big numpy array (that is what
nibabel.streamlines.CompactList is doing).


Reply to this email directly or view it on GitHub
#391 (comment).

@MarcCote
Copy link
Contributor Author

I'm a bit confused about the errors I got on appveyor: MemoryError and PermissionError. @matthew-brett have you encountered that error before?
Also, I'm not sure what's causing the UnpicklingError with Python2.6. I'm simply calling np.savez and np.load as usual.

for ext, cls in nib.streamlines.FORMATS.items():
with tempfile.NamedTemporaryFile(mode="w+b", suffix=ext) as f:
with clear_and_catch_warnings(record=True, modules=[trk]) as w:
nib.streamlines.save_tractogram(complex_tractogram, f.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here (and with the other permission errors) is that Windows doesn't like you re-opening the NamedTemporaryFile.

What I tend to do is to use the nibabel.tmpdirs.InTemporaryDirectory context manager and then load / save with some filename. When you do this, you have to make sure all objects holding onto the file get deleted though, otherwise Windows (again) may refuse to delete the temporary file...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll do that. Thanks.

@matthew-brett
Copy link
Member

Not sure what's happening with the unpickling - can you replicate with a virtualenv locally?

@MarcCote
Copy link
Contributor Author

I'm trying to install python 2.6 and I'll check the unpickling error.

@MarcCote
Copy link
Contributor Author

Almost there. There is only the MemoryError I don't understand what's happening.

@matthew-brett
Copy link
Member

Yes, not sure what's happening there either. How big are the arrays it's trying to create there? Could the process have run out of memory? Could the shape be unexpected?

@MarcCote
Copy link
Contributor Author

The shape of my 2D array is random but should be bound between 10 and 50 rows with 3 columns, and I have 10 of those arrays:
data = [rng.rand(rng.randint(10, 50), 3) for _ in range(10)]

What's the memory limit on that machine?

@matthew-brett
Copy link
Member

I don't know the memory limit, but it can't be so low that it cannot make those arrays.

Would you mind sending me your ssh public key by email, so I can let you trigger builds on the buildbots? It's easier to debug on those machines, I'll explain when I've got you set up.

@MarcCote
Copy link
Contributor Author

I was wrong, I use a buffer of 128Mb when creating a CompactList from a generator to avoid allocating memory for each item of the generator. I'll try reducing it and check if it works.

@matthew-brett
Copy link
Member

Maybe adapt buffer to system memory somehow?

@MarcCote
Copy link
Contributor Author

That was it, I reduced it to 1Mb and now it works. Right now, the value is hardcoded in the code as a constant. Is there a better way to automate that, or make it more flexible?

@MarcCote
Copy link
Contributor Author

There is the library psutil which is cross-platform and allows to get the available memory but that would had an additional dependency for NiBabel.

@matthew-brett
Copy link
Member

Maybe benchmark whether there's a difference between 1M and 128M?

@MarcCote
Copy link
Contributor Author

On my computer (using an SSD), with a buffer of 4Mb it took 7.8sec to read a tractogram file of 1.4Gb containing 500k streamlines. With 128Mb, it took 7.5sec.

@MarcCote
Copy link
Contributor Author

Now, it is all green. 👍

Is there something else to do? From now, I think the best test we can have is using it in real life applications. To be sure, is there anyone else that want to have a look at this PR or play with it but didn't had time? @arokem did you try to add support to the PDB file format? @MrBago any comments on the new framework?

@matthew-brett
Copy link
Member

I guess the speed difference should be less when you've got a normal hard drive - so 1M probably fine. Any benefit from 10M?

@matthew-brett
Copy link
Member

Sorry I mean 4M probably fine. I guess the chances are that 10 or 8 won't make any difference.

@MarcCote
Copy link
Contributor Author

10Mb was still causing a MemoryError, but no, it didn't make any difference regarding the speed. Just tried 8Mb, not causing a MemoryError, speed relatively the same. I'll go for 8 Mb then.

@matthew-brett
Copy link
Member

Still getting a memory error with 8M on appveyor? Back off to 4M again?

@MarcCote
Copy link
Contributor Author

Guess so.

@Garyfallidis
Copy link
Member

This PR seems ready to be merged. @matthew-brett would you like to have the honour to merge this the next couple of days if nobody else has something to add? After this is merged together with @MarcCote we will start making the API changes in DIPY to accommodate for the new API for streamlines. If we can have this issue fixed for 0.11 it would be fantastic! Much luv to all and many thx to @MarcCote for the excellent work.

Test reset of file position when reading trackvis header.  This revealed
that we were using the wrong flag to `seek`.
Use ``assert_arr_dict_equal`` to test whether two headers are equal,
rather than custom testing function.
Add tests for encoding / decoding numerical values in byte string
fields.

Refactor encoding.

Update comments / docstrings to note new encoding with numbers as ASCII
strings.
Add docstring for function giving RAS+ affine from trackvis header.
return lazy_tractogram

@classmethod
def create_from(cls, data_func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to from_data_func ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.


# We make sure there is the right amount of data.
if (self.nb_elements is not None and
value.nb_elements != self.nb_elements):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you think of a better name for nb_elements for the arraysequence ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think about what ArraySequence means it is an array of sequences. So calling len(arr_seq) gives you the number of sequences and arr_seq.nb_elements is the total number of sequences' items. What about: sequences_(items|elements)_count or nb_sequences_(items|elements)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After speaking with @jchoude, what about using the term atom to refer to a "row" of an ArraySequence._data. So, an ArraySequence is made of atoms which come from several sequences' elements.

We could then replace element in the docstring with atom when we refer to the row of the _data. nb_elements would become either nb_atoms or atoms_count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference. How about total_nb_rows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And use row to refer to an element ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway - chose as you like for nb_elements, then I think we're ready to merge, if you agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 96.102% when pulling ce9b44a on MarcCote:streamlines_api into 83d3249 on nipy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 96.102% when pulling 8952c23 on MarcCote:streamlines_api into 83d3249 on nipy:master.

matthew-brett and others added 3 commits June 7, 2016 10:50
Try dropping the special cases for the contructors for sliceable dicts,
and deal with None as input value in the tractogram properties.
Suggestion: simplify constructor for sliceable dicts
@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 96.098% when pulling 0324bce on MarcCote:streamlines_api into 83d3249 on nipy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 96.098% when pulling 0a22b93 on MarcCote:streamlines_api into 83d3249 on nipy:master.

@matthew-brett
Copy link
Member

OK - finally done - congratulations. Thanks too for your patience and good grace in responding to all the comments.

@matthew-brett matthew-brett merged commit 9959c74 into nipy:master Jun 7, 2016
@arokem
Copy link
Member

arokem commented Jun 7, 2016

This is fantastic! Congratulations to @MarcCote and everyone else involved!

Question: when might we see this in a released version of nibabel? In other words -- what is the road map towards a 2.1 (assuming that's the next release)?

@jchoude
Copy link

jchoude commented Jun 8, 2016

Fantastic! Good job @MarcCote and everyone!

@demianw
Copy link
Contributor

demianw commented Jun 8, 2016

Congrats @MarcCote and everyone for this titanic piece of work

@MarcCote
Copy link
Contributor Author

MarcCote commented Jun 8, 2016

😃

@MarcCote MarcCote deleted the streamlines_api branch June 8, 2016 11:39
matthew-brett added a commit that referenced this pull request Apr 18, 2017
MRG: Support TCK streamlines file format

This PR adds support for MRtrix streamlines file to the Streamlines API (#391).

Eventually, in a future PR, I will add some functions that allows easy conversion between all different formats. For now, going from TRK to TCK is easy. See
https://gist.github.com/MarcCote/ea6842cc4c3950f7596fc3c8a0be0154

Converting from TCK to TRK requires additional information from a reference anatomy. See
https://gist.github.com/MarcCote/ebf0ae83a9202eb96d1dfbb949cd98af
@effigies effigies mentioned this pull request Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.