Skip to content

Adding Multiple Training Files to the Pipeline? #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cflint987 opened this issue May 19, 2018 · 4 comments
Closed

Adding Multiple Training Files to the Pipeline? #192

cflint987 opened this issue May 19, 2018 · 4 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@cflint987
Copy link

cflint987 commented May 19, 2018

System information
OS version/distro: Windows 7 Home
.NET Version (eg., dotnet --info): ML .net V0.1.0

Issue:
What is the correct way to add multiple training files to a Learning Pipeline?

In the Taxi Fare example, just adding another textloader and/or ColumnCopier, etc seems to not be correct.

Example:
pipeline.Add(new TextLoader(DataPath, useHeader: true, separator: ","));
pipeline.Add(new TextLoader(DataPath2, useHeader: true, separator: ","));

@cflint987 cflint987 changed the title .Net Framework Support? Adding Multiple Training Files to the Pipeline? May 19, 2018
@shauheen shauheen added the question Further information is requested label May 21, 2018
@GalOshri
Copy link
Contributor

Thanks for asking! This is not currently possible, but let's use this issue to track enabling multiple inputs in a pipeline.

Just to clarify: is your intention to concatenate the two files as soon as they are loaded, or to apply different transforms/trainers to them?

A potential workaround for now is to read in the examples from both files into memory and use the CollectionDataSource (see example usage here). You could also concatenate the two files into one CSV outside of the ML.NET pipeline.

@GalOshri GalOshri added the enhancement New feature or request label May 21, 2018
@cflint987
Copy link
Author

My intention is for creating and testing ML structures with large datasets to be modular and less taxing on file transfers to/from servers. For example, moving 100GB is to a server is easier if split by time or another parameter. It also allows ML structures to be updated as new data comes in without having to concat onto what are already are/is a large file.

Reducing the memory footprint by loading subsets of the data would be nice, but as I understand it, that is not possible for all ML structures.

I have concated the files and it works properly but this would be a nice feature to have.

Thanks for the answer.

@glebuk
Copy link
Contributor

glebuk commented May 23, 2018

@cflint987,
We have a work item to address your exact scenario. Please take a look at PR #61. Feel free to comment and ask @tyclintw.

@Ivanidzo4ka
Copy link
Contributor

DRI RESPONSE: You can do it with new Api:

// Create the reader: define the data columns and where to find them in the text file.
var reader = TextLoader.CreateReader(env, ctx => (
        // A boolean column depicting the 'target label'.
        IsOver50K: ctx.LoadBool(14),
        // Three text columns.
        Workclass: ctx.LoadText(1),
        Education: ctx.LoadText(3),
        MaritalStatus: ctx.LoadText(5)),
    hasHeader: true);
 // Now read the files (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
var data = reader.Read(exampleFile1, exampleFile2);

Please let me know if this satisfy you. I'm intend to close this issue within next few days.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants