Skip to content

14 TB of Hundreds of Thousands of Input Files For Training? #1668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ShinobiWannabe opened this issue Nov 19, 2018 · 2 comments
Closed

14 TB of Hundreds of Thousands of Input Files For Training? #1668

ShinobiWannabe opened this issue Nov 19, 2018 · 2 comments

Comments

@ShinobiWannabe
Copy link

I am sorry if this was answered, but the closest I could find is this:
#192

Which looks to have an answer of specifying all the related files.
var data = reader.Read(exampleFile1, exampleFile2);

The other tutorials on Microsoft all used a single file.

I am looking at something to examine about 14 Terabytes worth of data within hundreds of thousands of files across multiple hard drives. Because of the size there would not really be any way to store that in memory either.

Could I use ML.NET for this problem?

@Zruty0
Copy link
Contributor

Zruty0 commented Nov 20, 2018

@ShinobiWannabe , there is no principled limitation for this type of scenario, as long as you don't try to use an in-memory operation (like training FastTree/LightGBM, for example).

However, there are 2 difficulties I envision right now:

  • We don't have a standard component that can read from multiple files. You could try to create new MultiFileSource("file1.txt+file2.txt+file3.txt") (the + syntax used to work back in the day, and I'm not sure if we still support this capability. If that doesn't work, you'd need to implement your own IMultiStreamSource, which is a bit involved.
  • Our learners, when they deem that they benifit from caching the data, cache the data in memory (without the possibility to opt out). This is a bug tracked by Clean up our auto-caching #1604

@ShinobiWannabe
Copy link
Author

Ok thank you for the answer. Good to know.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants