Skip to content

Support for parquet file with type inferring #7734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

evanevanevanevannnn
Copy link
Collaborator

Changelog entry

https://st.yandex-team.ru/YQ-2830

Changelog category

  • Improvement

Copy link

github-actions bot commented Aug 13, 2024

2024-08-13 15:48:25 UTC Pre-commit check for 0996aee has started.
2024-08-13 15:51:46 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-13 16:00:32 UTC Build successful.

Copy link

github-actions bot commented Aug 13, 2024

2024-08-13 15:54:40 UTC Pre-commit check for 0996aee has started.
2024-08-13 15:58:43 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-13 17:07:17 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14108 12757 0 1 1335 15

2024-08-13 17:09:39 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-13 17:16:51 UTC Tests successful.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18 (only retried tests) 10 0 0 0 8

🟢 2024-08-13 17:16:58 UTC Build successful.
🟡 2024-08-13 17:17:31 UTC ydbd size 8.1 GiB changed* by +1.3 MiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 010b521 merge: 0996aee diff diff %
ydbd size 8 700 764 520 Bytes 8 702 140 872 Bytes +1.3 MiB +0.016%
ydbd stripped size 473 100 616 Bytes 473 156 232 Bytes +54.3 KiB +0.012%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 13, 2024

2024-08-13 15:56:55 UTC Pre-commit check for 0996aee has started.
2024-08-13 16:00:31 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-13 18:08:29 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
9769 9724 0 7 24 14

🟢 2024-08-13 18:09:38 UTC Build successful.
🟡 2024-08-13 18:10:06 UTC ydbd size 5.5 GiB changed* by +769.9 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 010b521 merge: 0996aee diff diff %
ydbd size 5 851 672 488 Bytes 5 852 460 840 Bytes +769.9 KiB +0.013%
ydbd stripped size 1 469 404 912 Bytes 1 469 570 448 Bytes +161.7 KiB +0.011%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 14, 2024

2024-08-14 09:46:44 UTC Pre-commit check for 4bf02c2 has started.
2024-08-14 09:49:29 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-14 09:55:03 UTC Build successful.

Copy link

github-actions bot commented Aug 14, 2024

2024-08-14 09:47:27 UTC Pre-commit check for 4bf02c2 has started.
2024-08-14 09:50:14 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-14 13:12:18 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
9823 9746 0 21 22 34

🟢 2024-08-14 13:13:26 UTC Build successful.
🟡 2024-08-14 13:13:58 UTC ydbd size 5.5 GiB changed* by +783.0 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 9ec6f05 merge: 4bf02c2 diff diff %
ydbd size 5 869 832 728 Bytes 5 870 634 480 Bytes +783.0 KiB +0.014%
ydbd stripped size 1 473 498 096 Bytes 1 473 657 296 Bytes +155.5 KiB +0.011%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 14, 2024

2024-08-14 09:49:24 UTC Pre-commit check for 4bf02c2 has started.
2024-08-14 09:52:07 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-14 12:14:29 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14186 12813 0 4 1335 34

2024-08-14 12:16:08 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟡 2024-08-14 12:23:49 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
49 (only retried tests) 17 0 3 0 29

2024-08-14 12:23:57 UTC Failed tests rerun (try 3) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-14 12:32:01 UTC Tests successful.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
44 (only retried tests) 15 0 0 0 29

🟢 2024-08-14 12:32:10 UTC Build successful.
🟡 2024-08-14 12:32:48 UTC ydbd size 8.1 GiB changed* by +1.3 MiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 9ec6f05 merge: 4bf02c2 diff diff %
ydbd size 8 725 240 488 Bytes 8 726 653 680 Bytes +1.3 MiB +0.016%
ydbd stripped size 474 249 384 Bytes 474 296 616 Bytes +46.1 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

[actorSystem, selfId = SelfId(), request = std::move(request)](NYql::IHTTPGateway::TResult&& result) mutable {
actorSystem->Send(selfId, new TEvS3DownloadResponse(std::move(request), std::move(result)));
}, {}, RetryPolicy_);
return std::move(headers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

а почему тут NRVO не сработает?

@@ -333,7 +333,7 @@ struct TObjectStorageExternalSource : public IExternalSource {
}
for (const auto& entry : entries.Objects) {
if (entry.Size > 0) {
return entry.Path;
return entry;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А как оно раньше работало. Почему на entry заменилось?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

раньше мы скачивали первые 10МБ файла -> размер файла нам был не нужен.
теперь для паркета нам нужно скачать последние N -> нужно знать размер файла, тк S3Fetcher принимает на вход (начало, конец) участка памяти

соотв я добавил в TEvInferFileSchema размер файла

return;
}
case EFileFormat::Undefined:
Y_ABORT("Invalid format should be unreachable");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABORT плохо в этом месте, можешь ошибку возвращать по аналогии с default? Мы легко и удобно можешь донести эту проблему выше. Еще case нужно вынести выше default

@@ -118,12 +119,24 @@ struct TEvArrowFile : public NActors::TEventLocal<TEvArrowFile, EvArrowFile> {
TString Path;
};

struct TEvArrowSchema : public NActors::TEventLocal<TEvArrowSchema, EvArrowSchema> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Может назвать это TEvInferredArrowSchema по аналогии с TEvInferredFileSchema?

}

void HandleFileError(TEvFileError::TPtr& ev, const NActors::TActorContext& ctx) {
Cout << "TArrowInferencinator::HandleFileError" << Endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cout лишний. Либо в логи переносить если это нужно

futureSchema.Apply([actorSystem, sender, request](NThreading::TFuture<NYql::IArrowReader::TSchemaResponse> response) {
if (response.HasException()) {
try {
response.TryRethrow();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А TryRethrow что делает если HasException = false?

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 07:54:14 UTC Pre-commit check for 77c4e0b has started.
2024-08-15 07:56:57 UTC Check linux-x86_64-release-asan is running...
2024-08-15 08:20:08 UTC Check cancelled

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 07:54:24 UTC Pre-commit check for 77c4e0b has started.
2024-08-15 07:57:05 UTC Check linux-x86_64-relwithdebinfo is running...
2024-08-15 08:20:07 UTC Check cancelled

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 07:54:24 UTC Pre-commit check for 77c4e0b has started.
2024-08-15 07:57:04 UTC Check linux-x86_64-release-clang14 is running...
2024-08-15 08:20:06 UTC Check cancelled

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 08:21:38 UTC Pre-commit check for 6a159eb has started.
2024-08-15 08:24:34 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-15 10:40:20 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
9801 9724 0 20 23 34

🟢 2024-08-15 10:41:10 UTC Build successful.
🟢 2024-08-15 10:41:39 UTC ydbd size 5.5 GiB changed* by +54.8 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 4943715 merge: 6a159eb diff diff %
ydbd size 5 877 787 456 Bytes 5 877 843 600 Bytes +54.8 KiB +0.001%
ydbd stripped size 1 476 701 296 Bytes 1 476 716 720 Bytes +15.1 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 08:23:43 UTC Pre-commit check for 6a159eb has started.
2024-08-15 08:26:22 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-15 09:35:41 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14183 12812 0 4 1337 30

2024-08-15 09:36:51 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟡 2024-08-15 09:44:03 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
47 (only retried tests) 16 0 3 0 28

2024-08-15 09:44:11 UTC Failed tests rerun (try 3) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-15 09:51:03 UTC Tests successful.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
43 (only retried tests) 16 0 0 0 27

🟢 2024-08-15 09:51:11 UTC Build successful.
🟡 2024-08-15 09:51:44 UTC ydbd size 8.1 GiB changed* by +162.6 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 93998b8 merge: 6a159eb diff diff %
ydbd size 8 731 626 192 Bytes 8 731 792 656 Bytes +162.6 KiB +0.002%
ydbd stripped size 475 246 088 Bytes 475 252 744 Bytes +6.5 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 08:23:45 UTC Pre-commit check for 6a159eb has started.
2024-08-15 08:26:24 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-15 08:31:56 UTC Build successful.

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 12:47:04 UTC Pre-commit check for 070653e has started.
2024-08-15 12:50:33 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-15 14:01:29 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14188 12816 0 4 1336 32

2024-08-15 14:02:43 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-15 14:10:58 UTC Tests successful.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
47 (only retried tests) 19 0 0 0 28

🟢 2024-08-15 14:11:06 UTC Build successful.
🟡 2024-08-15 14:11:41 UTC ydbd size 8.1 GiB changed* by +164.9 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 1f7017f merge: 070653e diff diff %
ydbd size 8 731 836 008 Bytes 8 732 004 896 Bytes +164.9 KiB +0.002%
ydbd stripped size 475 256 104 Bytes 475 263 080 Bytes +6.8 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 12:47:44 UTC Pre-commit check for 070653e has started.
2024-08-15 12:50:18 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-15 12:55:34 UTC Build successful.

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 12:48:52 UTC Pre-commit check for 070653e has started.
2024-08-15 12:51:39 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-15 14:57:57 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
9817 9745 0 14 23 35

🟢 2024-08-15 14:58:52 UTC Build successful.
🟡 2024-08-15 14:59:21 UTC ydbd size 5.5 GiB changed* by +130.9 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 52d9c30 merge: 070653e diff diff %
ydbd size 5 877 829 040 Bytes 5 877 963 088 Bytes +130.9 KiB +0.002%
ydbd stripped size 1 476 728 048 Bytes 1 476 753 776 Bytes +25.1 KiB +0.002%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation


std::shared_ptr<arrow::io::RandomAccessFile> BuildParquetFileFromMetadata(const TString& data, const TRequest& request, const NActors::TActorContext& ctx) {
auto arrowData = std::make_shared<arrow::Buffer>(nullptr, 0);
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

а зачем эти скобочки?

if (buildRes.ok()) {
buildRes = builder.Finish(&arrowData);
}
if (!buildRes.ok()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

это условие вверх можно перенести, а ниже ок часть и убрать if (buildRes.ok()) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

это условие проверяет и builder.Append и builder.Finish, предлагаешь их по-отдельности проверять?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Скорее да, еще бы разные ошибки написать. Чтобы понимать какое конкретно место сломалось, finish или append

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 15:27:43 UTC Pre-commit check for 41329d9 has started.
2024-08-15 15:30:24 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-15 16:01:30 UTC Build successful.

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 15:46:10 UTC Pre-commit check for 41329d9 has started.
2024-08-15 15:49:42 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-15 18:16:38 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
9824 9752 0 17 23 32

🟢 2024-08-15 18:17:30 UTC Build successful.
🟡 2024-08-15 18:18:00 UTC ydbd size 5.5 GiB changed* by +310.7 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: d24978e merge: 41329d9 diff diff %
ydbd size 5 878 198 040 Bytes 5 878 516 208 Bytes +310.7 KiB +0.005%
ydbd stripped size 1 476 799 504 Bytes 1 476 872 560 Bytes +71.3 KiB +0.005%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 15, 2024

2024-08-15 15:46:58 UTC Pre-commit check for 41329d9 has started.
2024-08-15 15:49:41 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-15 17:12:46 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14190 12823 0 1 1336 30

2024-08-15 17:14:00 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-15 17:21:44 UTC Tests successful.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
43 (only retried tests) 16 0 0 0 27

🟢 2024-08-15 17:21:52 UTC Build successful.
🟡 2024-08-15 17:22:29 UTC ydbd size 8.1 GiB changed* by +896.5 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 929e8f5 merge: 41329d9 diff diff %
ydbd size 8 731 908 688 Bytes 8 732 826 712 Bytes +896.5 KiB +0.011%
ydbd stripped size 475 258 760 Bytes 475 296 328 Bytes +36.7 KiB +0.008%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@evanevanevanevannnn evanevanevanevannnn merged commit a6a8b05 into ydb-platform:main Aug 15, 2024
10 of 12 checks passed
evanevanevanevannnn added a commit to evanevanevanevannnn/ydb that referenced this pull request Aug 16, 2024
stanislav-shchetinin pushed a commit to stanislav-shchetinin/ydb that referenced this pull request Aug 30, 2024
@evanevanevanevannnn evanevanevanevannnn deleted the parquet_support branch November 7, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants