Skip to content

Adds average document size to DocsStats #27117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 28, 2017
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 26, 2017

This change is required in order to support a size based check for the
index rollover.

The average document size is estimated by sampling only existing
segments. We prefer using segments rather than StoreStats because
StoreStats is not reliable if indexing or merging operations are in
progress.

Relates #27004

This change is required in order to support a size based check for the
index rollover.

The average document size is estimated by sampling only existing
segments. We prefer using segments rather than StoreStats because
StoreStats is not reliable if indexing or merging operations are in
progress.

Relates elastic#27004
@dnhatn
Copy link
Member Author

dnhatn commented Oct 26, 2017

@jasontedor, Should we also expose this stat to the REST layer (eg. _cat)?

@dnhatn dnhatn changed the title Adds document average size to DocsStats Adds average document size to DocsStats Oct 26, 2017
@dnhatn dnhatn requested a review from s1monw October 26, 2017 04:05
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comment but in general it LGTM.

@@ -57,16 +66,26 @@ public long getDeleted() {
return this.deleted;
}

public long getAverageSizeInBytes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
public void readFrom(StreamInput in) throws IOException {
count = in.readVLong();
deleted = in.readVLong();
if (in.getVersion().onOrAfter(Version.V_6_1_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to make it Version.V_7_0_0 for now and change it back after this change is backported in order not to cause failures in the multi-version cluster qa tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint. I temporarily made this for v7 only (f38e957)

indexShard = newStartedShard(true);
int smallDocNum = randomIntBetween(5, 100);
for (int i = 0; i < smallDocNum; i++) {
indexDoc(indexShard, "test", "small-" + i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've been trying to use doc as a type name in new tests whenever possible. Can you rename?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion.

@@ -880,8 +880,15 @@ public FlushStats flushStats() {
}

public DocsStats docStats() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should rather be something like this:

public DocsStats docStats() {
        long numDocs = 0;
        long numDeletedDocs = 0;
        long sizeInByte = 0;
        List<Segment> segments = segments(false);
        for (Segment segment : segments) {
            if (segment.search) {
                numDocs += segment.getNumDocs();
                numDeletedDocs += segment.getDeletedDocs();
                sizeInByte += segment.getSizeInBytes();
            }
        }
        return new DocsStats(numDocs, numDeletedDocs, sizeInByte);
    }

that way we maintain a consistent total and we can calculate the average at read time and aggregation of doc stats will be much simpler? I also think we should make sure the size in bytes is based on the currently used reader which is guaranteed by the Segment#search flag. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes the DocsStats simpler and the average value more accurate. I have updated this in 662f062

@dnhatn
Copy link
Member Author

dnhatn commented Oct 26, 2017

@jpountz @simonw I have addressed your feedbacks. Could you please have another review? Thank you.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two really minor comments, this LGTM regardless of what you choose :)

}

public void add(DocsStats docsStats) {
if (docsStats == null) {
public void add(DocsStats that) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally that is a bit too close to this for typo avoidance, so I'd prefer other, but it's so minor that it's totally up to you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer other over that (addressed in 6cca080) but I used that in order to have the same indention for this expression (removed).

-        long totalBytes = this.averageSizeInBytes * (this.count + this.deleted)
-                        + that.averageSizeInBytes * (that.count + that.deleted);

@Override
public void readFrom(StreamInput in) throws IOException {
count = in.readVLong();
deleted = in.readVLong();
if (in.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
totalSizeInBytes = in.readVLong();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set totalSizeInBytes to -1 to indicate that it cannot be read and not that there are 0 bytes in use for 6.x nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 96e9be7

@Override
public void readFrom(StreamInput in) throws IOException {
count = in.readVLong();
deleted = in.readVLong();
if (in.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should be V_6_1_0 since you will be backporting this to the 6.x branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz recommended to make this for v7, then change for the backport later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, it should be 7.0.0. Then when you backport set it to 6.1.0 in the 6.x branch and make sure that the BWC tests in master against 6.x pass (you might have to skip some of them). Then push a commit to master flipping the version to 6.1.0 and removing the skips.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay, I hadn't realized we were doing it the reverse way now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by the reverse way, these are part of the steps to have green CI on all branches every step of the way.

}

@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeVLong(count);
out.writeVLong(deleted);
if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for V_6_1_0

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn dnhatn merged commit 07d270b into elastic:master Oct 28, 2017
@dnhatn dnhatn deleted the avg-doc-size branch October 28, 2017 17:21
dnhatn added a commit that referenced this pull request Oct 28, 2017
This change is required in order to support a size based check for the
index rollover.

The index size is estimated by sampling the existing segments only. We
prefer using segments to StoreStats because StoreStats is not reliable
if indexing or merging operations are in progress.

Relates #27004
dnhatn added a commit that referenced this pull request Oct 28, 2017
dnhatn added a commit that referenced this pull request Oct 28, 2017
jasontedor added a commit to olcbean/elasticsearch that referenced this pull request Oct 30, 2017
* master: (63 commits)
  [Docs] Fix note in bucket_selector
  [Docs] Fix indentation of examples (elastic#27168)
  [Docs] Clarify `span_not` query behavior for non-overlapping matches (elastic#27150)
  [Docs] Remove first person "I" from getting started (elastic#27155)
  [Docs] Correct link target for datatype murmur3 (elastic#27143)
  Fix division by zero in phrase suggester that causes assertion to fail
  Enable Docstats with totalSizeInBytes for 6.1.0
  Adds average document size to DocsStats (elastic#27117)
  Upgrade Painless from ANTLR 4.5.1-1 to  ANTLR 4.5.3. (elastic#27153)
  Exists template needs a template name (elastic#25988)
  [Tests] Fix occasional test failure due to two random values being the same
  Fix beidermorse phonetic token filter for unspecified `languageset` (elastic#27112)
  Fix max score tracking with field collapsing (elastic#27122)
  [Doc] Add Ingest CSV Processor Plugin to plugin as a community plugin (elastic#27105)
  Removed the beta tag from cross-cluster search
  fixed typo in ConstructingObjectParse (elastic#27129)
  Allow for the Painless Definition to have multiple instances (elastic#27096)
  Apply missing request options to the expand phase (elastic#27118)
  Only pull SegmentReader once in getSegmentInfo (elastic#27121)
  Fix BWC for discovery stats
  ...
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants