Skip to content

Initializing Public Git Repo Fails/Unclear When Gitbase is Ready #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elithrar opened this issue Mar 11, 2019 · 15 comments
Closed

Initializing Public Git Repo Fails/Unclear When Gitbase is Ready #313

elithrar opened this issue Mar 11, 2019 · 15 comments

Comments

@elithrar
Copy link

Summary: Downloading a subset of the PGA and calling srcd init fails to provide a usable environment after several hours.

srcd Engine version: v0.11.0
Container image versions:

elithrar@matt-workstation:~$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
srcd/cli-daemon     v0.11.0             e09406877f03        2 days ago          44MB
srcd/gitbase        v0.19.0             5df5a9b119a9        3 days ago          37.6MB
srcd/gitbase-web    v0.6.2              1c745ac11485        4 days ago          109MB
bblfsh/bblfshd      v2.11.8-drivers     ac9a79330aa9        2 weeks ago         1.42GB

Machine spec: 16 vCPU, 200GB RAM, 200GB SSD (custom GCP VM)

Steps to reproduce:

  1. Initialize repo
~$ srcd init $HOME/repos/pga/siva # I believe this is the correct root as per below?
~$ tree -L 2 repos/pga/
repos/pga/
`-- siva    
     `-- latest
  1. Call srcd sql 'SHOW tables; until it no longer reports "waiting for Gitbase to be ready". This takes multiple hours; htop reports random cores pegged at 100% at intervals, but low memory usage.

  2. Successfully call srcd sql 'SHOW TABLES; and get a table listing:

~$ srcd sql "SHOW tables;"
+--------------+
|    TABLE     |
+--------------+
| blobs        |
| commit_blobs |
| commit_files |
| commit_trees |
| commits      |
| files        |
| ref_commits  |
| refs         |
| remotes      |
| repositories |
| tree_entries |
+--------------+
  1. Attempt to run a test query against a table (any) - fails with an rpc error:
~$ srcd sql 'SELECT 1 from refs;'
2019/03/11 05:30:13 rpc error: code = Unknown desc = closing row iterator: invalid connection

Open to suggestions!

My initial feedback (noting the empathy sessions/issues on this repo) is that determining the status of gitbase is near-impossible as a user without (likely) entering the container; knowing how to initialize the engine on siva files is undocumented (educated guess based on prev. issues; may still be wrong); errors are opaque and lack context (filename, line no. would be useful).

@dpordomingo
Copy link
Contributor

Hi @elithrar thanks for helping us to polish source{d} Engine.

I wonder if that problem could be related with the PGA subset you're using, or even related with your infrastructure.

I tried it with this small PGA subset with no problems:

# install pga from sources (it will require to have `go` installed)
go get -u github.com/src-d/datasets/PublicGitArchive/...

# downloads all repos from 'src-d' repos writen in 'Go' into '$HOME/repos/pga/siva'
pga list -u /src-d/ -l Go -f json | jq -r ".sivaFilenames[]" | pga get -i -o $HOME/repos/pga

# init source{d} Engine from '$HOME/repos/pga/siva' as downloade by pga in prev step
srcd init $HOME/repos/pga/siva

# query 'remotes' table to get some repo info
srcd sql 'select remote_fetch_url from remotes'

Could you maybe share the subset of PGA projects that you're using in order to help us to reproduce the issue?

Also,
if the purpose of:

$ srcd sql 'SELECT 1 from refs;'

is to check everything is ok, I'd try

$ srcd sql 'SELECT count(*) from refs;'
$ srcd sql 'SELECT * from refs limit 1' 

to ensure there is no problem retrieving that much 1.

@elithrar
Copy link
Author

elithrar commented Mar 11, 2019

Thanks for the quick reply @dpordomingo!

  • I'll see whether a subset of the PGA works - I'm using pga get -l go in this case (so ~8k repos). Ultimately, it's important that I get that working for a talk/project I'm working on, but will trim it down a bit further.
  • I had tried SELECT count(*) ... on a few tables, which also failed - the SELECT 1 would be a (useful) load test to make sure I can parse the data.

Will report back later today.

@smacker
Copy link
Contributor

smacker commented Mar 11, 2019

logs from gitbase can be helpful too:

docker logs srcd-cli-gitbase

@elithrar
Copy link
Author

elithrar commented Mar 11, 2019 via email

@dpordomingo
Copy link
Contributor

I tried pga get and download ~9k repos, then I hit the same problem than you.
I found that some random repos take too much time to be initialized by Engine.

This problem is somehow addressed by src-d/go-mysql-server#631 but it has not been yet added into source{d} Engine.

@elithrar
Copy link
Author

elithrar commented Mar 12, 2019 via email

@dpordomingo
Copy link
Contributor

I was told this morning that PGA also contains all remote refferences of every repo (it even includes PullRequests references), which could be too much for your use case.
I didn't confirm it myself, but maybe you could do it easily inspecting the content of a repo having a big siva file ~100MB.

@elithrar
Copy link
Author

elithrar commented Mar 12, 2019 via email

@elithrar
Copy link
Author

elithrar commented Mar 16, 2019

Update:

  • Grabbed ~12k repos (still 97GB...) via the below:
pga list --format json | jq -r 'select(.fileCount > 10) | select(.langs[] == "Go") | select(.commitsCount > 10) | select(.codeLinesCount[] > 1000) | .sivaFilenames[]' | pga get -i -o smaller-pga/
  • Made a 100GB tmpfs (i.e. all in RAM) and copied the directories over. Machine has 200GB RAM.
  • Gitbase still times out - the most recent logs are as below - the Gitbase container doesn't output any further logs, but the srcd daemon does:
~$ docker logs -f srcd-cli-gitbase 
path=/opt/repos/latest/ff/fff08a6b14a1d64934f458b58349c5a9e036e1dd.siva
time="2019-03-16T16:31:55Z" level=debug msg="repository added" path=/opt/repos/latest/ff/fff15fbe527c013370507e7f74085673a2aeff58.siva
time="2019-03-16T16:31:55Z" level=debug msg="repository added" path=/opt/repos/latest/ff/fff2d7b637d108d4973789a50e2cbe618ee9b859.siva
time="2019-03-16T16:31:55Z" level=debug msg="repository added" path=/opt/repos/latest/ff/fff638b259193bf8d2e7cc98ebafd2cab7e38bed.siva
time="2019-03-16T16:31:55Z" level=debug msg="repository added" path=/opt/repos/latest/ff/fff9f7ed5edfe1c8fa597a659b601e6cd6caef09.siva
time="2019-03-16T16:31:55Z" level=debug msg="repository added" path=/opt/repos/latest/ff/fffda083cb28606ee1aa81c730b6c923d112b3ce.siva
time="2019-03-16T16:31:55Z" level=debug msg="repository added" path=/opt/repos/latest/ff/fffe75f3381f183dde26b2734309e186181a5eaa.siva
time="2019-03-16T16:31:55Z" level=debug msg="registered database to catalog" db=gitbase
time="2019-03-16T16:31:55Z" level=debug msg="registered all available functions in catalog"time="2019-03-16T16:31:55Z" level=debug msg="created index storage"
time="2019-03-16T16:31:55Z" level=debug msg="registered pilosa index driver"
time="2019-03-16T16:31:55Z" level=info msg="squash tables rule is enabled"

But it seems the srcd-cli container can't connect to MySQL:

~$ docker logs -f srcd-cli-daemon
time="2019-03-16T21:12:47Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:48Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:49Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:50Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:51Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:52Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\""time="2019-03-16T21:12:53Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\""time="2019-03-16T21:12:54Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:55Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:56Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:58Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:12:59Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:13:00Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:13:01Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"
"time="2019-03-16T21:13:02Z" level=info msg="connecting to mysql \"root@tcp(srcd-cli-gitbase)/?maxAllowedPacket=65536\"

I don't see MySQL listening in that container:

docker exec -it srcd-cli-gitbase /bin/sh
/ # netstat -tunapl
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.11:42709        0.0.0.0:*               LISTEN      -
udp        0      0 127.0.0.11:54828        0.0.0.0:*            

@dpordomingo
Copy link
Contributor

some insights about your logs:

  • docker logs -f srcd-cli-gitbase is logging the DB engine (gitbase), if last message is squash tables rule is enabled, then it's checking indexes; once it finishes doing it, it will log server started and listening on 0.0.0.0:3306, and then it will be ready for being queried, not before. It (most probably) takes that much because Perf: Avoid call to Checksum if there are no indexes. go-mysql-server#631 is not yet released.
  • docker logs -f srcd-cli-daemon is logging the attempts of Engine to query the DB, that is not yet ready as said above.

@elithrar
Copy link
Author

Yep - that's what I expected.

Querying directly off the tmpfs:

elithrar@matt-workstation:~/repos$ srcd sql 'SELECT count(*) from remotes;'
2019/03/18 21:43:39 rpc error: code = Unknown desc = SQL query failed: Error 1105: unknown error: repository does not exist   

Looking back at the container logs:

time="2019-03-16T21:22:57Z" level=info msg="squash tables rule is enabled"
time="2019-03-17T02:18:02Z" level=info msg="server started and listening on 0.0.0.0:3306"
time="2019-03-18T21:43:39Z" level=info msg="NewConnection: client 1"
time="2019-03-18T21:43:39Z" level=error msg="Error parsing auth server config: unexpected end of JSON input"
time="2019-03-18T21:43:39Z" level=info msg="audit trail" action=authentication address="172.18.0.2:54518" success=true system=audit user=root
time="2019-03-18T21:43:39Z" level=debug msg="executing query" query="SELECT 1"
time="2019-03-18T21:43:39Z" level=info msg="audit trail" action=authorization address="172.18.0.2:54518" connection_id=1 permission=read pid=1 query="SELECT 1" success=true system=audit user=root
time="2019-03-18T21:43:39Z" level=info msg="audit trail" action=query address="172.18.0.2:54518" connection_id=1 duration=10.25411ms pid=1 query="SELECT 1" success=true system=audit user=root
time="2019-03-18T21:43:39Z" level=info msg="NewConnection: client 2"
time="2019-03-18T21:43:39Z" level=info msg="audit trail" action=authentication address="172.18.0.2:54520" success=true system=audit user=root
time="2019-03-18T21:43:39Z" level=debug msg="executing query" query="SELECT count(*) from remotes;"
time="2019-03-18T21:43:39Z" level=info msg="audit trail" action=authorization address="172.18.0.2:54520" connection_id=2 permission=read pid=2 query="SELECT count(*) from remotes;" success=true system=audit user=root
time="2019-03-18T21:43:39Z" level=info msg="audit trail" action=query address="172.18.0.2:54520" connection_id=2 duration=73.830968ms err="repository does not exist" pid=2 query="SELECT count(*) from remotes;" success=false system=audit user=root
  • SHOW tables; works - the schema is there
  • No query against those tables works.

@dpordomingo
Copy link
Contributor

dpordomingo commented Mar 31, 2019

sorry for the delay, but these investigation issues are kind of out of our scope.

I ran it this weekend, and I also found that repository does not exist error. It was related with corrupted database, most probably caused by wrong siva.

What I did is to fetch only HEAD of the repos, in order to avoid references to tags, prs and many strange branches from many repos.
To do so, instead of downloading the database with pga, I build a list of those repos, and downloaded them with a modified version of borges (it took ~3h).
That way, I could query all tables from that dataset:

  • 86GB
  • 7493 siva files
  • 8004 remotes
  • 5.2M commits
  • 122k unique author emails

You can get the code of that modified version of borges, and instructions about how to run it from this PR dpordomingo/borges#1

@dpordomingo
Copy link
Contributor

By the way: initing Engine for that 7.5K repos was fast (less than a minute), and the query to count the commits was ready in similar time, so I'd think that the problem you got was related with a corrupted database (as seen by repository does not exist error)

@elithrar
Copy link
Author

elithrar commented Mar 31, 2019 via email

@dpordomingo
Copy link
Contributor

Glad to know you're trying hard.
Thanks for sharing your experiences with us, because it will help us to enhance our stack.
Looking forward to know if this new approach lets you to run some queries with source{d} Engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants