Skip to content

Latest commit

 

History

History
106 lines (55 loc) · 3.46 KB

proxyhandler-v1.md

File metadata and controls

106 lines (55 loc) · 3.46 KB

We have outlined a plan for a proxy picker, to pick proxies that haven't been used.

The following restrictions are required:

  • Proxies must timeout after 500 successful/failed requests in total are made.
  • If a proxies invalid attempts exceeds 30 attempts within a 5 minute window, the proxy next_avaliable should be set 10 minutes in the future
  • If a proxy is successful, delta should increment. If it fails it should decrement.
  • If a GT (Twitter Guest Token) has been recently updated, skip the update process, as it's possible a cocurrent/paralell thread has attempted this recently.
  • As threading is used, a single proxy should not be used too many times.
  • Proxies should be added to a queue by a proxy gen thread that updates a FILO queue.
Proxy:
	usage_count -> 1 -> 500 # holds information so that one proxy isn't used more than 500 times in quick succession
	next_avaliable -> TIME	# holds the next avaliable time a proxy can be used

	gt_last_updated -> TIME	# holds a guest token (a token used when making a request. It can be invalid.)
	success_delta -> 0 LIMIT ( 5, 000 -> 10, 000 ) # holds information on how many times a proxy was successful
	failed_count -> 0 (0 -> 100) # holds information on the number of concequative failed attemps 

add_proxy_to_queue():	# proxies are added into a queue, and then used at a later point

	# makes a query to the database to get a free proxy using parameters.
	proxy = get_proxy(get a proxy where the next_avaliable time is now, sort by success_delta)

	if(there are proxies avaliable):

		if(usage_count > 500):	# checks if the proxy has been used more than 500 times recently

			# we are close to the rate limit of the proxy

			usage_count = 0	# reset the usage count
			failed_count = 0 # reset the failed count (should this be done here?)

			next_avaliable = time + 15 minutes	# up the next_avaliable time to 15 minutes into the future.

			<RETURN NEW PROXY>	# get a new proxy, since this one can not be used
			(END)

		if(failed_count > 100): # check if there have been over 100 sequential failed requests

			usage_count = 0	# reset the usage count	(should this be done here?)
			failed_count = 0	# reset the failed count
			next_avaliable = time + 10 minutes	# up the next_avaliable time to 15 minutes into the future.

			<RETURN NEW PROXY>	# get a new proxy, since this one can not be used
			(END)

		+1 usage count	# add 1 to the usage count

		make_regular_request()

		if(HTTP request success && Proxy is working):	# if the HTTP request was a success using proxy
			+ 1 success_delta	# include the success delta
			failed_count = 0 # reset the consecutive fail count {p1}
			if(guest_token_invalid):	# if the Guest token (GT) was invalid
				if(gt_last_updated is longer than 2 minutes ago):	# and hasn't been updated recently
					make_gt_request()	# make a separate request to get a new guest token
					update_gt()	# update the guest token
				next_avaliable = 0	# allow this proxy to be used again
			else:
				save_reqest()	# if the guest token was correct, and the HTTP request didn't fail due to a proxy
		else:
			-1 success_delta	# decrease the success delta
			+1 failed_count	# increase the consecutive fail count
			next_avaliable = NOW() + 5 minutes	# make the proxy usable again in 5 minutes

Additional suggestions:

  • Use a ML model to choose which proxy to use.
  • Backoff strategy 2^n seconds timeout. n = consecutive filed attempts -> max n + $r (randomness).
  • Use a rolling time window instead