no-https: a plain-HTTP to HTTPS proxy

From	Ivan Shmakov <ivan@siamics.net>
Newsgroups	comp.infosystems.www.misc, comp.misc
Subject	no-https: a plain-HTTP to HTTPS proxy
Date	2018-09-16 07:07 +0000
Organization	A noiseless patient Spider
Message-ID	<87d0tdgbt4.fsf@violet.siamics.net> (permalink)

Cross-posted to 2 groups.

Show all headers | View raw

	[Cross-posting to news:comp.misc as the issue of plain-HTTP
	unavailability was recently discussed there.]

	It took me about a day to write a crude but apparently (more or
	less) working HTTP to HTTPS proxy.  (That I hope to beat into
	shape and release via news:alt.sources around next Wednesday
	or so.  FTR, the code is currently under 600 LoC long, or 431 LoC
	excluding comments and empty lines.)  Some design notes are below.


    Basics

	The basic algorithm is as follows:

	1. receive a request header from the client; we only allow
	   GET and HEAD requests for now, as we do not support request
	   /bodies/ as of yet;

	2. decide the server and connect there;

	3. send the header to the server;

	4. receive the response header;

	5. if that's an https: redirect:

	5.1. connect over TLS, alter the request (Host:, "request target")
	     accordingly, go to step 3;

	6. strip certain headers (such as Strict-Transport-Security: and
	   Upgrade:, but also Set-Cookie:) off the response and send the
	   result to the client;

	7. copy up to Content-Length: octets from the server to the
	   client -- or all the remaining data if no Content-Length:
	   is given; (somewhat surprisingly, this seems to also work with
	   the "chunked" coding not otherwise considered in the code);

	8. close the connection to the server and repeat from step 1
	   so long as the client connection remains active.

	The server uses select(2) so that socket reads do not block and
	supports an arbitrary number (up to the system-enforced limits)
	of concurrent connections.  For simplicity, socket writes /are/
	allowed to block.  (Hopefully not a problem for proxy-to-server
	connections most of the time, and even less so for proxy-to-client
	ones; assuming no malicious intent on the part of either,
	obviously.  The latter case may be mitigated by using a "proper"
	HTTP proxy, such as Polipo, in the front of this one.)


    Dealing with the https: references

	There was an idea of transparently replacing https: references
	in HTML and XML attributes with scheme-relative ones (like, e. g.,
	https://example.com/ to //example.com/.)  So far, that fails
	more often than it works, for two primary reasons: compression
	(although that can be solved by forcing Accept-Encoding: identity
	in requests) -- and the fact that by the time such filtering can
	take place, we've already sent the Content-Length: (if any) for
	the original (unaltered) body to the client!

	Also, as the code does not currently handle the "chunked" coding,
	references split across chunks will not be handled.  (The code
	should handle references split across bufferfuls of data, though.)

	Two possible ways to solve that would be to, for desired
	Content-Type: values, either retrieve the whole response in full
	before altering and forwarding to the client, /or/ to implement
	support for "chunked" coding and force its use there (striping
	Content-Length: off the original response, if any.)

	I suppose both approaches can be implemented, with the first
	used, say, when Content-Length: is below a configured limit,
	although that increases the complexity of the code, which is
	something I'd rather avoid.

	That said, I don't think the https: references /should/ be an
	issue in practice, as most of the links are ought to be relative
	in the first place, such as:

    <p ><a href="page2.html" >Continue reading of this article</a>,
    or <a href="/" >go back to the top page.</a></p>

	However, I suspect that images and such may be a common
	exception in practice, like:

    <img src="https://static.example.com/useless-stock-photo.jpeg" />

	Which of course would've worked just as well (and require no
	specific action on the part of this proxy) being written as:

    <img src="//static.example.com/useless-stock-photo.jpeg" />


    Making responses even better

	Other possible response alterations may include removing <link />
	elements and Link: HTTP headers pointing to JavaScript code
	(running arbitrary software from the Web is a bad idea, and
	doing so while forgoing the meager TLS protection isn't making
	it better) /and/ also <script /> elements.  The latter, in turn,
	will probably either require rather complex state tracking --
	or getting the server response in full before the alterations
	can take place.


	Thoughts?

-- 
FSF associate member #7257  np. Nine Lives -- Slaygon

Back to comp.infosystems.www.misc | Previous | Next — Next in thread | Find similar

Thread

no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-09-16 07:07 +0000
  Re: no-https: a plain-HTTP to HTTPS proxy not@telling.you.invalid (Computer Nerd Kev) - 2018-09-16 22:52 +0000
    Re: no-https: a plain-HTTP to HTTPS proxy Mike Spencer <mds@bogus.nodomain.nowhere> - 2018-09-19 17:27 -0300
  Re: no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-09-18 13:10 +0000
    Re: no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-09-18 17:05 +0000
      Re: no-https: a plain-HTTP to HTTPS proxy Andy Burns <usenet@andyburns.uk> - 2018-09-18 18:32 +0100
      Re: no-https: a plain-HTTP to HTTPS proxy Rich <rich@example.invalid> - 2018-09-18 18:56 +0000
        Re: no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-09-19 05:15 +0000
    Re: no-https: a plain-HTTP to HTTPS proxy Marko Rauhamaa <marko@pacujo.net> - 2018-09-18 22:02 +0300
      Re: no-https: a plain-HTTP to HTTPS proxy Rich <rich@example.invalid> - 2018-09-18 19:08 +0000
    Re: no-https: a plain-HTTP to HTTPS proxy Andy Burns <usenet@andyburns.uk> - 2018-09-18 20:16 +0100
  Re: no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-09-25 18:39 +0000
    Re: no-https: a plain-HTTP to HTTPS proxy Eli the Bearded <*@eli.users.panix.com> - 2018-09-25 22:29 +0000
      Re: no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-09-26 01:05 +0000
        Re: no-https: a plain-HTTP to HTTPS proxy Ivan Shmakov <ivan@siamics.net> - 2018-10-04 20:07 +0000
          Re: no-https: a plain-HTTP to HTTPS proxy not@telling.you.invalid (Computer Nerd Kev) - 2018-10-05 00:11 +0000

csiph-web