Week 8 Scalability and Security - CS50's Web Programming with Python and JavaScript

Scalability and Security

Scalability

Launch our sites so they can be accessed by anyone on the internet.

In order to do this, we run sites on servers, which are physical pieces of hardware dedicated to running applications. Servers can either be on-premise(We own and maintain physical servers) or on the cloud(owned by a different company, and we pay to rent server space)

- Customization: Hosting your own servers gives you the ability to decide exactly how they work, allowing for more flexibility than cloud-based hosting.

- Expertise: To host an application is much simpler than to maintain your own servers.

- Cost: Since server-hosting sites need to make a profit, they charge you more than it costs them to maintain their on-premise servers, making cloud-based servers more expensive. However, the startup costs of running on-premise servers can be high, as you need to purchase physical servers and potentially hire someone with the expertise to set them up.

- Scalability: Scaling is easier when hosting on the cloud. / For example, if we host a site on premise that gets 500 visits per day, and then it starts getting 500000 visits per day, we would've to order and setup more physical servers to handle the requests, and in the mean time many users will not be able to access the site. / Most cloud hosting sites will allow you to rent server space flexibly, paying based on how much action your site sees.

> When a user sends an HTTP request to this server, the server should send back a response. However, in reality, most servers get far more than one request at a time.

> Issue of scalability: A single server can handle only so many requests at once, forcing us to make plans about what to do when our one server is overworked.

- Whether we decide to host on premise or on the cloud, we have to determine how many requests a server can handle without crashing, which can be done using any number of benchmarking tools including Apache Bench(AB).

Scaling

Once we have some upper limit on how many requests on our server can handling, we can think about how we want to handle the scaling of our application.

> Two different approaches to scaling include:

1. Vertical Scaling: In vertical scaling, when our server is overwhelmed we simply buy or build a larger server. This is limited, as there is an upper limit on how powerful a single server can be.

2. Horizontal Scaling: In horizontal scaling, when our server is overwhelmed we buy or build more servers, and then split the requests among our multiple servers.

Load Balancing

When we use horizontal scaling, we're faced with the additional problem of how we decide which servers are assigned to which requests.

> Load balancer: Another piece of hardware that intercepts incoming requests, and then assigns those requests to one of our servers.

> A number of different methods for deciding which server receives which requests

- Random: In this simple method, the load balancer will decide randomly which server it should assign a request to

- Round-Robin: In this method, the load balancer will alternate which server receives an incoming request. / If we have three servers, the first request might go to server A, the second to server B, the third to server C, and the fourth back to server A.

- Fewest Connections: In this method, the load balancer looks for the server that is currently handling the fewest request, and assigns the incoming request to that server. This allows us to make sure we're not overworking one particular server, but it also takes longer for the load balancer to calculate the number of requests each server is currently handling than it dows for it to simply choose a random server.

> There's no method of load balancing that is strictly better than all other methods, and there are many different methods used in practice.

> One problem that can arise when scaling horizontally is that we might have sessions that are stored on one server but not another, and we don't want users to have to re-enter information just because the load balancer pushes their request to a new server.

> Multiple approaches to solving the problem of sessions:

- Sticky Sessions: Once a user visits a site, the load balancer remembers which server they were sent to first, and makes sure to send them to the same one. One big concern with this method is that we could end up having a large number of users sticking to one server, causing it to crash.

- Database Sessions: All sessions are stored in a database that all servers have access to. This way, a user's information will be available no matter which server they are assigned to. / The drawback here is that it takes additional time and computing power to read from and write to a database.

- Client-Side Sessions: Rather than storing information on our server, we can choose to store them locally on the user's web browser as cookies. / The drawbacks to this method include the security concern of users creating false cookies that allow them to log in as another user, and the computational concern of sending cookie information back and forth with every request.

- Like with load balancing, there's no best answer to the sessions problem, and the method you choose will often depend on your specific circumstances.

Autoscaling

> Many websites are visited much more frequently at certain times.

- For example, if we decide to launch our "Is it New Year's?" app from earlier, we might expect it to get a lot more traffic in late December to early January than any time of year.

- If we buy enough servers for the site to stay active during the winter, those servers would be sitting idle for the rest of the year, wasting space and energy.

- Autoscaling: Common in cloud computing, where the number of servers being used by your site can grow and shrink based on the number of requests it gets.

- Autoscaling takes time to determine that a new server is needed and to launch that server. And the more servers you have running, the more opportunity there's for one to fail.

Server Failure

> Single Point of Failure: A piece of hardware that, after failing, will cause the entire site to crash. Having multiple servers can help to avoid this.

- When scaling horizontally, the load balancer can detect which servers have crashed by sending periodic heartbeat requests to each server, and then stop assigning new requests to servers that have crashed.

+ Every some number of seconds, the load balancer sends a quick request to all the servers. And the servers are supposed to respond back. Using that information, the load balancer knows a little bit about the latency of each of the servers - how long it took for the server to respond to the request.

+ If the load balance happens to fail, nothing is going to work because the load balancer is the one responsible for directing traffic to all of the various different servers.

- At this point, it seems we have simply moved our single point of failure from a server to the load balancer, but we can account for this by having backup load balancers available if our original happens to crash.

Scaling Databases

> We use SQLite which stores data inside a file on the server, but as we store more and more data, it sometimes makes more sense to store data in a number of different files, and maybe even on a separate server.

- This brings up the problem then of what to do when our database server can no longer handle all of the requests coming in.

> Methods we can use to migrate this problem

- Vertical Partitioning: Similar to the one we used when first discussing SQL, where we split our data into multiple different tables rather than having redundant information in one table.

- Horizontal Partitioning: This involves storing multiple tables with the same format, but different information. / For example, we could split a 'flights' table into a 'domestic_flights' table and an 'international_flights' table. This way, when we wish to search for a flight from JFK to LHR, we don't have to waste time searching through a table full of domestic flights. / One drawback is that it can be expensive to join multiple tables once they have been split.

Database Replication

> After we've scaled a database, we're still left with a single point of failure: If our database server crashes, all of our data could be lost.

- Just as we added more servers to avoid a single point of failure, we can add copies of our database to make sure the failure of one database does not shut down our application.

> Most popular methods of database replication:

1. Single-Primary Replication: There are multiple databases, but only one of them is considered to be the primary database, meaning you can read from and write to one of the databases, but only read from each of the others.

- When the primary database is updated, the other databases are then updated to match the primary one.

- All of the databases are kept in sync. If you try and run query on any of these databases to select and get some information, you'll get the same results from all of these various different databases.

- One drawback is that it still contains a single point of failure when it comes to writing to the database.

- If the primary database fails, we're no longer able to write data.

2. Multi-Primary Replication: All of the databases can be read from and written to.

This solves the problem of a single point of failure, but it comes with a tradeoff: it is now much more difficult to keep all databases up to date because each database must be aware of changes to all other databases.

> This system also sets us up for the possibility of some conflicts:

- Update Conflict: With multiple databases, one user may attempt to edit a row in one database while another user attempts to edit that same row in a different database, causing a problem when the databases sync up.

- Uniqueness Conflict: Every row in a SQL database must have a unique identifier, and we may run into the problem that we assign the same id to two different entries in two different databases.

- Delete Conflict: One user may delete a row while another user attempts to update it.

Caching

> Whenever we're working with larger databases, it is important to recognize that every interaction we have with a database is costly. We wish to minimize the number of calls to our database server.

- For example, at the New York Times website, the New York Times may have some database with all of their articles which is queried and some template that is rendered every time someone loads the home page, but this would be a waste of resources, as the articles displayed on the home page likely do not change much from second to second.

- Caching: The idea of storing some information in a more accessible location if we anticipate needing it again in the near future.

> One way that caching can be implemented is by storing data on the user's web browser, so that when a user loads certain pages, no request to the server even needs to be sent.

- To do this, include this line in the header of an HTTP response:

Cache-Control: max-age=86400

- Specifying the number of seconds that you should cache this resource for.

> This will tell the browser that when visiting a page, as long as I have visited that page within the last 86400 milliseconds, no request has to be made to the server.

- This method is used commonly by web browsers especially with files that are less likely to change over short periods such as a CSS file.

- To take more control over this process, we can also add an 'ETag' to the HTTP response header, which is a unique sequence of characters that represents a specific version of a document.

- If the web server were ever to change that CSS file, the corresponding ETag will also change.

- This is useful because future requests can include this tag and compare it to the tag of the latest document on the server, only returning an entire document when the two differ.

- But if you ask for a new version of the resource after this number of seconds has elapsed, if the ETag value hasn't updated, then no need to redownload a whole new version of a particular file.

- Drawbacks: If the resource changes within this amount of time, if I try and load the page again, then if it's loading the cache version of the page, I might be seeing an outdated version of a web page.

> In addition to the client-side caching, it can often be helpful to include a cache on the server side. With this cache, our backend setup will look a bit like the one below, where all servers have access to a cache.

> Django provides its own cache framework which will allow us to incorporate caching in our projects. Several ways of implementing a cache:

- Per-View Caching: This allows us to decide that once a specific view has been loaded, that same view can be rendered without going through the function for the next specified amount of time.

- Template-Fragment Caching: This caches specific parts of a template so they do not have to be re-rendered. / For example, we may have a navigation bar that rarely changes, meaning we could save time by not reloading it.

- Low-Level Cache API: This allows you to do more flexible caching, essentially storing any information you would like to.

documentation

Security

How to make sure our web applications are secure

Git and GitHub

> One of the greatest strengths of Git and Github is how easy they make it to share and contribute to open-source software, which can be seen contributed to by anyone on the internet.

- Drawback: If at any point you commit a file that includes some private credentials like a password of API key, those credentials could be publicly available. / Someone who has access to the Git repository has access not just to the latest version of your code, but to every version of the code. And that person could, theoretically go back through the history of the repository and find the commit.

HTML

> There are many vulnerabilities that arise form using HTML.

- Phishing Attack: Occurs when a user who thinks they are going to one page is actually taken to another. / Keep in mind when interacting with the web ourselves.

- For example, a malicious user might write out this HTML:

<!DOCTYPE html>
<html lang="en">
    <head>
        <title>Link</title>
    </head>
    <body>
        <a href="https://cs50.harvard.edu/">https://www.google.com/</a>
    </body>
</html>

> The fact that HTML is actually sent to a user as part of a request opens up more vulnerabilities, because everyone has access to the layout and style that allowed you to create your site.

- For example, a hacker could go to Bank of America - Banking, Credit Cards, Loans and Merrill Investing, copy all of their HTML, and paste it in their own site creating a site that looks exactly like Bank of America's. / The hacker could then redirect the login form on the page so all usernames and passwords are sent to them.

HTTPS

> Most interactions that occur online follow HTTP protocol, although now more and more transactions use HTTPS, which is an encrypted version of HTTP.

- While using these protocols, information is transferred from one computer to another through a series of servers.

- There's no often no way to ensure that all of these transfers are secure, so it is important that all of this transferred information is encrypted, meaning that the characters of the message are altered so that the sender and receiver of the message can understand it, but no one else can.

Secret-Key Cryptography

> The sender and receiver both have access to a secret key that only they know. Then, the secret key is used by the sender to encrypt a message which is then sent to the recipient who uses the secret key to decrypt the message.

- Extremely secure, but it produces a big problem when it comes to practicality.

- In order for it to work, both the sender and the receiver must have access to the secret key, which means they must meet in person to exchange a key securely.

- With the number of different websites we interact with on a daily basis, it is clear that in-person meetups are not an option.

Public-Key Cryptography

> There are two keys: Public key, Private key

- Once these keys are established, a sender could look up the public key of a recipient and use it to encrypt a message, and then the recipient could use their private key to decrypt the message that was encrypted using the public key.

- When we use HTTPS rather than HTTP, we know that our request is being secured using public-key encryption.

Databases

> Make sure that our databases are secure.

- One common thing we'll need to store is user information, including usernames and passwords.

- However, you never actually want to store passwords in plaintext in case an unauthorized person gets access to your database.

- Instead, we'll want to use a hash function, that takes in some text and outputs a seemingly random string, to create a hash of each password.

> A hash function is one-way. It can turn a password into a hash, but cannot turn a hash back into a password.

- Any company that stores user information this way does not actually know any of the users' passwords, meaning each time a user attempts to sign in, the entered password will be hashed and compared to the existing hash.

- This process is already handled for us by Django.

- One implication of this storage technique is that when a user forgets their password, a company has no way of telling them what their old password now, meaning they would have to make a new one.

> There are same cases where you'll have to decide a developer how much information you are willing to leak.

- For example, many sites have a page for forgotten password. As a developer, you may want to include either a success or error message after submission.

> But anyone could determine who has an email registered with that site.

- This could be totally fine in cases where whether or not a person uses the site is inconsequential, but extremely reckless if the fact that you are a member of a certain site could put you in danger.

> Another way data could be leaked is in the time it takes for a response to come back.

- It probably takes less time to reject someone with an invalid email than a person with a correct email address and a wrong password.

> We must be ware of SQL Injection Attacks whenever we use straight SQL queries in our code.

- If a hacker tries to log into a website and maybe includes a double quotation mark and two hyphens ("--"), where two hyphens mean a comment in SQL. This ignores the rest of the query, effectively ignoring any kind of password checking.

APIs

> We often use JavaScript in conjunction with APIs to build single-page applications. When we build our own API, there are a few methods we can use to keep our API secure.

- API Keys: Only process requests from API clients who have a key you have provided to them.

- Rate Limiting: Limit the number of requests any one user can make in a given time frame. This helps protect against Denial of Service(DOS) Attacks, in which a malicious user makes so many calls to your API that it crashes.

- Route Authentication: There are many cases where we don't want to give anyone access to all of our data, so we can use route authentication to make sure only specific users can see specific data.

Environment Variables

> As we want to avoid storing passwords in plaintext, we'll want to avoid including API keys in our source code.

- One common way of avoiding this is to use environment variables, or variables that are stored in your operating system or server's environment.

- Then, rather than including a string of text in our source code, we can include a reference to an environment variable.

JavaScript

> There are a few types of attacks that malicious users may attempt using JavaScript.

- One example is Cross-Site Scripting, which is when a user writes their own JavaScript code and runs it on your website.

- For example, we have a Django application with a single URL

urlpatterns = [
    path("<path:path>", views.index, name="index")
]

-> Here's a single URL that just allows us to provide any path, and it's going to load the index view.

- A single view

def index(request, path):
    return HttpResponse(f"Requested Path: {path}")

- This website essentially tells the user what URL they have navigated to:

- But a user can now easily insert some JavaScript into the page by typing <script>alert("hello")</script> in the url:

- While this 'alert' example is fairly harmless, it wouldn't be all that more difficult to include some JavaScript that manipulates the DOM or uses 'fetch' to send a request.

Cross-Site Request Forgery(CSRF)

> We discussed how we can use Django to prevent CSRF(Cross-Site Request Forgery) attacks, but what could happen without this protection?

- As an example, imagine a bank has a URL you could visit that transfers money out of your account. A person could easily create a link that would make this transfer:

<a href="http://yourbank.com/transfer?to=brian&amt=2800">
    Click Here!
</a>

> This attack can be more subtle than a link. If the URL is put in an image then it will be accessed as your browser attempts to load the image:

<img src="http://yourbank.com/transfer?to=brian&amt=2800">

- The transfer page is not an image. All an image tag is going to do is try to make a request to this source URL to get that image and then try to display it in the user's web browswer.

> Because of this, whenever you are building an application that can accept some state change, it should be done using a POST request.

- Even if the bank requires a POST request hidden form fields can still trick users into accidentally submitting a request.

- The following form doesn't wait for the user to click, it automatically submits.

 <body onload="document.forms[0].submit()">
    <form action="https://yourbank.com/transfer"
    method="post">
        <input type="hidden" name="to" value="brian">
        <input type="hidden" name="amt" value="2800">
        <input type="submit" value="Click Here!">
    </form>
</body>

- <body onload="document.forms[0].submit()"> : When the body of the page is done loading, go to document.forms -meaning all of the forms for this web page- get the first one, and submit it.

= Even without the user clicking on the Click button, as soon as this page is loaded, this form is going to submit a post request to the bank, and attempting to transfer funds from one user to another.

> The above is an example of what Cross-Site Request Forgery might look like. We can stop attacks such as these by creating a CSRF token when loading a webpage, and then only accepting forms with a valid token.

- {% csrf_token %}

Author Description

The Void

L8. Scalability and Security - CS50's Web Programming with Python and JavaScript

Scalability and Security

Scalability

Scaling

Load Balancing

> A number of different methods for deciding which server receives which requests

> Multiple approaches to solving the problem of sessions:

Autoscaling

Server Failure

Scaling Databases

> Methods we can use to migrate this problem

Database Replication

> Most popular methods of database replication:

Caching

Security

Git and GitHub

HTML

HTTPS

Secret-Key Cryptography

Public-Key Cryptography

Databases

APIs

Environment Variables

JavaScript

Cross-Site Request Forgery(CSRF)

Keres

댓글 없음:

댓글 쓰기

Search This Blog

Popular Posts

Tags

Labels

Popular Posts

문의하기 양식

Author Description

Author Social Links

The Void

Full width home advertisement

Post Page Advertisement [Top]

L8. Scalability and Security - CS50's Web Programming with Python and JavaScript

Scalability and Security

Scalability

Scaling

Load Balancing

> A number of different methods for deciding which server receives which requests

> Multiple approaches to solving the problem of sessions:

Autoscaling

Server Failure

Scaling Databases

> Methods we can use to migrate this problem

Database Replication

> Most popular methods of database replication:

Caching

Security

Git and GitHub

HTML

HTTPS

Secret-Key Cryptography

Public-Key Cryptography

Databases

APIs

Environment Variables

JavaScript

Cross-Site Request Forgery(CSRF)

Keres

댓글 없음:

댓글 쓰기

Bottom Ad [Post Page]

Search This Blog

Popular Posts

Tags

Labels

Popular Posts

문의하기 양식