SQL-INFS3200

联系我们: 手动添加方式: 微信>添加朋友>企业微信联系人>13262280223 或者 QQ: 1483266981

1 INFS3200 Advanced Database Systems Prac 3: Data Linkage (5%) Semester 2, 2021 Due time: 4:00pm, Friday 15 Oct 2021 (Week 11) Submission: Submit your work online (INFS3200 Course Website) Introduction Learning objectives: Learn how to use JDBC/cx_Oracle to interact with Oracle DBMS, including the operations of create table, update table, query table, etc. Learn how to measure the performance of data linkage. Understand various string similarity measures, e.g., edit distance and Jaccard coefficient, for field-level data linkage. Marking Scheme: This Prac carries 5 marks for 5% of this course. 1. 2 marks: Complete two missions in Task 1, each mission is accounted for 1 mark. 2. 1 mark: Complete Task 2, the correct answer for each bullet point is worth 0.5 marks. 3. 1 mark: Complete Task 3. 4. 1 mark: Complete Task 4. Please include your screenshots of the data import and data linkage results (precision, recall, f- measure and time) in a word/pdf document, as well as your answers to the task questions. It is not mandatary to include your student ID in every screenshots as we can check the originality through your code submission. The submission should be a zip/rar file, which includes both the answer document and your source code files. Please format your document and code nicely to help tutor’s marking process. A poorly formatted document may receive a reduced mark. Submit your work to the Blackboard website of the course by 16:00 (i.e., 4pm), Friday Oct 15th 2021. Late Penalties (from ECP): “Where an assessment item is submitted after the deadline, without an approved extension, a late penalty will apply. The late penalty shall be 10% of the maximum possible mark for the assessment item will be deducted per calendar day (or part thereof), up to a maximum of seven (7) days. After seven days, no marks will be awarded for the item. A day is considered to be a 24 hour block from the assessment item due time. Negative marks will not be awarded.” 2 Part 1: Preparation of the Restaurant Data Set In this part, our objective is to prepare data for subsequent tasks. There are two options for data storage: (1) import data into Oracle database, or (2) store data as CSV files. Both solutions are implemented and available in either Java or Python. Although both options are available, we recommend you to try the database option, as you can learn more about how to interact with database using Java/Python code. Option 1: Import data to Oracle database through JDBC/cx_Oracle This option requires the following software: (1) Oracle platform used in Prac1 & Prac2, (2) Java JDK/Python library (Java 8/Python 3.8 recommended), and (3) Java/Python IDE. Here we give an example of Java with Eclipse, but others, like Intellij IDEA for Java or PyCharm for Python, will work the same. You need to go through the following 4 steps. Step 1: Log in and Create Users In this part, we first use “SQL Plus” to create a database user, then we need to connect user to “SQL Developer” and interact with an Oracle database. In SQL Plus Command Line window, login to Oracle with a default username “SYS AS SYSDBA” and password “Password1!”, as shown below. Follow the commands below to create a user: /*Enable user creation*/ ALTER SESSION SET “_ORACLE_SCRIPT”=TRUE; /* Create a user named “S1234567” (student id) with password “w” */ CREATE USER S1234567 IDENTIFIED BY w ACCOUNT UNLOCK DEFAULT TABLESPACE “USERS” TEMPORARY TABLESPACE “TEMP” PROFILE “DEFAULT”; /* Grant DBA privilege to “S1234567” */ 3 GRANT DBA TO S1234567; Same as that of the previous Prac1 & Prac2, please change “S1234567” to your student ID. Step 2: Use Oracle SQL Developer Open SQL Developer and connect the user that we have just created. Click the green “+” button to connect to the database. Fill the connection information in the prompted dialog window shown as above. The “connection name” is specified by the user, and the “username” is the “S1234567” we have just created. “Password” is the password for that user, i.e., “w” in this case. You also need to change SID to “orcl”. Then press “Connect” button to connect to the user “S1234567”. Step 3: Import the Data Linkage Project via Java IDE In this step, we import the Java code template to your Java IDE. Here we use Eclipse as our example. Other IDEs work in a similar way. From the Windows 10 “Start” menu, you can search the “Eclipse”. When you open the “eclipse” software, a dialog window will be prompted asking you to select a workspace, as shown below. Choose a folder where you want your Eclipse project to be stored. Connection name specified by user Change SID to orcl Username just created in the database Password for that user, “w” in this prac 4 Extract the DataLinkage Java project downloaded from Blackboard course website to the workspace. At the eclipse “Project Explorer” panel, mouse right-click and then choose the “import” function. From the prompted dialog window, click “General – Existing Projects into Workspace” and then press the “Next” button. Browse the file folder to find the DataLinkage project in your workspace and press “Finish”, as shown below. Now the project has been successfully imported into your workspace. Step 4: Understand JDBC/cx_Oracle Connection to Oracle DBMS The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database (the counterpart in Python is cx_Oracle, open a command prompt and enter “python -m pip install cx_Oracle –upgrade” to install cx_Oracle to Python). JDBC helps you to write Java applications that manage these three programming activities: Connect to a data source, like a database, Send queries and update statements to the database, Retrieve and process the results received from the database in answer to your query. In order to connect a Java application with the Oracle database, typically you need to follow five steps to perform database connectivity. In the following example, we use a database supported by Oracle 12c. Hence, we need to know some related information for the Oracle database management: Driver class: The driver class for the Oracle database is “oracle.jdbc.driver.OracleDriver”. URL: The URL for Oracle 12c database is “jdbc:oracle:thin:@localhost:1521:orcl” where “jdbc” is the API, “oracle” is the database, “thin” is the driver, “localhost” is the server name on which Oracle is running (we may also use IP address), “1521” is the port number, and “orcl” is the Oracle service name. Username: In this Prac, please use the “S1234567” user that you have just created. Password: Password is assigned at the time of creating a user, i.e., “w” in this case. 5 The following simple code fragment gives an example of a five-step process to connect to an Oracle database and perform certain queries. Example 1: In this code fragment, we create a connection con to the database and store the query result of “SELECT * FROM table” to the result set rs and display them. Browse the DataLinkage project in your IDE and find the package “Oracle”. We provide various Java classes for interacting with the Oracle database, including: DBConnection: Basic connection settings and functionalities. CreateTable: Create a table in the database InsertTable: Insert tuples into a table DeleteTable: Delete tuples from a table ReadTable: Select tuples from a table DropTable: Drop a table from the database In order to play with these functionalities, you need to first change the username to the user you just created in DBConnection.java class. import java.sql.*; class OracleCon { public static void main(String args[]) { try { // step1 load the driver class Class.forName(“oracle.jdbc.driver.OracleDriver”); // step2 create the connection object Connection con = DriverManager.getConnection(“jdbc:oracle:thin:@localhost:1521:orcl”, “S1234567”, “w”); // step3 create the statement object Statement stmt = con.createStatement(); // step4 execute query ResultSet rs = stmt.executeQuery(“select * from table”); while (rs.next()) System.out.println(rs.getInt(1) + ” ” + rs.getString(2) + ” ” + rs.getString(3)); // step5 close the connection object con.close(); } catch (Exception e) { System.out.println(e); } } } 6 CREATE TABLE The dataset we are going to use in this Prac is a list of restaurants at various locations. The table contains four attributes: ID, Name, Address, and City. We use the following SQL statement to create this table in the database: CREATE TABLE RESTAURANT (ID NUMBER NOT NULL, NAME VARCHAR2(100 BYTE) NOT NULL, ADDRESS VARCHAR2(255 BYTE) NOT NULL, VARCHAR2(50 BYTE) NOT NULL, RESTAURANT_PK PRIMARY KEY(ID) ENABLE); Run the CreateTable.java class to execute this SQL query using JDBC. Open CreateTable.java, right-click the class, and choose “Run As – Java Application”. When the program terminates, check the table you created in SQL Developer. INSERT INTO TABLE The class InsertTable.java reads all the restaurant records from the excel file we have provided in this Prac, i.e. “datarestaurant.csv”, and insert these records one by one into the RESTAURANT table using JDBC. Run InsertTable.java and then check the result in SQL Developer. 7 SELECT FROM TABLE The class ReadTable.java reads all the restaurant records from table RESTAURANT in the database by executing the SQL query, i.e., “SELECT * FROM RESTAURANT”, using JDBC. It then prints out these records (ID, Name, Address, City) on the Eclipse Console, as shown below. There should be 858 records in total, which can be seen at the end of the printed screen. Option 2: Read data from the CSV File In this option, IDE (integrated development environment) is used, which can be either a Java IDE or Python IDE, determined by the language that you prefer to use. 1. Java IDE: In Java, we provide you with the data loader functionality in CSVLoader.java under Data package. The function restaurantLoader() in CSVLoader.java reads the restaurant information from the CSV file. We call this function in three files, namely NestedLoopByName.java, 8 NestedLoopByNameED.java and NestedLoopByNameJaccard.java. Therefore, you should enter these three files respectively and enable the CSV reader by uncommenting the following line: Restaurant[] restaurants = CSVLoader.restaurantLoader(“data\restaurant.csv”); In the meantime, comment the option 1 part in each file, the result is shown as follows: Be sure to perform the same process to all three files. 2. Python IDE: In Python, we provide you with the data loader functionality in csv_loader.py. Since we also provide two data loading option in Python, we choose the data loading from CSV as default in nested_loop_by_name.py, nested_loop_by_name_ed.py and nested_loop_by_name_jaccard.py. You can switch between two options in the following code snippet. Task 1: Read the code in NestedLoopByName.java/nested_loop_by_name.py and focus on the data loading part. Understand how data are loaded into restaurants array and complete the following data statistics tasks in the given class DataStatistics.java/data_statistics.py: Count the number of restaurant records whose city is “new york” and “new york city”, respectively. Count total number of distinct values in city attribute (Hint: use HashSet in Java or set in Python). There are two ways of completing this task: 9 (1) Load data from Oracle database or CSV file to “Restaurant[] restaurants” object using the method implemented in NestedLoopByName.java/ nested_loop_by_name.py, then obtain corresponding results by processing the “restaurants”, or (2) Write SQL queries that retrieve the task results directly from database, send such query through JDBC/cx_Oracle (like the example shown in Example 1) and print the result to screen. (Note: in Java, if you want to complete this task through SQL query, you may also change the ExecuteQuery(query) in DBConnection.java, as it is initially designed for parsing restaurant records to a Restaurant[] array. Writing a new parser for your new SQL is a better idea than changing ExecuteQuery(query) directly as it is called by other classes as well.) Please screenshot both your code in DataStatistics.java/data_statistics.py and the running results. They should be included within one image. The format of the results is similar as follows (The actual value is not the same as the example): Part 2: Measure the Performance of Data Linkage There are some duplications in the original restaurant dataset, and we will use Data Linkage techniques to detect these duplications, i.e., pairs of records that refer to the same real-world entity. For example, the following two records actually represent the same restaurant: 5, “hotel bel-air”, “701 stone canyon rd.”, “bel air” 6, “bel-air hotel”, “701 stone canyon rd.”, “bel air” 1. Nested-Loop Join for Data Linkage The nested-loop join, a.k.a nested iteration, uses one join input as the outer input table and the other one as the inner input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table. The pseudo-code below shows its workflow. In the simplest case, the search scans an entire table or index, which is called a naive nested loop join. In this case, the algorithm runs in O(|R|*|S|) time, where |R| and |S| are the number of tuples contained in tables R and S respectively and can easily be generalized to join any number of relations. Furthermore, the search can exploit an index as well, which is called an index nested loop join. A nested loop join is particularly effective if the outer input is small and the inner input is pre-indexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loop joins are superior to both merge joins and hash joins. In large queries, however, nested loop joins are often not the optimal choice. For each tuple r in R do For each tuple s in S do If r and s satisfy the join condition Then output the tuple 10 In this Prac, we adopt a nested-loop method to self-join the RESTAURANT table for data linkage. We first consider perfect matching on the “Name” attribute. In other words, we link two restaurant records (i.e., they refer to the same entity) only if their names are identical, e.g., 1, “arnie morton’s of chicago”, “435 s. la cienega blv.”, “los angeles” 2, “arnie morton’s of chicago”, “435 s. la cienega blvd.”, “los angeles” Program namely, NestedLoopByName.java/nested_loop_by_name.py is the implementation of this algorithm. It first reads all restaurant records from the Oracle database using JDBC, and then self-joins the table by a nested-loop join. If two records have the same “Name” value, the algorithm outputs this linking pair, i.e., id1_id2 where id1 and id2 are the “ID”s of these records respectively. 2. Precision, Recall, and F-measure We can regard the problem of data linkage as a classification task. Specifically, for each pair of records r and s, we predict a binary class label: “0” or “1”, where “1” means we believe these two records refer to the same entity and hence can be linked. Naturally, a data linkage algorithm is perfect if and only if: (1) Precision: all the linked pairs it predicts are correct, and (2) Recall: all possible linked pairs are discovered. We provide a file “data\restaurant_pair.csv” which stores the gold-standard linking results, i.e., all the restaurant record pairs that refer to the same real-world entity. Suppose that D represents the set of linked pairs obtained by the data linkage algorithm, and D* is the set of gold-standard linking results. The algorithm is regarded as perfect if the two sets are identical, i.e., D=D*. Precision and Recall are well-known performance measures that can capture the above intuitions. For classification tasks, the terms true positives, true negatives, false positives, and false negatives are considered to compare the results of the classifier under test with trusted external judgments (i.e., gold-standard). The terms positive and negative refer to the classifier’s prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation), as shown below. Given a linked pair id1_id2 in this Prac, it is like: True positive, if it belongs to both D and D* False positive, if it only belongs to D False negative, if it only belongs to D* True negative, if it belongs to neither D nor D* Based on these terms, precision and recall are defined as follows, where tp, fp, and fn represent true positive, false positive, and false negative, respectively. = + = + 11 Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. For example, brain surgery can be used as an illustrative example of the trade-off. Consider a brain surgeon tasked with removing a cancerous tumour from a patient’s brain. The surgeon needs to remove all of the tumour cells since any remaining cancer cells will regenerate the tumour. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. It can be seen that, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome). Therefore, another performance measure is used to consider the both precision and recall together, namely, F-measure, in which precision and recall are combined by their harmonic mean, as shown below. = 2 + Task 2: Program Measurement.java/measurement.py class implements above performance measures, namely precision, recall, and f-measure. Please read and understand the “CalcuMeasure” function in Measurement.java. Explain the following concepts and explain which variable in the code are they correspond to (count, result.size(), benchmark.size()) in our data linkage problem. Note that some concepts may require additional calculation on the existing variables, and some may not derivable from the variables provided (then only explain its meaning). Include your explanation in the submitted document: What are the true positive, true negative, false positive and false negative What are the meanings of precision and recall in this case An example answer could be as such format (the answer in the example is not correct): “The false positive is the records that appear in both the linkage result and the gold-standard, which is corresponding to count/result.size(). There are 120 false positive pairs.” or “The false positive is the records that appear in both the linkage result and the gold-standard, it is not calculated in the code.” Part 3: Similarity Measures for Field-Level Data Linkage In above Part 2, we link two restaurant records only if their names are identical. However, datasets, in practice, are always informal and noisy, full of abbreviations, spelling mistakes, various entity representations, etc. Therefore, a better solution would be to calculate the similarity between records, which is also the main issue of data linkage. We consider two string similarity measures in Task 3, i.e., Jaccard coefficient and edit distance, to estimate the similarity between restaurant names, so to link corresponding restaurant records. You will see later how these similarity measures affect the performance of data linkage. 12 1. Jaccard Coefficient Jaccard coefficient, also known as Jaccard index or Intersection over Union, is a statistic method used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: (, ) = | ∩ | | ∪ | = | ∩ | || + || | ∩ | Here, A and B are two sample sets, and J(A, B)=1 if A and B are both empty. As you can see, the Jaccard coefficient compares members for two sets to see which members are shared and which are distinct. It measures the similarity between two sets, with a range from 0% to 100%. The higher the percentage, the more similar the two sets. Consider the following simple example, how similar are these two sets Obviously, J(A,B) = |A∩B| / |A∪B| = |{0,2,5}| / |{0,1,2,3,4,5,6,7,9}| = 3/9 = 0.33. A = {0,1,2,5,6} B = {0,2,3,4,5,7,9} In order to measure the similarity between two strings using Jaccard coefficient, the strings need to be converted into sets firstly. In this Prac, we consider q-gram representation of a string. Q- gram is a contiguous sequence of q items from a given string. The items can be phonemes, syllables, characters, or words according to the application. A q-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram” (or 3-gram). Larger sizes are sometimes referred to by the value of q in modern language, e.g., “four-gram”, “five-gram”, and so on. As an example mentioned in Week-8 Lecture Notes and Tutorial 7, consider the string “University_of_Queensland”, we can transform it as a set of 3-grams with each character as the item: “University_of_Queensland”: {“##U”, “#Un”, Uni”, “niv”, “ive”, “ver”, “ers”, “rsi”, “sit”, “ity”, “ty_”, “y_o”, “_of”, “of_”, “f_Q”, “_Qu”, “Que”, “uee”, “een”, “ens”, “nsl”, “sla”, “lan”, “and”, “nd#”, “d##”}, where using character “#” is optional to pad the beginning and ending of the string for the completeness of 3-gram. The Jaccard coefficient similarity measure based on q-grams has been implemented in the class Similarity.java/similarity.py. Task 3: In NestedLoopByNameJaccard.java/nested_loop_by_name_jaccard.py, we link two restaurant records if the Jaccard coefficient of corresponding restaurant names exceeds a predefined threshold. Run NestedLoopByNameJaccard.java/nested_loop_by_name_jaccard.py with different settings of “q” and “threshold” to see how these two parameters affect the output of the similarity measure, and therefore affect the performance of data linkage in terms of the output size and measurement result. Test the influence of each parameter by altering its value to five different settings (up to your choice, but make sure your values are valid) as well as fixing the other parameter. Do such process to both parameters and screenshot all your precision, recall and f-measure results. Explain why the precision/recall increases/decreases based on your understanding. Include both your screenshots and your results in your submitted document. Note that, q=0 means we divide the original string into a bag-of-words. 13 2. Edit Distance Given two strings A and B, their edit distance is the minimum number of edit operations required to transform A into B. Most commonly, the edit operations allowed for this purpose are: Insert a character into a string; Delete a character from a string; Replace a character of a string by another character. For these operations, edit distance is sometimes known as Levenshtein distance. For example, the edit distance between “cat” and “dog” is 3. In fact, edit distance can be generalized to allowing different weights for different kinds of edit operations, for instance a higher weight may be placed on replacing the character “s” by the character “p”, than on replacing it by the character “a” (the latter being closer to “s” on the keyboard). Setting weights in this way depending on the likelihood of letters substituting for each other is very effective in practice. However, to simplify the work in this Prac, we will only focus on the case in which all edit operations have the same weight. It is well-known on time complexity of computing the edit distance between two strings is O(|A|*|B|), where |A| and |B| denote the length of strings A and B respectively. The idea is to use the dynamic programming algorithm, where the characters in A and B are given in array form. The algorithm fills the (integer) entries in a matrix namely ED, whose two dimensions equal to the lengths of the two strings whose edit distances is being computed. The ED[i][j] entry of the matrix will hold (after the algorithm is executed) the edit distance between the strings consisting of the first i characters of A and the first j characters of B. There is a relation between ED[i][j] and ED[i-1][j-1]. Assume that we transform from one string to another. The first string has length i and its last character is “x”, and the second string has length j and its last character is “y”. The following diagram shows the relation. Therefore, by using the dynamic programming algorithm, we can calculate the edit distance between two strings based on the edit distance of their substrings. In particular, If “x” and “y” are identical, then ED[i][j] = ED[i-1][j-1]; If “x” and “y” are different, and we insert “y” for the first string, then 14 ED[i][j] = ED[i][j-1] + 1; If “x” and “y” are different, and we delete “x” for the first string, then ED[i][j] = ED[i-1][j] + 1; If “x” and “y” are different, and we replace “x” with “y” for the first string, then ED[i][j] = ED[i-1][j-1] + 1; When “x” and “y” are different, ED[i][j] is the minimum of the three situations. Above is an example of using the dynamic programming algorithm to calculate the edit distance between strings “fast” and “cats”. Some more examples are available in Week-8 Lecture Notes and in T7 Tutorial. It is worth noting that the edit distance is not a direct hint of the similarity between two strings. Naturally, long strings could have more typing errors than short strings. In other words, we need to normalize the edit distance by string length in order to have a fair comparison of two strings. Since the maximum possible edit distance between any two strings A and B is max(|A|, |B|), we further calculate edit similarity as follows: (, ) = 1 (, ) (||, ||) Task 4: Please complete the implementation of edit distance in Similarity.java/similarity.py, run NestedLoopByNameED.java/nested_loop_by_name_ed.py to see how edit distance affects the performance of data linkage. Same as the above Task 3, you need to report the algorithm’s precision, recall, and f-measure with five different settings of the “threshold” and provide your understanding of the trends. Include both the screenshots of the results under different threshold settings and your explanation in your submitted document. Hint: You can construct an example to test if our edit distance implementation works correctly. In Java, find edit distance and Jaccard coefficient using the code in Similarity.java: 1. Open Similarity.java. 2. Make a public main method to implement following lines as shown in the screenshot below: public static void main(String[] args) { String str1 = “University”; String str2 = “Unvesty”; int out = Similarity.CalcuED(str1, str2); System.out.println(“Edit Distance = “+ out); double out2 = Similarity.CalcuJaccard(str1, str2, 2); System.out.println(“Jaccard Coefficient = “+ out2) }; 15 In Python, do the similar process in similarity.py as follows: str1 = “University” str2 = “Unvesty” out = calc_ed(str1, str2) print(“Edit Distance = “, out) out2 = calc_jaccard(str1, str2, 2) print(“Jaccard Coefficient = “, out2) Based on your understanding of these two measures, you can check if your code returns the expected value. Note that edit distance is different from edit distance similarity. The edit distance similarity is implemented as CalcuEDSim/calc_ed_sim in Java/Python, which is a normalized similarity. —ooo000OOO000ooo— 程序

发表评论

了解 KJESSAY历史案例 的更多信息

立即订阅以继续阅读并访问完整档案。

继续阅读