联系我们: 手动添加方式: 微信>添加朋友>企业微信联系人>13262280223 或者 QQ: 1483266981
COM 301
Lab 2: Implementing Indexing Using B-Trees
An index is a file that facilitates insert, delete, and search operations on a database. The
efficiency of these operations depends on the underlying structure of the index. An example of
an index that ensures operations of Θ(log n) time complexity is the B-tree. We’ve already
established that a B-tree of order m is always height-balanced—leaves are always at the same
level. In addition, its internal nodes (except perhaps the root) have between m/2 and m
children, to keep the space overhead of the index reasonable.
Assignment:
For this assignment, you will index a database of student records using B-trees. There will be
two index files as the database will be indexed both by student ID and student last name. A
separate database file will contain the actual student records.
The program will take two input files: a data file and a command file. The data file is a text file
containing student records from which the database and its indexes (binary files) will be initially
built. The command file is a list of insert and search operations to be carried out after the files
have been built. The results of these operations will be sent to an output file.
A name will be provided (through the command line) that is used to obtain the names of the
database and index files. For example, if the name provided is “student,” the database will be
named “student.dat” while the two index files will be named “student.ix1” (for the ID index) and
“student.ix2” (for the name index).
The order (m) of the B-trees will also be specified (the same for both indexes) in the command
line.
Program Invocation:
The program is invoked as follows:
btree
If the command line contains an insufficient number of arguments or if any of the specified input
files do not exist, the program should print an appropriate error message and exit. Input and Output Formats:
The data file will consist of some number of text lines, one student record per line. Each line will
have 6 space-separated fields: ID (9 digits), last name (at most 15 letters), first name (at most
15 letters), year (1 digit), major (at most 4 letters), and email address (at most 20 characters).
This means a record will require at most 64 bytes if stored as text fields.
The command file will contain lines specifying one of five kinds of commands:
find ID
find name
add
dump by name
The first two commands will output the record(s) found, one line per record. Note that the first
command returns at most one record (output “NOT FOUND” if no such record exists), while the
second command may return more than one record. (You may assume that IDs are unique and
that last names may duplicate). For the third command, the program should output STUDENT
are counts of splits performed during the insert operation. The count for the ID index is
while the count for the name index is
indicated order.
For all the five commands, echo the command line itself before you echo the output results
described above.
For all input files, the fields are separated by one or more spaces and there may be leading or
trailing spaces in a line. You may assume, of course, that the fields themselves may not contain
spaces (e.g.,
When printing out a record, print the byte position of that record in the database file, followed by
a colon and 69 (ASCII) characters describing the record: each field is printed with its maximum
number of characters padded with trailing spaces if necessary. Add a single space between
each field—64 characters for the fields plus 5 separator spaces makes 69 characters total.
Finally, output the number of records in the database file and the number of B-tree nodes in the
index files before and after processing the command file, using the following format:
NUMBER OF RECORDS:
been read, while the last line of the output file will contain the status of the files after the
command file has been processed.
Refer to the sample files presented later in this document.
Data Structures:
This assignment is an exercise in B-trees, indexing, and file I/O. Store fixed-length (64-byte)
records for the database (store fields using ASCII characters and then pad each field with
spaces up to its maximum length) and append these records to the database file in the order
that the add commands appear. For the indexes, store the keys in ASCII (9 bytes for ID, 15
bytes for last name) and the file positions as C++ long values. The index file will consist of fixed
sized tree nodes that contain up to m 1 keys, m 1 database record positions (pointers to the
database file), and m index file record positions (pointers to children). You will also need to store
the actual number of children for the node (some number less ≤ m stored as a C++ int). This
means the size of an index file should be a multiple of (m 1)K + (2m 1)L + I, where K is the
size of a key, L is the size of a C++ long value, and I is the size of a C++ int value. The first (m
1)K + (2m 1)L + I bytes of the index file will be the node that corresponds to the root of the B
tree. When new nodes need to be added because of splits, (m 1)K + (2m 1)L + I bytes will
be appended to the index file for each new node. (Note that you have a decision to make when
you perform a split. For example, suppose you want to split node A to result in nodes B and C.
You can either have B occupy the node previously occupied by A, and C occupies the newly
appended node, or, C takes over A, and B occupies the new node. This detail is left up to you.)
The order in which records are added to the database is important, so make sure that when the
initial database is set up and the data file is read, it is equivalent to carrying out an add
operation for each line in the data file in the indicated sequence.
Since duplicate last names may exist, a rule is needed in case the name already exists in the
index. Comply with the following rule: If a record is inserted into the database such that one or
more record(s) with the same name already exists in the index, the new record should occur
after the last record (according to the index) with that name.
When splitting a node, there will be m 1 keys in a node and an incoming key, for a total of m
keys. The split should cause the m2 th key to be promoted (assuming the first key is the 0th
key and the last key is the (m 1)th key).
There is no limit for the value of m. This means the buffers (the variables you use to store the
nodes) for the files must be dynamically allocated. Use of the C++ Standard Template Library (STL) is limited; in general, you should avoid using
such classes for this project, but ask your instructor during class, in case you find it reasonable
to use a particular STL class, while staying within the objectives of this assignment. The point of
this project is to implement the B-Tree data structure from scratch.
Sample Input and Output Files:
datafile1.txt
000000088 Platt John 0 ML jplatt@vt.edu
000000097 Brill Eric 1 NLP brill@vt.edu
000000056 Dumais Susan 2 IR dumais@aol.com
000000003 Chen Zheng 1 CN zhengc@msn.com
cmdfile1.txt
find ID 000000097
findnamePlatt
findnameChen
add 000000060 Shaffer Clifford 1 ALG shaffer@vt.edu add 000000066 Platt Jenny 1 UNK
platt@vt.edu
add 000000021 Chen Jian 0 cSE jchen@vt.edu dump by ID
dump by name
add 000000098 Chen Xiao 2 XX cx@yahoo.com find name Vergara
When the following command is invoked:
btree datafile1.txt student1 3 cmdfile1.txt outfile1.txt
…the program will create one database file named “student1.dat” and two index files named
“student1.ix1” and “student1.ix2”. It will also produce the following output file:
outfile1.txt
NUMBER OF RECORDS: 5, ID INDEX: 4 nodes, NAME INDEX: 4 nodes. COMMAND: find ID
000000097
64:000000097Brill Eric 1NLP brill@vt.edu
COMMAND: find name Platt
0:000000088Platt John 0ML jplatt@vt.edu
COMMAND: find name Chen
192:000000003Chen Zheng 1CN zhengc@msn.com
256:000000010Chen Ming 0ISE mcheng@vt.edu
COMMAND: add 000000060 Shaffer Clifford 1 ALG shaffer@vt.edu
STUDENT 000000060 ADDED (320;0;0)
COMMAND: add 000000066 Platt Jenny 1 UNK platt@vt.edu
STUDENT 000000066 ADDED (384;2;2)COMMAND: add 000000021 Chen Jian 0 cSE jchen@vt.edu
STUDENT 000000021 ADDED (448;0;0)
COMMAND: dump by ID
192:000000003Chen Zheng 1CN zhengc@msn.com
256:000000010Chen Ming 0ISE mcheng@vt.edu
448:000000021Chen Jian 0cSE jchen@vt.edu
128:000000056Dumais Susan 2IR dumais@aol.com
320:000000060Shaffer Clifford 1ALG shaffer@vt.edu
384:000000066Platt Jenny 1UNK platt@vt.edu
0:000000088Platt John 0ML jplatt@vt.edu
64:000000097Brill Eric 1NLP brill@vt.edu
COMMAND: dump by name
564:000000097Brill Eric 1NLP brill@vt.edu
192:000000003Chen Zheng 1CN zhengc@msn.com
256:000000010Chen Ming 0ISE mcheng@vt.edu
448:000000021Chen Jian 0cSE jchen@vt.edu
128:000000056Dumais Susan 2IR dumais@aol.com
0:000000088Platt John 0ML jplatt@vt.edu
384:000000066Platt Jenny 1UNK platt@vt.edu
320:000000060Shaffer Clifford 1ALG shaffer@vt.edu
COMMAND: add 000000098 Chen Xiao 2 XX cx@yahoo.com STUDENT 000000098 ADDED
(512;0;1)
COMMAND: find name Vergara NOT FOUND
NUMBER OF RECORDS: 9, ID INDEX: 7 nodes, NAME INDEX: 8 nodes.
Program Submission and Other Deliverables:
You are required to submit the output file from your program.
Submit Lab 2 to the Assignment box no later than Sunday 11:59 PM EST/EDT.Elements
Criteria
Score
Not Attempted
(Criterion is
missing, not in
evidence, or fails
to meet minimum
requirements.)
Novice
(Does not meet
expectations;
performance is
substandard.)
Basic
(Works towards
meeting
expectations;
performance needs
improvement.)
Proficient
(Meets
expectations;
performance is
satisfactory.)
Exemplary
(Exceeds
expectations;
performance
is
outstanding.)
Completeness and
correctness of the
report.
0-11
12-13
14-15
16-17
18-20
__/20
Code documentation.
0-10
11
12
13
14-15
__/15
Development and
correctness of the
application.
0-11
12-13
14-15
16-17
18-20
__/20
Development and
correctness of sorting
methods.
0-24
25-29
30-34
35-39
40-45
__/45
TOTAL ___/100


发表评论