Method of indexed storage and retrieval of multidimensional information
DCFirst Claim
1. A method of organizing the data records of a database into clusters, comprising the steps of:
- (a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
(b) choosing a set of variables from those represented in all of the data records, whereby principal component analysis of the set of variables yields distinct clusters of the data records;
(c) applying principal component analysis to the chosen set of variables for a sample of the data records, whereby two or more principal component vectors are identified;
(d) examining the scores of the chosen set of variables for the sample data records along the two or more vectors;
(e) selecting those vectors of the two or more vectors for which the examined scores form distinct clusters;
(f) formulating a test based on the selected vectors; and
(g) performing the test formulated in step (f) on each data record, whereby the data records are organized into clusters.
1 Assignment
Litigations
0 Petitions
Accused Products
Abstract
A tree-structured index to multidimensional data is created using naturally occurring patterns and clusters within the data which permit efficient search and retrieval strategies in a database of DNA profiles. A search engine utilizes hierarchical decomposition of the database by identifying clusters of similar DNA profiles and maps to parallel computer architecture, allowing scale up past previously feasible limits. Key benefits of the new method are logarithmic scale up and parallelization. These benefits are achieved by identification and utilization of naturally occurring patterns and clusters within stored data. The patterns and clusters enable the stored data to be partitioned into subsets of roughly equal size. The method can be applied recursively, resulting in a database tree that is balanced, meaning that all paths or branches through the tree have roughly the same length. The method achieves high performance by exploiting the natural structure of the data in a manner that maintains balanced trees. Implementation of the method maps naturally to parallel computer architectures, allowing scale up to very large databases.
103 Citations
49 Claims
-
1. A method of organizing the data records of a database into clusters, comprising the steps of:
-
(a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
(b) choosing a set of variables from those represented in all of the data records, whereby principal component analysis of the set of variables yields distinct clusters of the data records;
(c) applying principal component analysis to the chosen set of variables for a sample of the data records, whereby two or more principal component vectors are identified;
(d) examining the scores of the chosen set of variables for the sample data records along the two or more vectors;
(e) selecting those vectors of the two or more vectors for which the examined scores form distinct clusters;
(f) formulating a test based on the selected vectors; and
(g) performing the test formulated in step (f) on each data record, whereby the data records are organized into clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
projecting the chosen set of variables from a data record onto the selected principal component vectors;
scaling the projected values;
calculating a distance from the vector of scaled projected values to a representative sample vector of each cluster; and
assigning the data record to the clusters associated with the least distance.
-
-
6. The method of claim 5, wherein the representative sample vector of each cluster is the cluster center.
-
7. The method of claim 1, wherein the data comprise DNA profiles.
-
8. The method of claim 7, wherein the represented variables are alleles at two or more polymorphic loci.
-
9. The method of claim 8, wherein the value of each bit is assigned in step (a) based on whether a designated allele is present.
-
10. The method of claim 1, wherein the chosen set of variables and selected principal component vectors yield distinct clusters of approximately equal size.
-
11. The method of claim 1, wherein each test defines a partition of data of the database according to one of entropy/adjacency partition assignment or data clustering using multivariate statistical analysis.
-
12. The method of claim 1 wherein the variable in step (a) comprises more than one value, wherein each of the more than one value is represented in binary form.
-
13. The method of claim 1 wherein the binary form representation of step (a) is provided as a matrix, said matrix comprising a row for each data record and a column for each value of each variable, and wherein the row for each data record contains binary encoded information regarding the presence or absence of the value of each variable in the data record.
-
14. A method of organizing the data records of a database into clusters, comprising the steps of:
-
(a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
(b) choosing a set of variables from those represented in all of the data records, whereby multivariate statistical analysis of the set of variables yields distinct clusters of the data records;
(c) applying multivariate statistical analysis to the chosen set of variables of a sample of the data records, whereby two or more vectors are identified;
(d) examining the vectors of inner products of the chosen set of variables of the sample data records with the identified vectors;
(e) selecting the identified vectors that cause the vectors of inner products to form clusters;
(f) formulating a test based on the selected vectors which assigns each data record to a cluster; and
(g) performing the test formulated in step (f) on each data record, whereby the data records are organized into clusters. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
projecting the chosen set of variables from a data record onto the selected vectors;
scaling the projected values;
calculating a distance from the vector of scaled projected values to a representative sample vector of each cluster; and
assigning the data record to the clusters associated with the least distance.
-
-
19. The method of claim 18, wherein the representative sample vector of each cluster is the cluster center.
-
20. The method of claim 14, wherein the data comprise DNA profiles.
-
21. The method of claim 20, wherein the represented variables are alleles at two or more polymorphic loci.
-
22. The method of claim 21, wherein the value of each bit is assigned in step (a) based on whether a designated allele is present.
-
23. The method of claim 14, wherein the chosen set of variables and selected vectors yield distinct clusters of approximately equal size.
-
24. The method of claim 14 wherein the variable in step (a) comprises more than one value, wherein each of the more than one value is represented in binary form.
-
25. The method of claim 14 wherein the binary form representation of step (a) is provided as a matrix, said matrix comprising a row for each data record and a column for each value of each variable, and wherein the row for each data record contains binary encoded information regarding the presence or absence of the value of each variable in the data record.
-
26. A method of organizing data records of a database into clusters, comprising:
-
(a) choosing a set of variables to represent data records of a database and a set of principal component vectors with which the set of variables cluster the data records, said step of choosing being performed by;
(i) selecting one or more variables present in all of the data records;
(ii) applying principal component analysis to the selected variables of a sample of the data records to obtain principal component vectors;
(iii) projecting the selected variables of a sample of the data records onto the principal component vectors;
(iv) determining whether the projections of the selected variables of the sample of data records form clusters;
(v) repeating steps (i)-(iv) to obtain the set of variables and the set of principal component vectors that form clusters;
(b) projecting the variables chosen in step (a) for the data records onto the principal component vectors chosen in step (a);
(c) formulating a test which assigns each data record to a cluster, wherein the test is based on the variables chosen in step (a);
(d) performing the test on each data record, whereby the data records are organized into clusters. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 49)
projecting a first data record onto the principal component vectors chosen in step (a);
optionally scaling the projected first data record;
calculating a distance between the projected first data record and at least one representative vector of two or more clusters; and
assigning the first data record to the clusters associated with the least distance.
-
-
32. The method of claim 31 wherein each representative vector is a cluster center.
-
33. The method of claim 26 wherein the data records comprise DNA profiles.
-
34. The method of claim 33 wherein the set of variables chosen in step (a) comprise allele values.
-
35. The method of claim 34 wherein the allele values are represented based on whether a designated allele is present.
-
36. The method of claim 26 wherein the set of variables and the set of principal component vectors chosen in step (a) yield distinct clusters of approximately equal size.
-
37. The method of claim 26 wherein the test defines a partition of the data records of the database according to entropy/adjacency partition assignment or data clustering using multivariate statistical analysis.
-
49. The method of claim 26 wherein the test defines a partition of the data records of the database according to entropy/adjacency partition assignment or data clustering using multivariate statistical analysis.
-
38. A method of organizing data records of a database into clusters, comprising:
-
(a) choosing a set of variables to represent data records of a database and a set of vectors with which the set of variables cluster the data records, said step of choosing being performed by;
(i) selecting one or more variables present in all of the data records;
(ii) applying multivariate statistical analysis to the selected variables of a sample of the data records to identify two or more vectors;
(iii) determining if vectors of inner products of the selected variables of the data records and the two or more identified vectors in (ii) form clusters;
(iv) repeating steps (i)-(iii) to obtain the set of variables and the two or more identified vectors that form clusters;
(b) representing the set of variables selected in step (a) in binary form, wherein the value of each bit is assigned based on the value of the variable;
(c) calculating vectors inner products between the selected variables of the data records and the two or more vectors chosen in step (a);
(d) formulating a test which assigns each data record to a cluster, wherein the test is based on the variables chosen in step (a);
(e) performing the test on each data record, whereby the data records are organized into clusters. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
projecting a first data record onto the vectors chosen in step (a);
optionally scaling the projected first data record;
calculating a distance between the projected first data record and at least one representative vector of two or more clusters;
assigning the first data record to the clusters associated with the least distance.
-
-
44. The method of claim 43 wherein each representative vector is a cluster center.
-
45. The method of claim 38 wherein the data records comprise DNA profiles.
-
46. The method of claim 45 wherein the set of variables chosen in step (a) comprise allele values.
-
47. The method of claim 46 wherein each value indicates whether a designated allele is present.
-
48. The method of claim 38 wherein the set of variables and the set of vectors chosen in step (a) yield distinct clusters of approximately equal size.
Specification