Method of indexed storage and retrieval of multidimensional information

US 6,741,983 B1
Filed: 09/28/2000
Issued: 05/25/2004
Est. Priority Date: 09/28/1999
Status: Expired due to Term

- Alert
- Pin

Associated Cases

Associated Defendants

First Claim

Patent Images

1. A method of organizing the data records of a database into clusters, comprising the steps of:

(a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;

(b) choosing a set of variables from those represented in all of the data records, whereby principal component analysis of the set of variables yields distinct clusters of the data records;

(c) applying principal component analysis to the chosen set of variables for a sample of the data records, whereby two or more principal component vectors are identified;

(d) examining the scores of the chosen set of variables for the sample data records along the two or more vectors;

(e) selecting those vectors of the two or more vectors for which the examined scores form distinct clusters;

(f) formulating a test based on the selected vectors; and

(g) performing the test formulated in step (f) on each data record, whereby the data records are organized into clusters.

View all claims

1 Assignment

Timeline View

Assignment View

Litigations

0 Petitions

Accused Products

Abstract

A tree-structured index to multidimensional data is created using naturally occurring patterns and clusters within the data which permit efficient search and retrieval strategies in a database of DNA profiles. A search engine utilizes hierarchical decomposition of the database by identifying clusters of similar DNA profiles and maps to parallel computer architecture, allowing scale up past previously feasible limits. Key benefits of the new method are logarithmic scale up and parallelization. These benefits are achieved by identification and utilization of naturally occurring patterns and clusters within stored data. The patterns and clusters enable the stored data to be partitioned into subsets of roughly equal size. The method can be applied recursively, resulting in a database tree that is balanced, meaning that all paths or branches through the tree have roughly the same length. The method achieves high performance by exploiting the natural structure of the data in a manner that maintains balanced trees. Implementation of the method maps naturally to parallel computer architectures, allowing scale up to very large databases.

103 Citations

View as Search Results

49 Claims

1. A method of organizing the data records of a database into clusters, comprising the steps of:
- (a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
  
  (b) choosing a set of variables from those represented in all of the data records, whereby principal component analysis of the set of variables yields distinct clusters of the data records;
  
  (c) applying principal component analysis to the chosen set of variables for a sample of the data records, whereby two or more principal component vectors are identified;
  
  (d) examining the scores of the chosen set of variables for the sample data records along the two or more vectors;
  
  (e) selecting those vectors of the two or more vectors for which the examined scores form distinct clusters;
  
  (f) formulating a test based on the selected vectors; and
  
  (g) performing the test formulated in step (f) on each data record, whereby the data records are organized into clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the value of each bit is assigned in step (a) based on whether the value of a variable is within a designated range of values.
  - 3. The method of claim 1, wherein the value of each bit is assigned in step (a) based on whether a designated value of a variable is present.
  - 4. The method of claim 1, wherein steps (c) through (e) are performed on a sample of data records from a different database.
  - 5. The method of claim 1, wherein the test formulated in step (f) comprises:
6. The method of claim 5, wherein the representative sample vector of each cluster is the cluster center.
7. The method of claim 1, wherein the data comprise DNA profiles.
8. The method of claim 7, wherein the represented variables are alleles at two or more polymorphic loci.
9. The method of claim 8, wherein the value of each bit is assigned in step (a) based on whether a designated allele is present.
10. The method of claim 1, wherein the chosen set of variables and selected principal component vectors yield distinct clusters of approximately equal size.
11. The method of claim 1, wherein each test defines a partition of data of the database according to one of entropy/adjacency partition assignment or data clustering using multivariate statistical analysis.
12. The method of claim 1 wherein the variable in step (a) comprises more than one value, wherein each of the more than one value is represented in binary form.
13. The method of claim 1 wherein the binary form representation of step (a) is provided as a matrix, said matrix comprising a row for each data record and a column for each value of each variable, and wherein the row for each data record contains binary encoded information regarding the presence or absence of the value of each variable in the data record.

14. A method of organizing the data records of a database into clusters, comprising the steps of:
- (a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
  
  (b) choosing a set of variables from those represented in all of the data records, whereby multivariate statistical analysis of the set of variables yields distinct clusters of the data records;
  
  (c) applying multivariate statistical analysis to the chosen set of variables of a sample of the data records, whereby two or more vectors are identified;
  
  (d) examining the vectors of inner products of the chosen set of variables of the sample data records with the identified vectors;
  
  (e) selecting the identified vectors that cause the vectors of inner products to form clusters;
  
  (f) formulating a test based on the selected vectors which assigns each data record to a cluster; and
  
  (g) performing the test formulated in step (f) on each data record, whereby the data records are organized into clusters.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 15. The method of claim 14, wherein the value of each bit is assigned in step (a) based on whether the value of a variable is within a designated range of values.
  - 16. The method of claim 14, wherein the value of each bit is assigned in step (a) based on whether a designated value of a variable is present.
  - 17. The method of claim 14, wherein steps (c) through (e) are performed on a sample of data records from a different database.
  - 18. The method of claim 14, wherein the test formulated in step comprises:
19. The method of claim 18, wherein the representative sample vector of each cluster is the cluster center.
20. The method of claim 14, wherein the data comprise DNA profiles.
21. The method of claim 20, wherein the represented variables are alleles at two or more polymorphic loci.
22. The method of claim 21, wherein the value of each bit is assigned in step (a) based on whether a designated allele is present.
23. The method of claim 14, wherein the chosen set of variables and selected vectors yield distinct clusters of approximately equal size.
24. The method of claim 14 wherein the variable in step (a) comprises more than one value, wherein each of the more than one value is represented in binary form.
25. The method of claim 14 wherein the binary form representation of step (a) is provided as a matrix, said matrix comprising a row for each data record and a column for each value of each variable, and wherein the row for each data record contains binary encoded information regarding the presence or absence of the value of each variable in the data record.

26. A method of organizing data records of a database into clusters, comprising:
- (a) choosing a set of variables to represent data records of a database and a set of principal component vectors with which the set of variables cluster the data records, said step of choosing being performed by;
  
  (i) selecting one or more variables present in all of the data records;
  
  (ii) applying principal component analysis to the selected variables of a sample of the data records to obtain principal component vectors;
  
  (iii) projecting the selected variables of a sample of the data records onto the principal component vectors;
  
  (iv) determining whether the projections of the selected variables of the sample of data records form clusters;
  
  (v) repeating steps (i)-(iv) to obtain the set of variables and the set of principal component vectors that form clusters;
  
  (b) projecting the variables chosen in step (a) for the data records onto the principal component vectors chosen in step (a);
  
  (c) formulating a test which assigns each data record to a cluster, wherein the test is based on the variables chosen in step (a);
  
  (d) performing the test on each data record, whereby the data records are organized into clusters.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 49)
- - 27. The method of claim 26, wherein the principal component vectors chosen in step (a) contain two vectors.
  - 28. The method of claim 26 wherein the variables selected in step (a) are represented in binary form based on whether the values of the variables are within a designated range of values.
  - 29. The method of claim 26 wherein the variables chosen in step (a) are represented in binary form, and wherein the value of the variable is assigned based on whether a designated value is present.
  - 30. The method of claim 26 wherein steps (a)(ii)-(a)(iv) are performed on data records from a different database.
  - 31. The method of claim 26 wherein the test comprises:
32. The method of claim 31 wherein each representative vector is a cluster center.
33. The method of claim 26 wherein the data records comprise DNA profiles.
34. The method of claim 33 wherein the set of variables chosen in step (a) comprise allele values.
35. The method of claim 34 wherein the allele values are represented based on whether a designated allele is present.
36. The method of claim 26 wherein the set of variables and the set of principal component vectors chosen in step (a) yield distinct clusters of approximately equal size.
37. The method of claim 26 wherein the test defines a partition of the data records of the database according to entropy/adjacency partition assignment or data clustering using multivariate statistical analysis.
49. The method of claim 26 wherein the test defines a partition of the data records of the database according to entropy/adjacency partition assignment or data clustering using multivariate statistical analysis.

38. A method of organizing data records of a database into clusters, comprising:
- (a) choosing a set of variables to represent data records of a database and a set of vectors with which the set of variables cluster the data records, said step of choosing being performed by;
  
  (i) selecting one or more variables present in all of the data records;
  
  (ii) applying multivariate statistical analysis to the selected variables of a sample of the data records to identify two or more vectors;
  
  (iii) determining if vectors of inner products of the selected variables of the data records and the two or more identified vectors in (ii) form clusters;
  
  (iv) repeating steps (i)-(iii) to obtain the set of variables and the two or more identified vectors that form clusters;
  
  (b) representing the set of variables selected in step (a) in binary form, wherein the value of each bit is assigned based on the value of the variable;
  
  (c) calculating vectors inner products between the selected variables of the data records and the two or more vectors chosen in step (a);
  
  (d) formulating a test which assigns each data record to a cluster, wherein the test is based on the variables chosen in step (a);
  
  (e) performing the test on each data record, whereby the data records are organized into clusters.
- View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 39. The method of claim 38 wherein the vectors identified in step (a)(iii) are two in number.
  - 40. The method of claim 38 wherein the set of variables chosen in step (a) is represented in binary form based on whether the value of the variable is within a designated range of values.
  - 41. The method of claim 38 wherein the set of variables is represented in binary form based on whether a designated value of a variable is present.
  - 42. The method of claim 38 wherein steps (a)(ii)-(a)(iii) are performed on data records from a different database.
  - 43. The method of claim 38 wherein the test comprises:
44. The method of claim 43 wherein each representative vector is a cluster center.
45. The method of claim 38 wherein the data records comprise DNA profiles.
46. The method of claim 45 wherein the set of variables chosen in step (a) comprise allele values.
47. The method of claim 46 wherein each value indicates whether a designated allele is present.
48. The method of claim 38 wherein the set of variables and the set of vectors chosen in step (a) yield distinct clusters of approximately equal size.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
University of Tennessee Research Foundation (University of Tennessee)
Original Assignee
David J. Icove, John D. Birdwell, Puneet Yadav, Roger D. Horn, Tse-Wei Wang
Inventors
Yadav, Puneet, Icove, David J., Birdwell, John D., Wang, Tse-Wei, Horn, Roger D.
Primary Examiner(s)
Metjahic, Safet
Assistant Examiner(s)
NGUYEN, MERILYN P

Application Number

US09/671,304
Time in Patent Office

1,335 Days
Field of Search

707/5, 707/2, 707/10, 707/104.1, 707/101
US Class Current

1/1
CPC Class Codes

G06F 16/2246   Trees, e.g. B+trees

G06F 16/2264   Multidimensional index stru...

G06F 16/285   Clustering or classification

G16B 40/00   ICT specially adapted for b...

G16B 40/30   Unsupervised data analysis

G16B 50/00   ICT programming tools or da...

G16B 50/20   Heterogeneous data integration

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99945   Object-oriented database st...

Method of indexed storage and retrieval of multidimensional information

First Claim

1 Assignment

Litigations

0 Petitions

Accused Products

Abstract

103 Citations

49 Claims

Specification

Solutions

Use Cases

Quick Links

Method of indexed storage and retrieval of multidimensional information

First Claim

1 Assignment

Subscription Required

Subscription Required

Litigations

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

103 Citations

49 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links