 |
-
Student Participants
- David Benson-Putnins (Oxford University & University of Michigan)
- Margaret Bonfardin (Washington University in St. Lousis)
- Meagan Magnoni (Ithaca College & RPI)
- Daniel Martin (Davidson College)
-
Advisors
- Carl D. Meyer (Primary Faculty Advisor, NC State)
- Chuck Wessell (Graduate Student Advisor, NC State)
-
Project Description
- Given a collection of raw data, the object is to determine hidden
patterns by partitioning the data into clusters (which may or may not be
disjoint), where the items in a given cluster share or exhibit some sort of
commonality that is not immediately apparent due to the sheer volume or the
diverse nature of the data. Two fundamental application areas will be
under investigation.
-
The first application involves clustering DNA microarray data.
Clustering DNA microarray data is a fundamental problem for genomic
scientists because in addition to helping reveal hidden genetic components
relevant to the development of diseases, it is a valuable diagnostic tool.
For example, without knowing all of the specific genetic factors governing
a disease such as leukemia, an individual's genetic propensity for
developing a particular type of leukemia can be inferred by incorporating
their DNA data into a known paradigm (such as the ALL-AML leukemia data set
described below) and applying an appropriate clustering technique on the
augmented data. By observing which cluster (if any) the individual falls
into (along with the strength of the clustering effect), a preemptive
diagnosis can be made. Much of this research will involve ALL-AML leukemia
data and associated studies from the MIT-Harvard Broad Institute of Genome
Research.
-
In the second application text data will be extracted from the world
wide web. The goal is to first facilitate machine reading of web documents
and then to cluster these documents into sets of common topics. The
motivation stems from the need to automate the evaluation of product
reviews and information that appear across the internet to help formulate a
consensus of opinion about specific products and their features. This
application will typically involve thousands of documents, each of which is
described by the relative presence of hundreds or thousands of vocabulary
words.
-
The tools employed are elementary probability and statistics,
networks and graphs, linear algebra, numerical analysis and some
scientific computing principles. A bit of basic knowledge of biology won't
hurt.
-
Results & Conclusions
-
Summary
- We surveyed different clustering algorithms.
- We created a search engine using latent semantic indexing.
- We used the theory of consensus clustering in conjunction with k-means, non-negative matrix
factorization, and MinMaxCut method for graph partitioning to search for and reveal hidden patterns in the Blackstone document
collection, the well-known leukemia data set of Golub, Slonim, et. al., and Fisher's Iris data set. Results are detailed in our
paper "Cluster And Data Analysis."
- We created a computer tool for visualizing cluster connectivity.
- We created a mathematical technique and wrote an algorithm for determining the optimal number of clusters in a data set.
- We researched aspects of the Fiedler method and proved results concerning Fiedler vector for some nonstandard cases.
-
Publications & Presentations
-
Final Report (download pdf)
-
Photos
|