CSc 7481 / LIS 7610
Information Retrieval
Spring 2008
Schedule: TTh 9:10-10:30 am Coates 169 (Tuesdays) and Coates 171
(Thursdays)
Instructor: Donald H. Kraft Office: 286 Coates Phone: (225) 388-2253
Office Hours: TTh 10:30 am-Noon
Email:
Web: http://www.csc.lsu.edu/~kraft/courses/csc7481.html
PowerPoint: http://www.csc.lsu.edu/~kraft/courses/csc7481_files/frame.htm
Abstract: Information retrieval
is concerned with problems relating to the effective storage, access, and
manipulation of primarily textual information, which are among the most
interesting and challenging problems facing computer and information
scientists. Information is continuing to grow in volume and is becoming
increasingly available and accessible in computer formats. Moreover, computer
networks, including the Internet and the World Wide Web, are making
communication of information easier; while new computer architectures make it
more inexpensive. In addition, new technology has made feasible the
introduction of powerful and sophisticated algorithms to store, retrieve, and
present massive volumes of information on a variety of media in new and better
ways (e.g., cross-language, multimedia, hypertext/hypermedia, natural language,
digital libraries, and the web).
Text: Meadow, C. T., Boyce, R.R., and Kraft,
D.H. Text Information Retrieval Systems, 3rd edition, San Diego, CA: Academic Press, 2007
References:
Salton, G. Automatic Text Processing: The Transformation, Analysis,
and Retrieval of Information by Computer,
Frakes, W.B. and Baeza-Yates, R. (Eds.) Information Retrieval: Data
Structures and Algorithms,
Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval,
Grossman, D.A. and Frieder, O. Information Retrieval: Algorithms and
Heuristics, 2nd edition, The Netherlands, Springer, 2004,
Korfhage, R.R., Information Storage and Retrieval,
Spark Jones, K. and Willett, P. (Eds.),
Walker, G. and Janes, J., Online
Retrieval, 2nd edition,
Other Texts:
Crochemore, M. and Rytter, W. Text Algorithms,
Del Bimbo, A. Visual Information Retrieval,
Lesk, M. Practical Digital Libraries: Books, Bytes & Bucks,
Maybury, M.T., Intelligent Multimedia Information Retrieval,
Miyamoto, S., Fuzzy Systems in Information Retrieval and Cluster
Analysis,
Shneiderman, B., Designing the User Interface: Strategies for
Effective Human & Computer Interaction,
van Rijsbergen, C.J., Information Retrieval, 2nd edition,
Other interesting books certainly exist; feel free to apply information
retrieval techniques to find them. In addition, the American Society for
Information Science (ASIS) annual meeting Proceedings, the Association for
Computing Machinery (ACM)/Special Interest Group on Information Retrieval
(SIGIR) Forum and International Conference on Research and Development in
Information retrieval (ICRDIR) Proceedings, and the ACM Conference on
Information and Knowledge Management (CIKM) Proceedings have good articles.
Moreover, journals with good articles include Information Processing and
Management, Information Retrieval, and the Journal of the
American Society for Information Science and Technology (JASIST). CACM,
ACM/TOIS, and IEEE Computer may have articles of interest, too.
The Course: The course will be run
as a seminar, with students reading research materials and participating in the
class discussion. This will include doing some homework, using and
experimenting with some of the retrieval, search engine, and digital library
systems. A project, involving design and implementation of a small
retrieval subsystem OR use of a bibliographic database for retrieval
experiments, is also required. In addition, a research paper,
along the lines of a bibliographic essay, is required. The project will count
for 45% of the final grade, the research paper will count for 33% of the grade,
and the homework and class participation will count
for 22% of the grade. The project may be
done individually or, better yet, in small teams (2-3 students). The research
paper must be done individually, as is true with the homework unless explicitly
specified. The Research Paper: Sample topics for the research paper can
be found in the list immediately below. In addition, some ideas might be gained
by perusing the list of sample topics for the project below. However, the topic
chosen for the research paper must not be identical to the topic chosen for the
project! It is anticipated that this paper will be relatively short, a fuzzy 10
pages or so, certainly shorter than the project. It should consist of a
bibliographic essay describing a concept and relaying the state of the art in
terms of the information technology and its relationship to retrieval (in other
words, do not provide a tutorial, rather, focus on research related to
retrieval). The choice of topic must be approved by the instructor and must be
specified in writing and approved by February 12th! The research
paper will be due on April 22nd! Hints: do not simply concatenate
several abstracts from papers; do not rely solely on textbooks; do weave a
pattern of what is going on with the topic selected based upon the open
literature in the field, especially in terms of journals, conference proceedings,
technical reports, and, perhaps, even web pages if needed; do not forget to
list your references, and use them, citing them in the paper; and, most
importantly, do not simply copy entire sections, or even papers, especially
without citing them - it is academically dishonest as well as intellectually
dishonest; you will be caught and violators will be prosecuted.
Sample Research Paper Topics: Relevance Research;
Clustering; Rules Based on Fuzzy Clustering; Text Processing; Storage
Technology (e.g., Disk, CD-ROM, WORM, DVD); Full Text Retrieval; Data
Retrieval; Cross-Language Retrieval; Language Models for Retrieval; Text
Algorithms (e.g., string search, pattern matching, string similarities); Rough
Sets for Retrieval; Bibliometrics for Retrieval; Automatic Abstracting; Text
Summarization; Data Warehousing; Hypertext/Hypermedia Systems; Non-print Media
(e.g., images) Retrieval; Multimedia Retrieval; Digital Libraries; Recommender
Systems; User Interfaces (Graphical, Others); Visualization for Retrieval;
Electronic Publishing; World Wide Web Retrieval (Search Engines, Metasearch);
Retrieval Applets in Web Languages (e.g., HTML, VMRL, XML, SGML, Perl, Java,
CGI); Expert Systems (Rule Retrieval, Indexing, Retrieval); Natural Language
Processing and Retrieval; Neural Nets/Connectionist Models of Retrieval;
Evolutionary Computing (e.g., Genetic Algorithms and Genetic Programming) for
Information Retrieval; Uncertainty and Imprecision in Retrieval (e.g., Fuzzy
and/or Rough Sets, Belief Functions); Inference and Retrieval; Retrieval with
Parallel Architectures and/or Distributed Processing; Retrieval Performance and
Evaluation Issues; Data and File Structures for Retrieval (e.g., MAT tries);
Data Encryption; Data Compression; Data Mining (Knowledge Discovery) and
Retrieval; Information Filtering; Intelligent Agents for Retrieval; Query
Expansion; Information Brokers; Data Fusion and Retrieval; Digital Libraries,
Transaction Log Analysis, Graphical models, Temporal Information, The Datalog System.
The Project: The project will
consist of a written report, plus the software if a system (even if it is but a
pilot) was implemented or the results if a set of (retrieval) experiments were
conducted. The purpose is to develop mechanisms to improve the state of the art
of information retrieval. Again, one wants to show what is currently being done
in a given subarea of retrieval with a familiarity with the current literature
and activity in that subarea, and show an ability to work with such systems. Do
NOT do a simple database application, since this is a retrieval course; you can
add a retrieval component (e.g., text retrieval, or nonprint, i.e., images
and/or sound) to a database by adding imprecise data and/or queries. Moreover,
the choice of project must be approved by the instructor and must be specified
in writing and approved by February 12th! The topic must not be
exactly identical to the topic chosen for the oral presentation nor to the
topic chosen for the research paper! The project will be due on April 24th.
One may form a small team of two or three people to do the project, but the
more people involved means the more effort expected.
Sample Projects: Development of a
Graphical User Interface for a Specific Retrieval Situation or System (e.g.,
SMART); Installing a Retrieval System (e.g., Smart, Terrier, Cheshire
(http://www.cheshire3.org), Inquery, MG, Lucene (http://lucene.apache.org) or Glimpse);
Development of a Neural Net Model for Retrieval; Implementing Learning to Rank
(http://research.mocrosoft.com/users/LETOR);
Testing Various Clustering Methods for Retrieval; Implementing k-nearest
neighbor (KNN) clustering especially for ranking; Implementing Various Text
Algorithms; In-Depth Testing of Various Web Search Engines;
Exploring/Evaluating a Digital Library; Developing a Digital Library Involving
Users; Data Mining on the Web; Development of a Pilot Hypertext System; Implementing
the open-source machine translation platform Apertium for retrieval (http://xixona.dlsi.ua.es/apertium-www/); Development of an Expert System for
Retrieval; Applying Natural Language Processing to Query and/or Document
Analysis; Developing a Retrieval Model to Exploit a Given Parallel
Architecture; Testing Clustering (Fuzzy and Crisp) Methods for Retrieval;
Applying a Belief Function to a Model of Retrieval; Testing Experimental
Retrieval Systems (e.g., Okapi, Smart, Cheshire, Terrier, or MG, Glimpse,
Inquery); Experimenting with Retrieval Data (e.g., TREC Data); Development of
Retrieval Performance Measures for Ranked Output; Experimentation with Various
Means for Boolean Relevance Feedback (e.g., Genetic Algorithms, OCAT); Testing
Various File Structures for Retrieval; Testing Various Data Compressions
Methods for Storage; Testing Various Encryption Algorithms for Text;
Development of a Retrieval Model or System Based on Document Components;
Development of a Model to Relate Query Complexity to Retrieval Performance;
Development of a Retrieval Model Applied to Software
Reuse; Application of Bibliometric Laws to a
Retrieval System; Empirically Evaluating Various Aggregation Methods for
Retrieval Status Value Determination for Boolean Queries; Experimentation with
Cross-Language Retrieval, Running experiments on the BBC algorithm for Boolean
queries
Course Topics to be covered include:
Information Retrieval Systems Week
1
Definitions
Related Information
Systems (e.g., DBMS, Q/A, MIS, DSS, Full-Text, DL)
Text, Chaps.
1,2;
Frakes and
Baeza-Yates, Chap. 1
Social Issues -
Intellectual Access, Restricted Access
History of Information Retrieval
Systems Week
2
Traditional Systems
Commercially Available
(Online Retrieval Systems, e.g., Dialog, BRS, Medline, OCLC)
Text, Chap.
8,15;
Walker and
Janes, Chaps. 1-2,3,5-7,9-10,13,15
Computing Aspects of
Information Retrieval Week
3
Storage, Architecture,
Parallel and Distributed Processing, HCI
File Structures (Primary
Key Searching - linear, sorted, binary, hashing;
indexes -
sequential, b-tree, tries); Secondary Key Searching (linear,
index - what
to index, multilist); String Search; Signature Files
Text, Chaps.
3,4,5,6;
Salton,
Chaps. 2,5,6,7;
Frakes and
Baeza-Yates, Chaps. 2,3,4,5,6,10,12,13,14
Advanced Retrieval
Architecture and Data Structures
(Hardware, Architectures,
Data Compression, Data Encryption)
Salton,
Chaps. 2,5,6,7;
Frakes and
Baeza-Yates, Chaps. 17,18
Relevance Week 4
Content Analysis (Document and
Query Representations) Weeks
5-6
Data versus Information
versus Knowledge
Data Relationships
Database Concepts -
Models (e.g., Relational)
Descriptive Cataloging
Subject Cataloging
Controlled
Vocabulary - Thesauri
Free
Vocabulary
Morphology
Term
Identification, Stop Words
Stemming
Indexing
Weighted Indexing
Summarization
Abstracting
Manual
versus Automatic
Queries
ASKs,
Negotiation, Intermediaries (Computers, Agents, Humans)
Interfaces
User
Modeling
Tactics -
Truncation, Field Specification, Proximity
Text, Chaps.
4,12,13,14;
Salton,
Chaps. 8,9;
Frakes and
Baeza-Yates, Chaps. 7-9
Models and Algorithms for
Query Processing Weeks
7-9
Ranking
Bibliometrics
Matching
Vector Space,
Probabilistic, Generalized Boolean
Language Models
Fuzzy sets
Text, Chaps.
7,8;
Frakes and
Baeza-Yates, Chaps. 14,15,16;
Salton, Chap 10
Relevance Feedback Week
10
Text, Chap. 11; Frakes and Baeza-Yates, Chap. 11
Performance Measures (Evaluation) Week
11
Efficiency (Speed, Storage, Cost)
Effectiveness (Recall, Precision)
Benefit
User Considerations
Special - Ranking, Hockey
Experimentation (e.g., TREC)
Text, Chap.
16;
Salton,
Chap. 8
Modern Technology Weeks
12-13
Text, Chap.
14
Walker and
Janes Chap. 14
Salton,
Chap. 11
Telecommunications,
Networks and LANs, Remote Access
Computer Languages
(e.g., XML, SGML, HTML,VRML, Javascript)
Hypertext and Hypermedia
Electronic Documents
Nonprint Media
Music,
Sound, Images; Multimedia
Data and
Graphical Retrieval
Digital Libraries
Artificial Intelligence
(Knowledge Representation ,Expert Systems)
Machine
Learning
Categorization
Neural Nets
and Retrieval
Rough Sets
and Retrieval
Natural Language
Processing (Vocabulary Terms and Phrases, Parsing)
Citation
Nets
Cross-Language Retrieval (CLIR)
Question/Answering (Q/A)
Systems
Human-Computer
Interaction (and User and Usage Studies, Design)
Additional Applications Week
14
GIS
Software Reuse
Summarization
Demonstration of Projects Week
15
Homework
Consider the resource http://www.csc.lsu.edu/~kraft/retrieval.html
. If any of your searches have more than one or two pages of output, just print
out ONLY the first one or two pages as examples!
1. Due: February 5, 2008
a) Using the LSU Middleton Library catalog, find
out what books in the LSU Middleton Library collection have been written or edited
by the Murphy J. Foster Distinguished Chair Professor in the LSU Department of
Computer Science. Are there any books written by others with the same last name
in the Middleton collection?
b) What other books on the topics of other
courses taught by the instructor of this course does the LSU Middleton Library
collection hold?
c) Using the East Baton Rouge Parish Library
catalog, find out what you can about what the author Robert Ludlum has written.
d) Consider the manuscripts in http://www.csc.lsu.edu/~kraft/IR.html.
Discuss how to describe each of them so a user could retrieve them via your
description in response to a query for some information need.
2. Due: March 6, 2008
a) Select a research group where IR research is
taking place and who has an interesting web site (e.g., University of
Massachusetts Center for Intelligent Information Retrieval,
b) Consider some specific research projects
(e.g.,
c) Find out how often the chairman of the LSU
Department of Computer Science has been cited and by whom. Hint: Web of
Science.
d) Try out one of the bibliographic search
engines, e.g., Smart, Okapi, Glimpse, Dialog, Medline (or one of its derivates,
e.g., Grateful Med). Try searching for
literature on aggregation operators in fuzzy logic.
3. Due: April 1, 2008
a) Consider at least two search engines. Conduct an experiment to test each against
the following three queries noted below.
Repeat this for at least one metasearch engine. Evaluate and rank the search/metasearch
engines based on your assessment of their performances. Feel free to be creative in finding
search/metasearch engines to test.
i) Search
1: find out about Tefko Saracevic and his contributions to information
retrieval
ii)
Search 2: Find out who “invented” the World Wide Web
iii) Search 3: Find out about the “semantic Web”
b) What is the most common search topic
presented to these search engines?
c) What is possibility theory in the context of
fuzzy set theory and how does it relate to probability?
4. Due: April 22, 2008
Select a digital library (e.g.,
b) Select a museum (e.g., Louvre, Prado,
Guggenheim, or Getty). Explore what is
there available online, what access is given, and what is displayed to the
user.
c) Discuss what hot topics there are in
information retrieval today that users need and on which researchers are
working. Which one(s) is your favorite
and why?
d) Check out an article on web stuff in
Information Research, an online journal (hint: check out http://www.csc.lsu.edu/~kraft/retrieval.html).
5. Due: May 1, 2008
a) Consider again the list of web sites at the
web site given above (http://www.csc.lsu.edu/~kraft/retrieval.html). Organize (reorganize) the list according to
some reasonable, intelligent structure.
This may be done as a team effort.
Extra credit will be given for implementing a clustering algorithm on
web site title (or other metadata) on the computer.
b) If the authors were to do a fourth edition of
the text book for this course, what changes (additions, deletions,
modifications) should be made?