CSc 7481 / LIS 7610 Information Retrieval
Schedule: TTh 9:10-10:30 am Coates 169 (Tuesdays) and Coates 171 (Thursdays)
Instructor: Donald H. Kraft Office: 286 Coates Phone: (225) 388-2253
Office Hours: TTh 10:30 am-Noon
Abstract: Information retrieval is concerned with problems relating to the effective storage, access, and manipulation of primarily textual information, which are among the most interesting and challenging problems facing computer and information scientists. Information is continuing to grow in volume and is becoming increasingly available and accessible in computer formats. Moreover, computer networks, including the Internet and the World Wide Web, are making communication of information easier; while new computer architectures make it more inexpensive. In addition, new technology has made feasible the introduction of powerful and sophisticated algorithms to store, retrieve, and present massive volumes of information on a variety of media in new and better ways (e.g., cross-language, multimedia, hypertext/hypermedia, natural language, digital libraries, and the web).
Text: Meadow, C. T., Boyce, R.R., and Kraft, D.H. Text Information Retrieval Systems, 3rd edition, San Diego, CA: Academic Press, 2007
Salton, G. Automatic Text Processing: The Transformation, Analysis,
and Retrieval of Information by Computer,
Frakes, W.B. and Baeza-Yates, R. (Eds.) Information Retrieval: Data
Structures and Algorithms,
Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval,
Grossman, D.A. and Frieder, O. Information Retrieval: Algorithms and Heuristics, 2nd edition, The Netherlands, Springer, 2004,
Korfhage, R.R., Information Storage and Retrieval,
Spark Jones, K. and Willett, P. (Eds.),
Walker, G. and Janes, J., Online
Retrieval, 2nd edition,
Crochemore, M. and Rytter, W. Text Algorithms,
Del Bimbo, A. Visual Information Retrieval,
Lesk, M. Practical Digital Libraries: Books, Bytes & Bucks,
Maybury, M.T., Intelligent Multimedia Information Retrieval,
Miyamoto, S., Fuzzy Systems in Information Retrieval and Cluster
Shneiderman, B., Designing the User Interface: Strategies for
Effective Human & Computer Interaction,
van Rijsbergen, C.J., Information Retrieval, 2nd edition,
Other interesting books certainly exist; feel free to apply information retrieval techniques to find them. In addition, the American Society for Information Science (ASIS) annual meeting Proceedings, the Association for Computing Machinery (ACM)/Special Interest Group on Information Retrieval (SIGIR) Forum and International Conference on Research and Development in Information retrieval (ICRDIR) Proceedings, and the ACM Conference on Information and Knowledge Management (CIKM) Proceedings have good articles. Moreover, journals with good articles include Information Processing and Management, Information Retrieval, and the Journal of the American Society for Information Science and Technology (JASIST). CACM, ACM/TOIS, and IEEE Computer may have articles of interest, too.
The Course: The course will be run as a seminar, with students reading research materials and participating in the class discussion. This will include doing some homework, using and experimenting with some of the retrieval, search engine, and digital library systems. A project, involving design and implementation of a small retrieval subsystem OR use of a bibliographic database for retrieval experiments, is also required. In addition, a research paper, along the lines of a bibliographic essay, is required. The project will count for 45% of the final grade, the research paper will count for 33% of the grade, and the homework and class participation will count for 22% of the grade. The project may be done individually or, better yet, in small teams (2-3 students). The research paper must be done individually, as is true with the homework unless explicitly specified. The Research Paper: Sample topics for the research paper can be found in the list immediately below. In addition, some ideas might be gained by perusing the list of sample topics for the project below. However, the topic chosen for the research paper must not be identical to the topic chosen for the project! It is anticipated that this paper will be relatively short, a fuzzy 10 pages or so, certainly shorter than the project. It should consist of a bibliographic essay describing a concept and relaying the state of the art in terms of the information technology and its relationship to retrieval (in other words, do not provide a tutorial, rather, focus on research related to retrieval). The choice of topic must be approved by the instructor and must be specified in writing and approved by February 12th! The research paper will be due on April 22nd! Hints: do not simply concatenate several abstracts from papers; do not rely solely on textbooks; do weave a pattern of what is going on with the topic selected based upon the open literature in the field, especially in terms of journals, conference proceedings, technical reports, and, perhaps, even web pages if needed; do not forget to list your references, and use them, citing them in the paper; and, most importantly, do not simply copy entire sections, or even papers, especially without citing them - it is academically dishonest as well as intellectually dishonest; you will be caught and violators will be prosecuted.
Sample Research Paper Topics: Relevance Research; Clustering; Rules Based on Fuzzy Clustering; Text Processing; Storage Technology (e.g., Disk, CD-ROM, WORM, DVD); Full Text Retrieval; Data Retrieval; Cross-Language Retrieval; Language Models for Retrieval; Text Algorithms (e.g., string search, pattern matching, string similarities); Rough Sets for Retrieval; Bibliometrics for Retrieval; Automatic Abstracting; Text Summarization; Data Warehousing; Hypertext/Hypermedia Systems; Non-print Media (e.g., images) Retrieval; Multimedia Retrieval; Digital Libraries; Recommender Systems; User Interfaces (Graphical, Others); Visualization for Retrieval; Electronic Publishing; World Wide Web Retrieval (Search Engines, Metasearch); Retrieval Applets in Web Languages (e.g., HTML, VMRL, XML, SGML, Perl, Java, CGI); Expert Systems (Rule Retrieval, Indexing, Retrieval); Natural Language Processing and Retrieval; Neural Nets/Connectionist Models of Retrieval; Evolutionary Computing (e.g., Genetic Algorithms and Genetic Programming) for Information Retrieval; Uncertainty and Imprecision in Retrieval (e.g., Fuzzy and/or Rough Sets, Belief Functions); Inference and Retrieval; Retrieval with Parallel Architectures and/or Distributed Processing; Retrieval Performance and Evaluation Issues; Data and File Structures for Retrieval (e.g., MAT tries); Data Encryption; Data Compression; Data Mining (Knowledge Discovery) and Retrieval; Information Filtering; Intelligent Agents for Retrieval; Query Expansion; Information Brokers; Data Fusion and Retrieval; Digital Libraries, Transaction Log Analysis, Graphical models, Temporal Information, The Datalog System.
The Project: The project will consist of a written report, plus the software if a system (even if it is but a pilot) was implemented or the results if a set of (retrieval) experiments were conducted. The purpose is to develop mechanisms to improve the state of the art of information retrieval. Again, one wants to show what is currently being done in a given subarea of retrieval with a familiarity with the current literature and activity in that subarea, and show an ability to work with such systems. Do NOT do a simple database application, since this is a retrieval course; you can add a retrieval component (e.g., text retrieval, or nonprint, i.e., images and/or sound) to a database by adding imprecise data and/or queries. Moreover, the choice of project must be approved by the instructor and must be specified in writing and approved by February 12th! The topic must not be exactly identical to the topic chosen for the oral presentation nor to the topic chosen for the research paper! The project will be due on April 24th. One may form a small team of two or three people to do the project, but the more people involved means the more effort expected.
Sample Projects: Development of a Graphical User Interface for a Specific Retrieval Situation or System (e.g., SMART); Installing a Retrieval System (e.g., Smart, Terrier, Cheshire (http://www.cheshire3.org), Inquery, MG, Lucene (http://lucene.apache.org) or Glimpse); Development of a Neural Net Model for Retrieval; Implementing Learning to Rank (http://research.mocrosoft.com/users/LETOR); Testing Various Clustering Methods for Retrieval; Implementing k-nearest neighbor (KNN) clustering especially for ranking; Implementing Various Text Algorithms; In-Depth Testing of Various Web Search Engines; Exploring/Evaluating a Digital Library; Developing a Digital Library Involving Users; Data Mining on the Web; Development of a Pilot Hypertext System; Implementing the open-source machine translation platform Apertium for retrieval (http://xixona.dlsi.ua.es/apertium-www/); Development of an Expert System for Retrieval; Applying Natural Language Processing to Query and/or Document Analysis; Developing a Retrieval Model to Exploit a Given Parallel Architecture; Testing Clustering (Fuzzy and Crisp) Methods for Retrieval; Applying a Belief Function to a Model of Retrieval; Testing Experimental Retrieval Systems (e.g., Okapi, Smart, Cheshire, Terrier, or MG, Glimpse, Inquery); Experimenting with Retrieval Data (e.g., TREC Data); Development of Retrieval Performance Measures for Ranked Output; Experimentation with Various Means for Boolean Relevance Feedback (e.g., Genetic Algorithms, OCAT); Testing Various File Structures for Retrieval; Testing Various Data Compressions Methods for Storage; Testing Various Encryption Algorithms for Text; Development of a Retrieval Model or System Based on Document Components; Development of a Model to Relate Query Complexity to Retrieval Performance; Development of a Retrieval Model Applied to Software
Reuse; Application of Bibliometric Laws to a Retrieval System; Empirically Evaluating Various Aggregation Methods for Retrieval Status Value Determination for Boolean Queries; Experimentation with Cross-Language Retrieval, Running experiments on the BBC algorithm for Boolean queries
Course Topics to be covered include:
Information Retrieval Systems Week 1
Related Information Systems (e.g., DBMS, Q/A, MIS, DSS, Full-Text, DL)
Text, Chaps. 1,2;
Frakes and Baeza-Yates, Chap. 1
Social Issues - Intellectual Access, Restricted Access
History of Information Retrieval Systems Week 2
Commercially Available (Online Retrieval Systems, e.g., Dialog, BRS, Medline, OCLC)
Text, Chap. 8,15;
Walker and Janes, Chaps. 1-2,3,5-7,9-10,13,15
Computing Aspects of Information Retrieval Week 3
Storage, Architecture, Parallel and Distributed Processing, HCI
File Structures (Primary Key Searching - linear, sorted, binary, hashing;
indexes - sequential, b-tree, tries); Secondary Key Searching (linear,
index - what to index, multilist); String Search; Signature Files
Text, Chaps. 3,4,5,6;
Salton, Chaps. 2,5,6,7;
Frakes and Baeza-Yates, Chaps. 2,3,4,5,6,10,12,13,14
Advanced Retrieval Architecture and Data Structures
(Hardware, Architectures, Data Compression, Data Encryption)
Salton, Chaps. 2,5,6,7;
Frakes and Baeza-Yates, Chaps. 17,18
Relevance Week 4
Content Analysis (Document and Query Representations) Weeks 5-6
Data versus Information versus Knowledge
Database Concepts - Models (e.g., Relational)
Controlled Vocabulary - Thesauri
Term Identification, Stop Words
Manual versus Automatic
ASKs, Negotiation, Intermediaries (Computers, Agents, Humans)
Tactics - Truncation, Field Specification, Proximity
Text, Chaps. 4,12,13,14;
Salton, Chaps. 8,9;
Frakes and Baeza-Yates, Chaps. 7-9
Models and Algorithms for Query Processing Weeks 7-9
Vector Space, Probabilistic, Generalized Boolean
Text, Chaps. 7,8;
Frakes and Baeza-Yates, Chaps. 14,15,16;
Salton, Chap 10
Relevance Feedback Week 10
Text, Chap. 11; Frakes and Baeza-Yates, Chap. 11
Performance Measures (Evaluation) Week 11
Efficiency (Speed, Storage, Cost)
Effectiveness (Recall, Precision)
Special - Ranking, Hockey
Experimentation (e.g., TREC)
Text, Chap. 16;
Salton, Chap. 8
Modern Technology Weeks 12-13
Text, Chap. 14
Walker and Janes Chap. 14
Salton, Chap. 11
Telecommunications, Networks and LANs, Remote Access
Hypertext and Hypermedia
Music, Sound, Images; Multimedia
Data and Graphical Retrieval
Artificial Intelligence (Knowledge Representation ,Expert Systems)
Neural Nets and Retrieval
Rough Sets and Retrieval
Natural Language Processing (Vocabulary Terms and Phrases, Parsing)
Cross-Language Retrieval (CLIR)
Question/Answering (Q/A) Systems
Human-Computer Interaction (and User and Usage Studies, Design)
Additional Applications Week 14
Demonstration of Projects Week 15
Consider the resource http://www.csc.lsu.edu/~kraft/retrieval.html . If any of your searches have more than one or two pages of output, just print out ONLY the first one or two pages as examples!
1. Due: February 5, 2008
a) Using the LSU Middleton Library catalog, find out what books in the LSU Middleton Library collection have been written or edited by the Murphy J. Foster Distinguished Chair Professor in the LSU Department of Computer Science. Are there any books written by others with the same last name in the Middleton collection?
b) What other books on the topics of other courses taught by the instructor of this course does the LSU Middleton Library collection hold?
c) Using the East Baton Rouge Parish Library catalog, find out what you can about what the author Robert Ludlum has written.
d) Consider the manuscripts in http://www.csc.lsu.edu/~kraft/IR.html. Discuss how to describe each of them so a user could retrieve them via your description in response to a query for some information need.
2. Due: March 6, 2008
a) Select a research group where IR research is
taking place and who has an interesting web site (e.g., University of
Massachusetts Center for Intelligent Information Retrieval,
b) Consider some specific research projects
c) Find out how often the chairman of the LSU Department of Computer Science has been cited and by whom. Hint: Web of Science.
d) Try out one of the bibliographic search engines, e.g., Smart, Okapi, Glimpse, Dialog, Medline (or one of its derivates, e.g., Grateful Med). Try searching for literature on aggregation operators in fuzzy logic.
3. Due: April 1, 2008
a) Consider at least two search engines. Conduct an experiment to test each against the following three queries noted below. Repeat this for at least one metasearch engine. Evaluate and rank the search/metasearch engines based on your assessment of their performances. Feel free to be creative in finding search/metasearch engines to test.
i) Search 1: find out about Tefko Saracevic and his contributions to information retrieval
ii) Search 2: Find out who “invented” the World Wide Web
iii) Search 3: Find out about the “semantic Web”
b) What is the most common search topic presented to these search engines?
c) What is possibility theory in the context of fuzzy set theory and how does it relate to probability?
4. Due: April 22, 2008
Select a digital library (e.g.,
b) Select a museum (e.g., Louvre, Prado, Guggenheim, or Getty). Explore what is there available online, what access is given, and what is displayed to the user.
c) Discuss what hot topics there are in information retrieval today that users need and on which researchers are working. Which one(s) is your favorite and why?
d) Check out an article on web stuff in Information Research, an online journal (hint: check out http://www.csc.lsu.edu/~kraft/retrieval.html).
5. Due: May 1, 2008
a) Consider again the list of web sites at the web site given above (http://www.csc.lsu.edu/~kraft/retrieval.html). Organize (reorganize) the list according to some reasonable, intelligent structure. This may be done as a team effort. Extra credit will be given for implementing a clustering algorithm on web site title (or other metadata) on the computer.
b) If the authors were to do a fourth edition of the text book for this course, what changes (additions, deletions, modifications) should be made?