CSc 7481 / LIS 7610 Information Retrieval

CSc 7481 / LIS 7610 Information Retrieval

Spring 2008

Schedule: TTh 9:10-10:30 am Coates 169 (Tuesdays) and Coates 171 (Thursdays)

Instructor: Donald H. Kraft Office: 286 Coates Phone: (225) 388-2253

Office Hours: TTh 10:30 am-Noon

Email: kraft@csc.lsu.edu

Web: http://www.csc.lsu.edu/~kraft/courses/csc7481.html

PowerPoint: http://www.csc.lsu.edu/~kraft/courses/csc7481_files/frame.htm

Abstract: Information retrieval is concerned with problems relating to the effective storage, access, and manipulation of primarily textual information, which are among the most interesting and challenging problems facing computer and information scientists. Information is continuing to grow in volume and is becoming increasingly available and accessible in computer formats. Moreover, computer networks, including the Internet and the World Wide Web, are making communication of information easier; while new computer architectures make it more inexpensive. In addition, new technology has made feasible the introduction of powerful and sophisticated algorithms to store, retrieve, and present massive volumes of information on a variety of media in new and better ways (e.g., cross-language, multimedia, hypertext/hypermedia, natural language, digital libraries, and the web).

Text: Meadow, C. T., Boyce, R.R., and Kraft, D.H. Text Information Retrieval Systems, 3rd edition, San Diego, CA: Academic Press, 2007

References:

Salton, G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, MA: Addison& Wesley, 1989, QA76.9T48S25

Frakes, W.B. and Baeza-Yates, R. (Eds.) Information Retrieval: Data Structures and Algorithms, Englewood Cliffs, NJ: Prentice-Hall, 1992, QA76.9.D35.I543

Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, New York, NY: ACM Press and Harlow, England: Addison Wesley Longman Ltd., 1999, Z667.B34

Grossman, D.A. and Frieder, O. Information Retrieval: Algorithms and Heuristics, 2^nd edition, The Netherlands, Springer, 2004,

Korfhage, R.R., Information Storage and Retrieval, New York, NY: John Wiley & Sons, Inc., 1997, QA76.9D3K657

Spark Jones, K. and Willett, P. (Eds.), Readings in Information Retrieval, San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1997 Z695.9.R43

Walker, G. and Janes, J., Online Retrieval, 2nd edition, Englewood, CO: Libraries Unlimited, 1999, Z699.35.055W35

Witten, I.H., Moffat, A., and Bell, T.C. Managing Gigabytes; Compressing and Indexing Documents and Images, 2nd edition, San Francisco, CA: Morgan Kaufmann Publishers, 1999, TA1637.W58

Other Texts:

Crochemore, M. and Rytter, W. Text Algorithms, New York, NY: Oxford University Press, 1994, QA76.9T48C76

Del Bimbo, A. Visual Information Retrieval, San Francisco, CA: Morgan Kaufmann Publishers, 1999

Lancaster, F.W. Indexing and Abstracting in Theory and Practice, London, England: Library Association, 1992

Lesk, M. Practical Digital Libraries: Books, Bytes & Bucks, San Francisco, CA: Morgan Kaufmann Publishers, 1997, Z692.C65L47

Maybury, M.T., Intelligent Multimedia Information Retrieval, Cambridge, MA: MIT Press, 1997

Miyamoto, S., Fuzzy Systems in Information Retrieval and Cluster Analysis, Boston, MA: Kluwer, 1990 QA248M5117

Shneiderman, B., Designing the User Interface: Strategies for Effective Human & Computer Interaction, Reading, MA: Addison & Wesley, 1987, QA76.9I58S47

van Rijsbergen, C.J., Information Retrieval, 2nd edition, London, England: Butterworths, 1979 Z699V35

Other interesting books certainly exist; feel free to apply information retrieval techniques to find them. In addition, the American Society for Information Science (ASIS) annual meeting Proceedings, the Association for Computing Machinery (ACM)/Special Interest Group on Information Retrieval (SIGIR) Forum and International Conference on Research and Development in Information retrieval (ICRDIR) Proceedings, and the ACM Conference on Information and Knowledge Management (CIKM) Proceedings have good articles. Moreover, journals with good articles include Information Processing and Management, Information Retrieval, and the Journal of the American Society for Information Science and Technology (JASIST). CACM, ACM/TOIS, and IEEE Computer may have articles of interest, too.

The Course: The course will be run as a seminar, with students reading research materials and participating in the class discussion. This will include doing some homework, using and experimenting with some of the retrieval, search engine, and digital library systems. A project, involving design and implementation of a small retrieval subsystem OR use of a bibliographic database for retrieval experiments, is also required. In addition, a research paper, along the lines of a bibliographic essay, is required. The project will count for 45% of the final grade, the research paper will count for 33% of the grade, and the homework and class participation will count for 22% of the grade. The project may be done individually or, better yet, in small teams (2-3 students). The research paper must be done individually, as is true with the homework unless explicitly specified. The Research Paper: Sample topics for the research paper can be found in the list immediately below. In addition, some ideas might be gained by perusing the list of sample topics for the project below. However, the topic chosen for the research paper must not be identical to the topic chosen for the project! It is anticipated that this paper will be relatively short, a fuzzy 10 pages or so, certainly shorter than the project. It should consist of a bibliographic essay describing a concept and relaying the state of the art in terms of the information technology and its relationship to retrieval (in other words, do not provide a tutorial, rather, focus on research related to retrieval). The choice of topic must be approved by the instructor and must be specified in writing and approved by February 12th! The research paper will be due on April 22nd! Hints: do not simply concatenate several abstracts from papers; do not rely solely on textbooks; do weave a pattern of what is going on with the topic selected based upon the open literature in the field, especially in terms of journals, conference proceedings, technical reports, and, perhaps, even web pages if needed; do not forget to list your references, and use them, citing them in the paper; and, most importantly, do not simply copy entire sections, or even papers, especially without citing them - it is academically dishonest as well as intellectually dishonest; you will be caught and violators will be prosecuted.

Sample Research Paper Topics: Relevance Research; Clustering; Rules Based on Fuzzy Clustering; Text Processing; Storage Technology (e.g., Disk, CD-ROM, WORM, DVD); Full Text Retrieval; Data Retrieval; Cross-Language Retrieval; Language Models for Retrieval; Text Algorithms (e.g., string search, pattern matching, string similarities); Rough Sets for Retrieval; Bibliometrics for Retrieval; Automatic Abstracting; Text Summarization; Data Warehousing; Hypertext/Hypermedia Systems; Non-print Media (e.g., images) Retrieval; Multimedia Retrieval; Digital Libraries; Recommender Systems; User Interfaces (Graphical, Others); Visualization for Retrieval; Electronic Publishing; World Wide Web Retrieval (Search Engines, Metasearch); Retrieval Applets in Web Languages (e.g., HTML, VMRL, XML, SGML, Perl, Java, CGI); Expert Systems (Rule Retrieval, Indexing, Retrieval); Natural Language Processing and Retrieval; Neural Nets/Connectionist Models of Retrieval; Evolutionary Computing (e.g., Genetic Algorithms and Genetic Programming) for Information Retrieval; Uncertainty and Imprecision in Retrieval (e.g., Fuzzy and/or Rough Sets, Belief Functions); Inference and Retrieval; Retrieval with Parallel Architectures and/or Distributed Processing; Retrieval Performance and Evaluation Issues; Data and File Structures for Retrieval (e.g., MAT tries); Data Encryption; Data Compression; Data Mining (Knowledge Discovery) and Retrieval; Information Filtering; Intelligent Agents for Retrieval; Query Expansion; Information Brokers; Data Fusion and Retrieval; Digital Libraries, Transaction Log Analysis, Graphical models, Temporal Information, The Datalog System.

The Project: The project will consist of a written report, plus the software if a system (even if it is but a pilot) was implemented or the results if a set of (retrieval) experiments were conducted. The purpose is to develop mechanisms to improve the state of the art of information retrieval. Again, one wants to show what is currently being done in a given subarea of retrieval with a familiarity with the current literature and activity in that subarea, and show an ability to work with such systems. Do NOT do a simple database application, since this is a retrieval course; you can add a retrieval component (e.g., text retrieval, or nonprint, i.e., images and/or sound) to a database by adding imprecise data and/or queries. Moreover, the choice of project must be approved by the instructor and must be specified in writing and approved by February 12th! The topic must not be exactly identical to the topic chosen for the oral presentation nor to the topic chosen for the research paper! The project will be due on April 24th. One may form a small team of two or three people to do the project, but the more people involved means the more effort expected.

Sample Projects: Development of a Graphical User Interface for a Specific Retrieval Situation or System (e.g., SMART); Installing a Retrieval System (e.g., Smart, Terrier, Cheshire (http://www.cheshire3.org), Inquery, MG, Lucene (http://lucene.apache.org) or Glimpse); Development of a Neural Net Model for Retrieval; Implementing Learning to Rank (http://research.mocrosoft.com/users/LETOR); Testing Various Clustering Methods for Retrieval; Implementing k-nearest neighbor (KNN) clustering especially for ranking; Implementing Various Text Algorithms; In-Depth Testing of Various Web Search Engines; Exploring/Evaluating a Digital Library; Developing a Digital Library Involving Users; Data Mining on the Web; Development of a Pilot Hypertext System; Implementing the open-source machine translation platform Apertium for retrieval (http://xixona.dlsi.ua.es/apertium-www/); Development of an Expert System for Retrieval; Applying Natural Language Processing to Query and/or Document Analysis; Developing a Retrieval Model to Exploit a Given Parallel Architecture; Testing Clustering (Fuzzy and Crisp) Methods for Retrieval; Applying a Belief Function to a Model of Retrieval; Testing Experimental Retrieval Systems (e.g., Okapi, Smart, Cheshire, Terrier, or MG, Glimpse, Inquery); Experimenting with Retrieval Data (e.g., TREC Data); Development of Retrieval Performance Measures for Ranked Output; Experimentation with Various Means for Boolean Relevance Feedback (e.g., Genetic Algorithms, OCAT); Testing Various File Structures for Retrieval; Testing Various Data Compressions Methods for Storage; Testing Various Encryption Algorithms for Text; Development of a Retrieval Model or System Based on Document Components; Development of a Model to Relate Query Complexity to Retrieval Performance; Development of a Retrieval Model Applied to Software

Reuse; Application of Bibliometric Laws to a Retrieval System; Empirically Evaluating Various Aggregation Methods for Retrieval Status Value Determination for Boolean Queries; Experimentation with Cross-Language Retrieval, Running experiments on the BBC algorithm for Boolean queries

Course Topics to be covered include:

Information Retrieval Systems Week 1

Definitions

Related Information Systems (e.g., DBMS, Q/A, MIS, DSS, Full-Text, DL)

Text, Chaps. 1,2;

Frakes and Baeza-Yates, Chap. 1

Social Issues - Intellectual Access, Restricted Access

History of Information Retrieval Systems Week 2

Traditional Systems

Commercially Available (Online Retrieval Systems, e.g., Dialog, BRS, Medline, OCLC)

Text, Chap. 8,15;

Walker and Janes, Chaps. 1-2,3,5-7,9-10,13,15

Computing Aspects of Information Retrieval Week 3

Storage, Architecture, Parallel and Distributed Processing, HCI

File Structures (Primary Key Searching - linear, sorted, binary, hashing;

indexes - sequential, b-tree, tries); Secondary Key Searching (linear,

index - what to index, multilist); String Search; Signature Files

Text, Chaps. 3,4,5,6;

Salton, Chaps. 2,5,6,7;

Frakes and Baeza-Yates, Chaps. 2,3,4,5,6,10,12,13,14

Advanced Retrieval Architecture and Data Structures

(Hardware, Architectures, Data Compression, Data Encryption)

Salton, Chaps. 2,5,6,7;

Frakes and Baeza-Yates, Chaps. 17,18

Relevance Week 4

Content Analysis (Document and Query Representations) Weeks 5-6

Data versus Information versus Knowledge

Data Relationships

Database Concepts - Models (e.g., Relational)

Descriptive Cataloging

Subject Cataloging

Controlled Vocabulary - Thesauri

Free Vocabulary

Morphology

Term Identification, Stop Words

Stemming

Indexing

Weighted Indexing

Summarization

Abstracting

Manual versus Automatic

Queries

ASKs, Negotiation, Intermediaries (Computers, Agents, Humans)

Interfaces

User Modeling

Tactics - Truncation, Field Specification, Proximity

Text, Chaps. 4,12,13,14;

Salton, Chaps. 8,9;

Frakes and Baeza-Yates, Chaps. 7-9

Models and Algorithms for Query Processing Weeks 7-9

Ranking

Bibliometrics

Matching

Vector Space, Probabilistic, Generalized Boolean

Language Models

Fuzzy sets

Text, Chaps. 7,8;

Frakes and Baeza-Yates, Chaps. 14,15,16;

Salton, Chap 10

Relevance Feedback Week 10

Text, Chap. 11; Frakes and Baeza-Yates, Chap. 11

Performance Measures (Evaluation) Week 11

Efficiency (Speed, Storage, Cost)

Effectiveness (Recall, Precision)

Benefit

User Considerations

Special - Ranking, Hockey

Experimentation (e.g., TREC)

Text, Chap. 16;

Salton, Chap. 8

Modern Technology Weeks 12-13

Text, Chap. 14

Walker and Janes Chap. 14

Salton, Chap. 11

Telecommunications, Networks and LANs, Remote Access

Computer Languages (e.g., XML, SGML, HTML,VRML, Javascript)

Hypertext and Hypermedia

Electronic Documents

Nonprint Media

Music, Sound, Images; Multimedia

Data and Graphical Retrieval

Digital Libraries

Artificial Intelligence (Knowledge Representation ,Expert Systems)

Machine Learning

Categorization

Neural Nets and Retrieval

Rough Sets and Retrieval

Natural Language Processing (Vocabulary Terms and Phrases, Parsing)

Citation Nets

Cross-Language Retrieval (CLIR)

Question/Answering (Q/A) Systems

Human-Computer Interaction (and User and Usage Studies, Design)

Additional Applications Week 14

GIS

Software Reuse

Summarization

Demonstration of Projects Week 15

Homework

Consider the resource http://www.csc.lsu.edu/~kraft/retrieval.html . If any of your searches have more than one or two pages of output, just print out ONLY the first one or two pages as examples!

1. Due: February 5, 2008

a) Using the LSU Middleton Library catalog, find out what books in the LSU Middleton Library collection have been written or edited by the Murphy J. Foster Distinguished Chair Professor in the LSU Department of Computer Science. Are there any books written by others with the same last name in the Middleton collection?

b) What other books on the topics of other courses taught by the instructor of this course does the LSU Middleton Library collection hold?

c) Using the East Baton Rouge Parish Library catalog, find out what you can about what the author Robert Ludlum has written.

d) Consider the manuscripts in http://www.csc.lsu.edu/~kraft/IR.html. Discuss how to describe each of them so a user could retrieve them via your description in response to a query for some information need.

2. Due: March 6, 2008

a) Select a research group where IR research is taking place and who has an interesting web site (e.g., University of Massachusetts Center for Intelligent Information Retrieval, University of Arizona MIS AI Group, University of Glasgow IDOM-IR, Virginia Tech). Discuss briefly the ongoing projects and work going on in that group.

b) Consider some specific research projects (e.g., Rutgers University and AntWorld, Bar-Ilan University and the Responsa Project, Susan Dumais at Microsoft Research, Drexel University and Xia Lin's Concept Maps, Princeton University and WordNet, CUNY-Queen’s College and K.L. Kwok's research lab in New York City, Ray Larson and Cheshire at University of California, Berkeley). Explore what one of these projects is, try out any available demonstrations, and report what you find.

c) Find out how often the chairman of the LSU Department of Computer Science has been cited and by whom. Hint: Web of Science.

d) Try out one of the bibliographic search engines, e.g., Smart, Okapi, Glimpse, Dialog, Medline (or one of its derivates, e.g., Grateful Med). Try searching for literature on aggregation operators in fuzzy logic.

3. Due: April 1, 2008

a) Consider at least two search engines. Conduct an experiment to test each against the following three queries noted below. Repeat this for at least one metasearch engine. Evaluate and rank the search/metasearch engines based on your assessment of their performances. Feel free to be creative in finding search/metasearch engines to test.

i) Search 1: find out about Tefko Saracevic and his contributions to information retrieval

ii) Search 2: Find out who “invented” the World Wide Web

iii) Search 3: Find out about the “semantic Web”

b) What is the most common search topic presented to these search engines?

c) What is possibility theory in the context of fuzzy set theory and how does it relate to probability?

4. Due: April 22, 2008

Select a digital library (e.g., University of California - Berkeley, University of Michigan, University of California - Santa Barbara, Stanford University, University of Illinois, Carnegie Mellon University, New Zealand). Explore what it has in its collection, what areas are covered, what materials and in what media forms are available, what access is given to those materials, and what is displayed to users in response to queries. Discuss what research topics have been done or are ongoing to enable all of this.

b) Select a museum (e.g., Louvre, Prado, Guggenheim, or Getty). Explore what is there available online, what access is given, and what is displayed to the user.

c) Discuss what hot topics there are in information retrieval today that users need and on which researchers are working. Which one(s) is your favorite and why?

d) Check out an article on web stuff in Information Research, an online journal (hint: check out http://www.csc.lsu.edu/~kraft/retrieval.html).

5. Due: May 1, 2008

a) Consider again the list of web sites at the web site given above (http://www.csc.lsu.edu/~kraft/retrieval.html). Organize (reorganize) the list according to some reasonable, intelligent structure. This may be done as a team effort. Extra credit will be given for implementing a clustering algorithm on web site title (or other metadata) on the computer.

b) If the authors were to do a fourth edition of the text book for this course, what changes (additions, deletions, modifications) should be made?