Write a Java program that interacts with a user to process information retrieval queries, First, prompt for the directory containing the collection of data, then, you will need to build an inverted index or incidence matrix
Write a Java program that interacts with a user to process information retrieval queries, First, prompt for the directory containing the collection of data, then, you will need to build an inverted index or incidence matrix. Each entry in the inverted index should consist of a vocabulary word, the word’s document frequency, and the word’s postings. Each posting should contain a document ID and the term frequency of the word with respect to the document.
Alternatively, you may build a (non-boolean) incidence matrix. This would contain a table where each row corresponds to a vocabulary word, and each column corresponds to a document. Each cell in the table contains the term frequency (which is an integer representing the number of times the row’s word appears in the column’s document). With that information, the term frequency and inverse document frequency can be calculated when needed.
Next, you will need to build the permuterm index. This will contain the information where each permuterm points back to the original vocabulary term. Thus, you will need an array where each record contains a permuterm and the vocabulary term that generated it. Finally, you will need to build a querying component. The program should prompt the user for a query term. The system should then input a query. If the query contains an asterisk, your program should find the permuterm of the query where the asterisk is at the end. It should then search the permuterm index for the matching terms which will indicate the vocabulary terms to search in the inverted index/incidence matrix. At that point, your program can compute the TF-IDF score for each vocabulary term and return them to the user.
Trending now
This is a popular solution!
Step by step
Solved in 2 steps