.\"------------------------------------------------------------ .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag. .de Id .ds Rv \\$3 .ds Dt \\$4 .. .Id $Id$ .\"------------------------------------------------------------ .TH mgquery 1 \*(Dt CITRI .SH NAME mgquery \- query program for the mg system .SH SYNOPSIS .B mgquery [ .B \-h ] [ .B \-D ] [ .BI \-f " name" ] [ .BI \-d " directory" ] .if n .ti +9n [ .I collection-name ] .SH DESCRIPTION .B mgquery enables users to make Boolean or ranked queries from a data base generated by the .BR mg (1) system. It accepts queries from .I stdin and sends the retrieved documents to .IR stdout . Information on the resource usage of .B mgquery as it processes queries can be obtained interactively. .SH OPTIONS Options may appear in any order, but the .IR collection-name , if specified, must be last. .TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n" .B \-h This displays a usage line on .IR stderr . .TP .B \-D This option causes the entire text to be decompressed and sent to .IR stdout . .TP .BI \-f " name" This specifies the base name of the document collection that will be used. If a collection with the specified base .I name does not exist, an error message will be displayed and .B mgquery will exit. .TP .BI \-d " directory" This specifies the directory where the document collection can be found. .SH USAGE Prior to processing the command line arguments, the .B mgquery program attempts to read in a startup script called .IR ./.mgrc . If that fails, it attempts to read in the file .IR $HOME/.mgrc . The startup file can only contain commands\(emno queries are permitted in the .I .mgrc file. Lines starting with \*(lq\fB#\fP\*(rq in the file are comments. The most common use for the .I .mgrc file is to personalise the initial values of the predefined parameters with .B .set commands. .LP The input to .B mgquery consists of a series of input lines. The backslash character .RB (\*(lq \e \*(rq) is used at the end of lines to indicate that input continues on the next line. .LP Input lines on which the first character is a dot .RB (\*(lq . \*(rq) are commands to the .B mgquery program. Input lines that do not start with a dot are queries. .LP A query consists of two parts. One part is a Boolean or ranked query that identifies documents. The second part is a post-processing pattern matching operation. Any text between the first speech mark (\*(lq) and the last speech mark (\*(rq) is considered to be a post-processing pattern. .SH COMMANDS The .B mgquery program can accept the following commands. .TP 17 .B .help Display several pages of help text. .TP .B .quit Quit the program. .TP .B .warranty Display the .BR mg (1) warranty. .TP .B .conditions Display the conditions of use and distribution of .BR mg (1). .TP .BI ".set " "name value" Set the parameter .I name to the specified .IR value . If the parameter is a Boolean .I value and the .I value is omitted, the parameter will be inverted (i.e., if it was .IR true , then it will change to .IR false ; if it was .IR false , then it will change to .IR true ). .TP .BI ".unset " name Delete the parameter .I name from the currently-defined parameters. .TP .B .reset Reset the parameters to the state that they had after the processing of the .B mgquery command line. .TP .B .display Display the values of all the currently-defined parameters. .TP .B .push Push the currently-defined parameters onto a stack. .TP .B .pop Pops a set of parameters off the stack, replacing the currently-defined ones. .TP .BI ".output " arg This is used to specify where to send the text of the documents. Once the .B .output command is specified, all subsequent output will be sent to the place specified by .IR arg . If .I arg is not specified subsequent output will be directed to .IR stdout . .I Arg may be any of the following. .RS .TP 13 .BI "> " filename Send output to the specified file. .TP .BI ">> " filename Append output to the specified file. .TP .BI "| " command Pipe the output to .IR command , which is executed by .IR sh . .RE .TP .BI ".input " arg This is used to specify where input (queries and commands) comes from. Once the .B .input command is specified all subsequent input will be come from the place specified by .IR arg . If .I arg is not specified subsequent input will come from .IR stdin . .RS .TP 13 .BI "< " filename Get input from the specified file. .TP .BI "| " command The input comes from the standard output of .IR command , which is executed by .IR sh . .RE .SH PARAMETERS The following parameters are predefined and have special significance. Each parameter will be followed by its default value. Parameters are initialised before the .I .mgrc file is read or the command line arguments are processed. .TP 17 .BI accumulator_method " `array'" This parameter is used during ranking, and specifies how the weight for each document should be accumulated. The following methods are available: .IR array , .IR splay_tree , .IR hash_table , and .IR list . .TP .BI briefstats " `off'" This is a Boolean parameter that determines whether the totals for disk, memory and time usage statistics will be displayed at the end of each query. .IR Note : this takes precedence over the parameters .BR diskstats , .BR memstats " and " timestats . This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI buffer " `1048576'" When the documents are being read in, they are read into a buffer of this size and then displayed from this buffer. If the documents are larger than this buffer, the buffer is expanded automatically. Having a large buffer gives a very slight performance improvement, because it allows the order of disk operations to be optimised. The buffer size is measured in bytes. .TP .BI diskstats " `off'" This is a Boolean parameter that determines whether the disk usage statistics for the preceding query will be displayed after each query. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI doc_sepstr " `---------------------------------- %n\en\'" This specifies the string that will be used to separate documents when they are displayed for `Boolean' or `docnums' queries. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be `\en'. To include a `%', use the sequence `%%'. To include the .BR mg (1) document number, use the sequence `%n'. The following escape character sequences are available .nf .ta 1.7iL .B Sequence Meaning `\e\e' backslash `\eb' backspace `\ef' formfeed `\en' newline `\er' carriage return `\et' tab `\e"' speech marks `\e'' quote mark `\ex\fIhh\fP' ASCII code in hexadecimal `\ennn' ASCII code in octal .fi .TP .BI expert " `false'" If this is .IR true , then much of the dialogue output is suppressed. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI hash_tbl_size " `1000'" One of the options during ranking queries is to use a hash table to accumulate the weights for each document. The hash table is a simple chained type. This parameter specifies the size of the hash table and may take any value between 8 and 268435456 (2^28). .TP .BI heads_length " `50'" When the mode is .BR heads , this specifies the number of characters that will be output for each document. .TP .BI maxdocs " `all'" The maximum number of documents to display in response to a query. This parameter may take on a numeric value between 1 and 429467295 (2^32 - 1) or the word .IR all . .TP .BI maxparas " `1000'" The maximum number of paragraphs to identify during a ranked query with paragraph indexing. After the paragraphs have been identified, the paragraphs are converted into documents, and because some of the paragraphs may refer to the same documents the final number of answers may be less than .BR maxparas . The .B maxdocs parameter will then be applied. This parameter may take on a numeric value between 1 and 429467295 (2^32 - 1). .TP .BI max_accumulators " `50000'" This parameter limits the number of different paragraph and document numbers to be accumulated during ranked queries when the parameter .B accumulator_method is set to .IR splay_tree , .IR hash_table , or .IR list . This parameter may take any value between 8 and 268435456 (2^28). .TP .BI max_terms " `all'" This parameter limits the number of terms that will actually be used during a ranked query. If more terms than the number specified by .B max_terms are entered, then the extra terms will be discarded. If .B sorted_terms is on, then the limiting will be done after the terms have been sorted. This parameter may take any value between 1 and 429467295 (2^32 - 1), or the word .IR all. .TP .BI memstats " `off'" This is a Boolean parameter that determines whether the memory usage statistics for the preceding query will be displayed after each query. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI mgdir " `.'" This is set to the directory where the .BR mg (1) data files may be found. If the environment variable .B MGDATA exists, then this is instead initialised to the value of .BR MGDATA . The value of this parameter may be changed, either in the .I .mgrc file with a .BI ".set mgdir "directory command, or from the command line using the .BI \-d " directory" option. Once the \*(lq\fB>\fP\*(rq prompt appears, changing this parameter will have no effect. .TP .BI mgname " `bible'" This is set to the name of the .BR mg (1) collection that is to be used for the session. The value of this parameter may be changed, either in the .I .mgrc file with a .BI ".set mgname "name command, or from the command line using the .BI \-f " name" option. Once the \*(lq\fB>\fP\*(rq prompt appears, changing this parameter will have no effect. .TP .BI mode " `text'" This specifies how documents should be displayed when they are retrieved. It may take six different values: .IR text , .IR hilite , .IR docnums , .IR heads , .IR silent , or .IR count . .I text displays the contents of the document. .I hilite displays the contents of the document and highlights any of the stemmed query terms. .I docnums displays only the document numbers. .I heads is used to print out the head of each document. .I silent retrieves all the documents but displays nothing except how many documents were retrieved. This mode is intended to be used in timing experiments. .I count does the minimum amount of work required to determine how many documents would be retrieved, but does not retrieve them. .TP .BI optimise_type " `1'" There are three types of boolean query optimisation (parse tree rearrangement). Type 0 leaves parse tree unaltered. Type 1 optimises for AND of terms and AND of OR of terms. Type 2 converts the tree into DNF (an experiment :-). .TP .BI pager " `more'" This is the name of the program that will be used to display the help and the retrieved documents. If the environment variable .B PAGER is defined, then .B pager takes on that value. .TP .BI hilite_style " `bold'" This specifies the type of highlighting method. It may take one of two different values: .IR bold, or .IR underline. .TP .BI para_sepstr " `\en######## PARAGRAPH %n ########\en'" This specifies the string that will be used to separate paragraphs. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\en'. To include a `%', use the sequence `%%'. To include the paragraph number within the document, use the sequence `%n'. .TP .BI para_start " `***** Weight = %w *****\en'" This specifies the string that will be used at the head of paragraphs for a paragraph-level index following a ranked query. The standard C-language escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\en'. To include a `%', use the sequence `%%'. To include the paragraph weight, use the sequence `%w'. .TP .BI qfreq " `true'" This determine whether the ranked queries will take into account the number of times each query term is specified. When this is .IR true , the number of times a term appears in the query is used in the ranking. When this is .IR false , all query terms are assumed to occur only once. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI query " `Boolean'" This specifies the type of queries that are to be specified. It can take four different values: .IR Boolean , .IR ranked , .IR docnums " or " approx-ranked. .I Boolean is for Boolean queries. The .BR yacc (1) grammar for Boolean queries is as follows. .IP .nf query : or; .IP or : or '|' and | and ; .IP and : and '&' not | and not | not ; .IP not : term | '!' not ; .IP term : TERM | '(' or ')' ; .fi .IP .IR ranked " and " approx-ranked are for queries ranked by the cosine measure. .I approx-ranked uses only the low-precision document lengths, and therefore only produces an approximation to full cosine ranking. .IP .nf query : TERM | query TERM ; .fi .IP .I docnums allows the entry of document numbers. Multiple numbers separated by spaces may be specified, or ranges separated by hyphens. .IP .nf query : range | query range ; .IP range : num | num '-' num ; .fi .TP .BI ranked_doc_sepstr " `-------------------------------- %n %w\en'" This specifies the string that will be used to separate documents when they are displayed for `ranked' or `approx-ranked' queries. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\en'. To include a `%', use the sequence `%%'. To include the .BR mg (1) document number, use the sequence `%n'. To include the document weight, use the sequence `%w'. .TP .BI sizestats " `false'" If this is .IR true , then various numbers are output at the end of each query indicating what went on during the query. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI skip_dump " `skips.%d'" If this parameter is set, then a file will be produced in the current directory during ranked queries on skipped inverted files when .B accumulator_method is set to .IR splay_tree , .IR hash_table , or .IR list . The name of the file is the value of this parameter. A `%d' in the file name will be replaced with the process id of .BR mgquery . This file will contain information about the usage of skips during the query processing. This option is expensive; use .B .unset skip_dump to obtain optimal performance. .TP .BI sorted_terms " `on'" This specifies whether or not the terms should be sorted into decreasing occurrence in documents so that the least-often occurring terms are processed first when ranked queries are being done. When this is .IR true , the terms are sorted. When this is .IR false , the terms are not sorted, and are instead processed in order of occurrence. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI stop_at_max_accum " `on'" This specifies what should happen when the maximum number of accumulators set by .B max_accumulators is reached. When this is .IR true , the processing of terms is stopped at the completion of the current term. When this is .IR false , processing continues but no new accumulators are created. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI terminator " `'" This specifies the string that will be output after the last document from the previous query has been output. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\en'. To include a `%', use the sequence `%%'. .TP .BI timestats " `false'" If this is .IR true , then the time to process a query is displayed in both real time and CPU time. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .TP .BI verbatim " `off'" This is a Boolean parameter that determines whether the program should attempt to do a regular-expression match on the retrieved text. If verbatim is .I on and a post-processing string is specified with the query, then the post-processing string will be searched for in the documents just before they are displayed. If the string is found, the document will be displayed; if not, the document will not be displayed. If verbatim is .IR off , the post-processing string will be considered a regular expression as in .BR egrep (1) or .BR vi (1). E.g., if verbatim is .I on, \*(lq\fBand.*the\fP\*(rq will look for the 8-character sequence \*(lq\fBand.*the\fP\*(rq. If verbatim is .IR off , \*(lq\fBand.*the\fP\*(rq will look for the sequence \*(lq\fBand\fP\*(rq followed somewhere later in the document by the sequence \*(lq\fBthe\fP\*(rq. This parameter may take the values .IR yes ", " no ", " .IR true ", " false ", " .IR on " or " off . .SH ENVIRONMENT .TP "\w'\fBMGDATA\fP'u+2n" .SB MGDATA If this environment variable exists, then its value is used as the default directory where the .BR mg (1) collection files are. If this variable does not exist, then the directory \*(lq\fB.\fP\*(rq is used by default. The command line option .BI \-d " directory" overrides the directory in .BR MGDATA . .SH FILES .TP 20 .I .mgrc .B mgquery startup file .TP .B help.mg Help file for .BR mgquery . The contents of this file is displayed with the .B .help command. .TP .B *.invf Inverted file. .TP .B *.invf.dict The `on-disk' stemmed dictionary. .TP .B *.text Compressed documents. .TP .B *.text.dict Compression dictionary. .TP .B *.text.idx Index into the compressed documents. .TP .B *.text.idx.wgt Interleaved index into the compressed documents and document weights. .TP .B *.weight.approx Approximate document weights. .SH "SEE ALSO" .na .BR egrep (1), .BR mg (1), .BR mg_compression_dict (1), .BR mg_fast_comp_dict (1), .BR mg_get (1), .BR mg_invf_dict (1), .BR mg_invf_dump (1), .BR mg_invf_rebuild (1), .BR mg_passes (1), .BR mg_perf_hash_build (1), .BR mg_text_estimate (1), .BR mg_weights_build (1), .BR mgbilevel (1), .BR mgbuild (1), .BR mgdictlist (1), .BR mgfelics (1), .BR mgstat (1), .BR mgtic (1), .BR mgticbuild (1), .BR mgticdump (1), .BR mgticprune (1), .BR mgticstat (1), .BR vi (1), .BR yacc (1).