Commands affecting text and
text files -- 影响 所有文本+文本文件
sort
File sort utility, often used as a filter in a pipe. This
command sorts a text stream or file forwards or backwards, or according to various
keys or character positions. Using the -m option, it merges presorted input files. The info
page lists its many capabilities and options. See , ,
and .
tsort
Topological sort, reading in
pairs of whitespace-separated strings and sorting
according to input patterns. The original purpose of tsort was to sort a list of dependencies
for an obsolete version of the ld linker in an "ancient" version of UNIX.
The results of a tsort will usually
differ markedly from those of the standard sort command, above.
uniq
This filter removes duplicate lines from a sorted
file. It is often seen in a pipe coupled with .
-
cat list-1 list-2 list-3 | sort | uniq > final.list
-
# Concatenates the list files,
-
# sorts them,
-
# removes duplicate lines,
-
# and finally writes the result to an output file.
-
bash$ cat testfile
-
This line occurs only once.
-
This line occurs twice.
-
This line occurs twice.
-
This line occurs three times.
-
This line occurs three times.
-
This line occurs three times.
-
-
-
bash$ uniq -c testfile
-
1 This line occurs only once.
-
2 This line occurs twice.
-
3 This line occurs three times.
-
-
-
bash$ sort testfile | uniq -c | sort -nr
-
3 This line occurs three times.
-
2 This line occurs twice.
-
1 This line occurs only once.
-
#!/bin/bash
-
# wf.sh: Crude word frequency analysis on a text file.
-
# This is a more efficient version of the "wf2.sh" script.
-
-
-
# Check for input file on command-line.
-
ARGS=1
-
E_BADARGS=85
-
E_NOFILE=86
-
-
if [ $# -ne "$ARGS" ] # Correct number of arguments passed to script?
-
then
-
echo "Usage: `basename $0` filename"
-
exit $E_BADARGS
-
fi
-
-
if [ ! -f "$1" ] # Check if file exists.
-
then
-
echo "File \"$1\" does not exist."
-
exit $E_NOFILE
-
fi
-
-
-
-
########################################################
-
# main ()
-
sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\
-
/g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
-
# =========================
-
# Frequency of occurrence
-
-
# Filter out periods and commas, and
-
#+ change space between words to linefeed,
-
#+ then shift characters to lowercase, and
-
#+ finally prefix occurrence count and sort numerically.
-
-
# Arun Giridhar suggests modifying the above to:
-
# . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
-
# This adds a secondary sort key, so instances of
-
#+ equal occurrence are sorted alphabetically.
-
# As he explains it:
-
# "This is effectively a radix sort, first on the
-
#+ least significant column
-
#+ (word or string, optionally case-insensitive)
-
#+ and last on the most significant column (frequency)."
-
#
-
# As Frank Wang explains, the above is equivalent to
-
#+ . . . | sort | uniq -c | sort +0 -nr
-
#+ and the following also works:
-
#+ . . . | sort | uniq -c | sort -k1nr -k
-
########################################################
-
-
exit 0
-
-
# Exercises:
-
# ---------
-
# 1) Add 'sed' commands to filter out other punctuation,
-
#+ such as semicolons.
-
# 2) Modify the script to also filter out multiple spaces and
-
#+ other whitespace.
-
-
bash$ cat testfile
-
This line occurs only once.
-
This line occurs twice.
-
This line occurs twice.
-
This line occurs three times.
-
This line occurs three times.
-
This line occurs three times.
-
-
-
bash$ ./wf.sh testfile
-
6 this
-
6 occurs
-
6 line
-
3 times
-
3 three
-
2 twice
-
1 only
-
1 once
expand,
unexpand
The expand filter converts tabs to
spaces. It is often used in a .
The unexpand filter
converts spaces to tabs. This reverses the effect of expand.
cut
A tool for extracting from files. It is similar
to the print $N command set in , but more limited. It may be
simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.
Using cut to obtain a listing of the
mounted filesystems:
-
cut -d ' ' -f1,2 /etc/mtab
-
Using cut to list the OS and kernel version:
-
-
uname -a | cut -d" " -f1,3,11,12
-
Using cut to extract message headers from an e-mail folder:
-
-
bash$ grep '^Subject:' read-messages | cut -c10-80
-
Re: Linux suitable for mission-critical apps?
-
MAKE MILLIONS WORKING AT
-
Spam complaint
-
Re: Spam complaint
-
#--- cut -c 字符数量提取 man cut
-
Using cut to parse a file:
-
-
# List all the users in /etc/passwd.
-
-
FILENAME=/etc/passwd
-
-
for user in $(cut -d: -f1 $FILENAME)
-
do
-
echo $user
-
done
-
-
# Thanks, Oleg Philon for suggesting this.
head
lists the beginning of a file to stdout.
The default is 10 lines, but a different
number can be specified. The command has a number of
interesting options.
-
#!/bin/bash
-
# script-detector.sh: Detects scripts within a directory.
-
-
TESTCHARS=2 # Test first 2 characters.
-
SHABANG='#!' # Scripts begin with a "sha-bang."
-
-
for file in * # Traverse all the files in current directory.
-
do
-
if [[ `head -c$TESTCHARS "$file"` = "$SHABANG" ]]
-
# head -c2 #!
-
# The '-c' option to "head" outputs a specified
-
#+ number of characters, rather than lines (the default).
-
then
-
echo "File \"$file\" is a script."
-
else
-
echo "File \"$file\" is *not* a script."
-
fi
-
done
-
-
exit 0
-
-
# Exercises:
-
# ---------
-
# 1) Modify this script to take as an optional argument
-
#+ the directory to scan for scripts
-
#+ (rather than just the current working directory).
-
#
-
# 2) As it stands, this script gives "false positives" for
-
#+ Perl, awk, and other scripting language scripts.
-
# Correct this.
-
Example 16-14. Generating 10-digit random numbers
-
-
#!/bin/bash
-
# rnd.sh: Outputs a 10-digit random number
-
-
# Script by Stephane Chazelas.
-
-
head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
-
-
-
# =================================================================== #
-
-
# Analysis
-
# --------
-
-
# head:
-
# -c4 option takes first 4 bytes.
-
-
# od:
-
# -N4 option limits output to 4 bytes.
-
# -tu4 option selects unsigned decimal format for output.
-
-
# sed:
-
# -n option, in combination with "p" flag to the "s" command,
-
# outputs only matched lines.
-
-
-
-
# The author of this script explains the action of 'sed', as follows.
-
-
# head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
-
# ----------------------------------> |
-
-
# Assume output up to "sed" --------> |
-
# is 0000000 1198195154\n
-
-
# sed begins reading characters: 0000000 1198195154\n.
-
# Here it finds a newline character,
-
#+ so it is ready to process the first line (0000000 1198195154).
-
# It looks at its <range><action>s. The first and only one is
-
-
# range action
-
# 1 s/.* //p
-
-
# The line number is in the range, so it executes the action:
-
#+ tries to substitute the longest string ending with a space in the line
-
# ("0000000 ") with nothing (//), and if it succeeds, prints the result
-
# ("p" is a flag to the "s" command here, this is different
-
#+ from the "p" command).
-
-
# sed is now ready to continue reading its input. (Note that before
-
#+ continuing, if -n option had not been passed, sed would have printed
-
#+ the line once again).
-
-
# Now, sed reads the remainder of the characters, and finds the
-
#+ end of the file.
-
# It is now ready to process its 2nd line (which is also numbered '$' as
-
#+ it's the last one).
-
# It sees it is not matched by any , so its job is done.
-
-
# In few word this sed commmand means:
-
# "On the first line only, remove any character up to the right-most space,
-
#+ then print it."
-
-
# A better way to do this would have been:
-
# sed -e 's/.* //;q'
-
-
# Here, two s (could have been written
-
# sed -e 's/.* //
-
#!/bin/bash
-
-
filename=sys.log
-
-
cat /dev/null > $filename; echo "Creating / cleaning out file."
-
# Creates the file if it does not already exist,
-
#+ and truncates it to zero length if it does.
-
# : > filename and > filename also work.
-
-
tail /var/log/messages > $filename
-
# /var/log/messages must have world read permission for this to work.
-
-
echo "$filename contains tail end of system log."
-
-
exit 0
grep
A multi-purpose file search tool that uses .
It was originally a command/filter in the
venerable ed line editor: g/re/p -- global -
regular expression - print.
-
bash$ grep '[rst]ystem.$' osinfo.txt
-
The GPL governs the distribution of the Linux operating system.
-
bash$ ps ax | grep clock
-
765 tty1 S 0:00 xclock
-
901 pts/1 S 0:00 grep clock
-
-
-
The -i option causes a case-insensitive search.
-
-
The -w option matches only whole words.
-
-
The -l option lists only the files in which matches were found, but not the matching lines.
-
-
The -r (recursive) option searches files in the current working directory and all subdirectories below it.
-
-
The -n option lists the matching lines, together with line numbers.
-
-
bash$ grep -n Linux osinfo.txt
-
2:This is a file containing information about Linux.
-
6:The GPL governs the distribution of the Linux operating system.
-
The -v (or --invert-match) option filters out matches.
-
-
grep pattern1 *.txt | grep -v pattern2
-
-
# Matches all lines in "*.txt" files containing "pattern1",
-
# but ***not*** "pattern2".
-
grep -c txt *.sgml # (number of occurrences of "txt" in "*.sgml" files)
-
-
-
# grep -cz .
-
# ^ dot
-
# means count (-c) zero-separated (-z) items matching "."
-
# that is, non-empty ones (containing at least 1 character).
-
#
-
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz . # 3
-
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$' # 5
-
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^' # 5
-
#
-
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$' # 9
-
# By default, newline chars (\n) separate items to match.
-
-
# Note that the -z option is GNU "grep" specific.
-
-
-
# Thanks, S.C.
-
tr
-
-
character translation filter.
-
-
Caution
-
-
Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.
-
-
Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.
-
-
The -d option deletes a range of characters.
-
echo "abcdef" # abcdef
-
echo "abcdef" | tr -d b-d # aef
-
-
-
tr -d 0-9 <filename
-
# Deletes all digits from the file "filename".
-
#The --squeeze-repeats (or -s) option deletes all but the first instance of a string of cons#ecutive characters. This option is useful for removing excess whitespace.
-
-
bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
-
X
-
#The -c "complement" option inverts the character set to match. With this option, tr acts on#ly upon those characters not matching the specified set.
-
# -c 是 complement 类似于其他command -v 反向匹配
-
bash$ echo "acfdeb123" | tr -c b-d +
-
+c+d+b++++
-
toupper: Transforms a file to all uppercase.
-
-
#!/bin/bash
-
# Changes a file to all uppercase.
-
-
E_BADARGS=85
-
-
if [ -z "$1" ] # Standard check for command-line arg.
-
then
-
echo "Usage: `basename $0` filename"
-
exit $E_BADARGS
-
fi
-
-
tr a-z A-Z <"$1"
-
# 大小写转换
-
-
# Same effect as above, but using POSIX character set notation:
-
# tr '[:lower:]' '[:upper:]' <"$1"
-
# Thanks, S.C.
-
-
# Or even . . .
-
# cat "$1" | tr a-z A-Z
-
# Or dozens of other ways . . .
-
-
exit 0
-
-
# Exercise:
-
# Rewrite this script to give the option of changing a file
-
#+ to *either* upper or lowercase.
-
# Hint: Use either the "case" or "select" command.
-
#!/bin/bash
-
#
-
# Changes every filename in working directory to all lowercase.
-
#
-
# Inspired by a script of John Dubois,
-
#+ which was translated into Bash by Chet Ramey,
-
#+ and considerably simplified by the author of the ABS Guide.
-
-
-
for filename in * # Traverse all files in directory.
-
do
-
fname=`basename $filename`
-
n=`echo $fname | tr A-Z a-z` # Change name to lowercase.
-
if [ "$fname" != "$n" ] # 不是小写字符
-
then
-
mv $fname $n
-
fi
-
done
-
-
exit $?
-
-
# 两段 代码等价
-
# Code below this line will not execute because of "exit".
-
#--------------------------------------------------------#
-
# To run it, delete script above line.
-
-
# The above script will not work on filenames containing blanks or newlines.
-
# Stephane Chazelas therefore suggests the following alternative:
-
-
-
for filename in * # Not necessary to use basename,
-
# since "*" won't return any file containing "/".
-
do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]
-
# POSIX char set notation.
-
# Slash added so that trailing newlines are not
-
# removed by command substitution.
-
# Variable substitution:
-
n=${n%/} # Removes trailing slash, added above, from filename.
-
[[ $filename == $n ]] || mv "$filename" "$n"
-
# Checks if filename already lowercase.
-
done
-
-
exit $?
-
#du: DOS to UNIX text file conversion.
-
-
#!/bin/bash
-
# Du.sh: DOS to UNIX text file converter.
-
-
E_WRONGARGS=85
-
-
if [ -z "$1" ]
-
then
-
echo "Usage: `basename $0` filename-to-convert"
-
exit $E_WRONGARGS
-
fi
-
-
NEWFILENAME=$1.unx
-
-
CR='\015' # Carriage return.
-
# 015 is octal ASCII code for CR.
-
# Lines in a DOS text file end in CR-LF.
-
# Lines in a UNIX text file end in LF only.
-
-
tr -d $CR < $1 > $NEWFILENAME
-
# Delete CR
fmt
Simple-minded file formatter, used as a filter in a
pipe to "wrap" long lines of text
output.
-
#!/bin/bash
-
-
WIDTH=40 # 40 columns wide.
-
-
b=`ls /usr/local/bin` # Get a file listing...
-
-
echo $b | fmt -w $WIDTH
-
-
# Could also have been done by
-
# echo $b | fold - -s -w $WIDTH
-
-
exit 0
-
#!/bin/bash
-
-
WIDTH=40 # 40 columns wide.
-
-
b=`ls /usr/local/bin` # Get a file listing...
-
-
echo $b | fmt -w $WIDTH
-
-
# Could also have been done by
-
# echo $b | fold - -s -w $WIDTH
-
-
exit 0