command line - grep regex .* not matching everything - Ask Ubuntu
i've gotten using tools grep
, wc
, cat
, etc. because have deal large csv files (>10gb) aren't quite delimited correctly (for instance, having occurrences of delimiter character inside of fields.
in working 1 of these files, i've run following command in process of trying figure out way correctly identify instances of ;
delimiter , replace them other character:
grep -v -n --text "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]" < transactions.csv
the regex can done better, anyway; surprising that, among others, above code outputs following line:
12345678:2016-10-25;12345678912345;2016-10-25;gobbledegook �idnr: 69 ;12345.67;.00;2003-09-05;12345678;2003-09-03;stuff stuff ;12345 fgadfkjgbsdkb;12/3/45678/9
(as transaction data, i've changed of fields' values, except offending �
) maybe i'm being silly, why doesn't above regex match line? seems regex .*
somehow doesn't match character reason.
i suspect file saved using utf-16 encoding, if makes difference.
edit: @exore answer. turns out, file encoded in iso-8859-15, able figure grep
ping out lines containing special characters, relatively few, file , opening in gedit. used iconv
convert utf8, after worked fine!
this typical char encoding problem. .
means character. sequence of byte legal character matter of encoding. dealing text without knowledge of encoding sure failure. grep command expect utf-8 encoded string. utf-8 multibyte encoding, meaning char represented multiple bytes. however, not sequence of bytes valid. see, example, wikipedia article on utf-8.
when grep encounters byte sequence not valid char in expected encoding, cannot recognise character, line doesn't match, it's output. since terminal doesn't recognise char either, �
.
there workaround in case. tell grep not bother encoding, , consider 1 byte 1 char.
env lang=c grep ....
or maybe
env lang=c lc_all=c grep ....
you may test easily:
create 2 files, 1 utf-8 encoded, 1 utf-16-be:
$ echo éléphant | tee file.std | iconv -f utf8 -t utf16be >file.utf16be
check content of files:
$ cat file* éléphant �l�phant
try grep. utf16be string not recognised, no output:
$ grep '^.*$' file* file.std:éléphant
don't use encoding @ all. 1 byte 1 char. strings matched � means terminal doesn't recognise utf16be sequence valid utf-8 char. note use of -a
tell grep consider binary is text.
$ env lang=c grep -a '^.*$' file* file.std:éléphant file.utf16be:�l�phant
alternatively, if know encoding, can use iconv
first convert file use grep. 1 of following should work.
iconv -f utf16 -t utf8 < file | grep ... iconv -f utf16le -t utf8 < file | grep ... iconv -f utf16be -t utf8 < file | grep ...
Comments
Post a Comment