command line - grep regex .* not matching everything - Ask Ubuntu


i've gotten using tools grep, wc, cat, etc. because have deal large csv files (>10gb) aren't quite delimited correctly (for instance, having occurrences of delimiter character inside of fields.

in working 1 of these files, i've run following command in process of trying figure out way correctly identify instances of ; delimiter , replace them other character:

grep -v -n --text "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]" < transactions.csv 

the regex can done better, anyway; surprising that, among others, above code outputs following line:

12345678:2016-10-25;12345678912345;2016-10-25;gobbledegook �idnr: 69 ;12345.67;.00;2003-09-05;12345678;2003-09-03;stuff stuff         ;12345 fgadfkjgbsdkb;12/3/45678/9 

(as transaction data, i've changed of fields' values, except offending ) maybe i'm being silly, why doesn't above regex match line? seems regex .* somehow doesn't match character reason.

i suspect file saved using utf-16 encoding, if makes difference.

edit: @exore answer. turns out, file encoded in iso-8859-15, able figure grepping out lines containing special characters, relatively few, file , opening in gedit. used iconv convert utf8, after worked fine!

this typical char encoding problem. . means character. sequence of byte legal character matter of encoding. dealing text without knowledge of encoding sure failure. grep command expect utf-8 encoded string. utf-8 multibyte encoding, meaning char represented multiple bytes. however, not sequence of bytes valid. see, example, wikipedia article on utf-8.

when grep encounters byte sequence not valid char in expected encoding, cannot recognise character, line doesn't match, it's output. since terminal doesn't recognise char either, .

there workaround in case. tell grep not bother encoding, , consider 1 byte 1 char.

env lang=c grep .... 

or maybe

env lang=c lc_all=c grep .... 

you may test easily:

create 2 files, 1 utf-8 encoded, 1 utf-16-be:

$ echo éléphant | tee file.std | iconv -f utf8 -t utf16be >file.utf16be 

check content of files:

$ cat file* éléphant �l�phant 

try grep. utf16be string not recognised, no output:

$ grep '^.*$' file* file.std:éléphant 

don't use encoding @ all. 1 byte 1 char. strings matched � means terminal doesn't recognise utf16be sequence valid utf-8 char. note use of -a tell grep consider binary is text.

$ env lang=c grep -a '^.*$' file* file.std:éléphant file.utf16be:�l�phant 

alternatively, if know encoding, can use iconv first convert file use grep. 1 of following should work.

iconv -f utf16   -t utf8 < file | grep ... iconv -f utf16le -t utf8 < file | grep ... iconv -f utf16be -t utf8 < file | grep ... 

Comments

Popular posts from this blog

download - Firefox cannot save files (most of the time), how to solve? - Super User

windows - "-2146893807 NTE_NOT_FOUND" when repair certificate store - Super User

sql server - "Configuration file does not exist", Event ID 274 - Super User