Announcement

Collapse
No announcement yet.

unimportant vs. default text

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • unimportant vs. default text

    I work with special (read: not so commonly used proprietary format) text files which contain record numbers.

    When comparing them I have 2 requirements:

    1. equal record numbers should always be aligned (even if other lines before or after them are identical and might suggest some other alignment)

    2. everything else but record numbers can be set as unimportant (at least optionally by pressing the "similar" button)

    I managed to create a file format with the right regexp to match my record numbers. Now BC3 recognizes record numbers as their own grammar item and the rest as "default text".

    But this file format has no visible effect to the comparison.

    What is the meaning of the line weights in the grammar tab? Could that be used to influence the alignment according to my #1? My record numbers are not separate lines, so I haven't experimented with that.

    In my comparison everything seems to be always important. Can I make the default text unimportant? I vaguely remember a setting from BC2 times "everything else is unimportant".

    Of course mathematically it is possible to define a regexp for everything else than my record numbers. And then deselect this everything else from being important. But it seems more complicated than what I would like to do (or my regexp algebra skills are not enough...)

    P.S. Yes, I probably somehow mix alignment and importance here. The problem is just that it's not really clear to me how these are related, if at all.

  • #2
    Originally posted by ugeuder View Post
    In my comparison everything seems to be always important. Can I make the default text unimportant? I vaguely remember a setting from BC2 times "everything else is unimportant".
    Under the Session Settings Importance tab, remove the check from the "Everything else" checkbox.
    BC v4.0.7 build 19761
    ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

    Comment


    • #3
      Hello,

      You define the grammar elements as the File Format, and then set importance in the Session Settings. As Michael suggests, you probably just need to uncheck the necessary values in your Session Settings (the Everything Else value, as well as any other defined Grammars you do not want to be considered Important). Then, Important text will be shown as Red, and Unimportant text will be shown as Blue. You can toggle Ignore Unimportant Text to hide blue text as black text.

      You can also use the Alignment option under the Session Settings: Alignment tab: Never align differences, so that only your defined important text is used during the alignment, and that it will always match 100%. Or simply increase the slider for more accurate alignment results.
      Aaron P Scooter Software

      Comment


      • #4
        Originally posted by Aaron View Post
        Hello,

        As Michael suggests, you probably just need to uncheck the necessary values in your Session Settings .
        Perfect answers, thanks. This does indeed most of the job.

        I had looked at the importance tab of the session settings window, but somehow I had interpreted the checkboxes in the "Default text" frame incorrectly that they define what belongs to the default text category.

        Improvement suggestion: Just add "is important" to all 5 lines in the default text to make it really clear to the confused user.

        - Leading whitespace is important
        ...
        ...
        ...
        - Character case is important

        Yes, it's somewhat duplicated with the headline, but I don't think that harms.


        Originally posted by Aaron View Post
        You can also use the Alignment option under the Session Settings: Alignment tab: Never align differences, so that only your defined important text is used during the alignment, and that it will always match 100%.
        Great, so importance has really an effect on alignment, too. I wasn't too sure, whether importance is only for coloring and navigating differences. Improvement suggestion: Add help to the alignment tab and explain how it works and what the various selections really mean.

        Yes, "never align differences" works 95% as expected for my files. My files have different length, one covering just 10% of the other one. Very close to the end of the shorter file the alignment bails out, starts to align unimportant stuff (which happens to be identical) and after that it can no longer align the important record numbers. But that's not a big issue now, because I can correct it with one manual alignment.

        Thanks again! Problem solved.

        Just for curiosity: What are the line weights in File Formats / Grammar used for?

        Comment


        • #5
          Line Weights can also help with alignment. Create a definition to match on, and then give it a high priority to help the alignment align that first.

          In your scenario, create a line weight that matches you Manual Alignment (that you do at the end), that may help you avoid that last step.
          Aaron P Scooter Software

          Comment


          • #6
            Originally posted by Aaron View Post
            ... that matches you Manual Alignment (that you do at the end), .
            Hmm, not sure how I would do that. My manual alignment is something like "align line 95 left to line 950 right". How would define something like that?

            I can of course repeat the regexp of my record number (which tells me how I want it aligned and which I already have in the grammar) and give lines containing the record number priority 5. But that does not seem to make any difference at all.

            If the regexp is in the grammar and this grammar element is the only important one, alignment works for 95% of the file.

            If the grammar is empty and I try to achieve the alignment using line weights, it doesn't work at all. (The same as with no line weight at all)

            If both grammar and line weight are defined it can again align the first 95% correctly but bails out towards the end.
            Last edited by ugeuder; 19-Apr-2009, 12:23 PM. Reason: clarifed after reading the manual once more

            Comment


            • #7
              You can also try the "alternate method" option on the alignment tab in session settings. This compares from both directions (top down & bottom up) which can improve alignment in certain types of files... especially when the top and bottom of both files are similar to each other with a change section in the middle.
              BC v4.0.7 build 19761
              ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

              Comment


              • #8
                Originally posted by Michael Bulgrien View Post
                try the "alternate method" option on the alignment tab in session settings. This compares from both directions (top down & bottom up) .
                Thanks for the hint Michael. I've already wondered in BC2 what this alternate method really is. I think I could have needed it sometimes earlier (with completely different files), when alignment bailed out and I had to do 10s of manula alignments.

                However, this time I don't think it would help. As I wrote above, one of my files has a long tail which the other one doesn't have. And because there a plenty of unimportant lines, which match in many places, I'd expect only problems when starting from the end, too. (I'm not at the machine with the files in question right now.)

                Comment


                • #9
                  The alternate method is actually a completely different way of aligning files.

                  The standard alignment starts at the top and works down using a sliding window of "Skew tolerance" lines. It calculates line similarity based on the number of matching/different characters.

                  The alternate method compares the entirety of both files to each other, so it can catch cases where there's only two matching lines in a 50,000 line file. The fact that it compares from both the top and bottom is an implementation detail and doesn't actually affect the quality of the alignment. It does have a couple of negatives though: It can't compute line similarity, so it just knows whether the important text matches or not, and for large files with lots of differences it can become much slower, but that's pretty rare.

                  In any case, I second Michael's suggestion to try it. It's quite possible it will be able to handle the added tail.
                  Zoë P Scooter Software

                  Comment


                  • #10
                    Ok, I've tried the alternate method.

                    My test files are both exactly 1000 lines long. Some lines contain record numbers they should always be aligned. The left file contains lots of additonal lines, so after the last identical record number the right file has a long tail of stuff which never appears left.

                    The normal algorithm works as desired until it correctly aligns line 680 to line 69. Thereafter it bails out.

                    If I select "Never align differences" it works until line 947 aligned to line 94. Thereafter it bails out. (With one additional manual alignment everything gets correct up to line 997 / line 96)

                    The alternate method works right away until lines 947 / 94, if "Never align differences" is not selected. "Never align differences" brings no further improvement. One manual alignment solves everything until line 994 if "never align differences" is not selected and until line 997 if "never align differences" is selected.

                    So, yes the alternate method works somewhat better for the patterns in my files.

                    At least with these example files it is easy to test. However, my real "production" can be longer than hundred million lines. The standard method takes about 4 hours. Not sure whether I dare to try that with the alternate method...

                    Comment


                    • #11
                      When experimenting a bit more I found a strange behavior in the Alignment tab.

                      It appears to me that the slider and the numerical field are connected to each other.

                      However, when I ...

                      1.) draw the slider to the right end (number changes to 4000)
                      2.) select "never align differences"
                      3.) close the dialog using "OK"
                      4.) open the dialog again

                      ... the slider is back in the default position (equals 2000) but the numerical field displays still 4000.

                      Is it supposed to work like this?

                      Comment


                      • #12
                        Umm, yeah. The alternate alignment probably won't work for hundred-million line comparisons, or if it does, it's quite possible it will take years to finish. Can you email [email protected] with your sample files? Are the record numbers ordered and increasing?
                        Zoë P Scooter Software

                        Comment


                        • #13
                          Yes, the slider is working correctly. It isn't actually stored in your settings; it just provides easy presets for setting the skew tolerance and "Use closeness matching" settings. If those settings match one of the presets the slider reflects them, otherwise it just sticks in the middle.
                          Zoë P Scooter Software

                          Comment


                          • #14
                            If/When you email support, also please include a link to this forum post, so we can reply with results here after we have found a solution.
                            Aaron P Scooter Software

                            Comment


                            • #15
                              Originally posted by Craig View Post
                              Can you email [email protected] with your sample files?
                              Well, obviosly I cannot mail you real production files, because they are several Gigabytes. Still 100s of Megabytes bzip2ed. While it would be only 2 shell commands to split them in acceptable chunks and mail them in 100s or 1000s of mails, it might be a bit more tedious for you to assemble them again.

                              Even if you provided an ftp upload, I would not be happy to send them, because such data volumes might trigger our corporate security watchdogs. Even though I believe there should be no critical confidential data in the files I wouldn't like to argue with somebody from security.

                              With smaller test files I can get pretty perfect results already (with help of your help above). I just tried to understand the options really well in order to get perfect results on the first attempt also with huge files, because they take several hours to align and trial and error is not feasible.

                              Comment

                              Working...
                              X